Foundations of Dependable Computing: Models and Frameworks for Dependable Systems

FOUNDATIONS OF DEPENDABLE COMPUTING Models and Frameworks for Dependable Systems T H E K L U W E R I N T E R N A T I...

Author: Gary M. Koob | Clifford G. Lau

46 downloads 892 Views 11MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

FOUNDATIONS OF DEPENDABLE COMPUTING

Models and Frameworks for Dependable Systems

T H E K L U W E R I N T E R N A T I O N A L SERIES IN E N G I N E E R I N G A N D C O M P U T E R SCIENCE OFFICE OF NAVAL RESEARCH Advanced Book Series Consulting Editor Andr~ M. van Tilborg

Other titles in the series: OF D E P E N D A B L E COMPUTING: Paradigms for Dependable Applications, edited by Gary M. Koob and Clifford G. Lau ISBN: 0-7923-9485-2

FOUNDATIONS

FOUNDATIONS OF DEPENDABLE COMPUTING: Implementation, edited by Gary M. Koob and Clifford G. Lau ISBN: 0-7923-9486-0

System

PARALLEL ALGORITHM DERIVATION AND PROGRAM TRANSFORMATION, edited by Robert Paige, John Reif and Ralph Wachter ISBN: 0-7923-9362-7 F O U N D A T I O N S OF K N O W L E D G E ACQUISITION: Cognitive Models of Complex Learning, edited by Susan Chipman and Alan L. Meyrowitz ISBN: 0-7923-9277-9 F O U N D A T I O N S OF K N O W L E D G E ACQUISITION: Machine Learning, edited by Alan L. Meyrowitz and Susan Chipman ISBN: 0-7923-9278-7 F O U N D A T I O N S OF REAL-TIME COMPUTING: Formal Specifications and Methods, edited by Andr6 M. van Tilborg and Gary M. Koob ISBN: 0-7923-9167-5 F O U N D A T I O N S OF REAL-TIME COMPUTING: Scheduling and Resource Management, edited by Andr6 M. van Tilborg and Gary M. Koob ISBN: 0-7923-9166-7

FOUNDATIONS OF DEPENDABLE COMPUTING

Models and Frameworks for Dependable Systems

edited by

G a r y M. Koob C l i f f o r d G. L a u

Office of Naval Research

KLUWER ACADEMIC PUBLISHERS Boston / Dordrecht / London

Distributors for North America: Kluwer Academic Publishers 101 Philip Drive Assinippi Park Norwell, Massachusetts 02061 USA Distributors for all other countries: Kluwer Academic Publishers Group Distribution Centre Post Office Box 322 3300 AH Dordrecht, THE NETHERLANDS

Library of Congress Cataloging-in-Publication Data Foundations of dependable computing. Models and frameworks for dependable systems / [edited by Gary M. Koob, Clifford G. Lau]. p. cm. -- (The Kluwer international series in engineering and computer science ; 0283) Includes bibliographical references and index. ISBN 0-7923-9484-4 1. Electronic digital computers--Reliability. 2. Real-time data processing. 3. Fault-tolerant computing. I. Koob, Gary M., 1958II. Lau, Clifford. III. Series: Kluwer international series in engineering and computer science ; SECS 0283. QA76.5.F6238 1994 004.2' 1--dc20 94-29126 CIP

Copyright © 1994 by Kluwer Academic Publishers All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photo-copying, recording, or otherwise, without the prior written permission of the publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061 Printed on acid-free paper.

Printed in the United States of America

CONTENTS

Preface

.....................................................

Acknowledgements

vii

......................................

xiii

Errata ......................................................

1. FRAMEWORKS 1.1

FOR DEPENDABLE

x v

SYSTEMS

......

A Consensus-Based Framework and Model for the Design of Responsive Computing Systems ...........................................................

1

3

M. Malek 1.2

A Methodology for Adapting to Patterns of Faults ..................................

23

G. Agha and D. C. Sturman

2. REQUIREMENTS 2.1

MODELS

............................

Military Fault-Tolerant Requirements ...................................................

6 1 63

T.P. Monaghan 2.3

Derivation and Use of Deadline Information in Real-Time Control Systems ..............................................................................

K.G. Shin and H. Kim

77

vi SYSTEM

3.

VALIDATION

................................

111

3.1

Software-Implemented Fault Injection of Transient Hardware Errors .......... 113 C.R. Yount and D.P. Siewiorek

3.2

REACT: An Integrated Tool for the Design of Dependable Computer Systems .......................................................................... 169 J.A. Clark and D.K. Pradhan

o

SYSTEM

EVALUATION

................................

193

4.1

Measurement-Based Dependability Evaluation of Operational Computer Systems .......................................................................... 195 R.K. Iyer and D. Tang

4.2

Modeling and Evaluation of Opto-Electronic Computing Systems ............ 235 T.Y. Lin

Index

......................................................

263

PREFACE

Dependability has long been a central concern in the design of space-based and military systems, where survivability for the prescribed mission duration is an essential requirement, and is becoming an increasingly important attribute of govemment and commercial systems where reduced availability may have severe financial consequences or even lead to loss of life. Historically, research in the field of dependable computing has focused on the theory and techniques for preventing hardware and environmentally induced faults through increasing the intrinsic reliability of components and systems (fault avoidance), or surviving such faults through massive redundancy at the hardware level (fault tolerance). Recent advances in hardware, software, and measurement technology coupled with new insights into the nature, scope, and fundamental principles of dependable computing, however, contributed to the creation of a challenging new research agenda in the late eighties aimed at dramatically increasing the power, effectiveness, and efficiency of approaches to ensuring dependability in critical systems At the core of this new agenda was a paradigm shift spurred by the recognition that dependability is fundamentally an attribute of applications and services--not platforms. Research should therefore focus on (1) developing a scientific understanding of the manifestations of faults at the application level in terms of their ultimate impact on the correctness and survivability of the application; (2) innovative, application-sensitive approaches to detecting and mitigating this impact; and (3) hierarchical system support for these new approaches. Such a paradigm shift necessarily entailed a concomitant shift in emphasis away from inefficient, inflexible, hardware-based approaches toward higher level, more efficient and flexible software-based solutions. Consequently, the role of hardware-based mechanisms was redefined to that of providing and implementing the abstractions required to support the higher level software-based mechanisms in an integrated, hierarchical approach to ultradependable system design. This shift was furthermore compatible with an expanded view of "dependability," which had evolved to mean "the ability of the system to deliver the specified (or expected) service." Such a definition encompasses not only survival of traditional single hardware faults and enviromnental disturbances but more complex and less-well understood phenomena, as well: Byzantine faults, correlated errors, timing faults, software design and process interaction errors, and--most significantly--the unique issues encountered in real-

° . °

Vlll

time systems in which faults and transient overload conditions must be detected and handled under hard deadline and resource constraints. As sources of service disruption multiplied and focus shifted to their ultimate effects, traditional frameworks for reasoning about dependability had to be rethought. The classical fault/error/failure model, in which underlying anomalies (faults) give rise to incorrect values (errors), which may ultimately cause incorrect behavior at the output (failures), required extension to capture timing and performance issues. Graceful degradation, a long standing principle codifying performance/dependability trade-offs must be more carefully applied in real-time systems, where individual task requirements supercede general throughput optimization in any assessment. Indeed, embedded real-time systems--often characterized by interaction with physical sensors and actuators--may possess an inherent ability to tolerate brief periods of incorrect interaction, either in the values exchanged or the timing of those exchanges. Thus, a technical failure of the embedded computer does not necessarily imply a system failure. The challenge of capturing and modeling dependability for such potentially complex requirements is matched by the challenge of successfully exploiting them to devise more intelligent and efficient--as well as more complete-dependability mechanisms. The evolution to a hierarchical, software-dominated approach would not have been possible without several enabling advances in hardware and software technology over the past decade: (1) Advances in VLSI technology and RISC architectures have produced components with more chip real estate available for incorporation of efficient concurrent error detection mechanisms and more on-chip resources permitting software management of f'me-grainredundancy; (2) The emergence of practical parallel and distributed computing platforms possessing inherent coarse-grain redundancy of processing and communications resources--also amenable to efficient software-based management by either the system or the application; (3) Advances in algorithms and languages for parallel and distributed computing leading to new insights in and paradigms for problem decomposition, module encapsulation, and module interaction, potentially exploitable in refining redundancy requirements and isolating faults; (4) Advances in distributed operating systems allowing more efficient interprocess communication and more intelligent resource management;

ix (5) Advances in compiler technology that permit efficient, automatic instrumentation or restructuring of application code, program decomposition, and coarse and fine-grain resource management; and (6) The emergence of fault-injection technology for conducting controlled experiments to determine the system and application-level manifestations of faults and evaluating the effectiveness or performance of fault-tolerance methods. In response to this challenging, new vision for dependable computing research, the advent of the technological opportunities for realizing it, and its potential for addressing critical dependability needs of Naval, Defense, and commercial systems, the Office of Naval Research launched a five-year basic research initiative in 1990 in Ultradependable Multicomputers and Electronic Systems to accelerate and integrate progress in this important discipline. The objective of the initiative is to establish the fundamental principles as well as practical approaches for efficiently incorporating dependability into critical applications running on modern platforms. More specifically, the initiative sought increased effectiveness and efficiency through (1) Intelligent exploitation of the inherent redundancy available in modern parallel and distributed computers and VLSI components; (2) More precise characterization of the sources and manifestations of errors; (3) Exploitation of application semantics at all levels---code, task, algorithm, and domain--to allow optimization of fault-tolerance mechanisms to both application requirements and resource limitations; (4) Hierarchical, integrated software/hardware approaches; and (5) Development of scientific methods for evaluating and comparing candidate approaches. Implementation of this broad mandate as a coherent research program necessitated focusing on a small cross-section of promising application-sensitive paradigms (including language, algorithm, and coordination-based approaches), their required hardware, compiler, and system support, and a few selected modeling and evaluation projects. In scope, the initiative emphasizes dependability primarily with respect to an expanded class of hardware and environment (both physical and operational) faults. Many of the efforts furthermore explicitly address issues of dependability unique to the domain of embedded real-time systems. The success of the initiative and the significance of the research is demonstrated by the ongoing associations that many of our principal investigators have forged with a variety of military, Government, and commercial projects whose critical needs are leading to the rapid assimilation of concepts, approaches, and expertise arising from this initiative. Activities influenced to date include the FAA's Advanced Automation System for air traffic control, the Navy's AX project and Next Generation Computing Resources standards program, the Air Force's Center for Dependable Systems, the OSF/1 project, the space station Freedom, the Strategic

Defense Initiative, and research projects at GE, DEC, Tandem, the Naval Surface Warfare Center, and MITRE Corporation. This book series is a compendium of papers summarizing the major results and accomplishments attained under the auspices of the ONR initiative in its first three years. Rather than providing a comprehensive text on dependable computing, the series is intended to capture the breadth, depth, and impact of recent advances in the field, as reflected through the specific research efforts represented, in the context of the vision articulated here. Each chapter does, however, incorporate appropriate background material and references. In view of the increasing importance and pervasiveness of real-time concerns in critical systems that impact our daily lives--ranging from multimedia communications to manufacturing to medical instrumentat i o n - t h e real-time material is woven throughout the series rather than isolated in a single section or volume. The series is partitioned into three volumes, corresponding to the three principal avenues of research identified at the beginning of this preface. While many of the chapters actually address issues at multiple levels, reflecting the comprehensive nature of the associated research project, they have been organized into these volumes on the basis of the primary conceptual contribution of the work. Agha and Sturman, for example, describe a framework (reflective architectures), a paradigm (replicated actors), and a prototype implementation (the Screed language and Broadway runtime system). But because the salient attribute of this work is the use of reflection to dynamically adapt an application to its environment, it is included in the Frameworks volume. Volume I, Models and Frameworksfor Dependable Systems, presents two comprehensive frameworks for reasoning about system dependability, thereby establishing a context for understanding the roles played by specific approaches presented throughout the series. This volume then explores the range of models and analysis methods necessary to design, validate, and analyze dependable systems. Volume II, Paradigmsfor Dependable Applications, presents a variety of specific approaches to achieving dependability at the application level. Driven by the higher level fault models of Volume I and buiilt on the lower level abstractions implemented in Volume III, these approaches demonstrate how dependability may be tuned to the requirements of an application, the fault environment, and the characteristics of the target platform. Three classes of paradigms are considered: protocolbased paradigms for distributed applications, algorithm-based paradigms for parallel applications, and approaches to exploiting application semantics in embedded realtime control systems. Volume III, System hnplementation, explores the system infrastructure needed to support the various paradigms of Volume II. Approaches to implementing

xi suppport mechanisms and to incorporating additional appropriate levels of fault detection and fault tolerance at the processor, network, and operating system level are pre~nted. A primary concern at these levels is balancing cost and performance against coverage and overall dependability. As these chapters demonstrate, low overhead, practical solutions are attainable and not necessarily incompatible with performance considerations. The section on innovative compiler support, in particular, demonstrates how the benefits of application specificity may be obtained while reducing hardware cost and run-time overhead. This first volume in the series covers system architectures or frameworks that serve as the foundation for dependable system design and the various models required at each layer of the system hierarchy and stage of its lifecycle to guide design decisions and evaluate their effectiveness. Section 1 presents two frameworks for the study and design of dependable systems. Maiek emphasizes the layered view of dependability advocated throughout this series and presents the concept of universal consensus for realizing dependability at each level in the context of distributed real-time systems. Agha and Sturman introduce the concepts of reflection and encapsulation as vehicles for tuning dependability to a dynamically changing fault environment and application requirements while maintaining the transparency of dependability mechanisms to the application itself. "l~e concepts are made concrete through the example of an actor-based language and run-time system, highlighting the importance of language hooks in granting users ent~lced control over detection and recovery mechanisms. Given these frameworks, Section 2 addresses the issue of mathematically characterizing dependability requirements in a manner exploitable by them. Monaghan introduces the section by outlining the real requirements demanded by typical military systems and the difficulty of precisely translating those requirements for systems designers and verifying the results. In real-time systems, dependability encompasses timeliness as well as correctness. Shin presents an approach to deriving precise deadline constraints from the application semantics to provide dependability mechanisms with maximum flexibility based on true requirements rather than specifications of undocumented origin. Shin also extends the layered view of dependability to the larger system level: an erroneous output from an embedded computer, while technically a failure of that computer, may be still recoverable at the system level if the controlled process is robust enough. Once a system is designed using appropriate high-level abstractions and fault models the problem remains to validate the design against the types of faults anticipated in actual operation. An emerging approach to this critical problem is fault injection, in which the response of the system to low-level injected errors is gauged. Efficiency of this process demands an intermediate model to guide injection that preserves coverage while simplifying and accelerating the testing. One such approach

xii and the issues involved in applying it are examined by Yount and Siewiorek in Section 3. Clark and Pradhan complete the picture by describing the REACT testbed for modeling and validating dependable system designs. Whereas the models presented thus far capture the behavior of the system in response to particular fault scenarios, a global, quanititative analysis of system dependability in terms of the probabilistic measures of reliability, availability, and performability is necessary in order to judge whether the overall requirements have met and to guide allocation of resources to the most critical system components. Iyer and Tang take an empirical approach using data from operational systems to drive their models, identify trends, and capture the shifting focus of dependability concerns as hardware and software technology evolve. Lin takes an analytical approach to developing a quantitative method for evaluating design alternatives in the context of optoelectronic interconnection networks. Gary M. Koob Mathematical, Computer and Information Sciences Division Office of Naval Research Clifford G. Lau Electronics Division Office of Naval Research

A CKNOWLEDGEMENTS

The editors regret that, due to circumstances beyond their control, two planned contributions to this series could not be included in the final publications: "Compiler Generated Self-Monitoring Programs for Concurrent Detection of Run-Time Errors," by J.P. Shen and "The Hybrid Fault Effects Model for Dependable Systems," by C.J. Walter, M.M. Hugue, and N. Suri. Both represent significant, innovative contributions to the theory and practice of dependable computing and their omission diminishes the overall quality and completeness of these volumes. The editors would also like to gratefully acknowledge the invaluable contributions of the following individuals to the success of the Office of Naval Research initiative in Ultradependable Multicomputers and Electronic Systems and this book series: Joe Chiara, George Gilley, Walt Heimerdinger, Robert Holland, Michelle Hugue, Miroslaw Malek, Tim Monaghan, Richard Scalzo, Jim Smith, Andr6 van Tilborg, and Chuck Weinstock.

ERRATA

1. Due to a late editing decision, Section 2.2 was removed from this volume. The somewhat anomalous numbering scheme employed in Section 2 reflects the original organization. 2. The following notes were inadvertently omitted from Section 3.2: •

This section is based on research sponsored, in part, by the Office of Naval Research under grants N00014-9 l-J- 1404 and N00014-92-J-1366 and conducted at the University of Massachusetts at Amherst and Texas A&M University.

•

Jeffrey A. Clark is with the MITRE Corporation, Bedford, Massachusetts.

•

Dhiraj K. Pradhan is with Texas A&M University, College Station, Texas.

SECTION 1

FRAMEWORKS FOR DEPENDABLE SYSTEMS

SECTION 1.1

A Consensus-Based Framework and Model for the Design of Responsive Computing Systems Miroslaw Malek ^

Abstract T h e emerging discipline of responsive systems d e m a n d s faulttolerant cind real-time performaince in uniprocessor, parallel, and distributed c o m p u t i n g environments. A new proposal for a measure of responsiveness is presented, followed by an introduction of a consensus-based framework and model for responsive comp u t i n g . T h e model, called C O N C O R D S ( C O N s e n s u s / C O m p u t a t i o n for Responsive Distributed Systems), is beised on t h e integration of various forms of consensus a n d c o m p u t a t i o n (management, t h e n progress or recovery). T h e consensus tasks include clock synchronization, phaise initialization a n d termination, diagnosis, checkpointing, resource eillocation, a n d scheduling. I n d e x T e r m s : consensus, distributed computing, fault tolerance, operating system, parallel computing, real time, responsive systems. ' T h i s research was supported in part by the Office of Naval Research Contract No. N00014-88-K-0543, Grant No. N00014-91-J-1858 and the Texas Advanced Technology Grant 386. 'Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas 78712. Phone: 1-512-471-5704, Fax; 1-512-471-0954, Email: malekQece.utexas.edu

1.1.1

Introduction

The integration of three or four traditionally separate disciplines of fault-tolerant computing, real-time systems, and parallel/distributed computing, may provide solutions to one of the greatest challenges of this decade; namely, design and implementation of responsive computer systems. A responsive computer system [7] is a fault-tolerant, real-time system whose objective is timely and correct programs execution in uniprocessor, parallel, and distributed computing environments, even in the presence of faults. In a nutshell, responsive systems are expected to maximize the probability of correct programs execution on time despite faults. Society, business owners, and government officials worldwide have been sold on the promise that open, distributed computer systems would deliver improved reliability and timeliness. The reality seems to be to the contrary, and effective solutions for making these systems responsive are badly needed. Responsive computer systems are crucial in many applications. Communication, control, avionic systems, robotics, multimedia, decision support systems, banking, air traffic control, and even point-of-sale terminals are just a few examples of systems that require a high degree of responsiveness. The objective of this chapter is to introduce a concrete model for optimizing both timeliness and claissical measures of dependability, such as reliability and availability, in the multicomputing environment. Our philosophy is based on the observation that high responsiveness can only be achieved by developing an ultrareliable kernel, whose correctness is provable, and combining it with application-specific responsive system design techniques. This is due to the diversity of applications and environments that make highly responsive systems extremely difficult to design. Although there were several fault-tolerant systems designed in the past, only two of them focused directly on both fault tolerance and real time: the Multicomputer Architecture for Fault Tolerance (MAFT) [5] and the Maintainable Real-Time System (MARS) [6]. While MAFT manages redundancy via Byzantine agreement protocols, MARS uses duplicated processors and relies on a fail-silent fault model. MAFT

separates scheduling, synchronization, voting (SSV), and computing in space by using separate processors for SSV and computing. Both systems heavily rely on static scheduling which limits their applications domain. Our approach, called CORE (Consensus for REsponsiveness), intends to support fault tolerance and real time in dynamic environments where fault and task arrival times, durations and deadlines can be modeled by specific distributions. CORE is a software layer, ideally a kernel. Our approach focuses on the identification of the fundamental functions that must be incorporated in a responsive system kernel and the development of application specific techniques for maximizing responsiveness. This approach is based on two further observations; the omnipresence of consensus ^ and the inevitability of probability. We view synchronization, initialization, termination, communication, data access and storage protocols, diagnosis, selected error detection and error correction codes, scheduling, replicated data management, checkpointing, and reconfiguration as consensus problems. Therefore, the consensus concept is fundamental in any multicomputer environment. The inherent complexity of the system, combined with the random occurrence of faults, implies that probability cannot be avoided, and, even in so-called "hard" real-time systems, only probabilistic guarantees can be given. The chapter is divided into the following parts. In Section 2, we discuss a consensus-based framework for responsive computer systems. Next, a new proposal for a measure of responsiveness is presented. Section 4 introduces the model and some alternatives for responsive computing and outlines design options for implementing them in a distributed computing environment. Section 5 gives the specification of the universal consensus while Section 6 presents the conclusions and summarizes our model. ' In this case, consensus is defined as an agreement among computers.

1.1.2

The Consensus-Based Framework

The consensus-baised framework [10] for responsive systems aimed to define the core concepts and the functions that lead to comprehensive design methods for responsive computer systems. We propose a consensus for REsponsiveness (CORE) software layer that may be incorporated on top or as a kernel (or microkernel) in systems such as Unix, Mach, and Chorus in order to improve their responsiveness. Any successful design requires quantitative and/or qualitative goals that can be verified through measurement. The most successful designs are bcised on particular models that are accurate abstractions of reality. Of course, the ultimate model is a copy of the given system itself; however, with the high complexity of today's systems, such a model is frequently unattainable. Therefore, models for these systems tend to focus on a specific aspect of system behavior or a specific layer of system design. We propose a definition of responsiveness which combines availability and timeliness (a specific proposal for this measure is outlined in the next section) and introduce a design framework in which characteristics such as synchronicity, message order or lack of it, and bounded or unbounded communication delay are well defined for a specific environment. This framework [10] is based on the consensus problem [2] and is, in our opinion, fundamental in the design of responsive multicomputer systems. In multicomputer systems, the consensus problem is omnipresent. It is necessary for handling synchronization and reliable communication, and it appears in resource allocation, task scheduling, fault diagnosis, checkpointing, and reconfiguration. Consensus tasks take many forms in multicomputer systems. Figure 1.1.1 shows the necessary functions for fault and time management in a multicomputer environment in which each function represents a separate consensus problem, with exception of applicationspecific methods at the top layer, although some of these techniques, such as coding, might also be viewed as a consensus problem. At the base of the model is the synchronization level. For a system to be responsive (i.e., fault tolerant and real time), there must be an agreement about time for task execution, fault detection and checkpointing. The next layer represents the requirement for reliable communication. Responsive computers must agree on how and when information is ex-

Application-Specific Responsiveness

Responsiveness Coordinator (Scheduler and Resource Allocator) Fault Recovery (Reconfiguration/Repair)

Checkpointing

Fault Diagnosis or Fault Masking Reliable Broadcast

—

Kernel or Responsiveness Layer

Synchronization

Figure 1.1.1; A framework for responsive systems design.

8 changed and how many messages can be considered delivered or lost. Also, timeliness of communication is crucial in achieving a high level of responsiveness. The third layer, checkpointing, ensures that a consistent system state can be saved after a set of operations. The fourth layer, diagnosis, is fundamental to both real time and fault tolerance, for agreements must be reached on task scheduling and on who is faulty and who is not. Finally, the fifth layer illustrates the need for agreement on resource allocation and reconfiguration for efficient task execution and recovery from potential faults. In addition, we add a scheduler that manages space (resources) and time, and an availability manager. The scheduler and availability manager form the responsiveness coordinator which is responsible for maximizing responsiveness by ensuring that tasks are executed on time (scheduling and coordinating reconfiguration), even in the presence of faults. Application-specific responsive design methods are placed on top, as shown in Figure 1.1.1. An example of implementation of this framework is illustrated in Figure 1.1.2, in which functions in the CORE are incorporated in the kernel and support applications, armed with application-specific techniques for responsive system design. With the variety and complexity of the numerous applications in multicomputer systems today, we insist on this approach, as it is our belief that general techniques have some limitations and, when used alone, cannot assure a high level of responsiveness. We believe that, although the small generic kernel may be proved to be correct, the correctness of real-world applications, in most cases, cannot be proven. Hence, application-specific techniques are necessary. In responsive system design, all of the consensus problems, as well as computations, should be accomplished in a timely and reliable manner. To design a responsive system, we need responsive synchronization, responsive communication, responsive task scheduling, responsive fault diagnosis, responsive resource allocation, and responsive reconfiguration. This means each layer should incorporate timeliness and dependability measures, as well as the relevant fault model. This requirement opens a number of research issues, and only a few of them have been studied from the perspective of responsiveness which is discussed in the next section.

Figure 1.1.2: An example implementation of a responsive systems design framework.

10

1.1.3

The Responsiveness Measure

In principle, there are two approaches in dealing with the responsiveness: one, in which the optimization process takes a pair of parameters such as timeliness and availability and attempts to optimize them as a pair using, for example, an integer programming formulation of this problem, and another one, in which timeliness and availability are integrated in a form of responsiveness measure. An optimization of a tirrieliness-availability (reliability) pair hypothetically gives a designer an opportunity for a separation of concerns but, in reality, there is usually a relation, often complicated, between timeliness and availability, and the designer might want to resort to an integrated form of responsiveness measure. The responsiveness measure combines the traditional meeisures of fault tolerance and real time; namely, availability and the ability to complete a task set by some deadline, with the objective of co-optimizing these dependent system qualities. This is done by examining the system from a task and service perspective. Consider that a system consists of a set of tasks, each having a particular execution time and deadline that may or may not be met. Whether or not a task's deadline is met depends on the ability of the system to perform the task set, given that the system hardware and software have some probability of failing. By taking this approach, we acknowledge that responsiveness is the ability to correctly execute the system's required functions (i.e., its task set or a service) in a timely manner, even in the presence of faults. Responsiveness, then, is dependent on the user's needs with respect to timeliness, a realistic fault model, and his perception of correctness which should ultimately be the system's goals and is the probability that these needs will be fulfilled. Unless these needs are very simple, responsiveness will be less than one. The user must balance how often the system can fail with how much the system will cost. This makes building a responsive system an iterative process of the user specifying what task set must be performed with a particular responsiveness, and the system designer creating a system at a certain cost that will meet the user's requirements. Thus, the goal for the responsiveness measure is that it be sufficiently intuitive so the user may easily use it to specify his requirements, and sufficiently rigorous so the designer may

11

Figure 1,1.3: Combining systems attributes with service attributes. analytically evaluate the system's responsiveness. Notice that reliability and other specific measures of dependability have been traditionally perceived as system attributes, while timeliness hais been viewed as a service attribute (see Figure 1.1.3). The challenge is to optimize the two measures in tandem or to combine them to maximize the probability of correct execution on time. In general, one may assume that the system responsiveness is a function of timeliness and some dependability measure such as availability which are time dependent and can be defined as r{t) — f[a(t),p(t)]. Deriving a precise responsiveness criterion for a multicomputer environment, as well as methods for evaluating and measuring responsiveness, is still an open issue. If a single-processor system is required to run a single task, its responsiveness r{t) is the probability a{t) that the processor performs the task without error (i.e., the instantaneous availability, or reliability, of the processor) multiplied by the probability p{t) that the processor can complete the task before its deadline (i.e., the task is schedulable at a given time t). In short, r{t) — a(t)p(t). If this system is required to perform n tasks, the probability that a particular task is completed is related to its governing scheduling policy and the other tcisks in the system. In addition, if multiple tasks are executed on a single or multiple processor system, the responsiveness measure could be defined as

r{t) =

n

^f^aiit)p,{t), !=1

12 where a; is the lifetime availability of task i and pi is the probability that task i will be completed before its deadline [9]. Availability, a, is calculated using the standard evaluation techniques for the configuration of equipment performing task i and the duration of task i. Timeliness, p,- is the probability that task i will complete before its deadline, given that it is part of some task set scheduled on the system. Note that responsiveness r{t) varies between zero and one with higher values of r(t) indicating a greater likelihood that the task set will be completed within its deadlines. For a gracefully degradable system, such as a consensus-based system, a, and p, may be calculated to any desired precision by computing a weighted average of these probabilities over the various configurations. For example, a four-processor, single-fault-tolerant, consensusbased system would have a^ and pi calculated for the two cases when there are four and three fault-free processors (two, one or zero faultfree processors amounts to a system failure and Oj = 0), then weighted and averaged depending on the expected time the system will spend in these various configurations. We view the proper definition of responsiveness a.s the key to any optimization that might be done in responsive systems design. Only experience will show whether r{i) = f[a{t),p{t)] is a measure that can accurately portray a system's ability to satisfy the user's needs, yet can still be useful to the system designer. It should be noted that responsiveness is a function of the current time t, a task set T (with its timing characteristics), the available time to deadline (d = td — t), and an execution strategy (replication in time or replication in space). One of the principal goals in responsive computing is to accurately define responsiveness and make it useful, easily modeled, and easily measurable. Complete and accurate evaluation of responsiveness may require sophisticated models that can cross the bounds of the design hierarchy without the usual state space explosion. We are currently studying system and service models that focus on responsiveness. Since it seems that no existing modeling paradigms are suitable, we extend Petri Nets [3], limit the state explosion by using partitioning and hierarchy, and incorporate both exponential and delay functions.

13

1.1.4

The Model

The model for responsive computing should facilitate the goals of timely and correct execution of programs in parallel and distributed environments. The proposed model, called CONCORDS (CONsensus Computation model for Responsive Distributed Systems) can be described as follows. In CONCORDS, computing is viewed as a collection of tasks (units of computation) whose execution time and deadline are specified. Taisks may arrive periodically or sporadically at random at one or a number of processing elements (PEs) with an ability to send messages among themselves. In order to perform responsive computing, both logical and physical synchronization are necessary. Reliable communication, fault diagnosis and checkpointing are a must. All of these processes can be executed as consensus. Therefore, the consensus plus, of course, computation and recovery, are fundamental in responsive systems design. A parallel or distributed computation is simply a group of these PEs working towards a common goal. We extend Valiant's BulkSynchronous Parallel model (BSP) [11] and model such a computation as a sequence of two alternating phases; namely, local execution and scheduling/consensus. The local execution might represent progress (a computation towards achieving expected results) or a recovery from a fault. A computation involves each PE performing its portion of a number of tasks (including recovery procedures) until it must send or receive information from the other PEs involved. Regardless of the nature of this information, an agreement is needed, i.e., consensus is executed. The UNIversal CONsensus protocol, UNICON, [1] is a recognition of this, as any nonlocal requirements of a PE are met by a particular instance of it. Therefore, we may model a distributed computation as an alternating sequence of local executions and calls to the UNICON protocol. These phases (consensus/computation) alternate indefinitely or until the application is completed. As an illustration, consider Figure 1.1.4 which shows the execution profiles of four processors A, B, C and D. The column below a processor label represents when consensus is being performed (marked in white) and when a local computation is being performed (marked in

14

time

B

D

consensus type clock synchronization and diagnosis (computation) checkpointing

(computation) computation termination

Figure 1.1.4: A computation/consensus view of a distributed program.

15 gray). A horizontal bar is used to collect related executions of consensus. For example, the first consensus resulting in clock synchronization and diagnosis involves all four processors as marked by the horizontal bar connecting all four columns. As Valiant argues t h a t the BSP model eases the programmer's task in writing parallel computations, we feel this computation/consensus paradigm will ease the development of responsive operating systems and applications. A consensus-based approach towards the development of responsive operating systems was presented in [10]. The primary rationale is that we can never rely on a single resource; thus, it is necessary to progress via consensus. Therefore, the kernel of a responsive operating system consists of a number of consensus tasks as illustrated in Figure 1.1.1. We are currently working on implementation of universal consensus [1] within a C O R E framework, expecting to obtain some definite advantages. First, we can encompass many differing fault models in the system model, and second, UNICON, once a fault model is defined, can be based on a single algorithm. T h e result is a simplicity and functionality that allows for more reliable coding, low installation overhead, and an ease of software maintenance. Moreover, we anticipate t h a t our model will ease the estimation of timeliness of the operating system tasks and applications. An open question is t h a t of efficiency. While the UNICON protocols, by their very nature of combining numerous systems, functions in a single message or pass throughout the network and are bound to be efficient, the restrictive model of consensus and computation phases might not be, especially in a large system. To improve system efficiency, the following variations of a C O N C O R D S model may alleviate the problem: 1. Application-specific C O N C O R D S . In this model, processors involved in a specific application (computation) perform consensus/computation phases alternatively, regardless of other processors in the system. The consensus, scheduling, and reconfiguration tasks might be more efficient, as they involve only part of a system. 2. Exception-driven C O N C O R D S . In this variation, consensus occurs only on demand, i.e., if one of the processes/processors raises an exception. Note that this model is mainly applicable to a fail-stop fault model where a faulty processor stops and notifies others about its

16 failure. While the suggested alternative models may improve efficiency, the potential gains may disappear due to potentially high overhead of concurrency control for some applications. Another interesting problem, from a scheduling perspective, is to analyze and carry out comparative analysis of dynamic versus static scheduling and then compare them with respect to a single global versus multiple task queue(s). Also, placement of software for support of responsive computing remains an open problem. Three possibilities exist; a kernel with an operating system on top of it, a microkernel cooperating with an operating system, and a responsiveness layer on top of an operating system. It remains an open question as to which approach is most cost effective but, obviously, all three of them trade off cost and effectivenesss of maximizing responsiveness.

1.1.5

Specification of Universal Consensus

We assume that all of the processors in the system are (logically) completely connected by an underlying communication medium. Moreover, there exists a distribution of message delays such that a processor may estimate the time that a message was sent from the time it was received and from which processor it was sent [4]. We assume that there exist mechanisms for scheduling and preempting tasks on a processor. We also assume that the number of faults tolerated is bounded according to the fault model employed. The processor clocks are assumed to be synchronized by the UNICON's synchronization task. For consensus to take place, at lesist five questions must be answered. First, who is to be involved in the consensus? The answer to this would be a list of processors. Second, what are the characteristics' of those involved? That is, the fault models of the processors are needed. Third, what is to be agreed upon? The consensus task must know what information is to be collected. Fourth, what is the desired timeliness of the task? Fifth, what structure should the consensus take? In other words, do all the processors involved need all of the consensus

17 results, or will a subset of them be sufficient? From the defining requirements for consensus, we envision the declaration of UNICON as unicon(con_id *id,PE_set members,con.top topology,con.type type,con_time p r i o r i t y ) where id is set to the identification number of this instance of consensus and members is the set of processors involved in the consensus. We assume that the set members refers to a specific set of physical processors, but it could refer to a logical set maintained by the operating system. The structure, topology, possesses two fields; namely, structure which takes values from the set { s i n g l e - l e v e l , partitioned, unspecified}, and r e s t r i c t i o n which takes values from the set {global, a p p l i c a t i o n - s p e c i f i c } . The variable, type, is enumerated from a set including { synchronization, configuration, diagnosis, communication scheduling, checkpointing, termination}. This set can obviously be extended. We may also consider the caise in which type is a union of the various members of this set such as when multiple purposes may be served by a single instance of UNICON. For this chapter we assume that each consensus has a distinct goal but, in practice, various information items could be combined in the consensus messages in order to perform multiple functions at once. The parameter, priority, determines the timeliness requirements of the responsive consensus. From this simple system call, it is possible to invoke any number of the consensus protocols. By examining the fault models of the processors in members, the system can choose a suitable consensus algorithm; by examining the number of processors in members, the system can decide whether a partitioning method is useful or, alternatively, the structure may be specified with topology. The consensus algorithm is abstracted across synchronization, configuration, diagnosis, communication scheduling, checkpointing, termination and consensus information requirements with the type variable. The priority parameter defines the timeliness requirements of a particular instance of UNICON. It is a structure with members time, which is an absolute, real-time value as set by the synchronization pro-

18 cess; periodic, which is a boolean stating whether or not the consensus should be scheduled periodically; duration, which in the case that the consensus should be scheduled periodically is the length of that period; and sched, which is of the set {urgent, deadline, asap, Icizy}. Note that specifying that the consensus should be completed in 10 seconds implies that time should be set to the current time plus 10 seconds. Also, if a consensus is periodic, then after each deadline the time variable should be increased by duration and the task rescheduled. The meanings of these elements are as follows: urgent: The consensus is to be completed by time even if this implies that all other scheduled tasks must be preempted. If time has passed, then the consensus task should be the sole purpose of the processor until its completion. deadline: The consensus should be scheduled during the spare capacity of the system to complete by time. If time passes before completion, then the incomplete results should be provided. asap: The consensus should be begun and completed as soon after time as the processor's schedule allows. By analogy to an interrupt hierarchy, it could be called a polite consensus. lazy: The consensus should be scheduled during the processor's spare capacity, but if it is not completed by time then it should be abandoned with the incomplete results reported. As we assume that all processor clocks are synchronized to within some bound called synch_error, time has meaning, although somewhat inconsistent, throughout the consensus. When we initiate UNICON, therefore, it is necessary to let other processes know that a result is needed by time - synch_error in order that the result will be available at the initiating process by time. A prototype for CORE runs as a set of library functions on top of AIX (IBM version of UNIX operating system) on a network of RS/6000 workstations. It provides the application developer with support for

19 incorporating fault tolerance and real-time requirements. The earliestdeadline-first scheduler had to be adapted to coexist with the AIX system. UNICON's protocol is implemented as a set of functions that are called by the system and the application to perform communication and consensus. We anticipate that the prototype will enable us to perform experiments that will give us insights into an ultimate implementation of CORE as a UNIX microkernel.

1.1.6

Conclusion

There is an urgent need to provide the users community with responsive processing in parallel and distributed computing environments. The thesis of this chapter is that the proposed framework and model will facilitate the development of systems which would improve the feasibility of achieving two crucial qualities, reliability and timeliness. The concepts of consensus and computation progress are fundamental. The universal consensus algorithm for synchronization, reliable communication, diagnosis, scheduling, checkpointing, and reconfiguration, combined with a consensus/computation model, illustrate a promising approach to responsive computing. In addition, a responsive systems layer such as CORE should be integrated with application-specific methods such as those presented in [7, 8] because high responsiveness can only be achieved by combining an ultrareliable layer (ideally a kernel) with application-specific techniques. As computer systems proliferate and our dependence on them increases, responsiveness will become the most sought-after quality in computer and communication systems.

Acknowledgment I would like to acknowledge my student Mike Barborak for contributing to the specification of universal consensus.

20

References [1] M. Barborak and M. Malek. UNICON —^ a UNIversal CONsensus for responsive computer systems. Technical Report TR-92-36, Department of Computer Sciences, The University of Texas at Austin, October 1992. [2] M. Barborak, M. Malek, and A.T. Dahbura. Consensus problem in fault-tolerant computing. ACM Computing Surveys, 25(2):171220, June 1993. [3] G. Brat and M. Malek. Incorporating delays and exponential distributions in Petri Nets for responsive systems. Technical report, The University of Texas at Austin, March 1993. [4] F. Cristian. Probabilistic clock synchronization. Distributed Computing, 3:146-158, 1989. [5] R. Kieckhafer et al. The MAFT architecture for distributed fault tolerance. IEEE Transactions on Computers, 37(4);398-405, April 1988. [6] H. Kopetz et al. Distributed fault-tolerant real-time systems: The MARS approach. IEEE Micro, 9(l):25-40, February 1989. [7] L. Laranjeira, M. Malek, and R. Jenevein. On tolerating faults in naturally redundant algorithms. In the 10th Symposium on Reliable Distributed Systems, pages 118-127, September 1991. Pisa, Italy. [8] L. Laranjeira, M. Malek., and R. Jenevein. Nest: A nestedpredicate scheme for fault tolerance. IEEE Transactions on Computers, 42(11):1303-1324, November 1993. [9] M. Malek. Responsive systems: A challenge for the nineties. In Euromicro 90, Sixteenth Symposium on Microprocessing and Microprogramming, pages 9-16, August 1990. Amsterdam, The Netherlands, Keynote Address.

21 [10] M. Malek. A consensus-based framework for responsive computer system design. In The NATO Advanced Study Institute on RealTime Systems, October 5-18 1992. Springer-Verlag, St. Martin, West Indies. [11] L. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103-111, August 1990.

SECTION 1.2

A Methodology for Adapting to Patterns of Faults* Gul Agha and Daniel C. Sturman ^

Abstract In this paper, we present a language framework for describing dependable systems. Our framework emphaisizes modularity and compoaition. The dependability and functionality aspects of an application may be described independently, providing separation of design concerns. Furthermore, the dependabUity protocols of an application may be constructed bottom-up cLS simple protocols that are composed into more complex protocols. Composition makes it easier to reason about the behavior of complex protocols and supports the construction of generic reusable dependability schemes. A significant aspect of our language framework is that dependability protocols may be loaded into a running application and installed dynamiccdly. Dynamic installation makes it possible to impose additional dependability protocols on a server as clients with new dependability demands are integrated into a system. Similarly, if a given dependability protocol is only necessary during some particular phase of execution it may be installed during that period only.

1.2.1

Introduction

A number of systems have been developed to support the development of dependable computing applications. Such support is given in terms of failure *The research described has been made possible by support from the Office of Naval Research (ONR contract numbers N00014-90-J-1899and N00014-93-1-0273), by an Incentives for Excellence Award from the Digital Equipment Corporation Faculty Program, and by joint support from the Department of Defense Advanced Research Projects Agency and the National Science Foundation (NSF CCR 90-07195). ' A u t h o r s address: Dep£irtment of Computer Science, University of Illinois at UrbanaChampaign, 1304 W. Springfield Avenue, Urbana, Illinois 61801, USA. Email: { agha j sturman } S c s . u i u c . e d u

24 semantics which specify legal ways in which a component can fail [11]. Failure semantics are enforced through the use of dependability protocols which guarantee that the probability of a failure of a type not specified in the semantics is acceptably small. However, existing systems assume that the failure semantics of a service are static and, therefore, the dependability protocols used may be fixed. In many computer systems, it is either unsatisfactory to adhere to a static failure semantics or impossible to adequately enforce the semantics with a fixed group of dependability protocols. We illustrate this concept with two example systems: • Consider an embedded system which is required to function over a long duration, yet is fault-prone due to the uncertain environment in which it operates. If this system is physically isolated, such as in the control system of a satellite, physical modification of system components is often infeasible. In such a system, a change in the physical environment may result in protocols designed for the old environment failing to uphold the failure semantics in the new environment. A different group of dependability protocols may then be required to enforce the desired failure semantics of the system. • Consider an open system. Open systems allow interactions with the external environment; in particular, new services may be added or deleted dynamically from the system in response to external events. Consequently, it may not be possible to statically determine the system configuration. Without knowing the system configuration, it may not be possible to determine what failure semantics a process must have, or what protocols are necessary to enforce these semantics, until after the process actually joins the system. Furthermore, the addition of new services may require a change in the failure semantics of existing components. For example, a file server may initially address safety only by check-pointing the files to stable storage. New clients that are added to the system, however, may require the server to also provide persistence and a protocol to support replication may need to be added. In this paper, we describe a methodology for the modular specification of systems that adapt to patterns of faults. We call the resulting systems adaptively dependable. We present a methodology which allows the transparent installation and reuse of dependability protocols as well as their dynamic installation in a system. Our methodology, when combined with a suitably structured exception handling mechanism and fault detection, allows for the development of

25 fault handlers which can maintain consistent failure semantics within a changing environment and can alter failure semantics as system needs change. We have provided programmer support for our methodology in the language Screed which is implemented on our run-time system Broadway. We employ reflection as the enabling technology for dynamic installation of dependability protocols. Reflection means that an application can reason about and manipulate a representation of its own behavior. This representation is called the application's meia-level. The components of an object that may be customized at the meta-level are referred to as the meia-architecture. In our case, the meta-level contains a description which implements the failure semantics of an executing application; reflection thus allows dynamic changes in the execution of an application with respect to dependability. Besides supporting dynamic installation, our meta-architecture supports transparency and reuse of dependability protocols. For example, the metaarchitecture allows protocols to be expressed as abstract operations on messages. Since the individual fields of a particular message are never examined, the same protocol may be used with different applications. Given this technique for dynamic modification of the dependability protocols used in a system, we describe how fault-detection and exception handling may be used in conjunction with our meta-architecture to support adaptive dependability. We model both failures and exceptions as objects. Each type of fault which may be detected is described as a specific system exception. We construct managers — exception handlers with meta-level capabilities — to address system exceptions. Managers serve three purposes: • A manager may correct for recoverable faults. The corrections allow the system to continue to function despite a fault. This role is generally referred to as performing forward error recovery. • Managers provide failure prevention. When a manager discovers a pattern of component failures, it dynamically installs protocols which mask future failures or facilitate future fault-correction by expanding the set of recoverable faults. In this manner, we have taken forward error recovery one step further; rather than simply adjusting the system state, the actual dependability characteristics of the system may be modified. • Managers support reconfiguration of the dependability protocols in a system. This may be done either to alter the system's failure semantics or to correctly enforce these semantics once the environment changes. Thus,

26 we can develop dependable long duration systems whose fault patterns are not known at start-up time. A prototype implementation of a run-time system which tests these idesLS is described: the system Broadway supports our meta-architecture as well as failure detection and a set of system exceptions. On top of Broadway, we have implemented the language Screed. Screed is a prototype concurrent actor language which provides complementary constructs for both fault-detection through exception handling and dynamic installation of protocols through a meta-architecture. Screed is presented as a demonstration of how such constructs may be added to existing languages. This paper is organized as follows. Section 1.2.2, discusses related research in the areas of reflection, exception handling, and languages for fault-tolerance. Section 1.2.3 provides a brief description of the concepts of reflection, the Actor model, and object-orientation. Section 1.2.4 provides a guide to the syntax of Screed to assist in understanding our examples. Section 1.2.5 discusses our meta-level architecture and how it may be used to construct dependability protocols. We also discuss the effect of our meta-level architecture on protocol performance. Section 1.2.6 describes exception handling in Screed and how exception handling may be used in conjunction with our meta-level architecture to implement adaptively dependable systems. We then illustrate this technique with an example of a system adapting to a change in environment. In

1.2.2

Related Work

A number of languages and systems offer support for constructing fault tolerant systems. In Argus [23], Avalon [15] and Arjuna [31], the concept of nested transactions is used to structure distributed systems. Consistency and resilience is ensured by atomic actions whose effects are check-pointed at commit time. The focus in [27], [9] and [7] is to provide a set of protocols that represent common communication patterns found in fault tolerant systems. None of the above systems support the factorization of fault tolerance characteristics from the application specific code. In [38] and [28], replication can be described separate from the service being replicated. Our approach is more flexible: fault tolerance schemes may not only be described separately, they may be attached and detached dynamically. Another unique aspect of our approach is that different fault tolerance schemes may be composed in a modular fashion. For example, check-pointing may be composed with replication without requiring that the representation of either protocol know about the other.

27 Non-reflective systems which support customization do so only in a systemwide basis. For example, customization in a micro-kernel based system [1] affects all the objects collectively. In an object-oriented system such as Choices [8], frameworks may be customized for a particular application. However, once customized, the characteristics may not change dynamically. Reflection in an object based system allows customization of the underlying system independently for each object. Because different protocols are generally required for very specific subsets of the objects in a system, this flexibility is required for implementing dependability protocols. Reflection has been used to address a number of issues in concurrent systems. For example, the scheduling problem of the Time Warp algorithm for parallel discrete event simulation is modeled by means of reflection in [40]. A reflective implementation of object migration is reported in [37]. Reflection has been used in the Muse Operating System [39] for dynamically modifying the system behavior. Reflective frameworks for the Actor languages MERING IV^ and Rosette have been proposed in [16] and [35], respectively. In MERING IV, programs may access mtta-instanccs to modify an object or meta-ciasses to change a class definition. In Rosette, the meta-level is described in terms of three components: a container, which represents the acquaintances and script; a processor, which acts as the scheduler for the actor; and a mailbox, which handles message reception The concept of unifying exception handling and fault detection was originally proposed in [30] and then refined in [29]. In these papers, detected failures are considered asynchronous events much as exceptional conditions are treated in distributed programming languages. Therefore, exception handling construct provide a natural way to incorporate failure-response code into an application. Goodenough introduced the idea of exceptions and exception handling in [19]. Since then, many different exception handling mechanisms have been proposed. Exception handling constructs have been developed for object-based languages such as Clu [24] and Ada [12]. Dony [13] describes an approach for object-oriented languages and its implementation in Smalltalk. In this approach, exceptions are implemented as objects much as we do. Exception handling for C-I--I- is discussed in [33]. A good overview of techniques proposed for other object-oriented languages can be found in [14]. A critical difference between object-oriented approaches to exception handling and non-object-oriented approaches such as CLU [24] or Ada [12] is that, in the latter, the exception object is represented by a set of parameters to a function. Therefore, on generating the signal, a parameter list must provide all possible information used by the handler.

28 For concurrent systems, another technique has been proposed for languages which use RPC communication [10]; the technique is based on synchronized components which allows the exception handling constructions to be closer to that of a sequential system than an asynchronous system. Exception handling mechanisms have been proposed for other Actor languages. An exception handling mechanism was proposed for ABCL/1 and for Acore [21, 26]; the mechanism uses complaint addresses to support exception handling. A complaint address is a specific location, specified with each message, to which all signals are dispatched.

1.2.3

Background

Before discussing our meta-architecture and how we use it to support adaptive dependability, we first discuss in greater detail some concepts that are important in our framework. The organization of this section is as follows. First, we briefly discuss some of the advantages of object-oriented programming and how they are useful with our methodology. Secondly, we describe the Actor model of concurrent computation. We chose the Actor model as the basis of our work due to the ease with which it may be extended, Finally, we give a more in-depth discussion of reflection and how it relates to a programming language.

Object Orientation In an object-oriented language, a program is organized as a collection of objects. Each object is an encapsulated entity, representing an instance of an abstract data type. The local data comprising each object may only be accessed through an interface specified as a set of methods. The operations carried out by a method are not visible outside the object. Objects communicate with messages which invoke a method in the receiving object. The local data of another object cannot otherwise be accessed or modified. Objects are mstanttated from classes. A class is a user-defined abstraction. Classes may be thought of as types and objects as elements of that type. Instantiation is the creation of an object of a particular class. Classes contain the description (code) of the methods and of the instance variables for objects instantiated from that class. Classes may inherit from other classes. Inheritance provides the inheriting class with the properties - the methods and instances of the ancestor class. The inheriting class can then utilize these properties as well as augment them with new instances variables or methods. Methods may be inherited directly or redefined, facilitating code reuse.

29 Object-oriented languages allow for modular development of systems. The implementation of each component is hidden from other components: only the interface is known. In this way, a component's implementation may change without affecting other components. Code may also be reused efficiently since components may share code by inheriting from a common ancestor class. Note that our use of classes and inheritance differs from that in sequential objectoriented languages in that we do not support class variables.

The Actor Model We illustrate our approach using the Actor model [2, 3]. Actors can be thought of as an abstract representation for multicomputer architectures. An actor is an encapsulated object that communicates with other actors through asynchronous point-to-point message passing. Specifically, an actor language supports three primitive operators; send Actors communicate through asynchronous, point-to-point message passing. The send operator is used to communicate a message asynchronously to another actor. Each message invokes a method (or procedure) at the destination. Upon reception of the message at the destination, the message will be buffered in a mail queue. Each actor has a unique mail address which is used to specify a target for communication. Mail addresses may also be communicated in a message, allowing for a dynamic communication topology. new Actors may dynamically create other actors. The new operator takes an actor behavior (class name) as a parameter, creates a new actor with the correct behavior and returns its mail address. The mail address is initially known only by the creating actor. However, the creator subsequently include this new mail address in future messages. become The become operator marks the end of state modifications in the execution of a method. Once a become has executed in a method, the actor may continue to modify state local to the method. However, such state changes do not effect the way in which the actor may process the next message. Therefore, once this operator is executed, the actor may begin processing its next pending message. Judicious use of the b e c o m e operator may improve performance by allowing internal concurrency; i.e., multiple threads of execution within a single actor. It is important to note that the idea of using reflection to describe dependability is not tied to any specific programming language. Our methodology

30 assumes only that these three operators are in some way incorporated into the language; we require that new actors may be created dynamically and that the communication topology of a system is reconfigurable. In fact, the actor operators may be used to extend almost any standard sequential language to provide coordination and communication in a distributed environment; local computations may still be expressed in terms of the sequential language. The level at which the sequential and actor constructs are integrated determines the amount of concurrency available in the system. An actor language may be used to "wrap" existing sequential programs, serving as an interconnection language. With this approach, each method in an actor class invokes a subroutine, or set of routines, written in a sequential language and dispatches messages based on the values returned. Such an approach was taken by the Carnot project at MCC [34]. In Carnot, the actor language Rosette "glues" sequential components together to facilitate heterogeneous distributed computing. A complementary approach is to actually integrate the actor operators into an existing language. Broadway, the run-time platform we use to implement the ideas in this paper, supports C++ calls for both send and new; the b e c o m e operator is implicit at the end of each method. Using Broadway, developers of distributed programs may use a well known language — C+i to develop distributed programs. Actor operators have also been combined with functional languages. Specifically, actor operators have been added to the call-by-value A-calculus [5]. In this case, the local computation is modeled as a sequential functional computation. An operation semantics is developed for the resulting language. The semantics supports operational reasoning. In [36], the semantics is extended to support formal reasoning about meta-architectures such as the one we describe here. If necessary, the actor operators may also be extended to support more complex functionality. In particular, communication model may be modified to support more complex message passing constructs. The asynchronous point-topoint communication model for actors has been extended to include patternbased multicasts using AciorSpaces [6]. Furthermore, remote procedure calls may be transformed into a set of asynchronous messages using a concurrent analog of the continuation passing style [22]. Synchronization constraints [17] and multi-object constraints [18] are two other extensions of the actor operators which greatly simplify distributed programming. Constraints allow the programmer to specify "when" asynchronous

31 events may occur based on the state of a single object or the occurrence of other events in the system. Using these techniques, the non-determioism of asynchronous communication may be constrained to maintain a consistent system state without requiring an overly restrictive comrrmnication model.

Reflection

Application

System

Figure 1.2,1: Through reflection, an application may modify the system by modifying its meta-objects. Meta-objects are a system level description of the base-level application objects.

Reflection means that a system can manipulate a causally connected description of itself [32, 25]. Causal connection implies that changes to the description have an immediate effect on the described object. In a reflective system, a change in these descriptions or meta-objecis results in a change in how objects are implemented. The object for which a meta-object represents certain aspects of the implementation is called the base object. This relationship is shown in Figure 1.2.1. Meta-objects may be thought of as objects which logically belong in the underlying run-time system. For examples, a meta-object might control the message lookup scheme that maps incoming messages to operations in the base object. Another meta-object may modify how values are read from memory. Using reflection, such implementation level objects can be accessed and exam-

32 ined, and user defined meta-objects may be installed, yielding a potentially customizable run-time system within a single language framework. The reflective capabilities which are provided by a language are referred to as the meta-level architecture of the language. The meta-level architecture may provide variable levels of sophistication, depending on the desirable level of customization. The most general meta-level architecture is comprised of complete interpreters, thus allowing customization of all aspects of the implementation of objects. In practice, this generality is not always needed and, furthermore, defining a more restrictive meta-level architecture may allow reflection to be realized in a compiled language. The choice of a meta-level architecture is part of the language design. Customizability of a language implementation must be anticipated when designing the run-time structure. Although a restrictive meta-level architecture limits flexibility, it provides greater safety and structure. If all aspects of the implementation were mutable, an entirely new semantics for the language could be defined at run-time; in this case, reasoning about the behavior of a program would be difficult. We limit our meta-level to contain only the aspects that are relevant to dependability. Application specific functionality is described in the form of base objects and dependability protocols are described in terms of meta-objects. Thus, dependability is modeled as a special way of implementing the application in question. Our methodology gives modularity since functionality and dependability are described in separate objects. Since meta-objects can be defined and installed dynamically, the objects in a system can dynamically change the protocols enforcing their failure semantics as system needs change. Furthermore, new dependability protocols may be defined while a system is running and put into effect without stopping and recompiling the system. For example, if a communication line within a system shows potential for unacceptable error rates, more dependable communication protocols may be installed without stopping and recompiling the entire system. Since meta-objects are themselves objects, they can also have meta-objects associated with them, giving customizable implementation of meta-objects. In this way, meta-objects realizing a given dependability protocol may again be subject to another dependability protocol. This scenario implies a hierarchy of meta-objects where each meta-object contributes a part of the dependability characteristics for the application in question. Each meta-object may be defined separately and composed with other meta-objects in a layered structure supporting reuse and incremental construction of dependability protocols. Because installation of a malfunctioning meta-level may compromise the dependability of a system, precautions must be taken to protect against erroneous

33 or malicious meta-objects. To provide the needed protection of the meta-level, we introduce the concept of privileged objects called managers. Only managers may install meta-objects. Using operating system terminology, a manager should be thought of as a privileged process which can dynamically load new modules (meta-objects) into the kernel (meta-level). It should be observed that, because of the close resemblance to the operating system world, many of the operating system protection strategies can be reused in our design. We will not discuss particular mechanisms for enforcing the protection provided by the managers in greater detail here. Because only managers may install metaobjects, special requirements can be enforced by the managers on the structure of objects which may be installed as meta-objects. For example, managers may only allow installation of meta-objects instantiated from special verified and trusted libraries. Greater or fewer restrictions may be imposed on the metalevel, depending on the dependability and security requirements that a given application must meet.

1.2.4

Screed

Screed is an object-oriented actor language that compiles applications for Broadway. Screed will be used to illustrate examples in this paper. Screed is an object-oriented language: programs are written in terms of class definitions. Each class defines a single actor behavior and consists of a set of variable declarations and a set of method definitions. Screed supports inheritance. A class for which a parent class is not specified will, by default, inherit from the system defined Object class. At any point, a parent method may be referenced by sending a message to the "object" parent. Inheritance is specified when the class is defined: c l a s s MyMailQueue : MailQueue { . . . instance variables . . . getO { . . . method body... } putO { . . . method body... } } In this example, the class MyMailQueue with the methods get and put is defined. It inherits from the class MailQueue.

34 Classes may be instantiated using the new command which returns a new actor address. For example: too = new MyHailQueue; This statement creates a new actor with the behavior MyMailQueue and returns the address of this actor, which is assigned to loo. There are five primitive types in Screed. The types i n t , r e a l , and s t r i n g are self-explanatory. The type a c t o r holds the address of any actor, regardless of class. The type method can have the value of any legal method name. In addition, one-dimentional arrays of any of these types may be specified. Arrays are defined and used as in C++: a c t o r myReplicas[5]; myReplicas[2] = new ...; Actors communicate through atsynchronous message passing. In the current implementation of Broadway message ordering (from a given source to the same destination) is guaranteed, although actor semantics enforce no such requirement. Messages are sent by specifying a method and an actor address with the appropriate parameters: foo.getO; In this case, the method get is invoked on the actor f oo without any parameters. Since methods are first-class values, it would be possible to specify a variable instead of the name of a particular method. Parameters are dynamically typechecked upon reception at the message destination. Note that since we are using asynchronous message passing, this method invocation does not block. Although asynchronous message passing provides improved performance and concurrency, a drawback is the difficulty in providing return values: since the method does not block upon sending a message, it is necessary to specify a return address and method in the message itself. Therefore, method invocations may return a value, thereby acting as an remote procedure call (rpc). For example: A X = loo.get(); B With rpc-communication, the current method invocation will block. The instructions in A will execute, followed by a message send to loo. B will not execute until a return value arrives and the value is assigned to x.

35 In a asynchronous system, the programmer may want to prevent certain methods from executing based on an actor's state. Therefore, we support synchronization constraints [17] in Screed. Using synchronization constraints, the programmer will be able to specify exactly which methods may not be invoked. Maximal concurrency is then preserved since only the minimal synchronization — as specified by the programmer not the language - will be enforced. The other constructs which comprise expressions in Screed (if, while, etc.) are similar to those in C; we do not describe them further.

1.2.5

Meta-level Architecture for Ultra-dependability

In this section we introduce MAtro (Meta-level Architecture for Ultra Dependability) [4]. MAUD supports the development of reusable dependability protocols. These protocols may then be installed during the execution of an application. MAUD has been implemented on Broadway, our run-time environment for actors. We begin with a discussion of MAUD'S structure. We then discuss how transparency and reusability of protocols are supported by MAUDand provide an example to illustrate the concepts. We finish this section by demonstrating how MAUD also allows the composition of protocols and give an example of composition.

A Meta-Level Architecture As previously mentioned, MAUD is designed to support the structures that are necessary to implement dependability. In MAUD, there are three metaobjects for each actor; dispatcher, mail queue and acquaintances. In the next three paragraphs we describe the structure of meta-objects in MAUD. Note that MAUD is a particular system developed for use with actors. It may be possible, however, to develop similar systems for other models. The dispatcher and mail queue meta-objects customize the communication primitives of objects so that their interaction can be modified for a variety of dependability characteristics. The dispatcher meta-object is a representation of the implementation of the message-send action. Whenever the base object issues a message send, the run-time system calls the transmit method on the installed dispatcher. The dispatcher performs whatever actions are needed to send the given message. Installing dispatchers to modify the send behavior makes it possible to implement customized message delivery patterns.

36 A mail queue meta-object represents the mail queue holding the incoming messages sent to an actor. A mail queue is an object with get and put operations. After installation of a mail queue meta-object, its get operation is called by the run-time system whenever the base object is ready to process a message. The put operation on a mail queue is called by the run-time system whenever a message for the base object arrives. By installing a mail queue at the meta-level, it is possible to customize the way messages flow into the base object. The acquaintances meta-object is a list representing the acquaintances of a base object. In an actor system, all entities are actors. Although they may be implemented as local state, even primitive data objects, such as integers or strings, are considered acquaintances in an actor system. Therefore, in an actor language the acquaintances and the mail queue comprise the complete state of an actor. The acquaintances meta-object allows for check-pointing of actors. Meta-objects are examined and installed by means of meta-operaiions. Metaoperations are defined in the class called Object which is the root of the inheritance hierarchy. All classes in the system inherit from Object implying that meta-operations can be called on each actor in the system. The mela-operations changejnailQueue and change_dispatcher install mail queues and dispatchers for the object on which they are called. Similarly, the meta-operations getjnailQueue, get_dispatcher and get_acquaintances return the metaobjects of a given actor. If no meta-objects have been previously installed, an object representing the built-in, default, implementation is returned. Such default meta-objects are created in a lazy fashion when a meta-operation is actually called.

Transparency and Reuse By describing our dependability protocols in terms of meta-level dispatchers and mail queues, we are able to construct protocols in terms of operations on messages where we treat each message as an integral entity. There are several advantages to developing dependability protocols in this manner. The first advantage is the natural way in which protocols may now be expressed. When dependability protocols are described in the literature, they are described in terms of abstract operations on messages, i.e. the contents of the messages are not used in determine the nature of manipulation to be performed. Therefore, it is logical to code protocols in a manner more closely resembling their natural language description. Secondly, because the protocols are expressed in terms of abstract messages

37 and because every object may have a meta-leve] mail queue and dispatcher, a library of protocols may be developed which may be used with any object in the system. Such a library would consist of protocols expressed in terms of a mail queue and dispatcher pair. T h e meta-objects may then be installed on any object in the system. Since the protocols deal only with entire messages, the actual d a t a of such messages is irrelevant to the operation of the protocol. Only fields common to every message, such as source, destination, time sent, etc., need be inspected. T h e libraries could also be used with other systems, allowing the reuse of dependability protocols. One set of developers could be responsible for the dependability of multiple software systems and develop a protocol library for use with all of them. Since protocols implemented with MAUD are transparent to the application, other development teams, who are responsible for development of the application programs, need not be concerned with dependability. In the final system, protocols from the library may be installed on objects in the application, providing dependability in the composed system. E x a m p l e 1: A R e p l i c a t e d S e r v e r In this section, we provide an example of how a protocol may be described using MAUD. In a distributed system, an important service may be replicated to maintain availability despite processor faults. In this section, we will give an example of how MAUD can be used in an actor domain to develop a modular and application-independent implementation of a protocol which uses replication to protect against crash failures. The protocol we describe is quite simple: each message sent to the server is forwarded to a backup copy of the server. In this manner, there is an alternate copy of the server in case of a crash. Reply messages from both the original and backup servers are then tagged and the client eliminates duplicate messages. Figure 1.2.2 shows the resulting actions occurring when a message is sent to the replicated service. The original server is actor Si. When a message is received by the Forwarder, the message is forwarded to the backup 52. 52 is initialized with the same behavior and state of Si- Since they will receive the same messages in the same order, their state will remain consistent. Therefore, any replies will be identical and in the same order. The replies are tagged by the dispatchers of class Tagger and only the first copy of each message is passed on to the client by Eliminator. Forwarding messages to the backup server is implemented using a meta-level mail queue. The Screed code for this mail queue is presented in Figure 1.2.3. Using a dispatcher, each reply message of the server is tagged to allow the

38

'1?!

Clicnl^

Eliminator

Eiiminator

g |

Key: Dispatcher

•r,,:,, MaiiQueue ^. Message send m- Causal Connection ,.,.J-

^ ""^

Forwarder ,'

Tagger

,»,

Figure 1.2.2: When a message is sent by the clients A or B to the replicated service Si, tlie message is received by the Forwarder and a copy is forwarded to the backup S2. When the servers reply, the Tagger dispatchers tag each message so that the Eliminator mail queues may remove duplicate results. If Si crashes, manager actors will install S2 as the new server.

elimination of duplicate replies by the client. A mail queue at the client performs this duplicate elimination. The code for this mail queue is shown in Figure 1.2.4. We assume that managers themselves install appropriate raeta-objects realizing a given dependability protocol. Therefore, we specify the relevant dependability protocols by describing the behavior of the initiating manager as well as the installed mail queues and dispatchers. A manager in charge of replicating a service takes the foUomdng actions to achieve the state shown in Figure 1,2.2: 1. The specified server is replicated by a manager by creating an actor with the same behavior and state. 2. A mail queue is installed for the original server to make it act as the Forwarder described above. 3. The mail queues of the original clients are modified to act as the Eliminator described above.

39 class Forwarder : MailQueue { actor backup; actor server; put(msg m) { m.base_send(); m.set-dest(backup); ra.sendO;

} } Figure 1.2.3: Code for the server-end mail queue which implements replication, The mail queue Forwarder sends a copy of each message to a backup copy of the server.

4. The dispatchers of the servers are changed to tag all messages so that the Eliminator may remove copies of the same message. 5. Upon detection of a crash of 5i, the manager takes appropriate action to ensure all further requests to the server are directed to 52- The manager may also create another backup at this time. Although this example is simple, it does illustrate some of the benefits of our approach. The manager initiating the replication protocol needs no advance knowledge of the service to be replicated nor does the replicated service need to know that it is being replicated. Because the clients using the replicated service are not modified in any way, this gives us the flexibility to dynamically replicate and unreplicate services while the system is running.

Composition of Dependability Characteristics In some cases, dependability can only be guaranteed by using several different protocols. For example, a system employing replication to avoid possible processor faults may also need to guarantee consensus on multi-party transactions through the use of three-phase commit or some similar mechanism. Unfortunately, writing one protocol which has the functionality of multiple protocols can lead to very complex code. In addition, the number of possible permutations of protocols grows exponentially — making it necessary to predict all possibly needed combinations in a system. Therefore, it is desirable to be able to compose two protocols written independently. In some cases this may not

40 class Eliminator : Mailq { int tag; actor members[NUMREP]; actor client; /* No get method is required since we use * the default behavior inherited from Mailq •/ put(msg m) { int i;

for (i=0; i < NUMREP; i = i + 1) if (m.getjsrcO == members[i]) / * Since t h e message was from a r e p l i c a , + we know t h a t the f i r s t argument i s a t a g and • t h e second i s t h e o r i g i n a l message. */ if (m.argCO] < t a g ) / * Discard message */ return; e l s e if (m[0] == t a g ) { self.enqueue(m[l]); t a g = t a g + 1;

} } } Figure 1.2.4: Code for the server-end mail queue which implements replication. The mail queue Eliminator removes tags (which have been added to all server replies by some other dispatcher) and takes the first message labeled by a new tag.

41 add_mailq ( a c t o r aMailq) { if (mailq == n i l ) { s e l i . chauige j n a i l q ( a M a i l q ) ; e l s e mailq.addjnailqCaMailq); ) add_dispatcher ( a c t o r aDispatcher) { i l ( d i s p a t c h e r == n i l ) { s e l l . changejdispatcher(aDispatcher); else dispatcher.addjiispatcher(aDispatcher); } Figure 1.2.5: The additional methods which must be inherited to allow for protocol composition.

be possible due to a conflict in the semantics of the two protocols. In other cases, performance may depend greatly on the way in which two protocols are composed. For many common protocols such as replication, checksum error detection, message encryption, or check-pointing, composition is possible. Because the meta-components of an object are themselves objects in a reflective system, there is a general solution for composing two protocols using MAUD. A simple change to the meta-operations inherited from the Object class, along with a few restrictions on the construction of mail queues and dispatchers, allows us to layer protocols in a general way. Figure 1.2.5 shows how an add-mailq method could be expressed in terms of the other meta-operations to allow layering. Because the mail queue and the dispatcher are objects, we can send a message to install meta-objects customizing their mail queue or dispatcher. By adding protocols in the above manner, the outer mail queue functionality will be performed on incoming messages before they are passed on to the "inner" mail queues. For the send behaviors, the process is reversed with the innermost send behavior being performed first and the outermost behavior last, thereby creating an onion-like model with the newest layer closest to the outside world. To preserve the model, however, several restrictions must be applied to the behavior of dispatchers and mail queues. We define the partner of a mail queue as being the dispatcher which handles the output of a protocol and the partner of a dispatcher as being the mail queue which receives input for the protocol. In Figure 1.2.6, B and C are partners as well as E and D. Each pair implements one protocol. It is possible for a meta-object to have a null partner.

42

Figure 1.2.6: Partrieis and Owner relationships. A is the owner of all other actors in the figure. Dispatcher B and mail queue C are partners as well as dispatcher D^ and mail queue E.

The owner application of a meta-ohject is inductively defined as either its base object, If its base object is not »• ineta-object,; or the owner application, of its base object. For example, in figure 1.2.6, A is, the owner application of meta-objects B, C, D, and E. With the above definition we can restrict the commnnication behavior of the actors so that: • A mail'queue or dispatcher may send or receive messages from its partner or an object created "by itself or its partner. • A dispatcher may send messages tcj the outside world, i.e. to an, object which is not a mail queue or dispatcher of the owner application (although, the message might be sent through the dispatcher's dispatcher). A dispatcher may receive t r a n s m i t messages from its base object and otherwis,c may only .re.ceive messages from its m,ail quetie partner. Therefore, a dispatcher with a null mail queue partner may only receive t r a n s m i t mes,sag,e,s from its base object or eomniunicate with actors it created. • A mail 'queue may receive messages from the outside world (through itS' own mail queue), and send pmt messages when responding to get, messages from itsbase object. Mail queues may otherwise only send messages

43 to its dispatcher partner or actors it created. Therefore, a mailq queue with a null dispatcher partner may only send put messages to its base object or communicate with actors it created. • Objects created by a mail queue or dispatcher may communicate with each other, their creator, or their creator's partner. Because of the above restrictions, regardless of the number of protocols added to an object there is exactly one path which incoming messages follow — starting with the outermost mail queue — and exactly one path for outgoing messages in each object — ending with the outermost dispatcher. Therefore, when a new dispatcher is added to an object, all outgoing messages from the object must pass through the new dispatcher. When a new mail queue is installed, it will handle all incoming messages before passing them down to the next layer. Thus, a model of objects resembling the layers of an onion is created; each addition of a protocol adds a new layer in the same way regardless of how many layers currently exist. With the above rules, protocols can be composed without any previous knowledge that the composition was going to occur and protocols can now be added and removed as needed without regard not just to the actor itself, but also without regard to existing protocols. In Figure 1.2.6, actors B and C are initially installed as one "layer." Messages come into the layer only through C and leave through B. Therefore, D and E may be installed with the add-mailq and add-dispatcher messages as if they were being added to a single actor. Now messages coming into the composite object through E are then received by C. Messages sent are first processed by B and then by D. Example 2: Composing Two Protocols Figure 1.2.7 shows the result of imposing the protocol described in Example 1 on a set of actors already using a checksum routine to guarantee message correctness. Originally, each actor had a corresponding Check-In mail queue and aCheck-Out dispatcher. When server Si is replicated, its meta-level objects are also replicated. The Forwarder mail queue is installed as the meta-level mail queue of Si's mail queue. It will forward all messages to S2. A Tagger dispatcher is installed for each of the two servers and the Eliminator mail queue removes duplicate messages at the client. Although this protocol would be difficult to write eis one entity, composition allows their modular, and therefore simpler, development. In terms of our onion-layer model, each Check-In/Check-Out pair forms a layer. For example, the innermost layer for server Si consists of a Check-Out

44

Key: Dispmchei-rirr-T _

Mai'Queue.

m- Message send m Causal Cotmeclion

Figure 1.2.7: System resulting from the composition of a replication protoco.1 and message checksum protocol. W]:ien a message is sent by the client A (1), the Check-Out dispatcher adds the checksum information to the message. The message is then forwarded to tlie replica as describe in Example 1 (2-3). The checksum inform,ation is removed by tiie Check-In mail queue(4) and the messages are processed, resulting in a reply (5). The reply messages both have the checksum (6) information added before they are tagged and sent to the client (7). At the client, duplicate messages are removed, the checksum information is checked, and the message is delivered.

45 dispatcher and a Check-In mail queue. The outermost layer at Si is comprised of a Tagger dispatcher and a Forwarder mail queue. The client A also has two layers. However, its outer layer consists solely of the Eliminator, this mail queue has a null dispatcher partner. Similarly, at server 52, the outermost layer consists only of a Tagger dispatcher with a null mail queue partner. As can be seen in the above example, the onion-layer model only provides consistency for mail queue and dispatcher installation at a single node: a manager that follows the above rules may still install protocols incorrectly. Such an error may occur if the protocols are installed in one order at one node and in a different order at another node. For example, if the manager installed the Ehmma/or mail queue at client A as the innermost layer rather than the outermost, the system would not operate correctly. An area of current research is developing methods for specifying managers which simplify protocol installation and guarantee global consistency.

Performance In implementing MAUD in Broadway, we have found that, in most cases, the additional message overhead caused by reflection is small compared to the actual cost in messages accrued by the protocols themselves. Using .MAUD, there is an additional in messages upon message reception, where n is the number of protocols composed together to provide dependability for a single object. Upon reception of a message, the message is routed up the chain of meta-level mail queues (n messages) and then worked its way down through a series of get and put messages. For message transmission, there are n additional transmit messages. Since each object is usually protected by only a small (1 or 2) number of protocols, this cost is not great. Since meta-level objects are most likely to be local to the base actor, most messages to meta-objects will be local and inexpensive. Furthermore, we use caching of the highest level mail queue to eliminate n of the messages: the system records the address of the top level mail queue and directs all messages intended for the base object to this mail queue. To preserve correctness with caching, meta-object installation is made atomic. Once a protocol installation begins, no messages are processed until the protocol is installed. This optimization is especially critical if some meta-objects need to be on a separate node. Placement of meta-objects on a different node from the base object is only done when physical separation is necessary for dependability: in this case, the inter-node communication from meta-mail queue to base-object

46 or base-object to meta-dispatcher would normally be required by the protocol, regardless of implementation technique. On the other hand, the communication cost from base-object to meta-mail queue is only due to the nature of using reflection. Therefore, caching eliminates this additional expense.

1,2.6

Exception Handling

Given a meta-level such as MAUD, it is still necessary for a programming language to provide flexible constructs supporting adaptive dependability. In particular, it is important to convey information to the correct entities when system failures occur. We have chosen exception handling as the medium through which managers are informed of problems in the system. This technique has been used extensively with forward error recovery; we simply extend the notion by having our managers prevent future failures through dynamic protocol installation. In this section, we describe the exception handling mechanism in Screed, our prototype actor language. To support adaptive dependability, faults and exceptions have been unified as one concept and exception handlers may be shared between objects. Broadway provides a set of system exceptions, some of which are notifications of failures. For example, when an actor attempts to communicate with an unreachable node, a crash exception is generated. We begin with a discussion of the general structure of exception handling in Screed followed by a specific illustration of the syntax used. We then show how this structure may be used with the meta-architecture to design adaptively dependable systems.

Exception Handling C o m p o n e n t s Exceptions are signaled whenever an unexpected condition is encountered. An exception may be signaled either by the run-time system or by the application. The former are referred to EIS system exceptions and the latter as user-defined exceptionsExceptions in Screed are represented as objects, as proposed in [13] for sequential object-oriented languages. Although none of the other concurrent languages discussed above have taken this approach, we feel representing exceptions as objects allows for more flexible and efficient exception handling; all the information needed by a handler is contained in one object. All system exceptions are derived, through inheritance, from the class exception. Userdefined exceptions may inherit from the exception class or from any other node

47

on the system exception inheritance tree. Below, we discuss the parties involved in the generation of an exception and then the structure of system exceptions. There are four roles involved in handling any exceptional condition: invoker, signaler, exception, and handler (see Figure 1.2.8). Each role is represented as an object in the system. The invoker initiates the method of the signaler which results in an exception. The occurrence of an exception generates a signal. When a signal occurs, a new exception object is created. The signaler notifies the appropriate handler object of the exception's mail address. The handler must then diagnose the exception and perform any appropriate actions for handling the exception. Exception handlers are constructed by the programmer as Screed actorclasses. For each exception a handler accepts, a method must exist with the same name as the exception and which takes an instance of the exception class as a parameter. In all other ways, handlers are identical to other actor classes: they may have a set of instance variables, inherit from other classes, and may communicate with any of their acquaintances. They may also have other, nonexception methods.

invoker

signaler /^ ^

Q \

*\ handler//

J

• I

'^z^^-**•

I *" V^ exception

Actor creation Message sent

Message may be sent

Figure 1.2.8: The four roles involved with an exceptional condition. The invoker initiated the method in the signaler which caused the exception. An object of the appropriate exception class is created and the handler is notified of the exception's mail address. The handler may then interact with the invoker and/or the signaler as well as the exception object to resolve the exception.

All exceptions must inherit from the class exception. When an exception is signaled, an object of the appropriate exception class is instantiated and initialized with any information needed by the handler to process the exception. Some of the initialization fields are supplied by the run-time system. These

48 fields are contained in the exception class from which all exception objects inherit, and are utilized through the methods inherited from the exception class. Additional arguments for the initialization of an exception may be specified by the objects raising a signal. For example, an arithmetic exception which is initiated by an application could be initialized when signaled with the values of the operands in the arithmetic operation. This exception object would still have default values specified by the system. Methods defined in the exception class make use of the system-supplied values. These methods are: name returns the name of the exception as a method value. Since method names are first-class values in Screed, this method enables the automatic calling of the correct method to handle it. invoker returns the mail address of the actor which invoked the method resulting in the generation of the signal. signaler returns the mail address of the signal generator. source returns the name of the method in which the signal was generated. arguments returns a list of the arguments that were passed to the method in which the signal was generated. request returns TRUE if the invoker is waiting on a reply, FALSE otherwise. reply allows a handler to reply to a request that was interrupted by the signal. The reply method can be used to supply an acceptable value to the invoker, thereby allowing the continuation of the computation. Each exception handler may utilize only a few of these fields. However, since our environment is asynchronous, we want to preserve all available information. There are no guarantees that this information will be retained by either the invoker or the signaler. Use of exception objects provides us with the flexibility to include a large amount of information without creating complicated function calls or messages: all the information is packed into an object and is referenced through a standard interface. In a procedural approach, long parameters lists would be necessary to achieve the same effect. Broadway currently supports three different system exceptions. All three inherit directly from the class exception. A bad-method exception is instantiated when an actor receives a message it cannot processes. The bad-method

49 exception class provides the behavior of the destination actor. In general, there is very little the run-time system can do to correct such an error, but this information allows a handler to provide meaningful error messages to the user. An a r i t h m e t i c exception is generated whenever Broadway traps an arithmetic error. Currently, this exception provides the state under which the exception occurred. We plan to expand this exception to include a string representing the expression being evaluated. Broadway also provides some failure detection capabilities. Each node on Broadway has a failure detector which uses a waich-dog timer approach to detect the failure of, or inability to communicate with, other nodes. A crash exception is generated whenever an actor attempts to communicate with an actor on a non-existent or unreachable node. A crash exception consists of the original message and the identity of the node which cannot be reached. Notice that, although Broadway has detected a component failure, it is treated similar to any other system exception. It is also possible for an object to subscribe to a failure detector. In this case, the subscriber's handler will automatically receive an exception whenever a failure is detected, even if the object did not try to communicate with the failed node. Besides detecting node crashes, Broadway will also handle the failure of individual actors. If an actor crashes due to an error that is trapped by Broadway, that actor address will be marked as a creish. Currently, only arithmetic errors are trapped by Broadway and, therefore, this is the only manner in which a single actor may crash. If the defunct actor receives a message, a d e a d - a c t o r exception will be generated. The dead-actor exception inherits from the crash exception. It also contains a reference to the exception generated when the actor crashed. (Currently, this is always an a r i t h m e t i c exception.)

Exception Handling in Screed In this section, we describe our two syntactic additions to Screed which enable exception handling: the handle statement which associates exceptions with handlers, and the s i g n a l statement which generates an exception. In Screed, handlers can be associated with exceptions for either entire actor classes or for arbitrary code segments within a method. Figure 1.2.9 gives the syntax for a handle statement. The statement defines a scope over which specific exceptions are associated with a particular handler. If any method invocation contained within the code block of the handle statement results in an exception, the signal is routed to the correct handler as specified by the with

50 handle (exceptionl, exception^

exception2 with with handler2,

handler!,

)

{ /* Any block ol code goes here */

} Figure 1.2.9: The structure of a handle block in Screed. exceptionl, exception^ are actor class names, handler is the name of an object.

bindings. As explained above, the exceptions are specified as cleiss names and the handlers are addresses of objects. Handler statements may be nested. In this case, when an exception is generated, the innermost scope is searched first for an appropriate handler. If a handler for the exception does not exist then higher level scopes are checked. handle (eirithmetic with arithhemdler, bad-method with aborthandler) { actor A; actor B; actor E; A = new complex(2,3); B = A.divide(C); handle (arithmetic with myhandler) E = B.divide(D); rayNum = r e s ; } } Figure 1.2.10: An example of handler scopes and their effect. The outermost handle statement provides handlers for arithmetic and bad-method exceptions. The inner statement overrides the outer scope in that all arithmetic exceptions will be handled by myhandler. Figure 1.2.10 demonstrates the scoping rules. In the scope of the outer handle statement, if in computing B (by dividing A by C), an arithmetic exception is generated (possibly by dividing by zero), the signal will be passed

51 to arithhcindler. The computation of E through the division of B by D, however, is in the scope of the second handle statement. Therefore, any arithmetic signals generated by this action are sent to rayhandler. Conversely, if our complex objects do not have a divide method, our actions will generate a bad-method signal which will be handled by aborthandler. Unlike the complaint address based schemes[21, 26], our syntactic mechanisms do not require explicit specification of a handler's address with each message. For any given scope, including a single message send, handlers — our equivalent of complaint addresses — may be specified for each individual exception or for any group of exceptions. One handler need not be specified for all exceptions. Additionally, our method takes greater advantage of the available inheritance mechanisms as well as the general structure of object-oriented languages: both exceptions and handlers are expressed as objects in our system. The above constructs work well within methods. However, there are two levels of scoping above the method level in Screed: the global and class levels. Exception handling at the class level is specified through the use of a handler statement which encloses several method definitions. In this manner, exception handling may be specified for an entire class by enclosing all methods in one handler statement. Such a construction does not prohibit handler statements inside the methods. A handle statement may not be defined across class boundaries as that would require the use of shared variables between class instances. However, to provide exception handling at the global level, Screed supports the systemdefined handler class Default-Handler. An instance of this class handles all signals which are not caught by another handler. Default system behavior is for a signal to be simply reported to the terminal. Delault-Heindler may be overwritten by a programmer defining a custom class of the same name. In this way, a final level of exception handling may be defined by the programmer. This type of facility is especially useful for writing debuggers. Any exception not defined in a custom Default-Handler class is handled by the system-default. Note that the system creates only one instance of the Default-Handler class: all otherwise unhandled signals are delivered to this instance. As mentioned previously, exceptions may be generated as user-defined signals. A signal is generated by a s i g n a l statement. signal

exception-class-name(_args . . . ) ;

The s i g n a l statement generates a message to the appropriate exception handler. The arguments are used for initialization of the exception as defined by

52 the interface of the particular exception class. The signal does not interrupt the flow of control in the code, although a Screed return statement could follow the signal to end the method. In many cases, it is necessary for the signaler of the exception to await a response from the handler before proceeding, signal statements are treated, syntactically, as message sends to a handler. Therefore, signal statements may act as an remote procedure call in the same manner as Screed message-sends. Thus, the handler may return a value to be used by the signaler. Such a case would be:

res = signal div-zero(); For this example, the exception handler would return an actor address as the value res. Then, the rest of the signalling method may compute. In other systems, a special construct exists for generating signals within the current context, i.e. generate a signal which is caught by the handle statement in whose scope the statement occurs. An example of such a construct would be the exit statement in Clu [24]. In Screed, such a construct in not necessary: the actor can explicitly send a message to the appropriate exception handler.

1.2.7

Supporting Adaptive Dependability

A significant difference between exception handling in Screed and other languages is the use of third-party exception handlers. In languages such as CLU [24], SR [20], and Ada [12], exception handling routines are defined within the scope of the invoking objects. We refer to this approach as two-party exception handling (the invoker and the signaler) and our approach as three-party exception handling (the invoker, the signaler and an independent handler). We have found that two-party exception handling is unsatisfactory for modeling the complex relationships between objects such as those required for adaptively dependable systems. The key difference between two- and three-party systems is in the potential knowledge of a given object. With two-party exception handling, the exception handler, which is also the invoker, may know only of exceptions its invocations generated. Therefore, in such a system it is very difficult to develop a global picture of the fault pattern in the system. In a three-party system, such monitoring may be naturally expressed in terms of the exception handler since it may handle exceptions for a large group of objects or even the entire system. Furthermore, an autonomous exception handler may subscribe to any available

53

Backup Node

^r:^-^^ Aoplicaiion

/

'^l

Fa'liire Detector Manager

Figure 1.2.11: A prototypical adaptively dependable system. In this case, the manager M receives input in the form of exceptions from applicatioE objects and .notices from the failure detector. Upon, determination that the Primary Node is unstable, M allocates the Backup Node and creates the appropriate objects to replace the Primary Node. Note that, in actuality, M is probably a replicated entity to ensure its resilience and availability. failure detectors, thereby augmenting the knowledge received through exception signals. A third-party exception handler may also be designated as an object with special rights. In this manner the system may be safely modified in response to exceptions and failures. Since it is dangerous to allow the arbitrary modification of one actor by another, most two-party systems, can express reconfiguration of the system only by mimicking a three-party system, i.e. they must notify a third object with special rights of the exceptions they encounter and this object 'may then reconfigure the system. Thus in adaptively dependable systems, the resulting system architectures will look qaite similar tO' Figure 1.2.,11. Such a system may allow the dynamic installation of dependability protocols or may simply support the reconfiguration of several objects m rcspofise to exceptiens. In either case, the system will

54 hiH' a mil ;<,ger Aii'i -.i • ;ial nclits to ifiOn *j nhe: objects. The manager will dCL d,s tiip excpption iianjler for --v^mt j^roap csf objscts being monitored. FurtheniiDr the n..iinL't-i r"dv -ilsi siii^tiihc f^. tailLre-detection services. Upon JCKI wiR/i, f'jid^h mpiii t • d'i»rniint ^'lar su^.t^ i : icceptable condition exists. tL ' niaiiaz^ r irc nfi^iii'^' th"" ei iiij jv^r v nn h it hag authority. Exainplf 3: B«-< oiifigiiratioii t o P r e s e r v e Failure Seiriantics

Figure 1.2.12: This figure snows the topology of the network on a sateUite. The .shaded circles represent actors and the manager for raemory errors is denoted by the rcpHcated actor M.. The memory uses checksums to avoid returning incorrect values. Instead, if an incorrect value is found, a special exception value is returned m'Mch may then trigger an exception in the application. The manager is replicared to guarantee a high level of fault-tolerance. To illustrate the concepts we described above, consider a distributed system which is operating in a hostile envixonm,ent. A good example, of such a. system is a satellite with multiple processors, each with its own memory. .Assume that the memory of these processors was developed to never return incorrect data to a read. Instea,d, the memory will detect the error through some checksum algorithm and return an error condition value. This reserved value can then be used to signal an exception and initiate forward error recovery. The specifications for this system state that such memory errors should occur with, probability ID^''. Such a system is shown in Figure 1.2.12. Notice that the manager is itself replicated to ensure fault-tolerance: in this vital .system component. Once, the system, is launched, these memory components seem to operate correctly and the manager responsible for memory errors occasionaJiy performs forward error recovery on the objects. However, the rate at which these memory errors are occurring is imacceptably high (10^* faults/read) and system performance d.egrades significantly due to repeated memory faults and subsequent error recovery. Therefore, the manager installs the replication protocol described in Example 1. T^'he resulting system is shown in Figure 1.2.13. Since

Figure 1.2.,:13: The applications on each node have now been replicated using the meta-ievel protocol described in Example 1. Since we are protecting only against the corruption of individual memory locations, the replica iiiay reside on the same node as the original actor, thereby improving the performance of the protocol. both the original actor and the replica will be reading values from different memory locations, the probability that a memory error will be noticed is now 10""-^"j well within the specified tolerance. When an exception occurs, the manager will still have to perform some corrections, but the system can keep computing during this time and the error recovery will be simplified due to the existence of the replica. Considering the nature of the faults, the rephca may be placed OH the same node as the original actor. However, if instead of signalling an exception, nodes crashed when they could not read memory, the replicas would have been placed on different nodes.

1.2.8

Conclusion

In this paper, we have described a methodology for the development of adaptively dependable systems. Adaptively dependable systems may function over a long duration despite a changing execution environment. Whether the changes are due to a variance in the components comprising the system or to a change in the physical environment in which the component operates, the use of dynamic protocol installation combined with exception handling allows fault tolerance tO' be guaranteed, by the 'System. Dynamic protocol installation is enabled througii the use of a mctadevel architecture. Our meta-level, MADD, allows the customization of an object's communication behavior on a per-object basis. By describing protocols in terms of modifications to the comrauaicatioo behavior, protocols may be dynamically installed on objects as necessary. Furthermore, if a protocol is mo longer re~

56 quired, it may be removed. Tiirough the use of caching and atomic protocol installation, meta-level description of protocols may be implemented with a minimal cost in performance. We also support composition of protocols. Provided there are no inherent semantic conflicts between two protocols and both protocols are implemented using MAUD, these two protocols may then be composed without foreknowledge that a composition may occur. In this manner, protocols may be constructed in a modular fashion and later combined to provide the desired level of faulttolerance. To provide adaptive dependability, we combine dynamic protocol installation with exception handling. We make extensive use of third-party exception handlers which are shared between multiple objects. Since these handlers have privileges to modify meta-level objects, they are termed managers. A single manager will be informed of all exceptions related to a particular problem. The knowledge may be augmented through subscription to failure-detection services. In this manner, the manager will have all information necessary for a correct diagnosis of fault patterns. The concepts described in this chapter have been implemented on Broadway — our actor run-time system — and are accessed through the language Screed. Screed provides exception handling constructs which support managers and provides access to the meta-level architecture implemented in Broadway. One problem not solved by our framework is guaranteeing the consistent installation of protocols on multiple actors. We are currently developing a Protocol Description Language which will allow a protocol to be expressed as a single entity. A protocol compiler can convert these protocols into MAUD mail queues and dispatchers. To address the installation problem, the compiler creates objects that guarantee correct installation of the protocol. Managers use these objects to install protocols on all actors involved in the protocol.

Acknowledgments The research described in this paper has benefitted from earlier collaboration with Svend Fr0lund and Rajendra Panwar. The authors would also like to acknowledge helpful comments from Christian Callsen, Svend Fr0lund, WooYoung Kim, Rajendra Panwar, Anna Patterson, Shangping Ren, Carolyn Talcott, Nalini Venkatasubramaniam, and Takuo Watanabe, among others.

57

References [1] M. Acceta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young. Mach: A New Kernel Foundation for UNIX Developement. In USENIX 1986 Summer Conference Proceedings, June 1986. [2] G. Agha. Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press, 1986. [3] G. Agha. Concurrent Object-Oriented Programming. Communications of the ACM, 33(9):125-141, September 1990. [4] G. Agha, S. Fr0lund, R. Panwar, and D. Sturman. A Linguistic Framework for the Dynamic Composition of Dependability Protocols. In C.E. Landwehr, B. Randell, and L. Simoncini, editors, Dependable Computing for Critical Applications 3, volume VIII of Dependable Computing and Fault-Tolerant Systems, pages 345-363. IFIP Transactions, SpringerVerlag, 1993. [5] G. Agha, I. Mason, S. Smith, and C. Talcott. Towards a Theory of Actor Computation. In R. Cleaveland, editor, The Third International Conference on Concurrency Theory (CONCUR '92). Springer-Verlag, 1992. LNCS (forthcoming). [6] Gul Agha and Christian Callsen. ActorSpace: An Open Distributed Programming Paradigm. In Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, California, May 1993. Also published as a Special Issue of SIGPLAN Notices vol. 28, No. 7, pp23-32, July 1993. [7] K. P. Birman and T. A. Joseph. Communication Support for Reliable Distributed Computing. In Fault-tolerant Distributed Computing. SpringerVerlag, 1987. [8] Roy Campbell, Nayeem Islam, David Raila, and Peter Madany. Designing and Implementing Choices: An Object-Oriented System in C-|—1-. Communications of the ACM, pages 117-126, September 1993. [9] E. Cooper. Programming Language Support for Multicaist Communication in Distributed Systems. In Tenth International Conference on Distributed Computer Systems, 1990. [10] Antonio Corradi, Paola Mello, and Antonio Natali. Error Recovery Mechanisms for Remote Procedure Call-Based Systems. In 8th Annual International Phoenix Conference on Computers and Commumcaton Conference

58 Proceedings, pages 502-507, Phoenix, Arizona, March 1989. IEEE Computer Society Press. [11] Flaviu Cristian. Understanding Fault-tolerant Distributed Systems. Communications of the ACM, 34(2):56-78, 1991. [12] Quian Cui and John Gannon. Data-Oriented Exception Handling in Ada. IEEE Transactions on Software Engineering, 18:98-106, May 1992. [13] Christophe Dony. Improving Exception Handling with Object-Oriented Programming. In Proceedings of the l^th Annual International Computer Software and Applications Conference, pages 36-42, Chicago, 1990. IEEE Computer Society, IEEE. [14] Christophe Dony, Jan Purchase, and Russel Winder. Exception Handling in Object-Oriented Systems. OOPS Messanger, 3(2):17-29, April 1992. [15] Jeffrey L. Eppinger, Lily B. Mummert, and Alfred Z. Spector, editors. CAMELOT AND AVALON: A Distributed Transaction Facility. Morgan Kaufmann Pubhshers, Inc., 1991. [16] Jacques Ferber and Jean-Pierre Briot. Design of a Concurrent Language for Distributed Artificial Intelligence. In Proceedings of the International Conference on Fifth Generation Computer Systems, volume 2, pages 755762. Institute for New Generation Computer Technology, 1988. [17] S. Fr0lund. Inheritance of Synchronization Constraints in Concurrent Object-Oriented Programming Languages. In Proceedings of ECOOP 1992. Springer Verlag, 1992. LNCS 615. [18] S. Fr0lund and G. Agha. A Language Framework for Multi-Object Coordination. In Proceedings of ECOOP 1993. Springer Verlag, 1993. LNCS 707. [19] John B. Goodenough. Exception Handling: Issues and a Proposed Notation. Communications of the ACM, 18(12):683-696, December 1975. [20] Daniel T. Huang and Ronald A. Olsson. An Exception Handling Mechanism for SR. Computer Languages, 15(3):163-176, 1990. [21] Yuuji Ichisugi and Akinori Yonezawa. Exception Handling and Real Time Features in an Object-Oriented Concurrent Language. In A. Yonezawa and T. Ito, editors. Concurrency: Theory, Language, and Architecture, pages 92-109. Springer-Verlag, Oxford, UK, September 1989. LNCS 491.

59 [22] Wooyoung Kim and Gul Agha. Compilation of a Highly Parallel ActorBased Language. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing, volume 757 of Lecture Notes in Computer Science, pages 1-15. Springer-Verlag, 1992. [23] Barbara Liskov. Distributed Programming in Argus. Communications of the ACM, 31(3):300-312, March 1988. [24] Barbara Liskov and Alan Snyder. Exception Handling in Clu. IEEE Transactions on Software Engineering, 5(6):546-558, November 1979. [25] P. Maes. Computational Reflection. Technical Report 87-2, Artificial Intelligence Laboratory, Vrije University, 1987. [26] Carl Manning. ACORE: The Design of a Core Actor Language and its Compiler. Master's thesis, MIT, Artificial Intelligence Laboratory, August 1987. [27] S. Mishra, L. L. Peterson, and R. D. Schlichting. Consul: A communication Substrate for Fault-Tolerant Distributed Programs. Technical report, University of Arizona, Tucson, 1991. [28] M. H. Olsen, E. Oskiewicz, and J. P. Warne. A Model for Interface Groups. In Tenth Symposium on Reliable Distributed Systems, Pisa, Italy, 1991. [29] Richard D. Schlichting, Flaviu Christian, and Titus D. M. Purdin. A Linguistic Approach to Failure Handling in Distributed Systems. In A. Avizienis and J.C. Laprie, editors, Dependable Computing for Critical Applications, pages 387-409. IFIP, Springer-Verlag, 1991. [30] Richard D. Schlichting and Titus D. M. Purdin. Failure Handling in Distributed Programming Languages. In Proceedings: Fifth Symposium on Reliability in Distributed Software and Database Systems, pages 59-66, Los Angeles, CA, January 1986. IEEE Computer Society Press. [31] Santosh Shrivastava, Graeme Dixon, and Graham Parrington. An Overview of the Arjuna Distributed Programming System. IEEE Software, pages 66-73, January 1991. [32] B. C. Smith. Reflection and semantics in a procedural language. Technical Report 272, Massachusetts Institute of Technology. Laboratory for Computer Science, 1982.

60 [33] Bjarne Stroustrup. The C++ Programming Language. Addison-Wesley, second edition, 1991. [34] C. Tomlinson, P. Cannata, G. Mereditii, and D. Woelk. The Extensible Services Switch in Carnot. IEEE Parallel and Distributed Technology: Systems and Applications, 1(2), May 1993. [35] C. Tomlinson and V. Singh. Inheritance and Synchronization with Enabled-Sets. In OOPSLA Proceedings, 1989. [36] Nalini Venkatasubramanian and Carolyn Talcott. A MetaArchitecture for Distributed Resource Management. In Proceedings of the Hawaii International Conference on System Sciences. IEEE Computer Society Press, January 1993. [37] T. Watanabe and A. Yonezawa. A Actor-Based Metalevel Arhitecture for Group-Wide Reflection. In J. W. deBakker, W. P. deRoever, and G. Rozenberg, editors, Foundations of Object-Oriented Languages, pages 405-425. Springer-Verlag, 1990. LNCS 489. [38] C. T. Wilkes and R. J. LeBlanc. Distributed Locking: A Mechanism for Constructing Highly Available Objects. In Seventh Symposium on Reliable Distributed Systems, Ohio State University, Columbus, Ohio, 1988. [39] Y. Yokote, A. Mitsuzawa, N. Fujinami, and M. Tokoro. The Muse Object Architecture: A New Operating System Structuring Concept. Technical Report SCSL-TR-91-002, Sony Computer Science Laboratory Inc., Feburary 1991. [40] A. Yonezawa, editor. ABCL An Object-Oriented Concurrent System, chapter Reflection in an Object-Oriented Concurrent Language, pages 45-70. MIT Press, Cambridge, Mass., 1990.

SECTION 2

REQUIREMENTS MODELS

SECTION 2.1

Military Fault Tolerant Requirements Timothy P. Monaghan'* 2.1.1

Trends in Modern Weapon System Procurement

The DoD is entering a radically new era of weapon systems procurement. The Cold War is over and new scenarios and new force structures are changing the requirements of the next generation of weapon systems. The major drivers shaping this new era of weapon system procurement are: decreasing military budgets, stretched out development cycles, an emphasis on prototypes and demonstrations, open systems standards, and intensive modeling and simulation before any hardware is built. The next generation of computer based weapon systems will be designed using a high percentage of commercial off-the-shelf parts. A key focus in this next generation of computer based weapon systems will be the fault tolerant features of the system's design. Studies have shown that a system can actually be more dependable, even though it uses less reliable parts, if adequate attention is paid to the fault handling features of the system's design. [1] A key issue in this new era of defense procurement is - how can the DoD manage the procurement of a dependable and cost effective, computer-based weapon system? The most critical task a program office faces in the development a dependable computer based weapon system is the Specification and Statement of Work (SOW) package. The government does not design the weapon system or give direct supervision to the ongoing work. The government's role is to specify the features of the system it requires and validate that the evolving system and the final delivered system meets those requirements. The specifications of the system describe the requirements and features of the system to be built what the system shall do. The SOW describes the deliverables that the contractor shall deliver to the government at each critical milestone of the evolving system's design and development - what the contractor shall do * Naval Air Warefare Center, Aircraft Division, Warminster, Pa.

64 at each milestone to ensure that the system will meet the government's specifications. The SOW also calls for specific validation information to be delivered in reports and the holding of periodic design review meetings. Design review meetings occur at the system's critical milestones. Their focus is to provide the government with the assurance that the contractors have considered all relevant trade-offs for dependability and fault tolerance and that adequate attention is being paid to dependability and fault tolerance in the system's design. For the government to ensure that the final system is a dependable computing system, it is essential to have accurate and clear fault tolerant specification language and proven methods for their validation. The courts have generally held any ambiguities in the specifications and SOW against the government and the burden is on the government to come up with "reasonably precise" contract design language.[2] The topic of what constitutes an adequate description of fault tolerant requirements is currently in transition within the fault tolerant community. The rest of this chapter will describe this trend and give examples of specific fault tolerant language in a current, progressive, mission avionics Specification and SOW. 2.1.2

Specifications

What legal language which to be in the Specification to clarify to the contractor the exact fault tolerant and dependability requirements that the government expects to see in the future system? Past avionics fault tolerant specifications have described vague requirements such as: "The contractor shall identify and discuss high risk areas..." and "single point failure modes that significantly degrade mission performance or prevent successful mission accomplishment shall be addressed...". The specifications also include descriptions of reconfiguration and degraded modes as well as the traditional reliability, availability, and maintainability numbers. But very few of these requirements translate into a directly testable quality of the design like a performance benchmark. In the extreme time pressures of modern complex weapon system design, vague language describing broad fault tolerance and dependability requirements are too easily put on the back burner and will not steer a contractor to produce a dependable cost effective system for the government to own and operate. Dependability requirements are often demonstrated after the design is complete when

65 non-compliance is easier to waive then to fix. The fault tolerant community is seeking to make the fault tolerance and dependability features of a system more visible, quantifiable, and verifiable throughout the development cycle. Legal, formal specifications are the only way the government can define computer performance and dependability requirements of the system it is procuring. Examples of more precise fault tolerant specification requirements are: (a) Fault Set The fault tolerant system design shall define the functional and physical partitioning of the system's fault containment regions (FCR). Each FCR shall handle hard faults, transient faults (including massive power supply transient faults), intermittent faults, coincident faults (a second fault occurring while recovering from the first fault), timing faults, minor battle induced physical faults and operator faults. All faults will be recorded in the error history log. The frequency of hard fault rates is decreasing for both military and commercial digital parts but transient fault rates are increasing. The next generation of systems need to be designed to automatically handle transient faults and notify the operator only when reconfiguring around hard faults yet record all fault activity to ensure quick diagnosis of failing parts. (b) Fault Characterization Each fault from the fault set shall be characterized by: (1) Uming - the probability density of its expected frequency, (2) extent - the extent of damage it causes to information or hardware in the system, (3) duration - the length of time that the faults persist, and (4) error model - the incorrect logical behavior expected when the fault produces an error. The redundancy management of the system shall be characterized by:

66 (c) Automatic Reconfiguration The system shall automatically reconfigure to circumvent a failed component and select a back up component or data path to redistribute the processing. This reconfiguration shall be transparent to the operator and shall be accomplished within the real time constraints of the system (typically 20ms to 50ms for an avionics system). The operator shall have the ability to be notified of recovery from hard faults. The system shall also be capable of automatic and transparent reconfiguration of the system when a failed component is restored to operation if the failure was due to a transient fault. This process shall continue as long as the processing and component resources continue to meet the demands of the reconfigured system. When system resources are inadequate to meet the real time system demands the system shall enter the degraded mode state.

(d) Degraded mode Multiple processors and distributed processing architectures shall permit the automatic, controlled and graceful degradation of system capabilities when errors occur and the real time processing demands exceed system resources. Functional capabilities shall be automatically redistributed where possible to the rest of the system. Impact on mission operations shall be kept to a minimum for as long as possible when operating in the degraded mode. 2.1.3

SOW

The SOW specifies what the contractor shall do at each stage of the system's design to assure that the system being built will be a dependable and cost effective system for the government to own and operate. The SOW really describes what reports and interactions with the contractor the government expects to see as it oversees the development of the system. The government's oversight of the development of a complex weapons system is divided up into five basic development phases:

67 Concept Exploration Stage, Demonstration and Validation Phase, Full Scale Development and Procurement and Deployment. [3,4] The first major design review is the System's Requirements Review (SRR). At this review the requirements of the system are carefully gone over to ensure that both the government and the contractor have a mutual understanding of the system requirements and that adequate attention has been paid to software requirements, the system's test requirements, and the applicable architectural and fault tolerant approaches. In the next review, the System Design Review (SDR), the contractor presents the results of the trade off studies and the rationale for the architectural approach and fault tolerant strategy that has been selected. This stage of the design, validation and evaluation by the government is critical. The contractor has selected an architectural approach for the system and after this design review, they will be committing the resources of the project to that approach. Now is the time to carefully consider the rationale of that architectural and fault tolerant approach before major funding is committed to it. At the next major review, the Preliminary Design Review, (PDR), the top level design efforts have been completed and the detailed design is about to begin. At PDR, the contractor presents the preliminary hardware and software design of the system and the initial performance and fault tolerance evaluations. At the Critical Design Review, (CDR), the contractor will provide a completed hardware, executive software design, a refined analysis of performance and fault tolerance and their plans to demonstrate and test the brassboard. The final stages of the system design is the actual building of a brassboard, a Test Readiness Review, (TRR), and the demonstration and formal testing of the system's performance and dependability features. Throughout each of the stages of the systems' design from SRR to the final testing the government can influence the system's design. But the later the stage the weaker that influence becomes due to the increasing cost to change major design features once the program has heavily

68 invested in that particular design. The government can raise the probability that the final system will be a dependable system by carefully describing the types of dependability exit criteria to be done at each stage of the system's design and requiring clear proofs of that testing. These proofs shall be delivered at each of the design reviews.

Concept Demonstratior Program Concept Program Exploration Validation Selection Initiation Go Ahead Phase Phase

A

SRR

A

SDR

A

SSR

Full Scale Development Phase

A

PDR

A TRR A

CDR

igure 2.1.1 2.1.4

Analytic Modeling and Fault Injection

Current SOW dependability and fault tolerant requirements stress reliability prediction and Failure Modes, Effect, and Criticality Analysis (FMECA) as described in MIL-STD-785, MIL-STD-756B and MILSTD-1629. The parameters used in reliability prediction or analytic models are failure rates, coverage estimates, repair rates, and costs of repair and parts. Reliability prediction modeling can be combinatorial or non-combinatorial. The combinatorial elements are reliability graphs, which aid in predicting the hard failure rate of the current system from the statistical past failure rates of similar components, and fault trees, which are trees of conditions that lead to certain system failures. The noncombinatorial modeling elements are Markov chains and Petri and Monte Carlo simulations. [5] While analytical studies such as these are useful as the system design is evolving, these mathematical models have numerous sources of error and need to be validated with fault injection. [6] The initial Markov, Petri or combinatorial models of the system describe how the designer assumes the system will handle faults but the current functional model, and the fault injections into it, describes how this evolving system is handling faults.

69 Fault injection experiments are performed on the functional simulation of the system that is developed from the high level conceptual model of the system. The functional system model is gradually transformed from behavioral descriptions to RTL descriptions to structural models to the final brassboard implementation. As the design is further refined, faults are injected into the current functional representation of the system and the results of the fault injection experiments are used to validate and refine the initial gross assumptions of the Markov, Petri or combinatorial models. The functional system model can never be exhaustively tested but a further refined analytic model can reveal which system parameters are most sensitive to change and need closer functional fault injection experiments. Thus, the analytic modeling can enable the test engineer to wisely inject faults in the short amount of time allowed for the fault injection testing. Fault injection techniques provide a measure of confidence that the system's fault tolerance will perform according to specifications but fault injection is a probabilistic means of validation and cannot provide a guaranteed deterministic validation. Thus, a variety of fault injection techniques (random and functional) are used along with analytic techniques to provide a complete system fault tolerant validation. Random fault injections should also be carried out to allow for any fault handling assumptions that have been overlooked in the analytic model. The overriding goal should be an adequate test set at each stage of a system's design, given the amount of time allocated for testing, to ensure that the evolving system will be dependable and continue to deliver its services in the presence of transient and hard faults. Figure 2.1.2 shows these two complementary streams of analytic modeling and fault injection that should be carried out through the evolving systems design and development. Analytic modeling provides the insight to the fault injection experiments and the results of the fault injection experiments are used to calibrate the parameters of the analytic modeling. Thus, the results of both streams are constantly being reconciled and the estimates of dependability are becoming more accurate as the design approaches a brassboard implementation. Carefully planned fault injection experiments in the brassboard are the final proof of the systems dependability. The SOW should specify the specific fault injection experiments and analytic modeling results the government requires to be delivered at each of the major milestones.

70 PHASES: Investigating Candidate Architectures-

A

ArthiUtturs Seleajgn-

A

SRR

A

A SDR

Impjimentation

PDR

CDR

Functional Simulation/Fault Injection

TEST PROTOTYPE

VHDL Behavioral/Gate

Architectural Level Simulation

I MODEL VAUDATIOW

Reliability Prediction/AnalyCIc Modeling M»Jelipg Element? Sys input -failure rate -repair rate Extrapolating Field -coverage estimate Fault Data -costsCparts, service, time) from Previous Sys Behavior Similar Systems Combinatorial -reliability graph -fault tree Noncombinatonal -Markov -Petri Net

'Tods: CAREIII HARP SURE

Figure 2.1.2

2.1.5

Design Reviews

Each design review requires specific exit criteria for that design stage. As well as the standard relability prediction and FEMA studies, the next generation of computer based weapon systems will also specify fault injection. Examples of some milestone requirements from a current, progressive mission avionics SOW that stress fault injection studies throughout the system's development are given below: (a) Functional Fault Analysis

71 The fault tolerance of the design shall be validated at each stage of the systems evolution. This evaluation shall be done by mapping the specified fault set onto each subsystem partition and then identifying the fault detection, isolation, removal and recovery mechanisms of the system that will establish system operation. This functional fault analysis (FFA) shall be done at each stage of the system design and development and any single points of failure that degrade mission performance or prevent successful mission accomplishment shall be rectified. [7] (b) Exit Criteria for SDR - Preliminary architectural description, architectural partitioning and functional allocation - A system development plan including: 1. Modeling tools for the evaluation of system operational performance and system fault tolerance; 2. Identification of a baseline set of application based benchmarks; 3. Identification of critical state information which must be protected under all fault conditions for transparent fault recovery. (c) Exit Criteria for SSR - Software architecture description and functional allocation - Justification for architectural approach including: 1. Preliminary operational performance estimates using the benchmarks defined at the SDR; 2. System fault tolerance paradims, approaches, maximum acceptable delay estimates and redundancy required for transparent fault recovery. (d) Exit Criteria for PDR For development items the following:

72

- A complete VHDL behavioral description of the architecture which provides accurate timing estimates; - A description of the planned logic layout which provides accurate prediction of circuit complexities, failure modes, error detection methods, and detection coverage and latency. For non-development and development items: - A description and simulation of the executive software including inter-process and inter-node communication mechanisms; - An integrated functional and fault tolerant description and analysis of the system demonstrating how the system mechanisms work together to achieve the desired performance and automatic recovery in the presence of faults; - A functional fault analysis of the system which shall include the mapping of representative faults from the specified fault set to each hardware and software component witiiin each Error Containment Region (ECR) and analyzing the effectiveness of the ECRs to contain errors. The system's error handling procedures shall be characterized in terms of coverage, delay times and system wide effects. (e) Exit Criteria for CDR For Development items: - A complete description of the final design including IC and logic layout, packaging, executive software and representative application programs. For Non-Development and Development Items: - A refined prediction of the designs performance and fault tolerance including detailed circuit analysis and fault injections simulations to determine fault coverage and recovery times;

73

- The time required and probability of recovery from each fault type in the specified fault class; - A plan for the detailed brassboard construction and evaluation shall be presented. The detailed design shall include provisions to allow experimental fault injection testing of the brassboard to verify that the physical system will perform in the fashion assumed in the previous simulations and reliability models. (f) Exit Criteria for TRR - A fullup demonstration of the brassboard with the executive software and a representative application. - Timing measurements from an agreed set of benchmarks shall be taken and reconciled with previous timing and throughput estimates. - Representative faults shall be injected into the brassboard to verify that the previous fault tolerant predictions accurately reflect the behavior of the implemented system. Representative faults from the specified fault set shall be selected by the contractor and the Navy. These faults will be inserted randomly and in selected locations. - The laboratory fault injection experiments shall be reconciled with the simulation and analytic results to provide assurance that the previous dependability and availability predictions are correct. Any discrepancies found at this stage shall be analyzed and corrected.

2.1.6

Conclusion

The DoD is giving increased attention to what precise legal language is needed for its future Specifications and S(DW requirements of complex computer based weapon systems. The DoD is seeking to specify

74

clear and quantifiable validation techniques at each stage of the system's design. This will allow the DoD to be informed customers able to quantify a design's dependability and fault tolerant metrics and reasonably ensure at each stage of the design that the evolving system will be a dependable system to own and operate. During the Gulf War, Naval aircraft sustained high Full Mission Capable (FMC) rates (in the high 90% range), but these high PMC rates were sustained at the cost of high Maintenance Man Hours per Flight Hour (MMH/FH), (30 to 65 MMH/FH), high spares usage and high false alarm rates, (consistently in the 30% to 35% range). Future complex weapon systems will increasingly rely on digital sub-systems and the dependability of these digital designs will play a critical role in the effectiveness of those systems in the field. In future combat situations, the rate of spares usage, the time to isolate and repair the equipment, and the false alarm rates will play a critical role in the effectiveness of those systems. As budgets decrease, fewer systems are purchased and commercial parts are used, the dependability of those systems will be a critical factor in the effectiveness of the next generation of computer based combat systems. References [1] Advanced Launch System Life-Cycle Cost Optimization Studies, Phillip Babcock, The Charles Stark Draper Laboratory, Inc. Cambridge, Massachusetts [2] Boyle v. United Technologies Corp.. 487 U.S. 500, (1988) [3] Defense System Management College, Systems Engineering Management Guide, January 1990, U.S. Printing Office. [4] SDIO BM/C3 Processor and Algorithm Working Group, D.A. Rennels, ed., "Volume Two, Management Issues: Contractor Milestones and Evaluation," Application of Fault Tolerance Technology, October 20, 1987 [5] A. Johnson, M. Malek, "Survey of Software Tools for Evaluating Reliability, Availability, and Serviceability", ACM Computing Surveys, Vol, 20, No.4, Dec. 1988

75 [6] D. Lee, G. Gilley, "Minimizing Error In Predicting The Reliability Of On-Board Satellite Systems", The Aerospace Corporation, April 1991 [7] 0. Gilley, "Validation of Fault Tolerant Designs," lEEE/AIAA 10th Digital Avionics Systems Conference, October 1991.

SECTION 2.3 Derivation and Use of Deadline Information in Real-Time Control Systemsi Kang G. Shin and Hagbae Kim^ Abstract This section presents a method for deriving the hard deadline of a real-time control system, which captures the interplay between a controlled process and its controller computer by formally specifying the need of the controlled process in a form understandable to the controller computer. Several examples of deriving and applying the hard-deadline information are also presented to demonstrate the importance of the deadline information.

2.3.1

Introduction

The use of digital computers for such applications as aircraft, automated factories and nuclear reactors, has become commonplace due to the availability of inexpensive, powerful computers and the increasing need to control more sophisticated processes. A traditional control system consists of simple components, such as an electrical, mechanical, or hydraulic hard-wired controller, in the feedback loop of the controlled process. These controllers have the virtue of simplicity in designing, implementing, and evaluating their performances. By contrast, recent and ^The work reported was supported in part by the Office of Naval Research under Grant N00014-91-J-1115 and NASA under Grant NAG-1220. Any opinions, findings, and conclusions or recommendations expressed in this section are those of the authors and do not necessairily reflect the view of the funding agencies. ^The authors are with the Department of Electrical Engineering and Computer Science, The University of Michigan, Ann Arbor, MI 48109-2122

78 future developments — which are likely to be intrinsically unstable to achieve other purposes — such as a highly fuel-efficient "fly-by-wire" aircraft demand the controllers to react fast enough to maintain stability. Digital computers are thus playing a more important role in maintaining safety-critical functions while improving control system performance significantly. Computers for controlling life-threatening processes must be highly reliable to avoid any catastrophe. For example, the computer used in commercial aircraft requires a failure probability to be lower than 10"^" per hour. To guarantee the safe operation of these processes in the presence of computer-component failure(s), some form of fault-tolerance should be provided with the controller computer. Fault-tolerant controller computers are far more complex and difficult to design than non-fault-tolerant counterparts due to the stringent reliability required and their interaction with the operating environment as well as the controlled processes. When designing these fault-tolerant controller computers, one must resolve such issues as control-task scheduling and allocation, optimal queue control, and optimal redundancy management, and also must specify and evaluate the controller's performance in the context of the controlled processes. A real-time control system may fail not only because of the slow execution of critical tasks but also because of massive hardware or software component failures. T h e design and evaluation of highly-reliable controller computers gets complicated due to the difficulty in expressing the needs of the controlled processes and due to the difficulty in predicting the execution time of control programs in the presence of component failure(s). A controller computer calculates the control input by executing a sequence of instructions, thereby introducing an unavoidable delay - called the

compulation-

time delay — to the controlled process. This is an extra delay in addition to the system delay commonly seen in the control literature. C o m p u t a t i o n lime is an important component of the delay in the feedback loop, which also includef contributions related to measurement or sensing, A / D and D / A conversion, and actuation. However, these other delay contributions are usually constant, and thus easy to model. Due to data-dependent loops and conditional branches, and contention in resource sharing during the execution of programs that imple-

79 ment control algorithms, the computation-time delay is a continuous random variable which is usually much smaller than one sampling period, T,, if no failure occurs to the controller computer. When a component failure or environmental disruption such as an electromagnetic interference (EMI) occurs, the time taken for error detection, fault location, and recovery must be added to the execution time of a control program, thus increasing the computation-time delay significantly. This increase in task execution time seriously degrades the system performance and may even lead to a dynamic failure by either violating the necessary conditions for system stability or causing the system to leave the region of the state space defined by the specification (i.e., the allowed state space) if the delay exceeds a certain limit called the hard deadline [18]. The hard deadline can formally specify the need of the controlled process in a form that is understandable to the controller computer. The notion of hard deadline thus captures the interplay between a controlled process and its controller computer. This section presents a method for deriving the hard deadline of a real-time control system, which is defined as the maximum tolerable duration of a controllercomputer failure by using the dynamic equation of the controlled process, information about failure occurrence rates, environment characteristics, faulttolerance features, and properties of the control algorithms employed. The dynamic equations are modified to account for the presence of some controllercomputer failures that result in computation-time delays. Using these modified equations, we compute the hard deadline iteratively by examining the necessary conditions for (asymptotic) system stability or the residence of system states in the allowed state space. We also extend this method to cover the case of nonzero error latency [19] owing to an imperfect error detection coverage. There may be control input disturbances due to erroneous computation during the error latency in addition to the computation-time delay incurred due to the occurrence of componentfailure or environmental-interference.

80 To demonstrate the application of the derived deadline information, we include (i) a design example that optimizes time-redundancy recovery methods such as retry or rollback, and (ii) an evaluation example that assesses the system reliability by using a Markov chain model augmented with new states representing system inertia. The rest of this section is organized as follows. Section 2.3.2 surveys the previous work related to the reliability or safety of control systems. In Section 2.3.3, we address the generic problem for qualitative analysis of computation-time delay and/or input disturbance effects and review the basic definition of a hard deadline in real-time digital control systems. Section 2.3.4 presents a method for modifying the state difference equation in the presence of the computationtime delay only, and then analyzes system stability by examining the pole positions of the modified state difference equation. In Section 2.3.5, we extend the method proposed in Section 2.3.3 to the case covering the effects of input disturbances. Hard deadlines are derived both deterministically and stochastically. In Section 2.3.6 the hard deadline associated with the one-shot event model is derived, when even one occurrence of a computer failure for a long period may drive the controlled process out of its allowed state space, i.e., a dynamic failure occurs. In Section 2.3.7, several simple linear systems are examined to demonstrate our approach. Section 2.3.8 deals with the application of the hard-deadline information by presenting a design example that optimizes time redundancy such as retry or rollback, but also an evaluation example that assesses the system reliability by using the derived deadline information.

2.3.2

Related Work

Failures in actuators, sensors, mechanical parts, A/D and D/A converters may also induce a system failure. In [14], a mathematical framework was presented to describe the interactions between the detection/isolation devices for component^ failures and the reconfiguration of control algorithms. The authors of [22] focused on the design of fault-tolerant control systems to enhance system reliability. In contrast to the work cited above, the main intent of this chapter is ' that is, actuator, sensor, or computers

81 to (a) analyze the interplay between a controlled process and its fault-tolerant controller computer, and (b) derive the deadline information of the controlled process t h a t is useful for the design and evaluation of the controller computer. Several researchers have attempted to analyze the effects of computation-time delay on the performance or stability of a control system. The authors of [24] considered the qualitative effect of feedback delay on a multivariable, computercontrolled linear system by proposing an algorithm to compensate for the delay effect. T h e sufficient (necessary) conditions of stability with a feedback delay and the delay effects on quadratic performance indices were presented for a linear control system [3] (for a nonlinear robot control system [16]). In [15], a more detailed analysis of the stability of a digital control system with a feedback delay was carried out by modifying the state difference equation. However, all of these analyses are based on the assumption that the feedback delay is fixed or constant. Although the stability problem with a variable-feedback delay was investigated in [4], it was still based on a regular and periodic (thus deterministic) pattern of delays. In [2], a control system with a random time-varying delay in the feedback loop was modeled with a stochastic-delay differential equation, and sufficient conditions were derived for the almost-sure sample stability under which almost every possible differential equation for an ensemble of stochastic systems has a stable solution. However, it did not show any explicit relation between the performance (or stability) and the magnitude of delay but, instead, gave a condition of the coeificients of the state equations and the average rateof-change of delay for sample stability. Furthermore, this work assumed a delay to be bounded by the 'worst-case' inter-sample period. In [18], the hard deadline in controlling the elevator deflection of the aircraft landing problem was obtained numerically by using the concept of allowed state space. T h e problem associated with disturbances to the control input was treated by analyzing the observability of a linear system with a dynamic feedback controller under unknown disturbances in the control input [5] and experimentally testing the functional error modes of computer-based control systems in a "harsh" operating environment [1].

82 u(t)

yit) Controlled system X

D/A

atch circuit

(<)

A/D

uikT s)

x{kT,) Controller computer r(i :r,)

F i g u r e 2.3.1

2.3.3

A computer-controlled system.

The Generic Problem of Controller-Computer Failures

As shown in Fig. 2.3.1, a linear time-invariant controlled process is generally represented by the vector difference equation as shown in Eq. (2.3.1) and is equipped with a well-designed controller that stabilizes the overall control system and optimizes the given control objective:

x(k+ 1) = Ax(k) + Bu{k)

(2.3.1)

where k is the time index, one unit of time represents the sampling interval T,, and X G 7^" and u E TZ^ are the state and input vectors, respectively. The coefficient matrices, A & TZ"^" and B G "R.""*', are obt ained from those of the corresponding continuous-time model:

Jo

(2.3.2)

where Ac and Be are the corresponding coefficient matrices of the continuoustime model. The (digital) controller computer reads sensor values, compares them with desired values, and calculates the control input once every T, seconds

83 according to a programmed control strategy. The control input, which is held constant within each sampling interval by a latch circuit, is applied to the continuous-time controlled process. Eq. (2.3.1) must be modified to account for the delays associated with the measurement or sensing, A/D and D/A conversion, and the execution of control algorithms. The sum of these delays is usually much smaller than one sampling period, T,, in the absence of computer failure(s) or external interferences like an EMI (called the delay problem in [16].) For the delay problem where the magnitude of delay, A, is smaller than T,, one can change the state equation (2.3.1) to: x{k + 1) = Ax{k) + Biu{k) + B2u{k - 1) where B^ = / p e^'^^'-'^BJs

and B2 = /^f

(2.3.3)

e^'^'^'-'^B,ds.

As mentioned earlier, digital computers are highly susceptible to transient EMI such as lightening, high intensity radio frequency fields (HIRE), and nuclear electromagnetic pulses (NEMP). The main problem caused by EMI is functional error modes — or computer failures — often without component damages. When a computer failure occurs and it is detected upon its occurrence (i.e., with a zero error latency measured from the occurrence to the detection of an error [19]), a certain recovery process is invoked.^ Then, the computation-time delay resulting from the corresponding recovery actions may be large relative to Tf. This was termed the loss problem in [16]. Note that detection schemes are, in reality, imperfect (i.e., the detection probability is smaller than one), resulting in a non-zero error latency. During this error latency, the control input may be updated erroneously until the fault inducing the computer failure disappears, or the computer failure is detected and handled properly. Suppose a computer failure occurs at time ko, it is detected ni sampling intervals after its occurrence, and the subsequent recovery takes n^ sampling intervals during which the control input will be held constant at u{ko -f- ni -|-1) by the D/A converter and latch circuits. The control inputs during this period A fault-tolerance mechanism consists of error detection, fault location, system reconfiguration, emd recovery. General recovery processes are retry, rollback, and reconfiguration, each of which takes a finite time. See [21] for a more detailed account of fault-tolerance mechanisms.

84 are formally represented as; u(ko + 1 ) / A , M ( ^ O + 2 ) / ^ , • • • ,u{ko

+ n i ) / ^ , • • •,

u{ko + ni + l ) , u ( * o + "1 + 1), • • • ,u{ko + ni + l),u{ko ^ „, ' where 7^ is a diagonal matrix with Diag[Ij^]i

+ ni+rj2+l),--.

= 1 + Au,- and Aii, is a ran-

dom sequence which is modeled by the output of a dynamic system with a white-noise input. Since faults/interferences occur randomly during the mission lifetime, their occurrences are considered stochastic perturbations to the controlled process, which can be modeled depending on the fault characteristics. When the environment is assumed to be stochastically stationary, the occurrence and duration of computer failure(s), and the magnitude of disturbances in the control input can be represented by several probability density functions. T h e relative frequency of disturbance and delay due to such computer failure(s) depends upon the coverage (the probability of detecting an error induced by an arbitrary fault), which is determined by failure-detection mechanisms [19]. Stationary occurrences of controller-computer failures/interferences not only degrade the performance of the controlled process but may also lead to loss of system stability if their active duration exceeds the hard deadline. Even one occurrence of this abnormality with a long active duration in the one-shot event model may make the system leave its allowed state space [17]. T h e hard deadline of the stationary model is defined as the m a x i m u m allowed duration of the failure-handling process without losing system stability [17]. More formally, we have the following definition. D e f i n i t i o n 1: Let Xg denote an equilibrium state of the system represented by Eq. (2.3.1). Then, x^ is said to be stable at time kg if for each e > 0 there exists a <5(e, ko) > 0 such that \\x{ko)-x,\\<6=^\\x(k)-x,\\<(, T h e equilibrium state x^ is said to be asymptotically

\fk>ko.

(2.3.4)

stable at time ko, if it is

stable at time ATQ, and there exists a (5i(A;o) > 0 \\x(ko)-Xe\\<6i(ko)=^\\^i>')-Xe\\—'0

as A : — • oc.

(2.3.5)

85 In linear time-invariant systems, stability can be checked simply by using the pole positions of the controlled process in the presence of random computer failures. Using this information one can derive hard deadlines stochastically or deterministically with the sample(s) and the ensemble average of the controlled process: D(N)

= inf supjiV : ||A(.V)|| < 1},

(2.3.6)

Cert V

where A(A'') is the eigenvalue of the controlled process in the presence of computer failures of the maximum duration NTs and Cenv represents all the environmental characteristics t h a t cause controller-computer failures. Consider a state trajectory evolved from time kt) to kj. Let A'^(A*) and UA be the allowed state space at time k and the admissible input space, respectively. If a computer failure, which occurred at ^i (A'o < ki < kj) and was detected I\\Ts later, is recovered within N2Ts, then the control input during these jV =

Ni+Nn

sampling periods is: u^ (k) = u{k)lAnk,{Ni)

+ u(ki + Ny)nk,+NA''^2),

k:
ki+N

where Umin) is a rectangular function on the interval [m,m + n], i.e., U,n(n) = ^{k —TO)— ^(k — 771 — n), where ^ is the unit step function. Then, the system hard deadline during [koT^, kfTs] is also defined as: D(N,x{ko))=

inf

supjA':

(p{k,kQ,x{kQ),u'^[k))

e XA(k),

ko < k <

kj},

(2.3.7) where the state trajectory is assumed to evolve as: xik) = (k, ko, xiko), y'aik)).

2.3.4

(2.3.8)

Deadlines of Stationary Model Due to Feedback Delays

In our model, the controlled processes represented by Eq. (2.3.1) are usually unstable in the absence of feedback control, i.e., the real part of at least one eigenvalue of .4 is greater than one. For example, the aircraft designer may

86 push his design to the edge of instability to improve the fuel-efficiency of a future aircraft, where the fast, accurate, and consistent control is required [18]. T h e state-feedback control input t h a t stabilizes such unstable systems can be calculated by using the observed (or estimated) states according to their own control objectives such as time-optimal control with an energy constraint, optimal state-tracking, and optimal linear regulator [12]. Suppose the feedback control input is computed by u{k) = —Fx{k),

where the feedback matrix F

depends upon the control objective used. We want to derive necessary conditions, under which it will remain (asymptotically) stable even in the presence of random failures to the controller computer. These conditions require the knowledge of the system dynamic equations, the control algorithm to be used, and the environmental characteristics. Consider a simple controlled process represented by a linear time-invariant differential equation, which can be converted to a discrete-time problem by using Eq. (2.3.2) and a quadratic performance index:

J = l \ Y .

[^^('')Qxik)

+ u'^(k)Ru{k)]

+ x'^{k;)Qx{kj)

where the matrix Q G 72."'*" and R £ TZ ^

1,

(2.3.9)

are positive semidefinite and

positive definite, respectively, and are determined by the control objective of interest. T h e optimal control input is calculated by minimizing J while maintaining system stability, that is, u(k) = —Fx(k),

where the state feedback gain

m a t r i x F is obtained by solving a discrete Riccati equation [12]. Let the control input have been updated at < = mNT,.

If the control input had

not been updated for i {Q
due to a

long computation-time delay, the corresponding state equations for the group of intervals during which the system failed to update the control input become: x{mN

+ 1)

x{mN-i-2)

=

Ax{mN)

-

Ax{mN

=

A^x{mN)

+

Bu{mN) ^\)^-Bu{mN)

^ {A ^

l)Bu(rnN)

87 time t 3(m - 1)

3(m-l) + l

3(m-l) + 2

3(m + l)

3(m + l) + l

3(m + l) + 2

1

•+•

k

3m + 1

1 -m-1

2 —

+

3m + 2

•+•

3

*-

3(m + 2)

1

1

+2

-m+1

^

3

H

F i g u r e 2.3.2 Time index for argumented vectors representing the stationary occurrence of computer fciilures (delays) when TV = 3.

i-1

x{mN

+ i)

=

A'x{mN)

+

YI^A^Bu{mN) j=o

x{mN

+ i+l)

=

A'+^x{mN)

-

A'^x{mN)+

+ '^A^Bu{mN)

+BuimN+

i)

N-l

x{(m + l)N)

^

A^Bu{mN)

j=N~i N~i-1

+

Y,

A^Bu{mN + N - j - i),

(2.3.10)

where m is the time index for the groups of TV sampling intervals each. Let X{m)

= [XI,X2,...,XN]'^

= [x{mN + l),x{mN

U{m) = [ M I , W 2 , - - - , W N F = [uimN is, X{m)

+ l),u{mN

+ 2 ) , . . . , x ( ( m + 1)A^)]^ and + 2),...,u{{7n+l)N)]'^.

That

and U{m) are respectively the augmented state and control vectors

at the group of sampling intervals during which the system failed to u p d a t e the control inputs (see Fig. 2.3.2).

When the delay is equal to i sampling

periods, the following augmented state equations result: X{m + l)

=

ADX{Tn) + Bl,^U{m)

U{m)

=

-FoXim),

+ BlV{m+l),

(2.3.11) (2.3.12)

where • •

0

0

A

F 0

0

•••

F

•

•

0

A

0 0

0

0

0 0

.

•••

•

•

F

•

AD

•

•

0

J

=

B {AB+B)

Bk =

Er=

^W — l - « ( , ) _ g

A^B

0

0 5 AB

Bl =

0 0

• B O (2.3.13)

Suppose the occurrence of computer failure(s) is binomially distributed with parameter P at each sampling time. Let qo,qi • • • ,qf^ he the probabilities of delays 0, r , , •• •, NT,, respectively, such that E«=i 1i = P' where the maximum delay is assumed to be NT,. This delay distribution can be derived in practice from the knowledge of the failure-handling policy and environmental characteristics. Since the delay is equal to the time interval from failure occurrence/detection to its recovery — called the Fault-Tolerance Latency (FTL) — it can be mecisured via fault-injection experiments or can be evaluated analytically by a certain pdf (probability density function) considering all fault-tolerance features [9]. From the experimental samples or the fitted pdf of the FTL, one can estimate the parameters of the multinomial distribution (^,'s). By combining the state equations (??) with the delay distribution, the state equation including the effects of delay no greater than NT, becomes: N

X{m -h 1) = ADX(m) +^^t 1= 0

(Sl,.t/(m) + BlU{m

-f 1)) ,

(2.3.14)

89 where ^i G {0,1} is a binomially-distributed random variable with parameter qi, i.e., Pr[^,- = l] = q^. Then, the first moment of Eq. (2.3.14) is: N

X{m + 1) = AoXim) + ^

9. {B}jU(m) + Bl_U{m+l)} .

(2.3.15)

Since the period of the index mis N, the complex variable z in the ^-transform of Eq. (2.3.15) corresponds to z^, where Zk is the complex variable in the Ztransforms of equations with index k, that is, the period of Eq. (2.3.15) is jr in the frequency domain, where ui indicates the period of equations with index A; in the frequency domain (sub-sampling). Using Eqs. (2.3.12) and (2.3.15), the characteristic equation of the control system in the presence of stationary occurrences of feedback delay is represented by: N

det

N

(I + Y. li^D, FD)Z'' -AD+Y. !=0

liBh.Fo

0.

(2.3.16)

1=0

T h e asymptotic stability can b e tested with the pole positions of Eq. (2.3.16) whose characteristic equation reduces t o a simple form due to the simple structure of Ao despite its augmented dimension. The characteristic equation of •.N

the zero-delay case (i.e., qo = I and thus P — J2i-i 9» — 0) is: det \z^I - (A-BFf]

=0.

(2.3.17)

Further, one can get the following characteristic equation for the worst case in which q^ = 1, or the control input is updated only once every N sampling intervals due to the periodic delay of an active duration NTg-. N-l

det

z^I-A^

AT-l

+ J2^'BF

= det

1= 0

N-l

z''^i-J2 •'^'^^^ -BF)+J2 -'^0, ' 1=1

(2.3.18) where A — QDQ~^

by the similarity transformation if D = diag[di, • • •,(!„]

and di's are the eigenvalues of A. The conditions for asymptotic stability are used to derive the hard deadline. Let NTs and DT, be the assumed m a x i m u m and the actual m a x i m u m delays, respectively. Then, the hard deadline can be obtained by iteratively testing the necessary conditions for system stability in the modified charcteristic equation Eq. (2.3.16) while changing N from 1 t o D.

90

2.3.5

Deadlines in the Presence of Delays/Disturbances

The hard-deadline information is also derived for a linear time-invariant controlled process in the presence of disturbances as well as delays in the control input. We use the same method in Section 4, i.e., iteratively testing for the stability of the modified (stochastic) dynamic equation while changing A'' from 1 to D, where NT, and DT, are redefined as the assumed maximum and actual maximum duration of delays and/or disturbances, respectively. To modify the given state equation of Eq. (2.3.1) by incorporating all stochastic behaviors of controller-computer failures, we need the following definitions and assumptions:

•

d is the conditional probability of successful detection given that a computer failure had occurred.

•

qf is the conditional probability of an input disturbance for i sampling intervals (X2i=i lT — 1) if a computer failure occurs and is not detected until its disappearance. It can be estimated by using the experimental data or the analytic model for the error latency.

•

9AU is the probability density function {pdf) of Au which is the magnitude of a control input disturbance at time kT,, i.e., Uactuat{k) = Udeiired(k)lAThe mean and variance of ^^u are given as /iAu and (T'^^, respectively.

•

The probability that two transient failures occur sequentially within a small number of sampling intervals, (A'^ — i)Ts, where the delay (recovery duration) or duration of the erroneous control input (active duration of a transient failure) is i sampling intervals and NT, is the assumed maximum value of such intervals — is so small as to be ignored. Thus, we consider only one computer failure possible during a group of N intervals.

•

Any random sequence will be independent identically distributed (i.i.d) for the time index k.

91 The control input at {mN + i)T, is updated to be u(mN + i ) / ^ for the disturbance case or u{mN) for the delay case. We, then, obtain the augmented state equations by using the augmented state and control vectors for a group of N sampling intervals: X{m + 1)

=

ADX{m) + B})^U{m) + Bl^U(m+l),

(2,3.19)

U(m.)

=

-FoXim),

(2.3.20)

where [J3]j,,Bp,] becomes [ B D ^ - ^ D O ] ^^"^ ^^^ normal behavior, [B'I)\,B'I^] for the delay case, and [5^J,5^^] for the disturbance case, respectively. (In [7], these augmented matrices are described in full.) Eqs. (??) and (2.3.20) are, in turn, modified by combining the random squences representing the behavior of computer failures: X{m + i)

=

ADXim)+(^il-i>)Bl\+4>(l-
+

\

v-^E^'^

t^(M + ((1 -

N

N

^)B"DI+ \

V-(l-^)^C.i5^f + ^^^^,5^Mt/(m+l)(2.3.21) where ip,(p E {0,1} are binomially-distributed random sequences with probabilities P, d, and ^,, (^i G {0,1} are muitinomially-distributed random sequences with probabilities qf,q^, i.e., Pr[^, = 1] = qfThe asymptotic stability of Eq. (2.3.21) can be examined deterministically or stochastically. A. Deterministic Approach Similar to the method used in the Section 4, the deterministic value of the hard deadline is obtained by examining the pole positions of the first moment (ensemble average) of Eq. (2.3.21). Although the resulting hard deadline ha^ little practical meaning, it can show the trend of the ensemble system behavior with an uncertainty (in the state and output) which can be characterized by the second moment of Eq. (2.3.21). The first moment of Eq. (2.3.21) is: X{m+l)

=

ADXim)+(^il-P)B1,\

+ Pil-d)B'^l

+

92 pdf^qfBiA !=1

U(m) + {(1 - P)Bll + /

N

N

\

P{1 ~ rf)^?r5S: + P ' ^ L ? . ' < t=l

1=1

f^(^ + 1)(2.3.22) /

Using Eqs. (2.3.20) and (2.3.22), one can get the characteristic equation of Eq. (2.3.22): N

N

\

/ + [(!- P)B"ol + F(l - d)J2'irB1^'+pdJ2'l'BD.]F'D z^~ (=1

lAD~[{\-P)Bl\

+ P{\-d)Bl]+PdY,qfB''j^]FD\

t=i

/

= 0 . (2.3.23)

The characteristic equation of the zero-delay case (i.e., no computer failure, or p = 0) is; det [;•''/ - (.4 - BFf] = 0, (2.3.24) where the magnitudes of eigenvalues are equal to those obtained from: dei[zI-{A-

BF)] = Q

(2.3.25)

which is the characteristic equation of the original state equation (Eq. (2.3.1)) in the absence of computer failures. B. Stochastic Approach The effectiveness of the deterministic approach decreases as the variance of q^iu gets large. In such a case, we can derive the probability mass function (prnf) of the hard deadline with respect to q^u rather than the deterministic value of the hard deadline based on the mean of q/^u- Now, the mapping between the hard deadlines and the magnitudes of disturbances (Au's) is not one-to-one, and the hard deadlines can be derived numerically for each sample value of Aw's. In all but simplest cases it is impossible to derive a closed-form expression for the prnf of the hard deadline or the exact relation between the hard deadline and AM. The method we use is therefore to quantize uniformly the ^AU continuum in the interval [a, 6], where / q^^^dAu = 7. Let this quantization result in M equal-length subintervals (cells). There is a tradeoff between the accuracy and the amount of computation in determining 7 and M, and a and b are determined

93 appropriately according to 7. Then, points are allocated to the quantized cells, and let the point of the i-ih cell ([a + (i— 1 ) ; ^ , a + ? j ^ ] ) be Au, which corresponds to the midpoint of the cell, i.e., Aui — a + ~',\_l\ • then the probability a + i-M-

of the point is calculated as 5^^ = J ,, ^T.JUL QAuis)!^^.

A hard deadline is

derived for each Aui, and let it be Di whose probability is equal to that of q'^^. F'inally, the pmf

of the hard deadline is derived numerically by multiplying

Di and q'^^^, 1 < « < M . The accuracy of the resulting pmf

depends on o . t ,

and M, which must be determined by considering the controlled-process state equations, the environment, and the amount of computation. Although the above method uses a stochastic approach in deriving the

pmf

of the hard deadline, it is still based on the mean values of binomially- and multinomially- distributed random sequences since the hard deadlines of all possible samples cannot be derived due to the excessive number of possible samples. However, the stability of each individual system (i.e.. samples) has more practical meaning than the stability of the average system or the ensemble of all possible systems which might be built. Thus, in addition to the deterministic analysis (or combined with the stochastic analysis) of the averaged system stability, we will a t t e m p t to stochastically analyze system stability by using the almost-sure

sample stability concept (which is actually almost deterministic).

D e f i n i t i o n 2:

•

Probabilistic

Stability

: For every pair of positive numbers a and 6, there

exists a positive number d(a,b.to) P[sup 11^(11 > a]
such that

for xt„ such that ||i-,J| <

rf,

(2.3.26)

«>«o

where Xf = {x{s) : to < s < t} is a. segment of the past history. •

Almost-Sure

Stability:

For every pair of positive numbers a and 6, there

exists a positive number rf(a,6,
sup ||i;t|| ) > a] < b. \t>to

(2.3.27)

94 No control u p d a t e during this period

y

failure(s) occur ^'

time t

l - ^ - « k :

0

F i g u r e 2.3.3

1

successful recovery

^

—I ko ko + I

^

)(-

•

. , — H

ko + N

Time index for the one-shot event/delay model.

In fact, the almost-sure sample stability means that almost every possible difference equation for a given ensemble of such systems has a state which is stable in the Lyapunov sense.

2.3.6

One-Shot Event Model

T h e pole locations do not change in case of only one failure with a relatively long ( > T,) active period (Fig. 2.3.3). The (asymptotic or global) stability condition discussed thus far is therefore no longer applicable.

Instead, the

terminal state constraints can be used to test whether or not the system leaves its allowed state space. Note that every critical process must operate within the state space circumscribed by given constraints, i.e., the allowed state space. When the control input is not updated correctly for a period exceeding the hard deadline, the system may leave the allowed state space, thus causing a dynamic failure. T h e allowed state space consists of two sets of states A'^ and X\

defined

as follows:

•

X\:

the set of states in which the system must stay to avoid an

immediate

dynamic failure, e.g., a civilian aircraft flying upside down is viewed as an immediate dynamic failure. This set can usually be derived o priori

from

the physical constraints. •

X\:

the set of states that can lead to meeting the terminal constraints

with appropriate control inputs. This set is determined by the terminal constraints, the dynamic equation, and the control algorithm used.

95 The system must not leave X\ nor X]^ in order to prevent catastrophic failure. Assuming that some computer failure may not be detected upon its occurrence but every detected failure can always be recovered successfully, we can consider three cases for the analysis of the effects of computer failures during a finite time: (i) delay: when a computer failure is detected upon its occurrence, (ii) disturbance: when a computer failure is not detected until its disappearance, and (iii) disturbance and delay: when a computer failure is detected at some time after its occurrence but before its disappearance. Let ko,kf, Ni, and A'^2 denote the indices for the failure occurrence time, the mission completion time, and the period of disturbance, the period of delay measured in sampling periods, respectively, where N = N1 + N2, 0 < Ni,N2 < N. The dynamic equation for a one-shot event model is; - /)nte+N,(^2)] (2.3.28) where TIkgiN) is the rectangular function as defined in Section 2, and A'^i and A'^2 are random variables and determined by the conditional probability of successful detection (d) \i N is given; :E(^+1)

= Axik)+B

PT[NI

[u{k) + (uiko) - uik))Uko(Ni) +

- i] =

Fi[N2 = i] = Pr[7Vi=7V]

=

d(l -dy

«(A:)(/A

0
d{l - d)^-'

N

-I

l
Pr[7Vi = 0 ] = l - r f ( l - r f ) ^ - ^

(2.3.29)

Thus, the first moment of Eq. (2.3.28) is; /

x{k+i)

=

Ax{k) + V

N

Bluik)+Y^[qi{u(ko)-u{k))nk,{Ni) »=0

+ qr^mi^ - mko+N,{N2)]).

(2.3.30)

Using Eq. (2.3.30), one can derive the states at time ko + N and kj, and examine whether or not the state trajectory satisfies the immediate and terminal constraints, iteratively for each N {I < N < D). ko+N-l

x{ko + N)

=

A^x{ko)+

Yl i=ko

A''°+^-'-^B[qiu{ko)

+

qTuii)I^],

96 fco + N - l « = A;o

ki-\ i = ko+N kf-1

=

A'''-'"'-^x{ko

+ N)+

Y,

In addition to this deterministic approach, the pmf

A''^-'-'^Bu{i).

(2.3.31)

of the hard deadline t h a t

depends on q^^u is also derived by using the same stochastic approach employed in the stationary model. Without using the first moment of the state equation (2.3.30) the probability of the hard deadline being N (i.e., Pr[A'^ = D]) is thus obtained as the sum of the probabilities^ of the sample equations (2.3.28) in which a dynamic failure occurs at time N. One must continue this process until a dynamic failure occurs for all samples (i.e., Pi[N = D] = l ) . T h e following pseudo-code is used to iteratively derive the hard deadline for the system with the initial state XQ at time ko, where Td is the duration of abnormality due to a computer failure.

\+ Recursive testing the allowed state space while increasing Td *\ m = I \* sub-samphng period, mA *\ D = integer.time{Nttm) \* integer times of T, *\ if (D < Num) t h e n while (m ^ Ts/A or test^constraintsO ?iTRUE) d o Td := {D- l)Td + mA it test.constatnts{Td) then return Td \* hard deadhne is Tj *\ else m := m -|- 1 \* test larger Td by increasing m *\ end_whlle else return no hard deadline

integer-time(N(,m) N =1 while {N # Nt„n or test.constaints() Td := NT, ' N^ testings are required for each N.

?^TRUE) d o

97 if test.consiaints{Td)

then return A^ \* D = N max is obtained *\ \* test larger N *\

else A^ := JV + 1 end_while test.constraints{Td) compute x({ko + N)Ts) from xo if (xiiko + N)Ts) e X i ) then begin compute x[kjTs) from either x{[ko + N)Tg) or xo if(x{kfTs) 6 X{) then return FAILURE else return TRUE end else return TRUE Xf € X^ can be tested indirectly by the following relation: x{kf)

eX{<^:^

x{ko + N) e Xl,

(2.3.32)

kj-l where X^

^ { x \ [A'''-''"~^X

+

^

A''^-"'Bu{i)]

G A'^ }.

i = ko+N

In practice, it is difficult to obtain X ^ .

Although there may be a one-to-

one mapping between x(ko + N) and x{kj),

X'^ is usually a continuum, which

requires an excessive amount of computation due to the curse of dimensionality. The size of X\ size of X\

will decrease as either A' increases or k approaches kj, but the

is usually larger than that of Xj^ due to the (asymptotic) stability

of a controlled process.

2.3.7

Examples

To demonstrate the concept of hard deadline, we derive the deadlines for two simple, yet practical, example control systems. E x a m p l e 7.1: To show the hard deadline of a linear time-invariant control system in the presence of stationary occurrences of input delays/disturbances due t o computer failures, we consider a simple controlled process: xi{k+l)

=

X2{k+\)

=

ll,02xi(fc)-H.08a;2(fc)+10wi(/t) 0.95a;2(A:) + 10u2(fc),

(2.3.33)

98 N d=

3 1

d = 0.9

T a b l e 2.3.1

4

5

6

7

0.7537

0.8945

0.9659

1.0021

1.0204

0.7579

0.9985

1.1698

1.2922

1.3798

Maximum magnitude of eigenvalues, |A„

where the coefficient matrices of a quadratic performance index are given as: ^xr

—

10 0

0 10

•>

^Miu

5

0

0

5

One can derive the optimal feedback control gain matrix F that stabilizes the controlled process by solving a discrete Riccati equation as: 3.1251 -1.0791

0.3090 0.5512

This feedback control changes the system eigenvalues from {0.95,11.02} to {0.0777,0.2101}. We then derived deterministically the change of poles as a result of iterativeiy incrementing N for the occurrence of the largest delay possible [p = q^ = 0.045). The results are given in Table 2.3.1, where the first case represents the perfect coverage (d = 1) and the second case represents the presence of an input disturbance {d = 0.9 and fi^-u = —5). The deterministic value of the hard deadline is D = 6r, in the absence of input disturbances under instant failure detection, whereas it decreases to D = 5T, with some (infrequent) input disturbances. The pmf of hard deadline is plotted in Fig. 2.3.4 along with the pmf of the magnitude of input disturbances. Example 7.2: The physical meaning of hard deadline based on the one-shot delay model (with the assumption of perfect detection coverage) can be explained with the real-time controller of a robotic manipulator, where the obstacles in the robot's workplace are translated into state constraints. That is, the system states (robot's positions) must be constrained to avoid collision with the obstacles. In addition to these state constraints, there are usually control input constraints due to the bounds on joint motor torques and energy. In [6], a point

99

-9

0

Input disturbance (Au) -6 -5 -4 -3

-8

-7

1

1

1

1

1

1

-z 1

-1

0.30

0

:

0

-

-

-

-

0.27 0.S4 0.21

0

So go

0.18; 0.15 I

I 1°

Au a

/

\

0.12 "

b

i

0.09

n

•

D

0,

0.06

I"

_j

0.03

0, 1

0,

3

0,

I

1

i /

1

0.00

4 5 6 7 Hard deadline [ft] (D)

F i g u r e 2.3.4 P r o b a b i l i t y m a s s f u n c t i o n s of A u , w h i c h is g i v e n a priori, D, w h i c h is d e r i v e d .

and

robot of Cartesian-coordinate class weis modeled by a set of decoupled double integrators: i V

' 0 /' 0

u

X V

+

' 0

0 ' ' 0 '

1 0

u

where x, v, and u are 2-dimensional position, velocity, and control vectors, respectively, and I £ "R.2 x 2 . The robot's end-effector is guided by the control input determined by (2.3.34) mm Vd-w| 2 subject to the control constraints |w,| < u^"^ for i — 1,2, and the state constraints specified by the obstacles. The Obstacle Avoidance Strategy (OVS) used in [6] considers the dynamic environment and real-time control needs, vs^here the obstacle-related state constraints are transformed into state-dependent control constraints by mapping each end-effector's position relative to an obstacle into the P-funciionals:

where the obstacle is assumed to be centered at the origin and covered by a circle of radius r.

100 / y

4 obstacles (circles)

o

Xi

delay for 3 -

.-^

s f

^ o/& 1,2,3/^

/

delay for 3

1

1

1

3

3 xl(k)

F i g u r e 2,3.5 State trajectories for a point robot with four obstacles, 1) in the absence of delay and 2,3) in the presence of delay equal to DTs.

The k in the P-functional must be chosen such that the set of permissible controls remains nonempty, and the system state and each obstacle's position determine x — \i:\ — a, X2 — b] , wliere (a, 6) is the coordinate of a 2D obstacle. According to Theorem I in [6], k is set to 0.5 so that the control input constraint set may remain nonempty over the duration from the detection of a potential collision to the disappearance of this collision danger. A collision-free trajectory is guaranteed if P{t) > 0 ^t. A potential collision was detected by checking P(i) = P ( 0 ) + tPiO) + -P„,ar

> 0 V< ^

Pmax > \p,Q.

,

(2.3.35)

where i — —P{0)/Pmax is the time at which the minimum of P occurs. If the condition (2.3.35) is violated, the instantaneous value of Pmax is calculated, and an additional constraint (Eq. (2.3.36)) is included until P becomes positive, i.e., the danger of collision disappears. ^(X, V) = {u : Pju

where P^ = (x' x)

-1/2

J-,

Px.

+ li^PxJv

l ^ - ( X X^

X

V

> Pmax},

x) -1/2 ,

(2.3.36)

and m{x) = 2k(x'^x)''

^.

101

1

J u I y

1

V

1

1

0.5

1

1

1.5

2

'

3.5

xl

Figure 2.3.6 Hard deadlines of state trajectory 1 of Fig. 2.3.5 in the absence of delay without terminal constraints.

The intersection of these constraints with the admissible control set Q results in a polygonal control space: A= The Optimal

Decision

Strategies

^(x,v)r]Q. (ODS) in [13] can be used to solve such a

constrained minimization problem, similarly to a class of pointwise-optima] control laws constrained by hard control bounds. The hard deadline is derived using a pointwise-optimal control law with the OVS. Since the computationtime delay causes the failure to update the control input, a collision (or a dynamic failure) occurs if the computation-time delay is longer than a certain threshold, which is a function of the system state and time. In this example, the state constraints change with system state (time), i.e., the state-dependent control constraints. Thus, the control input must be updated on the basis of new information to avoid any collision. The trajectories derived in both the absence and presence of delay DT^, and the hard deadlines on these trajectories in the absence of delay are plotted in Figs. 2.3.5 and 2.3.6.

102 Controlled Process: System Dynamics

"

Controller Computer:

Environment: Fault behaviors

Hard Deadline

Control Algorithms

H/W Design

S/W Design 4 Evaluation (Reliability)

Figure 2.3.7 The source and application of hard-deadline information in a real-time control system.

2.3.8

Application of Deadline Information

T h e hard-deadline information allows us to deduce the timing constraints of the controlled process. It can be used to evaluate the fault-tolerance requirements of a real-time control system in terms of its timing constraints. Fig. 2.3.7 shows the source and application of hard-deadline information. As we shall see, this deadline information about the controlled process is quite useful for the design and evaluation of a controller computer. When designing a controller computer, one has to make many design decisions in the context of controlled processes that are characterized by their hard deadlines and cost functions [18], including:

hardware design issues dealing with the number of processors and the type of interconnection network to be used, and how to synchronize the processors.

103 •

software design issues related to the implementation of control algorithms, task assignment and scheduling, redundancy management, error detection and recovery.

Since the timing constraints of the controlled processes are manifested as hard deadlines, the deadline information is also essential to evaluate the system reliability, an important yardstick to measure the goodness of the controller computer. To illustrate the general idea of applying the knowledge of the deadline information (i.e., system inertia), we consider two specific examples; (i) a design example that optimizes time-redundancy recovery methods such as retry or rollback, and (ii) an evaluation example that assesses the system reliability. Example 8.1: When an error is detected the simplest recovery method is to re-execute the previous instruction, called simply retry, which is effective in case of immediate error detection [10, 11]. When retrying an instruction, one must determine a retry period, which is long enough for the present fault(s) to die away. If the retry does not succeed in recovering from the error, we have to use an alternative recovery method like rollback or restart. So, the retry period must also be short enough not to miss the deadline by considering the amount of time to be taken by the subsequent recovery method in case of an unsuccessful retry. Let Tt, TQ, and tr be the "nominal" task execution time in the absence of error, the actual task execution time, and the retry period, respectively. Then, one can obtain a set of samples ofTa'. Ta E {Tt,Tt +

-^,if^T,+tr)-{-Tt,{T+T,-\-trHT,-^~,2{f+T,+tr)+Tt,---]-

where T,, T, and j - are the resetting time, the mean occurrence time of an error, and the mean active duration of a fault. Since Ta has discrete values, the probability mass function (pmf) of Ta is: /^^

=

PT[Ta = k{f + Z+tr)

+ Tt+6^l

=

p',+'iTt)il-p,(U))''a-Pe{Tt)y-'ps{irY,

0
Se{0,l} (2.3.37)

104 where Pe{Tt) and Ps{tr) are the probability of occurrence of an error during Tt (i.e., after restart) and the probability of a successful retry with a retry period tr-^ Then, the probability of missing a hard deadline is:

Pmh(Tt,D)= where foiy)

Yl

(2.3.38)

is the probability density function of the hard deadline. When the

hard deadline is deterministic, foiu) PmhiTi,D)

fxM)lD{y)dy.

is a delta function and the corresponding

becomes simpler. Consequently, the optimal retry period can be

determined by minimizing ^ ^ ^ ( r t , D) with respect to tr, using the derived hard-deadline information

foiy)-

Similarly, the hard-deadline information is also useful to rollback recovery, where checkpoints must be placed optimally.

The checkpoints are usually

placed so as to minimize the mean execution cost [23]. However, the mean cost must be minimized while keeping the probability

of dynamic

failure

—

the probability of missing a deadline [18] — below a prespecified level in a real-time control system [20]. The hard-deadline information is necessary to compute the probability of dynamic failure, which can, in turn, be used for the optimal placement of checkpoints. E x a m p l e 8.2: We consider a simple example of a triple modular redundant ( T M R ) controller computer in which three identical processors execute the same set of cyclic tasks. The T M R controller computer updates, once every T, seconds or every sampling period, the control input to the controlled process (plant). T h a t is, the period of each cyclic task is equal to T,. T h e input of the cyclic task is a discretized output of the plant and the o u t p u t of the cyclic task will be used to control the plant during the next sampling interval. T h e o u t p u t of the T M R controller is correct for each task only if at least two of the three processors in the T M R controller produce correct outputs. A TMR failure

is

said to occur if more than one processor in the T M R controller fail during T,. T h u s , the o u t p u t of the T M R controller would not be correctly updated in case of a T M R failure. T h e condition for a system (dynamic) failure resulting from 'See [8] for the derivation of Pe(Tt) and pj((r) in a TMR system.

105 Pl3

Si:

No failure

52; O n e processor with failure(s) 53: T w o or three processors with failure(s) = O n e T M R failure 54: T w o sequential T M R failures Ss'. Dynamic failure

F i g u r e 2.3.8 tia.

A Markov reliability model with knowledge of the system iner-

controller-computer (TMR) failures^ is derived from the hard deadline, which is the allowable maximum duration of TMR failure(s). In other words, this condition gives us the knowledge about the controlled system's inertia against controller-computer failures. More than 90% of computer failures have been reported to be transient, especially with short active durations [21]. Thus, the controller computer may recover from most failures in a few sampling intervals, and it can correctly update the control input without causing any dynamic failure, if the active duration of controller-computer failure is smaller than the hard deadline. ^ As mentioned earlier, the other sources of system failure(s), such as failures in actuators or sensors or mechanical parts and failures of A/D and D/A converters, are not considered in this chapter, because the main intent is to analyze the coupling between a controlled process and its (fault-tolerant) controller computer.

106 Suppose the hard deadline derived from the controlled system is three sampling periods and a TMR controller computer is used. That is, no dynamic failure occurs if the faults inducing computer failures disappear (or are recovered by a fault-tolerance mechanism) within three sampling periods. Then, the reliability model for this controller computer (Fig. 2.3.8) is built by extending a Markov chain model whose parameters are to be estimated at a given level of confidence from empirical data. The additional states account for the system inertia, i.e., a dynamic failure results from only three consecutive erroneous (missing the update of) outputs of the controller computer or for a period of 3T,, not immediately from one or two erroneous outputs. Without the information of hard deadline, one can over-estimate the probability of a system failure under the assumption that the system has no delay-tolerance, i.e., one incorrect output can lead to a dynamic failure.

2.3.9

Conclusion

The hard deadline for a critical-control task is usually cissumed to be given a prion. This presupposes the existence of a precise definition of the hard deadline and a method to derive it, which, however, have not been addressed in detail. The knowledge of hard deadlines, which must be derived from real control applications, is very important for task assignment and scheduling, specification and evaluation of fault-tolerant controller computers. We derived hard deadlines for linear time-invariant control systems in the presence of the computation-time delay and/or the control input disturbances due to a controller-computer failure. First, when the occurrence of computer failures is stationary or represented by an appropriate probability density function, the hard deadline is obtained as an integer multiple of the sampling period by using the (asymptotic) stability condition for a modified state equation. Second, a heuristic algorithm is proposed to iteratively compute the hard deadline as a function of the system state and time by testing for the given (or derived) constraints of control inputs and states.

107 As an extension of this work, it would be interesting to derive the deadlines of nonlinear time-invariant control systems in the presence of control input disturbances. The derivation of hard deadlines for time-varying systems will also be a challenging task.

108

REFERENCES

[1] C. M. Belcastro, "Laboratory test methodology for evaluating the effects of electromagnetic disturbances on fault-tolerant control systems," NASA TM-101665, November 1989. [2] A. F. Belleisle, "Stability of systems with nonlinear feedback through randomly time-varying delays," IEEE Trans, on Automat. Conir., vol. A C 20, no. 1, pp. 67-75, February 1975. [3] A. Gosiewski and A. W. Olbrot, "The effect of feedback delays on the performance of multivariable linear control systems," IEEE Trans, on Automat. Contr., vol. AC-25, no. 4, pp. 729-734, August 1980. [4] K. Hirai and Y. Satoh, "Stability of a system with variable time delay," IEEE Trans, on Automai. Conir., vol. AC-25, no. 3, pp. 552-554, J u n e 1980. [5] G. Hostetter and J. S. Meditch, "Observing systems with unmeasurable inputs," IEEE Trans, on Automat. Contr., vol. AC- 18, pp. 306-307, June 1973. [6] S. Kheradpir and J. S. Thorp, "Real-time control of robot manipulators in the presence of obstacles," IEEE journal of Robotics and Automation, vol. 4, no. 6, pp. 687-698, December 1988. [7] H. Kim and K. G. Shin, "On the m a x i m u m feedback delay in a linear/nonlinear control system with input disturbances caused by controllercomputer failures," IEEE Trans, on Control Systems Technology (in press). [8] H. Kim and K. G. Shin, "Design and analysis of an optimal instructionretry policy for T M R controller computer," submitted for publication, 1993. [9] H. Kim and K. G. Shin, "Evaluation of fault-tolerance latency from realtime application's perspectives," submitted for publication, 1993. [10] I. Koren and Z. Koren, "Analysis of a class of recovery procedures," Trans, on Computers, vol. C-35, no. 8, pp. 703-712, August 1986.

IEEE

109 11] Y. H. Lee and K. G. Shin, "Optimal design and use of retry in faulttolerant computing systems," Journal of the ACM, vol. 35, pp. 45-69, J a n u a r y 1988. 12] G. Leitmann, An Introduction McG ray-Hill, 1969. 13] D. G. Luenberger, Optimization 1969.

to Optimal

Control,

New York, NY:

by vector space methods, New York, Wiley,

14] M. Mariton, "Detection delays, false alarm rates and the reconfiguration of control systems," Int. J. Control, vol. 49, no. 3, pp. 981-992, 1989. 15] Z. V. Rekasius, "Stability of digital control with computer interruption," IEEE Trans, on Automat. Contr., vol. A C - 3 1 , no. 4, pp. 3 5 6 3 5 9 , April 1986. 16] K. G. Shin and X. Cui, "Effects of computing time delay on real-time control systems," in Proc. of 1988 American Control Conf., pp. 1071-1076, 1988. 17] K. G. Shin and H. Kim, "Derivation and application of hard deadlines for real-time control systems," IEEE Trans, on Systems, Man, and Cybernetics, vol. 22, no. 6, pp. 1403-1413, Nov./Dec. 1992. 18] K. G. Shin, C. M. Krishna, and Y.-H. Lee, "A unified method for evaliating real-time computer controller and its application," IEEE Trans, on Automat. Contr., vol, AC-30, no. 4, pp. 3 5 7 3 6 6 , April 1985. 19] K. G. Shin and Y.-H. Lee, "Error detection process — model, design, and its impact on computer performance," IEEE Trans, on Computers, vol. C - 3 3 , no. 6, pp. 529-539, June 1984. 20] K. G. Shin, T.-H. Lin, and Y.-H. Lee, "Optimal checkpointing of realtime tasks," IEEE Trans, on Computers, vol. C-36, no. 11, pp. 1328-1341, November 1987. 21] D. P. Siewiorek and R. S. Swarz, The Theory and Practice of Reliable System Design, Digital Equipment Corporation, Bedford, MA, 1982. 22] D. D. Siljak, "Reliable control using multiple control systems," Int. Control, vol. 31, no. 2, pp. 303-329, 1980.

J.

23] A. Tantawi and M. Ruschitzka, "Performance analysis of checkpinting strategies," ACM Trans. Computer Systems, vol. 2, pp. 123-1441, 1984.

110 [24] K. Zahr and C. Slivinsky, "Delay in multivariable computer controlled linear systems," IEEE Trans, on Automat. Contr., vol. AC-19, no. 8. pp. 442-443, August 1974.

SECTION 3

SYSTEM VALIDATION

SECTION 3.1

Software-Implemented Fault Injection of TVansient Hardware Errors^ Charles R. Yount^ and Daniel P. Siewiorek^ ABSTRACT As computer applications extend to areas which require extreme dependability, their designs mai\date the ability to operate in the presence of faults. The problem of assuring that the design goals are achieved requires the observation and measurement of fault behavior parameters under various input conditions. One means to characterize systems is fault injection, but injection of internal faults is difficult due to the complexity and level of integration of contemporary VLSI implementations. This chapter explores the effects of gate-level faults on system operation as a basis for fault models at the program level. A new fault model for processors based on a register-transfer-level (RTL) description is presented. This model addresses time, cost, and accuracy limitations imposed by current fault-injection techniques. It is designed to be used with existing software-implemented fault-injection (SWIFT) tools, but the error patterns it generates are designed to be more representative of actual transient hardware faults than the ad-hoc patterns currently injected via most SWIFI experiments. 3.1.1 INTRODUCTION As persons become more and more dependent on computers in everyday life, and, in many cases, for life itself, the understanding of how those computers react to faults becomes more important [13], Today's systems are much too complex to use purely formal techniques to prove their correctness, much less to prove that recovery from faults is performed correctly. Also, present technology produces parts with mean times to failure (MTTF) of several years, so designers cannot

'The research described in this chapter was funded in part by the Office of Naval Research under grant N00014-91-J-4139. ^Inter-National Research Institute. ^Carnegie Mellon University Department of Electrical and Computer Engineering.

114 simply wait for faults to occur in the finished product to determine their effects. Consequently, engineers and computer scientists have turned to fault injection to artificially induce premature faults into a system. The research described in this chapter seeks to overcome the limitations of current fault injection techniques by allowing high-speed injection experiments without sacrificing accuracy. To achieve this goal, a fault model is developed which may be applied to a register-transfer-level (RTL) description of the processor under study. Experimental results show that such a model of the IBM ROMP is capable of reproducing 97% of the errors which are generated by a detailed gate-level model with a time reduction of over 500 to 1. 3.1.2 FAULT-D«JJECTION TECHNIQUES This section lays out a background for fault-injection studies. Some reliability terms are presented, and a fault injection taxonomy is given. Then, some recent research is described and categorized in that taxonomy. 3.1.2.1 Reliability Terms This chapter uses the following definition of fault pathology—the relationship between faults, errors, and failures [23]: A system failure occurs when the delivered service deviates from the specified service... The failure occurred because the system was erroneous: an error is that part of the system state which is liable to lead to failure... The cause—in its phenomenological sense—of an error is a fault. In addition, faults may be classified by their duration [34]: Permanent faults are continuous and stable; in hardware, permanent failures reflect irreversible physical change. Intermittent faults are only occasionally present due to unstable hardware or varying hardware or software states. These faults are potentially repairable by replacement or redesign. Transient faults result from temporary environmental conditions. Since the hardware is undamaged, these faults are not repairable. Figure 3.1.1 relates several sources of errors with possible durations and manifestations to errors and failures. Note that some of these errors are physical faults (physical defect, unstable or marginal hardware, and unstable environment) and some are human faults (incorrect design and operator mistake).

!15

Physical defect

Incorrect design

Unstable or marginal hardware

Unstable environment

Permanent fault

^

X

Error

Service failure

Intermittent fault

Transient fault

Operator mistake

Figure 3.1.1: Sources of errors and failures (Taken from [34].)

3.1.2.2 Responding to Faults Computer systems can respond to faults at a variety of levels. Non-redundant systems can only take advantage of fault-intolerant techniques such as increasing the quality of the design process or a circuit's constituent parts. Redundant systems can use fault detection or fault tolerance. Fault detection provides a level of integrity to the result of computation, and fault tolerance uses static or dynamic techniques to mask faulty components or reconfigure the system to operate without them. For example, if one or more processors in a multi-computer fails, the connectivity may be reconfigured to operate in a mode of degraded performance [30]. Table 3.1.1 shows a list of common reliability techniques from these categories. Grading fault-avoidance measures and validating complex fault-detection and fault-tolerant techniques are important but almost intractable tasks. To stimulate these mechanisms, engineers and computer scientists have used fault-injection techniques, the artificial introduction of faults or their manifestations into a sys-

116

Class

Hardware techniques

Software techniques

Fault avoidance

Environmental modification Quality changes Component integration level

Modularity Object-oriented programming Capability-based programming Formal proofs

Fault detection

Duplication Error detection codes (M-of-N codes, parity, checksums, arithmetic codes, cyclic codes) Self-checking and fail-safe logic Watch-dog timers and time-outs Consistency and capability checks Processor monitoring

Program monitoring

Masking redundancy

Error correcting codes (Hamming SEC/DED," other codes) Masking logic (Interwoven logic, coded-state machines)

Algorithm construction Diverse programming

Dynamic redundancy

Reconfigurable duplication

Forward error recovery Backward error recovery (Retry, checkpointing, journaling, recovery blocks)

Reconflgurable NMR^ Backup sparing Graceful degradation Reconfiguration Recovery

Table 3.1.1: Classification of reliability techniques (Taken from [34].) "Single-error correction/double-error detection N-modular redundancy tern or a model of a system under study. The next section provides a taxonomy of techniques used in fault-injection studies. 3.1.2.3 A Fault-Injection Taxonomy Methods have been developed to inject faults into both software and hardware. Even though this chapter describes a model for hardware faults, it borrows ideas from software methods and combines them with existing hardware methods. Thus, this section provides a brief overview of software fault injection techniques

117 and then provides a two-way classification of hardware fault injection: by method of injection and by the level at which the fault is injected. Software Fault Injection. Fault injection has been used to test software by making modifications to the source code the programmers have written. Injection of faults into software falls into two major categories: error seeding and mutation analysis [1]. Error seeding is based on the assumption that artificially injected errors will be found at approximately the same rate as actual programmers' mistakes. Errors are injected during the debug phase, and when an acceptable number of them are found, the debug phase ends with the assumption that an acceptable number of actual errors were also found. Mutation analysis is used to determine the coverage of software test programs. Errors are injected, and the test program is used to attempt to detect them, providing coverage statistics. Hardware Fault Injection. A hardware fault injection experiment can be classified by both method of injection and the level of abstraction at which the faults are injected. First, consider a taxonomy of hardware fault injection methods containing four categories [9]: Software Simulation: The computer system under test (SUT) is simulated on another computer system. Faults are induced by altering logical values during simulation. Unfortunately, even the fastest simulators run several orders of magnitude slower than the actual SUT. Also, a device or gate-level simulation requires a full internal hardware description of the processor, not usually available to external testers because the designs are considered proprietary. Hardware Emulation: An engineering prototype, e.g., a wire-wrapped model, of the SUT is used, and faults are introduced into this model by using additional hardware to alter voltage values during execution. Again, a full hardware description is necessary, even more so than for the simulator model. In addition, the models are expensive to construct and are not exact replicas of the final system. Fault Emulation: The actual hardware of the SUT is used, and faults are emulated through special capabilities built into the hardware or through software (so-called software-implemented fault injection, or SWIFI). Often, though, with this method, there is no theoretical correspondence between the emulated faults and actual faults which might occur in the hardware.

118 Physical Fault Injection: Again, the actual hardware is used. FauUs may be injected by forcing voltage levels at the chip boundaries, but internal components of the processor will not be accessible. Alternatively, one may bombard the SUT with radioactive particles which cause internal faults, but this method is very expensive and difficult to control. As with fault emulation, there is often no theoretical correspondence between the injected faults and actual faults which might occur in the hardware. Table 3.1.2 summarizes the advantages and disadvantages of each of these methods. Fault injection methods

Advantages

Disadvantages

Software simulation

Hardware emulation

Fault emulation

Physical fault injection

Access to system at any level of detail. Fault types and control are unlimited.

Representative hardware with favorable access and monitoring.

True hardware and software in use.

True hardware and software in use.

Simulation time explosion.

Implementation and other parameters change with deployed system.

Fault types are limited and unknown.

VLSI limits access and monitoring points. Task is difficult.

Table 3.1.2: Trade-offs of fault-injection methods (Taken from [9].)

In addition to the method by which faults are injected, fault-injection techniques may also be categorized by the level of abstraction at which the faults are injected. High-level fault-detection or fault-tolerant mechanisms may be verified by injecting faults at any lower level of abstraction and allowing the errors to propagate to higher levels. In general, researchers prefer to inject faults at the lowest levels to assure accuracy, the degree to which the injected faults reflect actual faults. In practice, however, they are forced to inject faults at higher levels to attain a procedure that is tractable in terms of development cost and injection time. The following taxonomy presents a list of increasing levels of abstraction from the device level to the network level. In addition, they fall into two broad categories:

119 circuit-level abstractions at the lower end and functional-level abstractions at the higher end. Circuit: In this category, the physical makeup of the processor is considered. Device: The focus is on the actual transistor and how it operates. For a simulated injection, an analog simulator is used for each device. For a physical injection, the operation of the devices is affected with radiation or other physical stress. Gate: Simple logic gates (ANDs, ORs, multiplexors, and so forth) are faulted. Usually, a simple stuck-at (for permanent faults) or fixed-at (for non-permanent faults) model is assumed, although some experimenters use more accurate models including other effects such as coupling between lines or storage cells. Basic Block: The circuit is subdivided into functional groups such as adders, decoders, and registers. Here, the fault model may be ad-hoc or based on lower-level constituents. Chip: Faults are injected at the boundary of the integrated circuit. For microprocessors, this would include faults on the external address, data, and control lines. Functional: In this category, the circuit itself is no longer considered, but a functional description of the goal of the circuit is used. Micro-operation: The operation of the processor during each micro-instruction is faulted based on a fault model containing erroneous data transfers and micro-sequencing. Macro-operation: The operation of the processor's instructions are faulted by flipping bits in the instruction word or by using a fault model based on lower-level activity. System: Memory and I/O writes of the proces.sor are faulted because these are the activities which ultimately have an effect at the system level. Network: Messages and other accesses to shared resources are faulted. 3.1.2.4 Recent Research in Fault Injection Research has been performed in many of the areas formed by the cross-product of the two taxonomies of method and abstraction. Much of the fault-effect analysis work has actually been performed to gain insight to facilitate test generation. In this case, the faults themselves may not actually be injected; rather, the effects are studied to produce the appropriate tests to detect the faults.

120 Device. Within the device-level software-simulation area, a mixed-mode transient-fault simulator based on SPLICE has been used [6, 7, 12] to inject current transients corresponding to charge levels between 0.5 and 8.0 picoCoulombs. Also, simulation routines were developed [19] to accurately model single event upsets (SEUs) in CMOS SRAMS. Physical injection at the device level has included bombarding dynamic memories with alpha particles [24], a Zilog Z-80 with heavy ions and protons from a cyclotron [8], and a Motorola MC6809E with heavy ions from a ^^Californium source [16, 25]. A useful summary of ion bombardment studies is found in [26]. Also, performance at the device level has been affected indirectly through power supply disturbances [25]. Gate. Most gate-level injection is done by software simulation. For example, a recent fault-injection study of a 32-bit RISC used a register-level model in VHDL and injected 2,779 flip-to-1 and flip-to-0 faults via the simulator [29]. Another study [9, 10] injected transient faults into the IBM ROMP to determine how fault manifestations are related to workload. In addition, software fault simulation at the gate level has been the target of much optimization because of its popularity in testing. Exact solutions for permanent faults include concurrent, deductive, differential, and parallel fault simulation, and approximations include fast fault grading, critical path tracing, and statistical fault analysis. See [27] for a survey of these techniques. Basic-Block. Crossing the line between circuit and functional-level simulation is work done with basic circuit blocks [22] used to develop an approach for test generation which retains the accuracy of the gate-level while reducing computation time by using high-level primitives for fault propagation. Chip. At the chip level, most work has focused on physical fault injection such as altering the voltages on the output pins [2]. Micro-Operation. At the micro-operation level, researchers [4, 35, 36, 37] have developed test generation techniques which verify the functional behavior of microprocessors. Their work uses functional fault models on a graph-theory representation of a processor's data paths. Other techniques use a hardware description language (HDL) description of the processor to generate tests [28] or simulate faults [38]. However, the HDL models are often largely ad-hoc, reflecting HDL-level faults rather than hardware faults. Higher Levels. At the macro-operation, system, and network levels, fault emulation is the predominate method.

121 FIAT [3, 31] and FERRARI [20] are SWIFI tools which have been used to inject bit faults into a processor during execution of a workload. FIAT flips bits in an instruction word before its execution or flips bits in memory to simulating erroneous writes. FERRARI emulates line errors in the address or data lines when a data operand or opcode is fetched by modifying code sequences. It also injects errors into the condition codes. These errors can be transient or permanent. In [39], SWIFI is augmented with a hardware monitor for higher accuracy of evaluation. As stated earlier, these models are ad-hoc, chosen mainly because of their simplicity. The most recent attempt to increase the validity of these models is through current research into development of the EMAX process [21]. EMAX uses a Zycad hardware simulator to inject faults into a gate/transistor level description of a circuit. The error patterns are recorded and used to generate high-level models for injection by FERRARI. At the highest levels, studies have largely focused on the emulation of software faults (bugs). However, since the level of abstraction is so high, they also emulate many transient hardware faults which propagate through several levels of abstraction to manifest as errors in software structures. For example, in testing the FAA's Advanced Automation System [II], faults included corrupting virtual memory; corrupting, dropping, delaying or duplicating messages and timers; and terminating or delaying processes. A study using the DEPEND simulation environment [15] corrupted messages across a communications link. Similarly, a study to develop a deterministic fault injection-strategy injected message-passing faults by skipping transfers or sending extra, out-of order, out-of-time, or wrong-value messages [14]. Finally, as a means lo failure acceleration, whole pages of memory have been corrupted to simulate catastrophic errors in an IBM mainframe system [5]. Comparison. Table 3.1.3 shows a summary of these research projects, categorizing them by level of abstraction and the fault sources they attempt to model. 3.1.2.5 Section Summary Faults in hardware or software originate from many sources and may eventually manifest to failures in the encompassing computer system. The designers of a system may elect to incorporate fault-detection or fault-tolerant mechanisms to respond to these faults. To verify these mechanisms, the designers often use faultinjection techniques. Fault-injection techniques may be categorized by method and the level of abstraction at which the faults are injected. In general, lower levels of abstraction provide

122 Fault sources from Figure 3.1.1)' Level of abstraction

Physical defect

Incorrect design (H/W or S/W)

Unstable or marginal H/W

Device

Unstable environment simulation [6, 7, 12, 19] physical [8, 16,24,25]

Gale

simulation and test generation [27]

Block

test generation [22]

Chip

physical [2]

Micro-instruction

simulation [38] test generation [4. 28, 35. 36, 37]

simulation [29]

test generation [4, 28, 35. 36. 37]

Macro-instruction

SW1FI[3,31]

System

SWIFI[5, 11, 15]

Network

SWIFI[1I, 14]

SWin[3,9. 10,20,21,31, 39] and this research

SWIFI [3, 9. 10,20,21,31, 39] and this research

Table 3.1.3: Summary of recent research

^No recent research in the area of operator mistakes was found. high accuracy at high cost, and higher levels of abstraction provide various levels of accuracy at lower costs. The need for higher accuracy is based on two assumptions; (1) that a more accurate set of faults will propagate to a more accurate set of failures and (2) that a more accurate set of failures will produce a more accurate validation of fault-detection or fault-tolerant mechanisms. The following section will lay the framework for a low-cost SWIFI fault model for intermittent and transient hardware faults which provides a higher level of accuracy than is currently available.

123 3.1.3 DEVELOPING A NEW FAULT INJECTION METHODOLOGY FOR FAULT EMULATION In order to determine system-level errors or to verify a fault-tolerant configuration with redundancy at the processor level, an engineer must be able to determine the effects of hardware failures on actual processor workloads. Therefore, he or she must inject faults which will manifest to errors at the processor level. To maintain confidence in the validity of the errors, the nature of these injections should be based on actual hardware faults. Such studies are hindered, however, by the limitations discussed in the previous section. To address these limitations, this research provides a new fault model which allows SWIFI tools to inject faults at the instruction level which will produce errors that are representative of those produced by actual hardware faults. An overview of this process is illustrated in Figure 3.1.2. Since studies have shown that eighty [ 18] to eighty-five [32] or even ninety [34] percent of all computer failures are due to transient or intermittent faults, development of this new fault model focuses solely on modeling non-permanent faults. As such, it models faults caused by unstable or marginal hardware or an unstable environment. 3.L3.1 Hybrid Fault Emulation Any sequential network can be reduced to the simple model shown in Figure 3.1.3. This model consists only of (1) combinational logic which performs a boolean transformation on its input, (2) a collection of storage elements known as its "slate," (3) an input vector, and (4) an output vector. Also, there exists one or more clocks which determine when the state can change. For a processor, the combinational network comprises the ALU, decoding logic, etc.; the state comprises visible and non-visible registers; and the inputs and outputs are signals coming into and generated by the processor on its I/O pins. When injecting faults, it is important to note two things about any given interval between state changes. First, the only factors determining what happens during the interval are the current state and the input vector during the interval. Second, the only direct effects of these factors are the new state and the output vector during the interval. Subsequent states and outputs are only affected indirectly through the new state. Consequently, if a fault occurs and is contained completely within an interval, it can only directly alter the new state and the output vector. If a fault affects neither, it is called overwritten for this interval; an example of this might be a transient fault in the ALU at some time when it is not being used. It follows that if a fault is overwritten for any given interval, it will have no effect on subsequent intervals.

124

Manifestations are used by fauitinjection machin^ eryto emuiate faults during the execution of actual worltioads.

User inputs instruction set description of processor under study.

Automatic procedure generates instruction-level error manifestations of tiardware faults in that processor.

F i g y r e 3.1.2: A n o v e r v i e w of t h e new f a y l t - l n j e c t i o n p r o c e s s

Conversely, if the fault does affect the state or output, it is called non-overwritteji for this interval, although it could become overwritten at some interval in the future. Using software simulation, hardware emulation, or physical injection at the device, gate, or basic-block level, one may inject faults into all portions of a processor: tlic input, output, state, and combinational portions. When using SWIFI to avoid tJie time and accuracy limitations of these methods, however, one can only

A transient or intermittent fault or multiple sucli faults (and certainly a permanent fault) may affect a circuit across more tiian one time interval. Ttiis research only focuses on single faults that are contained within a specified dme interval.

125 Input

Combinational network

- •

Output

State

Clock

J Figure 3.1.3: General sequential network

access the CPU through its instruction set. How, then, can transient faults in the combinational network be emulated? Since the only factors that determine what happens during an interval T = [ t-, t^\ are the current state and input vector, one approach would be to set the current state at r and provide input during Tin such a way that the combinational network will produce a new state and output to match the effects of the fault. This is referred to diS forward fault emulation, FFE. Note, however, that FFE will only be possible in those cases where the combinational network is capable of producing those effects. A combinational network is a transformation from an input vector to an output vector, and in general, all values of the output vector are not producible. Also, if the fault is not directly associated with the processing of the current state or input, computing the required input values will require one to solve the inverse of the combinational network. On the other hand, since faults affect only the new state and output vector, one can emulate a fault occurring during some interval T = {t:,t^\ by simply generating the output during T and setting the new state at t^. to match the effects of the desired fault. This is referred to as reverse fault emulation, RFE. To perform RFE, one must choose the proper patterns to apply to the new state and output. This can be done in an ad-hoc manner by capturing the fault-free patterns and arbitrarily flipping bits to derive injection patterns. A more accurate method involves a hybrid fault injection approach like the one described in this chapter.

126 With the hybrid approach, one allows the network to run until / and captures the state at that point. Then, another fault-injection technique is used to determine the proper patterns based on the captured state and an appropriate fault model. For example, one could initialize a gate-level simulation of the network with the captured state and simulate for the duration of 7 while injecting the desired gate-level faults. The results of the simulation can then be used to artificially stimulate the outputs and set the new state of the actual network. Thus, the hybrid approach uses the actual circuit to perform the fault injection, thereby gaining a speed advantage. However, it uses a lower-level model to determine the injection patterns for accuracy. Since the lower-level model is used over a limited interval, the time penalty is reduced. The natural interval T to consider for a processor when emulating faults via SWIFI is the time required to execute one instruction. Emulating faults using hybrid fault injection is illustrated in Figure 3.1.4. Run the processor up to the instruction in which the errors should be injected.

Instruction boundaries

Again run the processor at full speed after the instruction in which the errors were injected.

r \\\

T

An output Fault to emulate

Emulate the fault by altering only the new state at the end of the instruction and outputs within the instruction (RFE).

An input

I".

time

Errors in outputs and states after the affected instruction are caused indirectly by the state at the end of that instruction.

Figure 3.1.4: A fault in an instruction stream

3.1.3.2 Developing a Fault Model FIAT and FERRARI, mentioned earlier, are SWIFI platforms which provide environments in which many error coverage statistics may be obtained. The abstract

127 fault models (bit flips) often used on the platforms, though, are generated in an adhoc manner. There is no theoretical mapping between actual faults in the computer system and these fault models. One may obtain many statistics such as detection coverage (percentage of injected faults that were detected) and detection time, but there is no way to interpret these results with respect to the detection of actual faults. To produce higher accuracy, one could invoke a complete device or gate-level simulation based on the state at f to determine what outputs and state changes to make. Recall, however, the difficulties with using device or gate-level simulations: (1) accurate descriptions are usually proprietary information not available to the user, (2) developing such models is costly, and (3) such simulations are extremely time-consuming, even given the time reduction of simulating only a limited time interval. EMAX [21] seeks to alleviate the third limitation by using a Zycad MACH-1500, a hardware simulation engine, which is capable of executing a concurrent fault simulation algorithm on three models of the processor under test. The error patterns are recorded and injected via FERRARI. One current limitation of EMAX is that the MACH-1500 can only simulate permanent faults. Given this limitation, however, EMAX can be considered both accurate and fast. Of course, this technique also requires a gate-level model of the processor. In order to provide even higher error-generation speed and lower cost, the experiments described in this chapter model the system at a higher level of abstraction, the register-transfer level (RTL) level. RTL can be considered slightly higher in abstraction than the micro-instruction level since a cycle-by-cycle description of the operation of the processor is not required. On the other hand, it is lower in abstraction than the bit-flipping models used previously for SWIFI since the functionality of the instruction is considered. The following section describes a tool built to support hybrid fault injection at the RTL level: ASPHALT.^ 3.1.3.3 An RTL Language To support the fault-generation process, the user provides ASPHALT with a description of the processor from an architectural viewpoint. This includes a list-

^ASPHALT is a loose acronym for ASembly-language level FAULT simulator. It also punningly suggests that the tool can be used as an underlying model for fault-injection test-beds, some of which are named after automobiles, such as FIAT and FERRARI.

128 ing of its architectural registers and an RTL description of the data transfers that take place for each instruction. The ASPHALT language is similar to the C programming language or the Verilog simulation language. The basic data type in ASPHALT is the register, which must be declared with a certain bit-length. Registers may be accessed wholly or with a bit-range specification. Registers of equal size may be grouped into a register file and accessed with an array subscript. The language contains all the expected control constructs such as if/then/else, while, for, do/while, and a specialized switch-type construct known as decode/ case. The decode statement takes an expression similar to C's switch, but the case statement is much more powerful: it allows matching to the result of the decode expression using a bit-field description containing fixed fields and wild-cards, and the wild-card fields can automatically be assigned to variables. This feature makes opcode decoding straight-forward and intuitive. Some other features which tailor the language to processor modeling are •

Built-in routines exist for memory and I/O read and write.

•

Built-in routines exist to generate various types of processor traps: address, bus, program, privilege, and opcode.

•

The add and subtract math operators allow an optional carry bit argument.

•

All arithmetic and logical operators automatically generate negative, zero, carry, and overflow flags as appropriate.

•

Bit lengths of operands are checked for compatibility; concatenation and extraction of bit fields is supported.

The user may also write subroutines which use the call-by-name protocol, and recursion is allowed. Figure 3.1.5 shows an extract from code used to model the ROMP, IBM's RISCOriented Micro-Processor. The complete description requires only about 400 lines of code. 3.1.3.4 An RTL Fault Model Once an RTL description of the processor is created, a fault model is applied to it to emulate the desired faults. This section develops a fault model from a taxonomy of actual hardware faults. The taxonomy of the hardware fault model and some of the hardware faults themselves are taken from [36]. For completeness, these fault model elements will be included in this section along with the new elements.

129 /* RTL code for the RISC-Orianted Micro-Processor (FiCMP) */

/* visible state variables */ r e g [32] iar; /* the instruction-aciir reg */ r e g [8] cs; /* the ccndition-status reg */ r e g [32] gpr[16],- /* the general-purpose reg file */ /* define bits of the ccmditicn-status reg */ #define I t cs{6} /* less than */ #define eg cs{5} /* equal */ #define gt cs{4} /* greater than */ ttdeflne cO cs{3) /* carry */ #define av cs{l} /* overflow */ #define tb cs{0) /* test b i t */ /* set ccnditicn status flags */ p r o c e d u r e set_cs_arith() {

It

= = gt = cO = ov =

negative; zero; !zero && Inegative; carry; overflow;

}

/* read halfword fron marory */ f u n c t i o n readhtf(addrx) {

r e g [32] adir; aci3r := addrx; /* to prevent miltiple evaluations */ i f (adi:{0) != ObO) { a d d r e s s _ e r r o r (); r e t u r n 0x0000; }

else r e t u r n read_mein (addr) :: read_inein (adfir{31. .1) :: Obi); }

Figure 3.1.5: Example RTL code (Sheet 1 of 2)

130 /* read 16 bits fran WBTDZY using iar as address */ f u n c t i o n fetchl60

( r o g [16] var; var := readhw(iar); i a r := i a r + 0x00000002; r e t u r n var; }

/* rcedn executicn */ p r o c e d u r e nrnnO

( reg reg reg reg

[16] [16] [32] [32]

ir; inmed; hranch_addr; inst:r_addr; /* addr of current i n s t r . */

instr_addr := i a r ; i r := f e t c h i e O ; d e c o d e (ir) { /* add extaxJed (AE) */ c a s e [8]0xET. [4)r2 [4]r3: gpr[r2] := gpr[r2] +[cO] g p r [ r 3 ] ; set_cs_arith(); /* load h a l f s t a r t {IBS) */ c a s e (8]0xEB [4]r2 [4]r3: gpr(r2] := 0x0000 :: readhw(r3); /* branch and l i n k iitinediate w/execute (BALIX) */ c a s e [8]0x8D [41r2 [4]bi0: iimied := f e t c h l 6 ( ) ; gpr[r2] := i a r + 0x00000004; hranch_a(iar := i n s t r . a d J r + ((biO :: i m i ^ : : ObO) @©32) ; rrainO; /* recursive c a l l for execute */ i a r := branch_addr; }

Figure 3.1.5: Example RTL code (Sheet 2 of 2)

131 The processor is divided into two major sections: data processing and control. The data processing is subdivided into storage, transfer, and processing functional units. This section presents a fault model which can be applied at the RTL level for elements within each category. The basic goals of the fault model are to •

Produce faults which are representative of actual hardware faults, and

•

Be implementation independent.

Obviously, to meet these goals, many assumptions and compromises must be made. This introduces the most crucial trade-off of the fault model: coverage versus overhead. Coverage is defined as the percentage of actual hardware faults that are modeled. Overhead is defined as the percentage of faults which were modeled but not produced by the actual hardware. A successful model will have a high coverage with low overhead. Unfortunately, an increase in one will produce an increase in the other. If assumptions are made to cover almost any implementation, the coverage will be high, but the increased overhead will make the model unusable. On the other hand, if very narrow assumptions are made about the implementation, the model will not cover any but the simplest processors. The experimental results will be evaluated against coverage and overhead. For each section of the processor, a list of possible hardware faults is given. A discussion explains why a subset of these faults was chosen for emulation through the RTL model. Then, a list of the RTL faults for that section is presented. Although the fault model from [36] was intended to test for permanent hardware faults only, most of the faults from that work can describe permanent or transient faults. Furthermore since 80 to 90% of hardware faults are transient or intemiittent, only non-permanent faults are considered in the following model. At the RTL level, a transient or intermittent fault is modeled as being active during a single execution of an RTL statement. Only one fault per simulation will be injected. After the basic model is presented, this section explains how the coverage of certain invisible registers is affected. The experiments in the next section are intended to show how well this model meets the preceding goals when applied to a contemporary processor. Data Storage and Transfer. The data storage units are the individual flip-flops, registers, and register files: anything declared in the RTL input file as a reg. Possible hardware faults for these units^ are

6

Addressing a register file is considered in following sections.

132 •

One or more cells are stuck at 0 or 1 [36].

•

One or more cells fail to make a 0-1 or 1-0 transition [36].

•

One or more pairs of cells are coupled [36].

•

All or part of a register is loaded with the value on its input lines when it should not be (extraneous load).

•

All or part of a register is not loaded when it should be (missed load). The last two faults cover all faults related to generating and acting on the loadenable signal.

To avoid a combinatorial explosion, only a subset of "one or more" is chosen. Specifically, the faults considered in this model are single-bit faults and wholeregister clear and set faults. This gives a fault space of n + 2 faults instead of 2" - 1 faults for each «-bit register. The missed loads are easily modeled. The extraneous loads, however, are not. Since the RTL description does not specify an internal connectivity structure, there is no way to know what the input value to a register might be. If all possible or even probable inputs are chosen, the number of faults grows combinatorially. For example, the input to a register might be the ALU output. Since, in general, no default ALU function or input values can be assumed, the number of possible combinations is enormous. For this reason, extraneous loads are not modelled. Furthermore, missed loads are limited to the full register. Therefore, the faults injected by the RTL simulator are as follows: •

XOR one bit in storage; this is repeated for each bit in the register before each access.

•

Set all bits to one.

•

Set all bits to zero.

•

Do not perform an assignment statement (except for any side-effects).

Temporally, for a given register, the first three of these faults are injected before each access, i.e., any place where its value is needed to resolve an expression, such as on the right-hand side of an assignment. Also, for each architectural variable (one which is declared as a global variable in the RTL input), the first three faults are injected at the end of the main routine to cover all remaining faults. Recall, of course, that only one fault is injected per faulty simulation. The example code in Figure 3.1.6 shows where and when register faults are injected. For any multiplexer (mux), the following faults may occur for a given source:

133 register A not loaded (the initial value is zero),

register B not loaded.

register A not loaded. clear all bits, set all bits, and XOR each of the 16 bits of A before access. clear all bits, set all bits, and XOR each of the 16 bits of B before access.

clear all bits, set all bits, and XOR each of the 16 bits of all global registers (A and B) at end of function main.

Figure 3.1.6: Example injection of registers For this code, there will be 75 faulty runs due to data storage faults: 3 missed loads, 4 clears, 4 sets, and 64 bit flips. There will be only one fault injected per run; for example, during the first run, A will not be loaded with Oxabcd.

•

No source is selected [36].

•

The wrong source is selected [36].

•

More than one source is selected, and the mux output is either a wired-AND or a wired-OR function of the sources, depending on the technology [36].

Note that these faults cover muxes built from standard logic gales as well as those implemented with tri-state drivers onto a bus as shown in Figure 3.1.7. There may be many muxes in a processor: selecting a register from a register file, selecting an input for the ALU, selecting a data value to be written to memory, and so forth. Only the fault of the wrong source selected in the context of a register file is emulated with the RTL fault model. Selecting the wrong source in other contexts leads to combinational explosion for the same reason as the extraneous load problem. Also, there is no way to predict what the value would be if no source were selected. Finally, selecting more than one source would result in an intractable number of combinafions, even if confined to the register files. Therefore, the only fault injected is

134 RO

R1

IZ^

Rn

I

Mux

Control

T

Value of Selected Reg

RO

Control

R1

Rn

n_j ^

Value of Selected Reg

Figure 3.1.7: Two implementations of a logical mux The first represents a logic-gate implementation, and the second represents gating onto a bus with tri-state drivers.

•

During a register array access, select an alternate register; repeat for each alternate register, giving n - 1 faulty simulations for an array of size n.

Decoder faults associated with addressing into a register file. The possible faults are •

No destination is selected [36].

•

Instead of, or in addition to the correct destination, one or more other destinations are selected [36],

The "no destination" fault is covered by the missed load fault discussed earlier. To avoid numerous combinations, the only other fault to be covered is that of one other destination selected. Therefore, the only additional fault injected is

135 •

During an assignment to a register array, make the assignment to an alternate register; repeat for each alternate register, giving n - 1 faulty simulations for an array of size n.

Possible hardware faults in data buses are •

One or more lines arc stuck at 0 or 1 [36].

•

One or more lines form a wired-OR or wired-AND function due to shorts or spurious coupling [36].

Faults in internal buses will only manifest when data is being transferred internally. Therefore, all single-line bus faults will be covered under register bit faults and single-bit input and output faults in the functional units. Multiple bit faults are not emulated. For the external buses, single bit faults arc injected on the address and data lines. The RTL faults, then, are as follows: •

XOR one bit of the address before a memory or I/O read or write; repeat for each bit.

•

XOR one bit of the data before a memory or I/O write or after a read; repeat for each bit.

Data Processing Functional Units. Data processing functional units include ALUs, shifters, and special-purpose computational units such as dedicated incrementers. Due to the large number of possible units, a general fault model is presented. •

One or more input, output, or internal data lines are stuck at 0 or 1 or are coupled.

•

One or more gates produce an incorrect output.

•

One or more condition codes are calculated incorrectly.

•

The wrong function is performed.

•

A pre-processing step is eliminated.

•

An extraneous pre-processing step is performed.

In order to develop an RTL fault model, generic processing units must be developed. First, a generic ALU is presented. Figure 3.1.8 shows the external interface and a bit slice of such a unit.

Bus control lines are covered later.

136

Control

Control

^ T ^

Condition codes: zero, overflow, negative, and carry

(a) External Interface

(b) Bit slice

Figure 3.1.8: Generic ALU model The external interface consists of two n-bit inputs, A and B; an n-bit result, R; control lines to select a function; and four condition codes. The bit slice calculates a result for each bit of input; some functions require a carry input and generate a carry output. Externally, there may be a single-bit fault on any of the data inputs or outputs. Control and pre-processing faults are limited to simple permutations listed at the end of this section. Faults not modeled include the substitution of a binary operation for a unary one. In that case, a second input would have to be chosen, of which there are numerous possibilities. Note that single-bit errors in the condition codes are covered by status line faults in the control section. Within a bit slice, no assumption is made about the independence of the generation of the carry and result bits. In other words, a single fault in a bit slice may cause errors in both the carry and the result. Therefore, the fault model includes single-bit faults in the carry, the result, and both simultaneously. It is easy to show that these faults cover all single-bit faults in the data inputs, except the carry input to bit 0 (CQ), which is handled as a special case. For performance reasons, of course, most ALUs do not use a simple bit slice implementation. Emulating a fault in the carry bit of each stage of a bit-slice model, however, covers all faults in the

137 generation of corresponding carry bits computed via one or more carry look-aiiead circuits. The RTL faults for ALU operations are as follows: •

For all unary operations, XOR a bit of the result; repeat for each bit. Perform an alternate unary operation; repeat for each of unary minus and logical negation. Pass the input directly to the output. For all binary operations, XOR a bit of the result; repeat for each bit. Perform an alternate binary operation; repeat for each of add, subtract, OR, NOR, XOR, XNOR, AND, and NAND. Pass the A or B input directly. Replicate the A input on the B input or vice-versa. For example, A+B becomes A+A or B+B. Logically negate A or B before the operation. This covers missing negations as well as extraneous negations because —i(-iA) = A. Clear A or B before the operation. For addition and subtraction and unary minus only, XOR the carry input. XOR a carry bit; repeat for each bit. XOR both a carry bit and the corresponding result bit; repeat for each bit.

•

For subtraction only, switch operands A and B. All other ALU operations are commutative, so this transformation would have no effect.

Next, consider a generic shifter. For performance reasons, a unit capable of shifting a word an arbitrary number of bits is usually separate from the ALU, often in parallel with the ALU data paths. A typical interface to a shifter and an example of the internal circuitry are shown in Figure 3.1.9. Possible hardware faults in the shifter include the following: •

One or more bits of the input or output are stuck at 0 or 1 or coupled.

•

One or more pass gates in the shift network are stuck open or closed.

138 Amount ' log2n

1 o t

t

I I i I Control

<> <> <>

Condition codes: zero, negative

(a) External interface

(b) An example shift network

Figure 3.1.9: Generic shifter module The control lines indicate direction (left or right) and mode (logical or algebraic). The amount value is decoded into n lines, one of which is active. These lines are combined with the direction lines to select the proper bit. The example circuit shows the generation of bit 1 of Z with an word size of 4 bits, logical shifting only.

•

Tlie wrong function is performed (left/right or logical/algebraic).

•

Incorrect condition codes are generated.

•

The amount value is decoded improperly, producing one or more active shifting signals in addition to, or instead of, the correct signal.

To reduce the number of RTL faults, the only bit faults modeled are single bit flips on the output which cover single bit faults on the input (except for the sign bit during a right algebraic shift, which is handled as a special case) and single gate faults in the shift network. Also, faults in the decoder are limited to one active output instead of the correct signal. The resulting RTL faults, then, are as follows: •

XOR a bit of Z; repeat for each bit.

139 For shift algebraic right only, XOR the sign bit prior to shifting. •

Shift the wrong amount; repeat for each incorrect amount between zero and the length of X.

•

Perform an alternate operation; repeat for each of shift left, shift right, and shift right algebraic (shift left algebraic is equivalent to shift left).

Special-purpose units such as multipliers and dividers have not yet been modeled due to the large number of possible implementations. Modeling can be approximated with iterative shift and add algorithms. Control. There is a large variety of ways to build the control section of a processor: "hard-wired," micro-programmed, micro and nano-programmed, or a combination of more than one of these methods. Also, the implementation of the control section is responsible (by exploiting the capabilities of the data section) for differences between various realizations of the same architecture. These are usually seen by the user as differences in execution time measured in nanoseconds or clock cycles and are achieved by differing amounts of parallelism, including pipelining. Because of these factors, only a high-level fault model is presented. •

For any instruction, any other instruction may be executed due to incorrect decoding [36].

•

A control command c(i) is faulty: •

c(i) is missing when it should have been issued [36].

•

c(i) is always present [36].

•

c(i) is issued extraneously in addition to, or instead of, some command c(j) whenever c(j) should be issued [36].

•

Status signals could be faulty, in a fashion similar to control commands [36].

To model the first fault, all the alternative choices in a decode statement are executed in lieu of the correct one. In order to simulate all possibilities, then, each instruction must be coded as a separate case statement within a single decode statement. To model missing control statements, certain statements are not executed. A missing load was covered under register faults. Other missing statements include the test instruction (set the condition codes based on an expression), expressions executed only for their side-effects (i.e., expressions with no left hand side), and memory and I/O writes.

140 Faults where control statements are always present are only applicable for permanent fault models, so they are not modeled. Extraneous commands will be limited to exactly one command being executed instead of the correct one; other combinations would lead to combinational explosion. Most of these have already been covered: the wrong register selected for accesses into and out of a register file and incorrect control in the ALU or shifter. Additional faults modeled are the substitution of a memory read for an I/O read or vice-versa, the substitution of a memory write for an I/O write or vice-versa, and the substitution of a signed bit-extension for an unsigned one and vice-versa. Obviously, there are many more conceivable extraneous control faults. For example, extraneous writes to memory or I/O and extraneous register loads. Also, faults such as erroneous branching within a micro-instruction sequence could cause extreme deviation from the expected behavior. As before, without implementation details, a high coverage of these faults would lead to unacceptable overhead. It is the purpose of the experiments described later to determine the impact of these restrictions. Faults in the status signals are modeled by negating the results of logical values. This covers errors in branching decisions and condition-code calculations. The model for control faults which were not covered under previous sections is as follows: •

Do not perform a test instruction.

•

Do not evaluate an evaluate-only expression.

•

Do not perform a memory write.

•

Do not perform an I/O write.

•

Perform an I/O read instead of a memory read, and vice-versa.

•

Perform an I/O write instead of a memory write, and vice-versa.

•

Perform a signed bit extension instead of an unsigned one, and vice-versa.

•

Invert the results of the following logical operations or values: logical NOT, OR, and AND; comparison tests (equal, not equal, less than, less than or equal, greater than, and greater than or equal); and the condition codes (negative, zero, carry, and overflow).

Implementation-Dependent Registers. An implementation-dependent register is one not specified in the architectural description of a processor. Therefore, no fault model can explicitly exist for one or for data paths or control sections associated

141 with it. Assumptions in the way these registers are used, however, make coverage of most such faults by the preceding fault model quite likely. Fault Model Summary. The fault model is summarized in Table 3.1.4. This table reviews the list of transient hardware faults for each section of a processor. A list of RTL faults is given that corresponds to each hardware fault. H/W element Register

Transient hardware fault

Original RTL =» faulted RTL

Missed load

R <— expr => no-op

Extraneous load

None

Level change in storage

R (on r.h.s.) =;. R <- R 0 2' (before reading R), V i"

Level change on input or output

Covered by bit faults on output of source or input of destination

Reg read mux Select wrong reg

Reg write selects

Bus

At end of simulation, Rj ^ Rj e 2', V i, j

Rj (on rh.s.) => Rj, V j ?^ i

No reg selected

None

Multiple regs selected

None

Select wrong reg

Rj <— expr => Rj <— expr, V j ?t i

No reg selected

Covered by missed load

Multiple regs selected

None

Level change on line

write (addr, data) => write (addr ® 2', data), V i write (addr, data) => write (addr, data © 2'), V i output (addr, data) => output (addr © 2", data), V i output (addr, data) => output (addr, data ® 2'), V i read (addr) => read (addr ® 2'), V i read (addr) => read (addr) ® 2', V i input (addr) =* input (addrffi2'), V i input (addr) => input (addr) © 2', V i

Table 3.1.4: Summary of RTL fault model (Sheet 1 of 3)

142 H/W element

Transient hardware fault

ALU

Level change on input

A + B + Co =* (A e 2') + B + CQ, V i*"

Level change on result

A + B + Co => (A + B + CQ) © 2', V i

Original RTL => faulted RTL

A + B + Co => A + (B © 2') + Co, V i

A op B =* (A op B) © 2', V i, where op is a logical of>eration

Level change on carry input

A+B+Co=*A+B+-Co

Level change on carry output (or carry look-ahead)

A + B + Co => A + B + CQ + 2', V i''

Level change on carry output and result

A + B + Co=> (A+B + Co± 2') © 2', V i

Incorrect condition code calculation

Covered by status line faults in control section

Incorrect function performed

A opj B => A opj B, V j ?t i Aop B => A A op B => B opj A=> opj A, V j ?i i op A=> A

Instruction Decoder

Incorrect preprocessing function performed

A op B => A op A AopB =i.BopB A-B=>B-A A op B => -A op B A op B =s> A op -B A op B => 0 op B A op B => A op 0

Incorrect instruction

Decode/case => all incorrect case statements executed

Table 3.1.4: Summary of RTL fault model (Sheet 2 of 3)

143 H/W element

Transient hardware fault

Shifter

Level change on input or output

AopB=!>(AopB)e2\Vi''

Incorrect decode of shift amount

A op B => A op i, V i ;t B such that 0 < i < length of A

Incorrect condition code calculation

Covered by status line faults in control section

Incorrect function performed

A opi B => A opj B, V j ?i i

Extraneous command issued

None

Missing command

Many covered previously; others are: expr => no-op (evaluate only) test (expr) => no-op write (addr, data) => no-op output (addr, data) => no-op

Incorrect command

Many covered previously; others are: write (addr, data) => output (addr, data) output (addr, data) => write (addr. data) read (addr) => input (addr) input (addr) => read (addr) sign extend expr to length i => extend expr to length i extend expr to length i => sign extend expr to length i Many others not covered

Level change on status signal

Logical operations, comparisons, and condition codes => invert result

Control lines

Original RTL => faulted RTL

A op B =*• (A ® 2") op B, where n is the position of the sign bit in A, and op is shift right algebraic

Table 3.1.4: Summary of RTL fault model (Sheet 3 of 3) "Only single bit faults considered; this is true for all "XOR" faults. Subtraction and unary minus are covered by these faults because A - B is implemented as A -I- -B -I- 1, and -A is implemented as 0 -h -A + 1. '^A fault in carry bit i is actually modeled by iteratively simulating each slice of the bitslice chain and XORing the carry bit between slices i and i-i-l. The "± 2'" term indicates that 2' is added if the carry bit was 0 and became I due to the fault; 2' is subtracted if the bit was I and became 0. A is the value to shift, B is the amount, and op specifies direction and type.

144 3.1.3.5 Operation The operation of ASPHALT is outlined in Figure 3.1.10 and broadly consists of four phases: reading the RTL code and the initial machine state, the golden run, the faulty runs, and post-processing. begin read RTL code and initial irachine state sinvLLate golden run: begin initialize registers sinulate instructicn and keep RTL executicn trace save final state and outputs end siitulate faulty runs: for each RTL cperaticn i in executicn trace for each possible fault j on node i begin initialize registers sinulate instruction vd.th fault j on i save final state and oitputs aid past-processing: begin ooll.^)se error states report statistics End Old

Figure 3.1.10: Pseudo-code for ASPHALT operation Reading the RTL Code and Initial State. ASPHALT starts by reading a description of the processor like the fragments shown in Figure 3.1.5. Then it inputs the values for all the visible registers captured from the actual processor just before the instruction to be faulted. It is also given access to the processor's memory image to obtain instruction and data values. The Golden Run. To obtain a basis for comparison, ASPHALT first simulates the instruction without any faults. Such a fault-free standard is called a "golden" run. ASPHALT initializes itself by loading its internal copies of the visible registers

145 with the initial machine state. All other registers are set to zero. It then executes main and any routines called by main. During this execution, ASPHALT keeps an RTL execution trace, which is a list of all the RTL operations evaluated during the golden run. The final machine state and a list of all memory and I/O writes is saved. The actual processor memory or I/O space is not written to during the simulation. The Faulty Runs. Now ASPHALT is ready to inject the RTL faults. It begins this process by sequentially stepping through the execution trace from the golden run. ASPHALT makes one faulty run for each possible fault for any given RTL operation. For example, if the current operation in the trace is the access of a 16-bit register from a 4-register file, 21 faulty runs are made: one for the XORing of each of the 16 bits, one for setting the register to all ones, one for setting it to all zeros, and one to access each of the three incorrect registers in the file. When each run completes, the final machine state and a list of memory and I/O writes is saved for that run. Alternatively, if the injected error causes one of the built-in fatal-error routines to be called, the type of error is saved; address, bus, program, privilege, or opcode trap. Post-Processing (Error Collapsing). When all faulty runs are completed, there exists a list of machine states and outputs, one for each injected fault. It is quite probable that some of these states will be identical. When performing fault-injection studies, it would be pointless to inject the same error more than once, so the list is collapsed into a set of unique errors. There are two ways in which errors are considered identical: equivalent fatal errors or equivalent states. If two injected faults cause the same fatal-error routine to be called, they are considered equivalent. For example, if two bit-flips in the instruction register cause the opcode trap to be invoked, the error states would be considered equivalent and would be identified only as an opcode-type error. It is also quite probable that two faults cause the exact same final machine state and outputs to be generated, i.e., all the bits in all the registers are the same, and the same data are written to the same memory and I/O addresses. In this case, these error states are considered equivalent and are identified by the state and outputs. A special case of this is the unique state which is identical to the state at the end of the golden run. The faults that generated this state are considered overwritten because they can never manifest into errors. A diagram showing the error collapsing process is shown in Figure 3.1.11.

146 Injected RTL Faults

Resultant States

Overwritten Faults

Equivalent Fatal Errors

Equivalent Error States

Figure 3.1.11: Error collapsing In this example, there are ten Injected faults, three of which are owerwrltten, leawing seven non-overwritten faults. Of tliese seven, there are only five unique states because two faults produced tlie same fatal error, and two produced equivalent non-fatal error states. Only the equivalent fatal errors and equivalent error states need to be used In a fault-injection experiment.

The following section will describe experiments used to verify the preceding RTL fault model and injection metliodology. 3.1.4 EXPEMMENTATION The main goal of the experiments presented in this chapter is to determine the extent to which the RTL fault model covers actual hardware faults. To estimate the error manifestations of actual hardware faults, a gate-level simulation model was used. Any of the methods from Table 2-2 on page 11 could have been used for fault injection at the gate level, but the advantage of full control over fault types and injections made software simulation the preferred choice. Figure 3.1.12 shows an overview of the experimentation process. First, both the gate-level and RTL models arc initialized with the same instruction to be faulted and the same machine state. Second, both models are exhaustively injected with

147 transient faults: gate-level faults into the gate-level model and RTL faults into the RTL model. This produces a set of machine states for each model. Finally, these two sets can be compared.

• Instruction to be taulted • Initial machine state (registers and memory)

f Gate-level simulation J

^ ^ L simulation ^

Set of resultant machine states

Set of resultant machine states

Comparison

Figure 3.1.12: Overview of experiments

Figure 3.1.13 shows a Venn diagram of this comparison. The intersection between the two sets is the covered faults. Removing the covered faults from the gate-level set leaves the non-covered faults. Removing them from the RTL set leaves the overhead, those faults generated by the RTL model but not produced by the hardware. The goal, of course, is to maximize the coverage while minimizing non-coverage and overhead. 3.1.4.1 ROMP Overview The IBM RISC-Oriented Micro Processor (ROMP) was chosen as a basis for the experiments for a number of factors: •

The gate-level model had already been developed by another researcher [9]. Therefore, since it was developed by a disinterested party, the model was not intentionally or unintentionally written to conform to the implementation assumptions of the RTL fault model.

•

There exists a fault-injection test-bed based on the IBM RT [3]. The IBM RT employs the ROMP as its CPU.

148 Resultant states, from

non-covered

__________^_^__

Resultant states from

1 covered

overhead

Figure 3,1.13: Comparison of results Coverage proportion = covered / (covered + non-GO¥erecl|. Overhead proportion = overhead / (overhead + covered).

•

The ROMP contains a number of interesting features: RISC technology, pipelining, both hard-wired and micro-coded control logic, and an asynchronous instructioa pre-fetch buffer.

Salient architectural features of the ROMP include sixteen 32-bit general purpose registers and a four gigabyte virtual address space broken into 16 segments of 256 megabytes each. The foUowins description of the processor is taken directly from [17]: The RDMP chip is a pipelined processor capable of executing a new instruction every cycle. While one instruction is being executed, the next instruction can be decoded, and following instructions can be fetched from memory and buffered. A one-cycle execution rate is prevented when either an instruction requiring multiple execution cycles is executed, or a hold-off occurs [i.e., when the processor waits for an instruction or data from memory). Most instructions take only one cycle, but Store and I/O Write instructions each take two cycles...

149 The ROMP processor is partially microprogrammed. It uses ROM for control during the execution cycles, but hturdwired control logic is used for instruction prefetching and for memory data requests, sincc: those operations are usually overlapped with tlie execution of other instructions. [Figure 3.1,14] shows, a block diagrain, of the ROMP processor data flow.

Storaae Channel (RSC)

FIgyre 3..t.14: ROMP data flow Taken from [9], as adapted from the original [17]. Major sections include the instfuction pre-fetch buffer (IPB), the micro-instruclion fetch (MIF), data fetch and storage (DFS), the ALU and shifter fALU), and the ROMP storage channel interface (RSei).

150 Instruction Fetching The instruction fetch area includes the Instruction Prefetch Buffers (IPBs), the IPB Multiplexer (MUX), and the Instruction Address Register (lAR) and its incrementers. Four IPBs are provided to keep the processor supplied with instructions. Instructions are prefetched whenever an IPB is available and there is no other request for use of the RSC [ROMP Storage Channel]. Every cycle, each IPB that is waiting for an instruction ingates the data from the RSC. During the following cycle, the tag associated with that data is examined to determine if it was addressed to any of the IPBs. If so, then that IPB will hold its contents until that instruction word is gated by the MUX to the decode circuits... Execution Unit The execution unit includes the register file, the AI and BI latches, the ALU and shifter, and the ALU output latch. It also includes the MQ [Multiplier Quotient] register and the Condition Status register, which are both SCRs [System Control Registers]. To support a one-cycle execution rate, a 4-port register file is used... Two of the ports of the register file are used to read two operands simultaneously into the AI and BI latches. Another port is used to write the result back from the ALU output latch, and the fourth port is used to write data from memory or I/O... The ALU is used for the execution of all arithmetic and logical instructions, and for address calculations when executing Load, Store, I/O and Branch Instructions... RSC Interface The RSC interface consists of the request area and the receive area. The request area arbitrates for use of the RSC, and it receives the acknowledgment signals after requests are sent. [Read requests are sent with a 5-bit tag which uniquely identifies the data when they return from the memory or yo.] The RSC receive area contains buffer registers from one incoming data word and tag. Each cycle, they capture whatever data word and tag is on the RSC... During the following cycle, the tag is examined to determine if the word is addressed to the ROMP, and if so, whether it is an instruction or data. If it is an instruction, the tag will also identify which IPB has been assigned to it. If it is data, the tag will point to one of two descriptors which will control the alignment by the formatter and store it into the proper location in the register file...

151 Control Unit The control unit includes the microcode ROM, the Control Register (C Reg), the instruction decoders, and the ROM and register file address latches and control circuits. It also includes circuits for detecting and handling interrupts, machine checks, and program checks, as well as the SCRs associated with these events. The ROM contains 256 control words of 34 bits each... The control words are needed only for the execution cycles of the instructions, since instruction prefetching is controlled by hardwired circuits, and the last control word of each instruction controls the decoding of the next instruction... The control words are fetched from the ROM into the C Reg during the cycle prior to the one when they are executed. During the last execution cycle of any instruction, the next instruction is selected from one of the IPBs by the MUX and decoded. This decode cycle is used to simultaneously fetch a control word from the ROM and fetch the two operands from the register file into the AI and BI latches. The operation code is taken directly form the output of the MUX and used as the ROM address... Also during the decode cycle, the register address for the result, called the destination address, is put into a two-stage pipeline to be used two cycles later for storing the result from the ALU output latch into the register file. The ROMP provides 118 instructions in ten classes as shown in Table 3.1.5. The branch instructions include "Branch with Execute" versions which allow overlap of the fetch of the branch target instruction with the instruction following the branch instruction (called the subject instruction). This eliminates dead cycles that would normally occur when flushing the pipeline. 3.1.4.2 The ROMP Hardware Model The gate-level model of the ROMP was written in Verilog, a hardware description language and simulator, in approximately 2600 lines of code. To generate error manifestations from the gate-level model, there must be a gatelevel fault model and a methodology of fault injection. Fault Model. There are two portions of the fault model: gate-level faults and behavioral faults for the registers. The gate-level faults are single fixed-at-0 and fixed-at-1 faults. Fixed-at faults are distinguished from stuck-at faults by being transient as opposed to permanent. The lines are fixed to a logical zero or one value for one clock cycle using a forcing

152

Instruction class Memory Access Address Computation Branch and Jump Traps Moves and Inserts Arithmetic Logical Shift System Control Input and Output Total

Number of instructions 17 8 16 3 13 21 16 15 7 2 118

Table 3.1.5: ROMP instruction classes feature of the Verilog simulator. Faults are injected into only one line at a time; no multiple faults are simulated. These faults are injected exhaustively into all nets in the model (spatial) and during all clock cycles of the instruction at the fault injection location (temporal). Since the registers are modeled at a behavioral level, a behavioral fault model is necessary. The model is based on the list of transient hardware faults from the "Register" section of Table 4-1. The following faults are injected exhaustively: Missed load: The register is not loaded when it should be. Extraneous load: The register is loaded when it should not be. (Recall that these faults are not directly emulated in the RTL fault model.) Level change in storage: A bit in the register is flipped. It is not necessary to model level changes in the input or output lines, because they are accessible at the gate-level. The procedure in Figure 3.1.15 is used to inject faults into a certain net or register. The process is repeated for all nets and registers in the Verilog model.

Actually, this "feature" was intended to be used as a debugging tool.

153 begin initialize irBchine state siimlate instructicn without faults (goldai run) recxjrd state and outputs at each cycle for i = each cycle during instructicn begin for j = each fault for the currait net or reg begin initialize itachine state siimlate instructicn with fault j in cycle i record state and outputs aid and end

Figure 3.1.15: Pseudo-code for gate-level fault injection

3.1.4.3 Results Faults were injected into the ROMP "a" instruction (add) following the procedure outlined above. The results are shown in Table 3.1.6. First, by looking at the row labeled "Total gate level," note that the gate-level injection into the Verilog model produced 711 error states. Of these, 692 were successfully reproduced by the simpler RTL model. Recall from the development of the RTL fault model that the faults for the control section were more general than those for the data section. The effects of this are evident in the results. Almost all (0.9970) of the data errors were reproduced, but only about two thirds (0.6885) of the control errors were covered. Fortunately, there are only 61 control errors compared to 669 data errors, so the average (0.9733) is weighted heavily by the data coverage. An error coverage of over 0.97 is encouraging for this simple instruction, but how do these results compare to those for different opcodes? Similar experiments were done using 91 of the ROMP's 118 instructions by injecting over 1.3 million gatelevel faults and 176 thousand RTL faults. The results arc shown in Figure 3.1.16. The mean coverage is 0.970 with a standard deviation of 0.0138. Also note that the mean overhead is 0.205, indicating that about 20% of the errors generated by ASPHALT were not generated by the Verilog model. Detailed data for each instruction may be found in [40]. A Study of the Outlying Data. This section looks at the instructions which produced the lowest coverage values. The sources of repeating non-covered faults are discussed. To improve the coverage values, the RTL model is modified based on the observations, and the effects of the new model are presented.

154

Total faults injected

Source

Faults causing fatal errors

Equivalent error states

Error states covered by RTL injection

Gate level, control section

4452

60

61

42 (68.85%)

Gate level, data section

7686

4

669

667 (99.70%)

12138

64

711

692 (97.33%)

40

834

N.A.

Total gate level RTL level

1718

Table 3.1.6: Injection into "a" instruction The first three rows show data from injection into the gate-level model: the first row shows the results of injection into the control section only; the second, data; and the third, both. The fourth row shows data from injection with ASPHALT. The first column of data shows the total number of faults injected over all cycles of the "a" instruction. The second column shows the number of those faults which caused fatal errors such as addressing or opcode traps. The third column shows the number of equivalent error states as illustrated in Figure 3.1.11. The fourth column shows how many of the gate-level equivalent error states were also generated by ASPHALT.

From the graphs of the data, the four leftmost points are immediately identified as the most extreme outliers. These values are all between 0.90 and 0.94, whereas the other 87 are between 0.94 and 0.99. These values were produced by the four branch-and-link instructions: branch and link immediate (bali), branch and link (balr), and the two versions which execute the subject instruction (balix and balrx).^ Is there a property of the RTL fault model or the RTL ROMP model which accounts for these relatively low coverage values? The branch-and-link instructions are so-called because they copy the value of the updated lAR into a link register before branching, thereby providing a return address to be used at the end of a subroutine. An analysis of the resultant machine states from the gate-level simulation which were not covered by the RTL fault model reveals two separate effects: one for bali and balix and another for balr and balrx. These are discussed in turn below.

Actually, there are six branch-and-link instructions in the ROMP. The two branch and link absolute instructions (bala and balax) were not implemented in the Verilog model.

155 Coverage

Figure 3.1.16: Dot chart of coverage data for 91 instructions The opcodes are listed along the left-hand side.

156 First, in the set of non-covered states from both bali and balix, there were 32 single-bit faults in both lAR and the link register. These faults were caused by singlebit faults in lAR and fixed-at faults in the ALU inputs. If these 32 errors were covered in bali, for example, the coverage would rise from 0.915 to 0.962. Figure 3.1.17 shows the key RTL transfers that are executed by the RTL model during bali. From this sequence, it is evident that no single RTL fault can affect both lAR and rl5, the link register, due to the temporary variable instruction_address used in the RTL code. By analyzing the microcode from the Verilog model, one can see that no such temporary variable is used. Instead, the following two transfers are made in parallel; rl5

*r-

IAR + 4

(3.1.1)

lAR

4-

lAR + immediate_valuc.

(3.1.2)

Thus, a single-bit fault can affect both rl5 and lAR. To compensate for this discrepancy, a change is introduced in the RTL model as shown in Figure 3.1.18 to force a single-bit fault in the temporary register to affect rl5 and lAR. 1. 2. 3. 4. 5. 6.

instruction_address <— lAR IR *- readhw(IAR) lAR <- lAR + 2 wDrd2 <- readhw(IAR) lAR
7. lAR <— instruction_address + initiediate_value (where iitrtiediate_value is a simple function of WDrd2 and 5 bits of the IR) Figure 3.1.17: Transfers due to RTL model for bali instruction Line 1 saves lAR to be used to compute the branch address later in line 7. Lines 2 and 3 fetch the opcode and update lAR (readhw(x) is a function to read 16 bits from address x). Lines 4 and 5 fetch the remainder of the immediate operand because bali is a 32-bit instruction. Line 6 saves the link address in r15, and line 7 updates lAR with the branch address. Any single-bit failure that would affect both lAR and r15 would have to occur before lAR is saved in line 1, but that would cause a different instruction to be fetched and executed altogether.

Now, the bair and balrx instructions are considered. In these, the predominate non-covered gate-level error is multiple bit faults in lAR due to single-bit faults in the ALU inputs. A closer inspection reveals that these multiple faults are due to an erroneous carry being generated which propagates through several bits.

157 1. instruction__address <— lAR 2. IR f- readhw(IAR) 3. lAR <- lAR + 2

4. word2
6. r l 5 «- instruction_address + 4 7. lAR <— instruction_address + irtmediate_value (where inTnediate_value is a simple function of word2 and 5 bits of the IR) Figure 3.1.18: Transfers due to modified RTL model for bali instruction Line 6 is modified to use the saved lAR value. Now, a single-bit failure in the lnstruction_address register immediately prior to line 6 affects both lAR and r15.

The balr and balrx instructions use the contents of a register as the branch address. An analysis of the microcode in the gate-level model reveals that the ALU is being used to transfer the value from the register to lAR by adding an immediate zero to the value. Since an add is being executed, faults which insert ones into the immediate zero may cause a carry to propagate along a sequence of ones in the other input. It is also observed that the branch-on-bit instructions where a register provides the branch address (bbr, bbrx, bnbr, and bnbrx) perform the same transfers through the ALU and exhibit similar non-covered error states. The RTL code, of course, does not use an addition of zero to transfer the register value to lAR. To mimic the gate-level behavior, however, a constant zero is added to the register during the transfer to lAR. The next value that stands out in Figure 3.1.16 is that for instruction onec, with a coverage proportion of only 0.950. All other arithmetic instructions have coverage values in excess of 0.970. Onec is a unary operation to take the one's complement (bit-wise negation) of a register. Recall the fault model for the ALU derived earlier. To reduce overhead, the fault model for unary operations was a special case. For example, the only alternative operation for negation is a unary minus or a direct passage of the argument. In the gate-level model, however, onec is implemented as subtraction from an immediate zero without the carry input being set. In other words, A - B is implemented as A -H —IB +1; —iB is implemented as 0 + —iB +0. To mimic this behavior in the RTL description, an immediate zero is added to the negated operand. Similarly, the other two unary instructions, twoc (two's complement) and abs (absolute value), are modified by adding or subtracting an immediate zero as required.

158 If the new data for the 11 modified instructions are substituted, the mean coverage for all 91 instructions is raised slightly from 0.970 to 0.972 and the standard deviation is reduced from 0.0142 to 0.00889. The more salient effect is that the extreme nature of the outlying data has been reduced to a large extent. Compare the resulting dot chart in Figure 3.1.19 to previous data. The following section will examine the time required for simulation at the RTL and gate levels. 3.1.4.4 Time Requirements The experiments were performed on a DECstalion 5000. The average CPU time required to simulate one instruction, and thus, one fault, with the Verilog simulator is approximately 1.65 seconds. There were 1,327,004 gate-level faults injected across the 91 implemented instructions which produced 61,286 unique errors for an injection-to-error ratio of almost 21.7:1. Thus, the average CPU time devoted to producing one error is approximately 35.8 seconds. In contrast, the average CPU time required to simulate one fault with ASPHALT is 24.2 milliseconds. This represents a per-fault gate-level to RTL simulation time ratio of 68.2:1. There were 176,245 RTL faults injected across the 91 implemented instructions which produced 75,052 unique errors for an injection-to-error ratio of 2.35:1. Thus, the average CPU time devoted to producing one error is approximately 56.9 milliseconds. Injecting the 1.3 million faults via Verilog consumed approximately 609 CPU hours. Injecting the 176 thousand faults via ASPHALT consumed approximately 1.19 CPU hours. Thus, the overall average ratio of gate-level to RTL simulation time is 512:1. 3.L5 CHAPTER SUMMARY Modern computer applications often require high levels of dependability. To meet these requirements, computer architects often use fault-detection and fault-tolerant mechanisms. In order to validate these mechanisms, these engineers often use fault-injection techniques. Major sources of faults are transient and intermittent hardware faults. Current methods to inject, simulate, or emulate these faults have significant time, cost, or accuracy limitations. This chapter has discussed a new method to emulate faults

DECstation is a trademark of Digital Equipment Corporation.

159 Coverage

Figure 3.1.19: Dot chart of coverage with new RTL code Compare to Figure 3.1.16.

160 using software-implemented fault injection (SWIFI) which addresses these concerns. By studying how faults propagate in general sequential networks, it was determined that reverse fault emulation via SWIFI could emulate a large majority of the faults that occur in processors. In order to determine how to emulate those faults, a fault model based on a register-transfer-level (RTL) description of the processor was proposed. This fault model was developed based on a proposed list of possible hardware faults in all sections of a processor: data storage and transfer, data processing units, and control. The goals of this model were to (1) obtain high coverage, the proportion of errors that are generated by gate-level faults that are successfully reproduced by the RTL fault model, and (2) minimize overhead, the proportion of errors that are produced by the RTL fault model but not by the gate-level model. A prototype RTL simulator and fault injector, ASPHALT, was developed to obtain experimental data. As a paradigm for future fault-injection studies, a study of the IBM RISC-Oriented Micro-Processor (ROMP) was proposed. An RTL model was developed, and a gate-level model was used for comparison purposes. Faults in all implemented instructions were injected exhaustively. Coverage ranged from 0.90 to 0.99 with a mean of 0.97. Overhead ranged from 0.10 to 0.40 with a mean of 0.21. The coverage value of 0.97 was derived from a naive RTL processor model which was coded with no knowledge of the implementation. Considering some simple implementation details allows a modest increase in coverage. To accomplish this, the instructions which exhibited the lowest coverage values were examined. A study of the microcode for those instructions revealed that certain RTL statements could be slightly revised to better emulate the actual data transfers. Minor changes for eight branch instructions and three unary arithmetic instructions increased the minimum coverage from 0.90 to 0.94. An analysis of the time requirements found that it took an average of 1.65 CPU seconds to simulate a fault at the gate-level using the Verilog simulator but only 24.2 milliseconds when using ASPHALT, the RTL simulator, a ratio of 68:1. Exaggerating this effect is the fact that the injcction-to-error ratio at the gate level is almost 22:1 but only 2.4:1 with RTL. This leads to an overall reduction of effort of512:l. 3.L5.1 Recommendations In general, based on the above results, the following procedure is recommended for a fault-injection study.

161 1. Develop an RTL model of the processor. 2. If only preliminary results are requested or the specifications for coverage are not stringent, use only the RTL fault simulator with a SWIFI tool such as FIAT or FERRARI to perform the studies. Random sampling of instructions in the workloads might also be used to decrease injection time. 3. If the specifications are more demanding, obtain a gate-level description. Follow the example experiments performed on the ROMP to determine average coverage and overhead statistics. Based on these results, a partial gate-level simulation may be used, or the RTL fault model may be enhanced based on the gate-level knowledge. In any case, a substantial savings over exclusive gate-level modeling is to be expected. 3.1.5.2 Contributions This work described in this chapter has contributed the following to the field of fault-injection: •

A new fault model which operates on an RTL description of a processor and is based on lower-level fault models.

•

An automated tool which injects these faults, producing error patterns over 500 times faster than a gate-level model of the same processor.

•

Experimental results which show that this abstraction is capable of reproducing over 97% of the error patterns generated by a gate-level model of the chosen processor.

•

A paradigm for developing future fault-injection studies on future processors.

3.L5.3 Future Work Future research can be carried out in at least three broad areas: additional verification, enhancements, and applications. First there are several experimental parameters that were not varied due to time constraints. In order to more fully verify and quantify the RTL fault model, some additional experiments may be performed: •

The procedure can be repeated on other processors to show that the fault model has generality across processors. These results will be especially interesting when the experiments are performed on higher-performance CPUs with the latest performance-enhancing devices.

162 •

The procedure can be repeated with various RTL models coded by independent persons. This will show any sensitivity to coding style.

Second, there are several alternatives for enhancing the RTL fault model: •

A broader spectrum of architectural knowledge, such as that applied to the branch and unary operations, could be applied to the RTL model for additional data.

•

Similarly, if architectural knowledge is available, the RTL fault model itself may be tailored for a particular processor. This might include a richer set of alternatives for ALU operations based on microcode analysis, for example.

•

Computer synthesis techniques might be applied to determine reasonable data and control paths for the processor being simulated. This could perhaps provide more intelligent control faults, increasing coverage and decreasing overhead.

Third, of course, there is a plethora of possible applications for this technology: •

The speed and accuracy advantages offered by this new fault model make fault-injection studies much more feasible. ASPHALT could be integrated into FIAT, FERRARI, or a similar system to perform these studies. Data gathered can not only facilitate verification of existing fault-detection and fault-tolerance techniques, but also can be used in the design cycle to help develop the next generation of highly-dependable computing systems.

•

Similarly, ASPHALT could be integrated into a robustness benchmark [33] to help derive reliability figures-of-merit for computer systems.

REFERENCES [1]

W. Richards Adrion, Martha A. Branstad, and John C. Cheriavsky. Validation, Verification, and Testing of Computer Software. ACM Computing Surveys. 14(2):159-192, June, 1982.

[2]

J. Arlat, Y. Crouzet, and Jean-Claude Laprie. Fault-Injection for Dependability Validation of Fault-Tolerant Computing Systems. I9th International Symposium on Fault-Tolerant Computing (FTCS19), pages 348-355. June, 1989.

[3]

James H. Barton, Edward W. Czeck, and Zary Segall and Daniel P. Siewiorek. Fault Injection Experiments Using FIAT. IEEE Transactions on Computers. 39(4):576-582, April, 1990.

163 [4]

Dhananjay Brahme and Jacob A. Abraham. Functional Testing of Microprocessors. IEEE Transactions on Computers. C-33(6):475485, June, 1984.

[5]

Ram Chillarege and Nicholas S. Bowen. Understanding Large System Failures—A Fault Injection Experiment. Technical Report RC 14233, IBM Corporation Research Division, IBM T.J. Watson Research Center, Yorktown Heights, NY, 1989.

[6]

Gwan S. Choi and Ravishankar K. Iyer. FOCUS: An Experimental Environment for Fault Sensitivity Analysis. IEEE Transactions on Computers. 41(5), May, 1992.

[7]

Gwan S. Choi, Ravishankar K. Iyer, and Victor A. Carreno. Simulated Fault Injection: A Methodology to Evaluate Fault Tolerant Microprocessor Architectures. IEEE Transactions on Computers. 39(4) :486-490, April, 1990.

[8]

J. Cusick SEU Vulnerability of the ZILOG Z-80 and NSC-800 Microprocessors. IEEE Transactions on Nuclear Science. NS32(46):4206-4211, December, 1985.

[9]

Edward Czeck. On the Prediction of Fault Tolerant Behavior Based On Workload. Ph.D. thesis, Carnegie Mellon University, December, 1990.

[10] Edward Czeck and Daniel Siewiorek. Observations on the Effects of Fault Manifestation as a Function of Workload. IEEE Transactions on Computers. 41(5):559-564, May, 1992. [11] Thomas R. Dilenno, David A. Yaskin, and James H. Barton. Fault Tolerance Testing in the Advanced Automation System. 21st International Symposium on Fault-Tolerant Computing (FTCS2I), pages 18-25.June, 1991. [12] R Duba and R.K. Iyer. Transient Fault Behavior in a Microprocessor: a case study. Proceedings of the 1988 IEEE International Conference on Computer Design: VLSI in Computers and Processors. October, 1988.

164 [ 13] Joanne Bechta Dugan. On Measurement and Modeling of Computer Systems Dependability: A Dialog Among Experts. IEEE Transactions on Computers. 39(4):506-509, April, 1990. [14]

Klaus Echtle and Yinong Chen. Evaluation of Deterministic Fault Injection for Fault-Tolerant Protocol Testing. 2]st International Symposium on Fault-Tolerant Computing (FTCS21), pages 418-425. June, 1991.

[ 15] Kumar Goswami and Ravishankar K. Iyer. DEPEND: A SimulationBased Environment for System Level Dependability Analysis. Technical Report UILU-ENG-92-2217/CRHC-92-11, Coordinated Science Laboratory, College of Engineering, University of Illinois at Urbana Champaign, June, 1992. [16]

Alf Gunneflo, Johan Karlsson, and Jan Torin. Evaluation of Error Detection Schemes Using Fault Injection by Heavy-ion Radiation. 19th International Symposium on Fault-Tolerant Computing (FTCSI9), pages 340-347. June, 1989.

[17]

IBM RT Personal Computer Technology. Technical Report SA231057, IBM, 1986.

[18]

Ravi K. Iyer and D.J. Rossetti. A Measurement-Based Model for Workload Dependence of CPU Errors. IEEE Transactions on Computers. C-3511-519, June, 1986.

[19]

Richard L. Johnson Jr., Sherra E. Diehl-Nagle, and John R. Hauser. Simulation Approach for Modeling Single Event Upsets on Advanced CMOS SRAMS. IEEE Transactions on Nuclear Science. NS-32(46):4122-4127, December, 1985.

[20]

Ghani A. Kanawati, Nasser A. Kanawati, and Jacob A. Abraham. FERRARI: A Tool for the Validation of System Dependability Properties. 22nd International Symposium on Fault-Tolerant Computing (FTCS22), pages 336-344. July, 1992.

[21] Ghani A. Kanawati, Nasser A. Kanawati, and Jacob A. Abraham. A High-level Error Model Automatic Extractor Technical Report UTCERC-TR-JAA93-01, Computer Engineering Research Center, University of Texas at Austin, January, 1993.

165 [22] Ramachandra P. Kunda, Prakash Narain, Jacob A. Abraham, and Brarat Deep Rathi. Speed Up of Test Generation using High-Level Primitives. 27th ACM/IEEE Design Automation Conference, pages 594-599. 1990. [23] Jean-Claude Laprie. Dependable Computing and Fault Tolerance: Concepts and Terminology. 15th International Symposium on FaultTolerant Computing (FTCS15), pages 2-11. June, 1985. [24] T. May and M. Woods. Alpha-Particle-Induced Soft Errors in Dynamic Memories. IEEE Transactions on Electron Devices. ED262-9, January, 1979. [25] Ghassem Miremadi, Johan Karlsson, Ulf Gunneflo, and Jan Torin. Two Software Techniques for On-line Error Detection. 22nd International Symposium on Fault-Tolerant Computing (FTCS22), pages 328-335. July, 1992. [26] Donald K. Nichols et al. Trends in Parts Susceptibility to Single Event Upset from Heavy Ions. IEEE Transactions on Nuclear Science. NS-32(46):4189-4194, December, 1985. [27] Thomas M. Niermann, Wu-Tung Cheng, and Janak H. Patel. PROOFS: A Fast, Memory Efficient Sequential Circuit Fault Simulator. 27th ACM/IEEE Design Automation Conference, pages 535540. 1990. [28]

Forrest E. Norrod. An Automatic Test Generation Algorithm for Hardware Description Languages. 26th ACM/IEEE Design Automation Conference, pages 429-434. 1989.

[29] Joakim Ohlsson, Marcus Rimen, and Ulf Gunneflo. A Study of the Effects of Transient Fault Injection into a 32-bit RISC with Built-in Watchdog. 22nd International Symposium on Fault-Tolerant Computing (FTCS22), pages 316-325. July, 1992. [30]

Micheal Peercy and Prithviraj Banerjee. Design and Analysis of Software Reconfiguration Strategies for Hypercube Multicomputers under Multiple Faults. 22nd International Symposium on Fault-Tolerant Computing (FTCS22), pages 448-455. July, 1992.

166 [31] Z. Segall, D. Vrsalovic, D. Siewiorek, D. Yaskin, J. Kownacki, J. Barton, R. Dancey, A. Robinson, and T. Lin. FIAT—Fault Injection Based Automated Testing Environment. Proceedings of the 18th International Symposium on Fault Tolerance, pages 102-107. The Computer Society of the IEEE, Washington, DC, June, 1988. [32]

Daniel P. Siewiorek and Robert S. Swarz. The Theory and Practice of Reliable System Design. Digital Press, DEC, Bedford, MA, 1982.

[33] Daniel Siewiorek, John Hudak, Byung-Hoon Suh, and Zary Segall. Development of a Benchmark to Measure System Robustness: Experiences and Lessons Learned. 23nd International Symposium on Fault-Tolerant Computing (FTCS23). July, 1993. [34]

Daniel R Siewiorek and Robert S. Swarz. Reliable Computer Systems: Design and Evaluation. Digital Press, DEC, Bedford, MA, second edition, 1992.

[35] Satish M. Thatte and Jacob A. Abraham. User Testing of Microprocessors. Spring '79 Compcon, 18th IEEE Computer Society International Conference, pages 108-114. 1979. [36] Satish M. Thatte and Jacob A. Abraham. A Methodology for Functional Level Testing of Microprocessors. 8th International Symposium on Fault-Tolerant Computing (FTCS8), pages 90-95. June, 1978. [37] Satish M. Thatte and Jacob A. Abraham. Test Generation for General Microprocessor Architectures. 9th International Symposium on Fault-Tolerant Computing (FTCS9), pages 203-210. June, 1979. [38] RC. Ward. Behavioral Fault Simulation in VHDL. 27th ACM/IEEE Design Automation Conference, pages 587-593. 1990. [39] Luke T. Young and Ravishankar K. Iyer. A Hybrid Monitor Assisted Fault Injection Environment. Technical Report UILU-ENG-922207/CRHC-92-04, Coordinated Science Laboratory, College of Engineering, University of Illinois at Urbana Champaign, February, 1992.

167 [40] Charles R. Yount. The Automatic Generation of Instruction-Level Error Manifestations of Hardware Faults: A New Fault-Injection Model. Ph.D. Thesis, Carnegie Mellon University, May 1993.

SECTION 3.2

R E A C T : An Integrated Tool for the Design of Dependable Computing Systems Jeffrey A. Clark and Dhiraj K. Pradhan 3.1

Introduction

T h e REliable Architecture Characterization Tool ( R E A C T ) is a generalized software testbed for analyzing a variety of fault-tolerant multiprocessor systems. R E A C T abstracts a system at the architectural level and performs autom a t e d life testing through simulated fault-injection to accurately and efficiently measure dependability. It integrates detailed system, workload, and fault/error simulation models t h a t have been derived, in part, from the published results of computer performance studies and low level fault-injection experiments. This chapter introduces R E A C T and discusses the theory behind its approach toward dependability modeling. The system, workload and fault/error models it employs are developed in the next three sections. Section 4 describes the implementation of the testbed, including its basic operation and the derivation of its dependability measures. Section 5 explains the appi oach taken to verify and validate R E A C T . Performance limitations of the tool are considered in Section 6. Appendix 1 contains an overview of the R E A C T software. Sample input and o u t p u t files appear in Appendix 2.

3.2

System Model

R E A C T will analyze most systems that can be represented by the model pictured in Figure 1. This class of architectures consists of one or more processor modules (P) interconnected via buses (B) to one or more memory modules (M) through a block of fault-toleranco mechanisms. Any number of processors and memories may he specified and each can be designated as initially active, or a hot or cold standby spare. Homogeneous groups of processors or memories can be defined in which all modules operate redundantly. A group of processors execute the same workload in lock-step synchronization. Memories in a group have identical contents and are accessed simultaneously. Multiple, non-overlapping groups may coexist within the system. Processor and memory modules are abstracted at the architectural level. T h e operation of each processor is governed by a stochastic model discussed later in this chapter. The state of a processor determines what values are driven onto its address and d a t a buses during memory accesses. Processor state changes when it experiences a fault or an erroneous value is read from memory (through the fault-tolerance inechanisms). The state of a memory is defined by the contents of its bit-array. T h e functionality of its addressing-logic

170

p

p

p

B

B

B

Fault Tolerance Mechanisms

B

B

+

B

Interconnect,

M

M

M • • •

• • • Figure 1: R E A C T System Model

is simulated on every memory access. Memory state changes when it experiences a fault or an erroneous value is written into it by the processor (through the fault-tolerance mechanisms). In order to reduce the complexity of the system model, logical "0" and " 1 " are not differentiated. Individual bits may take either an crror-fne or erroneous value. Memory depth (A'words) is variable up to 256 Mwords per module and a 3'2-bit word width has been assumed for memory and all d a t a paths. .Although caches are not explicitly represented within the R E A C T system model, the workload model it employs does account for the influence of caches on memory error latency and propagation. More detailed modeling of the processor and memory modules is possible, but may result in longer simulation run-times. T h e fault-tolerance mechanisms provide the hardware necessary to detect, correct or mask errors during memory accesses, and reconfigure the system when modules fail. Hardware components typically found in this block of mechanisms include majority voters, comparators, switches and error control codes. The operation of these mechanisms is simulated with a set of six software routines that delineate how addresses and d a t a are transformed by the hardware when passing between the processors and memories during an access. Users of R E A C T have complete freedom in specifying the number, functionality and interconnection of these mechanisms. This gives the system model the flexibility needed to represent a variety of multiprocessor architectures that utilize passive, active or hybrid redundancy to achieve fault-tolerance. Some examples of tlie systems t h a t can be analyzed with R E A C T are; • N-modular redundancy; C.vmp [31], F T M P [19], Integrity S2 [20]

171 • Duplication and comparison: ESS [37], Sequoia [4], Stratus [38] • Standby sparing: STAR [2] • Error-control coding; S2500 [24]

3.3

Workload IModel

A synthetic workload is assumed for which the processors continually perform tnslruction cycles consisting of up to three memory references plus the simulated execution of an instruction. Real code and d a t a are not used by R E A C T , but errors will propagate throughout the system as if an application program was actually being executed. Only those instruction cycles that change the state of the system, by propagating or overwriting errors, need to be simulated. It would be necessary to simulate every instruction, regardless of its impact, if real programs were to be used. The computational expense of simulating possibly trillions of instructions over the operational life of each system is obviously unacceptable. Behavior of the application workload is specified by the mean instruction execution rate {^Z^nstr)^ the mean rates at which main memory is read (T^Mread) and written (7\.Mwrite) per instruction, and a locality-of-reference model. The following memory references are assumed to take place during every instruction cycle: • One access to read the instruction • One access to read d a t a with probability Pdata read • One access to write d a t a with probability "Pdata write Memory references in real systems can access words in either the cache, located within each processor, or the (main) memory module. Because caches are not explicitly modeled, references to them are not simulated. However, the effect of accessing erroneous words in the cache is represented at a higher level of abstraction by the processor fault model which is described in the next section. Main memory is accessed only when the cache does not contain the referenced word. The miss probabiltly ('Pmiss) defines the likelihood a referenced word does not reside in the cache. Words in main memory are usually accessed in blocks, rather than individually. However, block accesses do not have to be simulated to correctly model the net flow of words between the processor and its associated memory. The workload model is specified in terms of time-averaged rates and probabilities that are independent of block size. All main memory references can therefore be assumed to read or write only one word, allowing the mean access rates to be computed as: '^Mread

—

' miss ( t - r r d a t a r e a d )

( 1)

'^Mwrite

-—

^ m i s s ' d a t a write

(^J

Values for T'dataread, •Pdata write and 'Pniiss may be obtained directly from memory reference traces of the uniprocessor t h a t will be used in the realization of the fault-tolerant multiprocessor system being evaluated. Table 1 lists

172 Table 1: Typical Values for the D a t a Access Probabilities a read

0.798 0.411 0.62;5 0.776 0.783 0.221 0.488 0.543 0.(578 0.349 0.192 0.690

write

0.211 0.187 0.379 0.471 0.409 0.111 0.301 0.285 0.390 0.157 0.102 0.490

Machine VAX 8350 DEC Titan MC88000 VAX 8800 VAX-11/780 Z8000 VAX VAX (Lisp) IBM 370 iHM 360/91 CDC 6400 VAX-11

Reference Agarwal [1] Borg [5] Bugge [7] Clark [13] Emer [16]

Smith [32]

Wiecek [39]

typical values for the probabilities of accessing inemory d a t a on several different machines. The probability of performing a data read during an instruction ranges from approximately 0.2 to 0.8. The probability of performing a d a t a write is usually about half that of reads. Cache miss probabilities have been observed to vary from 0.01 to 0.2 [18]. The miss probability for writes can be set independently of that for reads, to simulate different main inemory update policies. The locality- of-refe re nee model determines which memory locations are accessed during an instruction cycle. R E A C T implements a first-order approximation of a model based on firnrf/orrf-Zi;)/distributions [8, 9]. The locality model assumes t h a t aiocaiiiv ^ 100% of all references access /^locality x 100% of the memory under the condition aux.aiity + Aocaiity = 1- Values for ftioeaiity and /^locality can be extracted from a curve such as the one shown in Figure 2, which plots the cunudative frecjuency of word references against the logarithmic rank of the words. Heising first reported what is now commonly referred to as an ''80/20 Rule", when parameter values «iacaiity — 0.8 and locality = 0.2 were observed for many commercial applications [17]. Reference addresses are assumed to be uniformly distributed inside and outside of the locality and no a t t e m p t is made to separate code from d a t a in memory with this model. This locality-of-rcference model was adopted for a specific reason. Unlike most models intended for trace-driven simulation [22, 33, 36], the BradfordZipf model does not recjuire complicated state updates when the system clock is advanced over large periods of lime. The low computational overhead of this model makes it very attractive for use in a tool that may siinulate millions of systems over several years of operation.

173

0.8 h Cumulative Frequency ^ of Word References 0-4 0.2 0 0.001

0.01

0.1

1

Logarithmic Rank of Words Figure 2: Typical Bradford-Zipf Memory Reference Curve

T h e last phase of the instruction cycle to consider is the simulated execution of an instruction. The purpose of the workload model is to propagate errors between the processor and memory modules during an access, as they would in a real system. T h e particular computation performed during an instruction cycle is irrelevant, and therefore does not require explicit simulation. An instruction is assumed to be successfully executed if no faults are active in the processor and no errors are read from memory during the instruction and d a t a fetches. No errors will be written into memory when an instruction is successfully executed. If a fault is active in the processor or an error is read from memory, the instruction will not execute successfully, and any write into memory that takes place will be erroneous.

3.4

Fault and Error Models

R E A C T will automatically inject both permanent and transient faults into the processors, memories and fault-tolerance mechanisms of a system. Faults are assumed to occur independently within each component. Their inter-arrival times are randomly sampled from a Weibull distribution, Weibull (A, a ) , with the following cumulative distribution function (CDF): .F'WeibullC) = 1 - e' -(\tr

(3)

The parameters A and a are respectively known as the scale and shape parameters. Exponentially distributed inter-arrival times, which are assumed by many failure rate models, can be obtained simply by setting o = 1. Fault injections take place at the beginning of the instruction cycles specified by the C D F

174 Table 2: Typical Values for the Latency Distribution Parameters -^latent

C latent

-0.001 - 0.1 0.124-0.185 -0.0012-0.01 0.000016 - 0.2

n/a 0.796 - 0.902 n/a 0.21 - 0.89

Processor MC6800 IBM RT PC SPARC CAP-6

Reference Courtois [14] Czeck [15] Kanawati [21] Shin [30]

samples. When two or more faults simultaneously affect the same bit within a component, the fault with the longer duration takes precedence. Repair times for failed components are randomly sampled from a log-normal distribution, LogNorm (fi,a), with the following C D F :

F, og-norm •At)

/2^7-,

-'^-'''dx

(4)

T h e parameters // and cr are the mean and standard deviation of the distribution, respectively. Repairs are initiated after a constant, user specified logistics delay (Tiogistics)- Multiple failed components may be repaired either sequentially or concurrently. All faults in a component are removed during repair. The time required to resynchronize a repaired component (Tresynch) and reboot the system after a critical failure (Ti-eboot) are also constant and user specified. 3.4.1

Processor Fault M o d e l

Processor operation in the presence of faults is governed by the stochastic model shown in Figure 3. Ovals in this diagram represent valid processor states, while arcs indicate what state changes may take place. Dashed boxes are temporary nodes that are passed through when moving into a valid state. They have been included only to clarify the state transitions of the model. All processors initially start in the Fault Free state. Permanent faults occur at WeibuU {Xpp„n-^, appenn) distributed inter-arrival times and become Latent. No errors are generated by Latent processor faults. A permanent fault will remain Latent for a period of time specified by a Wetbull (Aiatent-«iateni) distribution, and then become Active. Table 2 lists typical values for the latency distribution parameters, measured through fault-injection experiments on several different processors. A processor is considered to be failed and will produce errors during every instruction cycle when it is in the Active state due to a permanent fault. It can only leave this state and return to the Fault Free state if it is repaired and resynchronized. Transient processor faults occur at l'l''ezAu//(Aptrari)Qptrati) distributed interarrival times. A transient is assumed to generate a single logic upset which must be latched to become Latent. If not latched with probability 'Piatch' the

175 Permanent I Fault

H

H

Latent

Active

Wethull (•^latent . ^ l a t e n t )

Wetbidl i^Pperm '^Pperni;

Fault Free

U. J

^

Weibull

l-Vv

' Transient

r

Pte

atch

(-^Ptran.Qptran)

Fault

'

^

,

!

^

'^'^^^"^

,

,

"^latch

Weibull

Weibull

J

(\iatent,autent)

with probability

(•^latent-Qlatcnt)

with probability

*" upset

^

Control Upset

Active

•' t Erroneous Read

'^upset

upset

Dormant

1 cycle

Weibull

1

(Aactive-Qactive

with probability 1 — Pte Input Error

' i

i _ T)

-•

^

A'^upset

Figure 3: Processor Fault Model

176 Table .'5: [ypical Values for the Transient Latching Probability ^latch

0.54 0.29 - 0.36 0.23 - 0.44

Processor HS1602 IBM RT PC Custom RISC

Reference Choi [12] Czeck [15] Ohlsson [29]

Table 4: Typical Values for the Control Upset Probability r upset.

0.62 0.60 - 0.82 0.32 - 0.35

Processor IBM RT PC MC6809E Custom RISC

Reference Czeck [15] Miremadi [28] Ohlsson [29]

processor will immediately return to the Fault Free state. Table 3 gives typical values for the transient latching probability of several different processors, again measured through fault-injection experiments. A Latent transient fault will upset the control flow of an executing program with probability Pupset, after a atent, Qiateiu) distributed period of time. This will cause the processor to diverge from the correct instruction stream and enter the Control Upset state. Typical values for the control upset probability, also measured through fault-injection experiments on several different processors, appear in Table 4. It is assumed t h a t the processor will remain in the Control Upset state, producing errors during every instruction cycle, until it is resynchronized. If the Latent transient does not result in a control flow upset, it will become Active after a ^iV^fe«//(Aiatent, «iatent) distributed length of time. An Active transient fault is assumed to produce errors during one instruction cycle and then move into the Dormant state. W i t h termination probability 'Ptermi the processor will immediately return to the Fatilt Free state. Otherwise, it will hold in the Dormant state for a period of time that is Weibull (Aactive, «active) distributed. After the holding time in the Dormant state expires, the transient will again become Active and repeat the same sequence of state changes. This cyclic behavior is intended to represent multiple errors being sourced from a single fault t h a t eventually disappears. Register and cache faults may exhibit such behavior when they are read several times before being overwritten. Parameters for the distribution of transient fault activity are not readily available in the research literature. However, several papers have published distributions of the time between errors (TBEs) in computer systems, from which -'^active and Oactive c^n be estimated [25, 34, 35]. The T B E distributions were obtained through the analysis of system error logs. Most large computers maintain

177 1""

1

...0.

0.1

•••..

m

1

Actual Approximation

t:^

0.01

n nni

0

1

1

1

5

10

15

20

Time t (Minutes) Figure 4: Error Hazard Rate of a VAX 8600

an error log, in which time-stamped entries describing abnormal system events are recorded. Entries are usually made in the log whenever an error is detected by any system component, including the processor, memory, I/O devices and software. Although it is possible to compute TBE distributions for individual components, most published TBE distributions utilize all the data in a log. It is therefore assumed that the TBE distribution for a processor resembles the aggregate TBE distribution for all the system components. Figure 4 graphs the error hazard rate, h(t), computed by Tang and Iyer from the TBE distribution of raw data in a VAX 8600 error log [351. The hazard rate of a distribution with probability density function f(t) and CDF F{i) is: h(t)

=

m

(5)

1 - Fit)

Error hazard rate can be interpreted as the conditional probability an error will occur within the next unit of time, given that no error has occurred since the last error was seen. This curve was approximated (by the author) with the Weibutl {X,a) hazard function: /lWeibull(0 =

a\{Xt)"

having parameter values Aactive = 0.022222 minute" mean of a Wetbull (A, a) distribution is: mwrnbuW

= T- r 1 H — A V a

(6) and Q,

0.5. The

(7)

178 where the gamma function.

T(x]. is defined for positive integers x as: r(j-) =

(i--i)!

Equation 7 yields a mean for the fitted H'ciAu//(Aactive-Oactive) distribution of 90 minutes, which corupares favorably to tlie 88.2 minute mean of the empirical data. Similar T B E distributions have also been observed for the VAX 11/780 [34] and Tandem Cyclone and V'LX systems [25]. Distribution |)aranieters needed to model other processors can be obtained using the approach described here. TBK distributions have been shown to exhibit two error modes, around which many samples accrue [25]. The first mode has a peak at the origin and represents a burst of errors froin one source. It is modeled by the Wdhull (-^active, ^active) fault activity distribution. T h e second mode occurs in the tail of the T B E distribution and represents single, isolated errors. It is modelled by the WeihuU (Xp,f^n^f^P\.niu) fault occurrence distribution. The terniination probability designalt's where the tail of the fault activity distribution should b(^ truncated to eliminate the second mode. The value of the T B E C D F at this point in time is equal to 1 — T'teimContinuing with the discussion of Figure 3, when a processor reads an error from memory (through the fault-tolerance mechanisms), its state may become corrupted. All input errors will be latched internally, so they are assumed to have an immediate effect on the processor. Based on the probability 'Pupset- an input error will put the processor in either the Control Upset or Actirc state. T h e input error is assumed to behave like a transient fault once the processor has entered either of these states. Processor errors can affect both the addresses and (write) d a t a generated during an instruction cycle. The probabilities of an error affecting only an address (Paddr error) or d a t a CPdata error) "^rt^ specified by the user. Addresses and d a t a are simultaneously affected by an error with probability (1 — Paddr em>r — ^data error)- An erroneous address is assumed to access a random memory location. Erroneous d a t a take on a random value. 3.4.2

M e m o r y Fault M o d e l

Memory faults are divided among the bit-array and addressing-logic regions of a module. The fraction of faults which fall into each of these regions may be approximated by their relative chip areas. Typical values for the fraction of memory chip faults that affect the bit-array (/"bit-array) are listed in Table 5. Bit-array faults are assumed to affect a single random bit in a word at a random address. It is assumed that references to memory access words at random addresses during an addressing-logic fault. An access to a random address reads or writes a value with randoiidy corrupted bits, corresponding to tlie difference in bit values between the word that was accessed and the word that should have been accessed. Similar fault models for memory have been employed in other experimental dependability studies [10, 26]. The stochastic model seen in Figure 5 illustrates how faults affect the o|)eration of a memory module. All memories initially start in the Fault Free stat(?. Permanent faults occur at Weibull (AMpem,. Q.Mpcrni) distributed inter-arrival

179 Tabic 5: Typical Values for the Bit-Array Fraction ''bit-array

0.39 0.85 0.89 0.89

Memory Chip 16K RAM 16K DRAM 64K DRAM 4K RAM

Reference Ayache [3] Koo [23] Marston [27]

limes, and affect the Addressing Logic with probability 1 — /"bn.array A meinory is considered to be failed if it is in this state, and will always return random words when accessed. It can only leave the permanent Addnssing Logic fault state and return to the Fault Free state if it is repaired and resynchronized. T h e permanent fault could have corrupted a bit in memory instead of the addressing-logic with probability ^bit-array It would have become Latcnl or Aciivf. based on the value of the bit latency probability Pbii-iatt-nt • Through fault-injection experiments on a VAX 11/780, Chillarege and Iyer found t h a t the probabilities a s-a-0 and s-a-1 fault are initially latent are 0.7 and 0.3, respectively [11]. Assuming both faults are equally likely, the probability a bit-array fault will be latent is Pbit-iatent = 0.5. A Lateni permanent fault in the bit-array generates no errors. It will become Active with probability 1 — T^bit-iatent when it is next written, at a time determined by the workload model. Once Active, the permanent bit fault cannot be overwritten and is assumed to remain in that state in order to simplify the fault model. Transient faults occur at WeibuU {\y[\.^m^^<-'^^Nltv^r\) distributed inter-arrival times in memory. Their effects are similar to, but shorter than those of permanent faults. Random words will be accessed if a transient affects the Addressing Logic. However, the fault is assumed to last only one instruction cycle after which the memory returns to the Fault Free state. A transient fault that strikes a memory bit has the probability 1 —'Pbit-iatent of becoming Active. Its duration is assumed to be so short that it cannot become latent, like a permanent fault can. As a result, a transient fault in the bit-array will immediately return to the Fault Free state with probability 'Pbit-iatent• An Active transient will produce errors whenever it is read, until overwritten at a time specified by the workload model. 3.4.3

F a u l t - T o l e r a n c e M e c h a n i s m Fault M o d e l

Permanent and transient faults occur at Wfizfti'// (Ai.>TMperm, OFTMpcrm) and W''e«6«//(AFTMtran, "FTMtran) distributed inter-arrival times, respectively, within the components t h a t make up the block of fault-tolerance mechanisms. Because most fault-tolerance mechanisms are relatively simple, having little associated state in comparison to the processors and memories, all faults are assumed to affect single random bits with no latency. A fault-tolerance mechanism component afflicted with one or more permanent faults is considered to be failed.

180 Address Logic 1 - /•b.t. array

I Permanent [ 1 Fault I

1 — ^bit-latent

-^bit- array

1

Array

Weibull

Latent

, Pbit -latent

('^Mperm i^^^Mperm j

H

Active

Next Write with probability 1 - 'Pb.t-iatem

Fault Free Wetbull

1 cycle

(•^Mtrarn'I^Mtran)

^[ Address

f\^

1 - n,t

'"ft^nt"; Fault I — _

\

Logic J

-array

Overwrite

^bit-la

J

/"b

Bit Array

' \

J \

. . Active

1 — r'bit-latent

Figure 5: Memory Fault Model

181 3.5

Implementation

3.5.1

Basic Operation

R E A C T utilizes the system, workload and fault/error models developed in the last three sections to perform life testing through simulated fault injection. Monte Carlo, event-driven simulation is employed to operate the system, inject faults and make repairs, based on these models. During a single simulation run, experiments or trials are conducted in which an initially fault-free system is operated until it fails or reaches a user specified censoring time (Tcensor)T h e number of trials (.Vtriais) is determined by the desired confidence intervals about the system dependability metric being evaluated. The censoring time dictates the m a x i m u m operational lifetime relevant to the given system. Those trials in which the system remains operational beyond the censoring tiine are terminated, thereby shortening the run-time of the simulation without affecting the measurements of interest. 3.5.2

Dependability Measures

Extensive instrumentation has been incorporated into R E A C T to collect d a t a from each trial, which is later aggregated over the entire simulation run in order to compute the system dependability metrics. A plot of reliability or availability as a function of time, various statistical measurements such as M T T F , M T B F , mean time to repair ( M T T R ) and steady-state availability (vlsteady-state)? plus a comprehensive system failure mode report are provided as outputs. Reliability is calculated from the set of failure times: FT={ftuft2

M.alsl

(8)

where /<, is the time at which the system failed or was censored during trial i. Define a sample period of length: _ 'sample

max(/
—

./•

,q^ \-J)

.'Vpomts

where A/'pomts is the number of sample points in the reliability curve. T h e failure frequency nfa,i(x) in sample period x [l < x < Appoints) is computed as the number of failure times in FT for which: {x - l)rsample < fti

< •P''"sample

(1 < i < W , , „ a l s )

(10)

T h e probability mass function (pmf) of failure times can be estimated by: /fauW

=

^ ^ ^ ^

( 1 < X < A-po.nts)

(11)

For a sufficiently large number of trials, /faii(a;) —+ /faii(x) and the corresponding failure time C D F may be obtained by integrating the pmf: X

ffail(x)

=

^/fail(0

(0<X<.Vpo,nts)

(12)

182 Tliis distribution represents the unnliabtiily of the system in sample period x. Reliability is computed as the probabilistic inverse of F(a,,\{x) when continuous time t is quantized to a sample period: R(t)

=

t

1 - F M

{Q
<

ll,n.cr)

(13)

, L ^sample _

Availability is calculated from the set of down-times; '•

'

•^'

:N)

^"points J

where d/j. is the total system down-time accumulated over all trials in sample period x. Sample periods are now defined to have length: ^ cen

'sarnplf

ir))

—

where Appoints specifies the number of points in the availability curve. Let U{x) represent the unavailability of the system in sample period x. It can be (Estimated by the ratio of average down-time to the length of a sample period: U(x)

=

dt.

; i < a; < Apo.ntb

16)

rials '^sample

Like the failure time pmf, U(x) —f U{x) when the number of trials is sufficiently large. Assume an initial unavailability of f/(0) — 0. Availability is the probabilistic inverse of l'[x) when continuous time / is quantized to a sample period: f t A{i) l• - ' - ••M ' (0
Estimates for the M T T F , M T B F and M T T R of a system are computed as: A' trials

MTTF

= Afa,l

MTBF

=

MTTR

=

;i8)

^

v'v trials -'censor

Anrail K points

Mfail

E ^^^

(20)

where A/faii is the total number of system failures that occurred before censoring. Using the relation MTBF = MTTF + MTTR, steady-state availability can be obtained in terms of these estimates: MTBF-

MTTF ^sCeady-staCe

MTTF +

MTTR

MTBF

MTTR

(2i:

183 Confidence intervals about point estimates for reliability, M T T F and M T B F are readily calculated. Assuming the number of trials is large, tli*^ normal approximation to the binomial distribution may be applied. This yields the following (1 — o) X 100% confidence interval around reliability point estimate R: - V

''^trials

2 y

A trials

where Zg represents the value at which the standard normal distribution is equal to 9. If the failure time distribution is approximately exponential and repair limes are short, the (1 — a ) x 100% confidence interval around T. a point estimate of M T T F or M T B F , is given by:

^J2Mu^

< T < T , ^'^^-'

(23)

where X% e represents the value at which the chi-square distribution with .V degrees of freedom is equal to 0. The failure mode report gives the percentage of system failures that can be attributed to each of the 26 possible causes listed in Table 6. The M T T F or M T B F associated with each failure mode is also provided.

3.6

Verification and Validation

Verificaiiov is the process of ensviring a model has been correctly implemented. R E A C T was verified primarily through jirogram tracing. A debugging flag may be .set when compiling the software to insert code that will dynamically print the state of the simulator as it executes. This feature was used to collect numerous traces of the program for a variety of system architectures and input parameter combinations. Each trace was then analyzed by hand to check for proper operation. R E A C T was also traced from certain pre-determined states, in order to observe the simulation of rare events that do not normally appear in a randomly collected trace. Implementation errors were minimized through the use of abstract d a t a types and standardized routines to generate random variates [6]. Validation is the process of checking if a correctly implemented model approximates reality with sufficient accuracy. R E A C T was partially validated by comparing its dependability predictions with those of coiribinatorial models. Many system attributes simulated by R E A C T cannot be easily modeled analytically. In order to facilitate the comparison, only permanent faults were considered and input parameters values were chosen to eliminate the effects of workload. This was accomplished by using the input parameter values listed in Table 7, which reduce fault latency and accelerate error propagation. Event distributions were assumed to be exponential, fixing all Weibull shape parameters at a = 1. Failure rates and censoring times were varied over a range of values, while the remaining input parameters were set to zero.

184 Table 6: System Failure Modes

1. 2. 3.

Processor Modules Permanent faults only Transient faults only Permanent and transient faults

4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

Memory Modules Permanent addressing faults only Transient addressing faults only Permanent bit faults only Transient bit faults only Permanent addressing and transient addressing faults Permanent addressing and permanent bit faults Permanent addressing and transient bit faults Transient addressing and permanent bit faults Transient addressing and transient bit faults Permanent bit and transient bit faults

14. 15. 16.

Fault-Tolerance Mechanisms Permanent faults only Transient faults only Permanent and transient faults

17. 18. 19. 20. 21. 22. 23. 24. 25. 26.

Error Propagation Errors propagating to the processors only Permanent processor faults and propagating errors Transient processor faults and propagating errors All types of processor faults and propagating errors Errors propagating to the memories only Permanent memory addressing faults and propagating errors Transient memory addressing faults and propagating errors Permanent memory bit faults and propagating errors Transient memory bit faults and propagating errors All types of memory faults and propagating errors

185 Table 7; Input, Parameter Vahies thai Reduce Fault Latency and Accclcrato Error Propagation Parameter

Value

Definition

'^instr

1000 MIPS 1.0 1.0 1.0 1.0 lOVsec 1.0 1

Instruction execution rate Memory read access rate Memory write acces.s rate Locality-of-refereiice access fraction Locality-of-refcrence size fraction Processor fault latency scale jjaraineter Processor control flow upset probability Number of words per memory modul(^

'•^--Mread write Q^locality ^locality -^latent 'upset A words

Several different system architectures were used to vaHdate R E A C T . One of these was a 2-of-3 module system, in which at least two of its three ]>rocessor and memory modules must be operational. In order to simplify the analysis, the fault-tolerance mechanisms of this architecture were assumed to have perfect coverage and never fail. The reliabilities of the processor and memory modtdes were defined to be exponential: i ? p ( 0 = e-

Hu{i) = e

-

AM(

m)

with failure rates Ap and AM, respectively. An expression for llie reliability of a 2-of-3 system is obtained by summing the probabilities of no failed modules, one failed processor or memory module, and both a failed processor and memory module:

/i2-of-3(<)

=

Rl{t)Rl^{t) ZRl{t)\l

+

- Rpit)]Rl^{t)

^Rl{t)Rl,{t)[\

-RM{i)]

9«|(<)[1 - R^(t)]RlS)[l

+ + - «M(0]

(25)

Coefficients on the product terms in this equation represent the number of possible combinations for module failures. T h e reliability of the 2-of-3 system was evaluated for three different memory failure rates and plotted in Figures 6 and 7 over mission times of 100 and 10000 hours, respectively. Processor failure rates were held constant at Ap = 10"'' failures/hour. Symmetry in Equation 25 yields the same reliability curves when the memory failure rate is fixed at, A^ = 10"'' failures/hour and the processor failure rate is varied equivalently. Results from the corresponding R E A C T simulation are also graphed in both plots. The reliability measured by RIOACT matched t h a t of the combinatorial model with little error. Table 8 gives tlie

186

1.0000 D

+

0.9999

o

-

0.9998

ran 0.9997

Analytical Aj^j = 10 '' .Analytical \f^^ = 10"'' Analytical A]\,] = lO"'' REACT

n

9

0.9996 I0.9995

•20

0

40 t

60

80

100

(Hours)

Figure 6: Reliability of a '2-of-3 System over 100 Hours

1.0 f%: ^a.

D

0.8

+

Analytical ,\]^ = 10 '' Analytical A ^ = l O ' ' '

0

Analytical Aj,, = 10~-'

-

REACT

^.

0.6

R(t)

Q •Q.

^. •Q

Q.

•+.

0.4

'+.•+..

•D.

•°--.-'

•+..

0.2 |_

*

••+-^ •••+..

•+••

0.0

6'-«i» 0 4

0

2000

4000 /

J_

J_

6000

8000

+ 10000

(Hours)

Figure 7; Reliability of a 2-of-3 System over 10000 Hours

exact analytical and empirical reliabilities at 50 hours. The true reliability always fell within the 95% confidence interval around the estimate. For the final check, the MTTF of the 2-of-3 system was derived as; M7'7'F2-of-3

f

Jo

«2-of-3(0 di

187 Tabic 8: Reliability of a •2-of-3 System at, 50 Hours Ap

AM

Analytical

REACT

95% Confidence Interval

10-* 10-* lo-""

10--^ 10-* 10-^

0.993022 0.999851 0.999925

0.993001 0.999853 0.999928

[0.992838,0.993161] [0.999829, 0.999877] [0.999911,0.999945]

Table 9: M T T F of a 2-of-3 System Ap

AM

Analytical

REACT

95% Confidence Interval

10"^ 10"^ 10-*

10-'^ 10-^ IQ-^

819.3 5166.7 8193.3

818.8 5162.4 8185.5

[817.2. 820.4] [5152.3,5172.5] [8169.5.8201.5]

35

3 0 ( A P + AM)

6 (Ap + A M )

(2Ap + 3 A M ) ( 3 A P + 2 A M )

(26)

and evaluated for the various failure rates. Table 9 compares those values with the predictions of R E A C T . T h e 95%. confidence interval again contained the exact M T T F in all cases. The validity of the portion of R E A C T ' s modeling not tested through this comparison would be very difficult to check analytically. However, because the modeling employed by R E A C T is based on measurements from actual systems, one would expect it to be a fairly accurate representation of real system behavior.

3.7

Performance Limitations

The execution time of R E A C T is a function of the number of trials performed during a simulation run and the average number of events simulated i)er trial. The number of events simulated per trial is determined mainly by the censoring time, but depends heavily on other input parameters. R E A C T may sinnilate as few as O.l events per trial, on average, at low censoring times (Tcensor < 100 hours) or up to several thousand events per trial at higher censoring times ("Tconsor > lOOOO hours). As a "rule-of-thumb", R E A C T can sinudate the reliability of one million systems, over a mission time of 100 hours, in about one hour of real time, when running on a 25 MHz RISC workstation. R E A C T has successfully evaluated multiprocessor systems with cnd-of-mission reliabilities as high as'R{t) = (1 - 10"^) ± 1.5 x 10"*' at 90% confidence in less than eight hours of run-time. Availability generally requires more time to analyze than

188 reliability, since repairs must be simulated in addition to failures and every system is operated until the censoring time.

3.8

Conclusion

This chapter has introduced R E A C T , a simulated fault-injection testbed for fault-tolerant multiprocessor architectures. The next two chapters demonstrate the effectiveness of R E A C T through the analysis of reliability, performance and cost tradeoffs in several alternative triple-modular redundant systems.

189

References [1] Agarwal, A. and G u p t a , A., "Memory-reference characteristics of multiprocessor applications under MACH," in Proceedings of the 1988 Sigmetrtcs Conference on Measurement and Modeling of Computer Systems, pp. 215225, ACM, May 1988. [2] Avizienis, A., Gilley, G. C., Mathur, F. P., Rcnnels, D. A., Rolir. J. A., and Rubin, D. K., "The STAR (Self-Testing And Repairing) computer; An investigation of the theory and practice of fault-tolerant computer design," IEEE Transactions on Computers, vol. C-20, no. 11, pp. 1312-11521, Nov. 1971. [3] Ayache, J. M. and Diaz, M., "A reliability model for error correcting memory systems," IEEE Transactions on Reliability, vol. R-28, no. 4, pp. 310314, Oct. 1979. [4] Bernstein, P. A., "Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing," IEEE Computer, vol. 21, no. 2, pp. 37-45, Feb. 1988. [5] Borg, A., Kessler, R. E., and Wall, D. W., "Generation and analysis of very long address traces," in Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 270 279, ACM, May 1990. [6] Bratley, P., Fox, B. L., and Schrage, L. E., A Guide to Simulation. York, NY: Springer-Verlag, 2nd ed., 1987.

New

[7] Bugge, H. O., Kristiansen, E. H., and Bakka, B. O., "Trace-driven simulations for a two-level cache design in open bus systems," in Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 250-259, ACM, May 1990. [8] Bunt, R. B. and Murphy, J. M., "The measurement of locality and the behaviour of programs," The Computer Journal, vol. 27, no. 3. pp. 238245, Aug. 1984. [9] Bunt, R. B., Murphy, J. M., and Majumdar, S., "A measure of program locality and its application," in Proceedings of the 198Jf Sigmetncs Conference on Measurement and Modeling of Computer Systems, pp. 28-40, ACM, Aug. 1984. [10] Chen, C. L. and Rutledge, R. A., "Fault-tolerant memory simulator," IBM Journal of Research and Development, vol. 28, no. 2, pp. 184-195, Mar. 1984. [11] Chillarege, R. and Iyer, R. K., "An experimental study of memory fault latency," IEEE Transactions on Computers, vol. 38, no. 6, pp. 869 874, J u n e 1989.

190 [12] Choi, G. S. and Iyer, R. K., "FOCUS: An experimental environment for fault sensitivity analysis," lEKE Transaciums on Coinpuifva, vol. 41. no. 12, pp. 1515-1526, Dec. 1992. [13] Clark, D. W., Bannon, P. J., and Keller, J. B., "Measuring VAX 8800 performance with a histogram hardware monitor," in Proceedings of the I51k Annual International Symposturn on Computer Architecture, pp. 176 185, ACM, May 1988. [14] Courtois, B., "Some results about the efliciency of simple mechanisms for the detection of microcomputer malfunctions," in Proceedings of the 9th International Symposium on Fault-Tolerant Computing, pp. 71-74, lEEf], J u n e 1979. [15] Czeck, E. W. and Siewiorek, D. P., "Ellfecls of transient gate-level faults on program behavior," iu Proceedings of the 20th International Symposium on Fault-Tote rant Computing, pp. 236-243, IEEE, June 1990. [16] Emer, J. S. and Clark, D. W., "A characterization of processor performance in the VAX-11/780," in Proceedings of the 11th Annual International Symposium on Computer Architecture, pp. 301 310, ACM, June 1984. [17] Heising, W. P., "Note on random addressing techniques," IBM Journal, vol. 2, pp. 112-116, June 1963.

Systems

[18] Hennessy, J. L. and Patterson, D. A., Computer Architecture, tative Approach. San Mateo, CA: Morgan Kaufmann, 1990.

Quanti-

A

[19] Hopkins, A. L., Smith, T. B., and Lala, J. H., " F T M P - A highly reliable fault-tolerant multiprocessor for aircraft," Proceedings of the IEEE, vol. 66. no. 10, pp. 1221 1239. Oct. 1978. [20] Jewett, D., "Integrity S2: A fault-tolerant Unix platform," in Proceedings of the 21 si International Symposium on Fault-Tolrrant Computing, pp. 512-519, IEEE, J u n e 1991, [21] Kanawati, G. A., Kanawati, N. A., and Abraham, J. A., " F E R R A R I : A tool for the validation of system dependability properties," in Proceedings of the 32nd International Symposium on Fault-Tolerant Computing, pp. 336-344, IEEE, July 1992. [22] Kobayashi, M. and MacDougall, M. H., "The stack growth function: Cache line reference models," IEEE Transactions on Computers, vol. 38, no. 6, pp. 798-805, J u n e 1989. [23] Koo, D. Y. and Chenowith, H. B., "Choosing a practical M T T F model for E C C memory chip," in Proceedings of the 198^ Annual Reliability and Maintainability Symposium, pp. 255-261, ll^EF], J a n . 1984. [24] Krol, T., "(N,K) concept fault tolerance," IEEE ers, vol. C-35, no. 4, pp. 339-349, Apr. 1986.

Transactions

on

Comput-

191 [25] Lee, I., Iyer, R. K., and Tang, D., "Error/failure analysis using event logs from fault tolerant systems," in Proceedings of the 21st International Symposium on Fault-Tolerant Computing, pp. 10-17, IEEE, June 1991. [26] Libson, M. R. and Harvey, H. E., "A general-purpose memory reliability simulator," IBM Journal of Research and Development, vol. 28, no. 2, pp. 196-205, Mar, 1984. [27] Marston, D.. Memory System Reliability Application Note AP-73, 1980.

with ECC.

Intel Corporation,

[28] Miremadi, G., Karlsson, J., Gunneflo, U., and Torin, J., "Two software techniques for on-line error detection," in Proceedings of the 22nd International Symposium on Fault-Tolerant Computing, pp. 328-335, IEEE. July 1992. [29] Ohlsson, J., Rimen, M., and Gunneflo, U., "A study of the effects of transient fault injection into a 32-bit RISC with built-in watchdog," in Proceedings of the 22nd International Symposium on Fault-Tolerant Computing, pp. 316-325, IEEE, July 1992. [30] Shin, K. G. and Lee, Y.-H., "Measurement and application of fault latency," IEEE Transactions on Computers, vol. C-35, no. 4, pp. 370 375, Apr. 1986. [31] Siewiorek, D, P., Kini, V,, Mashburn, H., McConnel, S., and Tsao, M., "A case study of C . m m p , Cm*, and C.vmp: Part I - Experiences with fault tolerance in multiprocessor systems," Proceedings of the IEEE, vol. 66, no. 10, pp. 1178-1199, Oct. 1978. [32] Smith, A. J., "Cache evaluation and the impact of workload choice," in Proceedings of the 12th Annual International Symposium on Computer Architecture, pp. 64-73, ACM, J u n e 1985. [33] Spirn, J. R., Program Behavior: NY: Elsevier, 1977,

Models and Measurements.

New York,

[34] Tang, D. and Iyer, R. K., "Dependability measurement and modeling of a multicomputer system," IEEE Transactions on Computers, vol. 42, no. 1, pp. 62-75, J a n . 1993. [35] Tang, D., Iyer, R. K., and Subramani, S. S., "Failure analysis and modeling of a VAXcluster system," in Proceedings of the 20th International Symposium on Fault-Tolerant Computing, pp. 244-251, IEEE, June 1990. [36] Thiebaut, D. F., Wolf, J. L., and Stone, H. S., "Synthetic traces for tracedriven simulation of cache memories," IEEE Transactions on Computers, vol. 41, no. 4, pp. 388-410, Apr. 1992, [37] Toy, W. N., "Fault-tolerant design of local ESS processors," Proceedings of the IEEE, vol. 66, no. 10, pp. 1126-1145, Oct. 1978.

192 [38] Webber, S. and Beirne, J., "The Stratus architecture," in Proceedings of the 21st International Symposium on Fault-Tolerant Computing, pp. 79 85. IEEE, J u n e 1991. [39] Wiecek, C. A., "A piler execution," in chitectural Support pp. 177-184, ACM,

case study of VAX-11 instruction set usage for comProceedings of the 1st International Conference on Arfor Programming Languages and Operating Systems, Mar. 1982.

SECTION 4

SYSTEM EVALUATION

SECTION 4.1

Measurement-Based Dependability Evaluation of Operational Computer Systems Ravishankar K. Iyer and Dong Tang

Abstract This paper discusses methodologies and advances in measurement-based dependability evaluation of operational computer systems. Research work over the past 15 years in this area is briefly reviewed. Methodologies are illustrated through discussion of authors' representative studies. Specifically, measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software dependability, and fault diagnosis are addressed. The discussion covers methods used in the area and several important issues previously studied, including workload/failure dependency, correlated failures, and software fault tolerance.

4.1.1 Introduction When a computer system is operating, various types of errors can occur both in the hardware and in the software. There are many possible sources of errors, including untested manufacturing faults and software defects, wearing out of devices, transient errors induced by radiation, power surges, or other physical processes, operator errors, and environmental factors. The occurrence of errors is also highly dependent on the workloads running on the system. A distribution of operational outages from various error sources for several major commercial systems is reported in [45].

This research was supported in part by the Office of Naval Research under Grant N00014-91 -J-II16 and in part by the National Aeronautical and Space Administration under NASA Grant NAG-1-613, in cooperation with the Illinois Computer laboratory for Aerospace Systems and Software (ICLASS). Ravishankar K. Iyer is with the Coordinated Science Laboratory, University of Illinois, Urbana, IL 61801. Dong Tang is currently with SoHaR Incorporated, Beverly Hills, CA 90211.

196 There is no better way to understand dependability characteristics of a complex computer system than measuring real systems and analyzing the measured data. Here, measuring a real system means monitoring and recording naturally occurring errors and failures in the system while it is running under user workloads. The collected data contain a large amount of information about naturally occurring errors/failures. Analysis of this data can provide valuable information on actual error/failure behavior, identify system bottlenecks, quantify dependability measures, and verify assumptions made in analytical models. Measurement-based dependability analysis of operational systems has evolved significantly over the past 15 years. This chapter reviews methodologies and advances in this area. The discussion is focused on authors' work. The results discussed are based on over 100 machine-years of data gathered from IBM, DEC, and Tandem systems. Table 4.1.1. gives a brief overview of representative studies in this area. Table 4.1.1. Measurement-Based Studies of Computer System Dependability Category Data Coalescing

Issues

Studies

Analysis of time-based tuples Clustering based on type and time

114] [56] [23] [25] [52]

Transient faults/errors Error/failure bursts TTE/TTF distributions

[23] [34] [44] [17] [23] [52] [22] [27] [34]

Dependency Analysis

Hardware failure/workload dependency Software failure/workload dependency Correlated failures and impact Two-way and multi-way failure dependency

15][6] [19] [7] [22] [38] [48] [50] [58] [9] [25] [49]

Modeling and Evaluation

Performability model for single machine Markov reward model for distributed system Two-level models for operating systems

[18] [52] [27]

Error recovery Hardware-related & correlated software errors Software fault tolerance Software defect classification

[17] [57] [21][27] [51] [13][26][28] [46] [47]

Heuristic (rend analysis Statistical analysis of symptoms Network fault signature

[29] [56] [24] 131] [32] [33]

Basic Error Characteristics

Software Dependability Fault Diagnosis

197 Early studies in this field investigated transient errors in DEC computer systems and found that more than 95% of all detected errors are intermittent or transient errors [34], [44]. The studies also showed that the inter-arrival time of transient errors follows a Weibull distribution with a decreasing error rate. This distribution was also shown to fit the software failure data collected from an IBM operating system [22]. A recent study of failure data from three different operating systems showed that time to error (TTE) can be represented by a multi-stage gamma distribution for the measured single-machine operating system and by hyperexponential distributions for the measured distributed operating systems [27]. Studies of dependency between workload and failure in early 1980s, based on measurements from IBM [5] and DEC [6] machines, revealed that the average system failure rate is strongly correlated with the average workload on the system, The effect of workload-imposed stress on software was investigated in [7] and [22]. Recent analyses of DEC [48], [58] and Tandem [25] multicomputer systems showed that correlated failures across processors are not negligible, and their impact on availability and reliability are significant [9], [49], [50]. In [18], analytical modeling and measurements were combined to develop measurement-based reliability/performability models using data collected from an IBM mainframe. The results showed that a semi-Markov process is better than a Markov process for modeling system behavior. Markov reward modeling techniques were further applied to distributed systems [52] and fault tolerant systems [26], to quantify performance loss due to errors/failures for both hardware and software. A census of Tandem system availability indicated that software faults are the major source of system outages in the measured fault tolerant systems [13]. Analyses of field data from different software systems investigated several dependability issues including the effectiveness of error recovery [57], hardware-related software errors [21], correlated software errors in distributed systems [51], software fault tolerance [26], [28], and software defect classification [46], [47]. Measurement-based fault diagnosis and failure prediction issues were investigated in [24], [29], [31], [32], [56]. The following sections discuss issues, analysis methods, and representative studies in measurements, data processing, preliminary analysis of data, dependency analysis, modeling and evaluation, software dependability, and fault diagnosis. Although other studies will be reviewed in the discussion, authors' work will be presented in detail to illustrate methodologies.

198 4.1.2 Basic Methodology Figure 4.1.1 outlines the procedure of a measurement-based analysis. Given field error data collected from a real system, a measurement-based study consists of four steps: 1) data processing, 2) model identification and measure acquisition, 3) model solution if necessary, and 4) analysis of models and measures. Step 1 consists of extracting necessary information from field data (the result can be a form of compressed data, or flat data), classifying errors and failures, and coalescing repeated error reports. In a computer system, a single problem commonly results in many repeated error observations occurring in rapid succession. To ensure that the analysis is not biased by repeated observations of the same problem, all error entries which have the same error type and occur within a short time interval (e.g., 5 minutes) of each other should be coalesced in data processing. Thus, the output of this step is a form of coalesced data in which errors and failures are identified. This step is highly dependent on the measured system. Coalescing algorithms have been proposedin[14], [23], [56]. Step 2 includes identifying appropriate models (such as Markov models) and acquiring measures (such as TBF distributions) from the coalesced data. Several models have been proposed and validated using real data. These include the workload-dependent cyclostationary model in [6], the workload hazard model in [19], and the failure correlation models in [50]. Statistical analysis packages such as SAS [SAS85] or measurement-based dependability analysis tools such as MEASURE-([53] are useful at this stage. Step 3 solves these models to obtain system-level dependability measures (such as availability, reliability and expected reward rates). Dependability and performance modeling and evaluation tools such as SHARPE [43] can be used in this step. Step 4, the most creative part of this study, involves a convincing interpretation of the models and measures obtained from the data; for

Models & Measures Field Data

Data

>

Coalesced Data

Model Idenlifylng

Processing

Measure Acquiring

Step 1

Step 2

Model Mode s

Solution

Me asures

Step 3

Figure 4.1.1. Measurement-Based Analysis

Analysis of Models Measures Step 4

Results

199 example, the identification of reliability bottlenecks and impact on availability of design enhancement. The analysis methods can vary significantly from one study to another study, depending on project goals.

4.1.3 Measurements There are numerous theoretical and practical difficulties associated with making measurements. The question of what and how to measure is a difficult one. A combination of installed and custom instrumentation is typically used in most studies. From a statistical point of view, sound evaluations require a considerable amount of data. In modern computer systems, especially in fault tolerant systems, failures are infrequent and, in order to obtain meaningful data, measurements must be made for a long period of time. Also, the measured system must be exposed to a wide range of usage conditions for the results to be representative. There are normally two ways to make measurements: on-line automatical logging and human manual logging. Many large computer systems, such as IBM and DEC mainframes, provide error-logging software in the operating system. This software records information on errors occurring in the various subsystems, such as the memory, disk, and network subsystems, as well as other system events, such as reboots and shutdowns. The reports usually include information on the location, time, and type of the error, the system state at the time of the error, and sometimes error recovery (e.g., retry) information. The reports are stored chronologically in a permanent system file. The main advantage of the on-line automatic logging is its ability to record a large amount of information about transient errors and to provide details of automatic error recovery processes, which cannot be done manually. Disadvantages are that an on-line log does not usually include information about the cause and propagation of the error or about off-line diagnosis. Also, under some crash scenarios, the system may fail too quickly for any error messages to be recorded. Table 4.1.2 shows a sample of extracted error logs from a VAXcluster multicomputer system. Often, the meaning of a record in a log can differ between versions of the operating system and between machine models. One reason is that error detection and recording routines are written and modified over time by different groups of people. For example, a careful study of VAX error logs and discussion with the field engineers indicate that the operating system on different VAX machine models might report the same type of error into different categories [51]. Thus, it is important to distinguish these errors in the subsequent error classification. Since the information provided by on-line error logs may not be complete, it is valuable to have operator logs compensate the missing information. An operator log

200 Table 4.1.2 A Sample of Extracted Error Logs from a VAXclusterf Entry 5815 7005 12979 13005 13734 3260 10939 14209 13941 20937 27958 37790

Svs. ID Earth E-anh Europa Europa Europa Mercury Jupiter Jupiler Mars Mars Mars Saturn

Logging Time 20-DEC-1987 20:23:13.22 4-JAN-1988 11:45:07.12 8-JAN-!988 14:14:28.63 8-JAN-1988 16:23:17.4! 19-JAN-1988 17:31:30.74 24-DEC-1987 04:54:52.06 l-APR-1988 09:57:39.40 16-MAY-1988 13:37:04.97 25-FEB-1988 02:13:20.25 18-APR-1988 16:46:39.75 14-MAY-1988 20:57:46.48 20-JUL-1988 18:51:49.15

Subsys. & Unit I/O, HOSDUASrI/O, H3$MUA1 CI, EURSPAAO: CI, EURSPAAO: CI, EURSPAAO: Memory, TR #2 Unknown Device CPU, SBI CPU, IBOX BugCheck BugCheck BugCheck

Interpretation Disk drive error Tape drive error Path itO went from good lo bad Error logging datagram received Virtual circuit timeout Corrected memory error Unexpected read data fault Machine check Bad memory deallocation reque.sl size or address Insufficient nonpaged pool to remaster locks Unexpected system service exception

t The sample is intended to illustrate different types of errors logged. Therefore, entry numbers arc not con.sccutive.

should include information on system crashes, failure diagnosis, component replacement and hardware and software updates.

4.1.4 Data Processing Usually, on-line logs contain a large amount of redundant and irrelevant information in various formats. Thus, data processing must be performed to classify this information and to put it into a flat format to facilitate subsequent analyses. The first step of data processing is error classification. This process classifies errors in the measured system into types based or the subsystems and components in which they occur. There is no uniform or "best" error classification, because different systems have different hardware and software architectures. But some error types, such as Table 4.1.3. Major Error Types in VAXciusters System Hardware

Software

Type

Description

CPU Memory Disk Network

CPU or bus controller errors Memory ECC errors Disk, drive, and controller errors Local network and controller errors

Control Memory I/O

Problems involving program flow control or synchronization Problems referring to memory management or usage Inconsistent conditions detected by I/O management routines

201 CPU, memory, and disk errors, are seen in most systems. Table 4.1.3 lists an error classification (major error types) for VAXcluster systems [51], [52]. After error classification, the following data processing can be broadly divided into two steps: data extraction and data coalescing. Data extraction selects useful entries such as error and reboot reports (throwing away uninteresting entries such as disk volume change reports) from the log file and transforms the data set into a flat format. The design of the flat format depends on the necessity of the subsequent analyses. The following is a possible format: entry number

logging time

error type

device id

error description fields

In on-line error logs, a single fault in the system can result in many repeated error reports in a short period of time. To ensure that the subsequent analyses will not be distorted by these repeated reports, entries that correspond to the same problem should be coalesced into a single event, or tuple [56]. A typical coalescing algorithm [23] is merging all error entries which have the same error type and occur within a AT interval of each other into a tuple. The algorithm is as follows: IF <error type> = AND < AT" THEN ELSE <start a new tuple> A tuple reflects the occurrence of one or more errors of the same type in rapid succession. It can be represented by a record containing information such as the number of entries in the tuple and the time duration of the tuple. Different systems may need different time intervals in data coalescing. A recent study [14] defined two kinds of mistakes that can be made in data coalescing: collision and truncation. A collision occurs when the detection times of two faults are close enough (within AT") such that they are combined into a tuple. A truncation occurs when the time between two reports caused by a single fault is greater than AT" such that the two reports are split into different tuples. If AT is large, collisions are likely to occur. If AT" is small, truncations are likely to occur. The study found that there is a time-intervals threshold beyond which collisions are rapidly increased. Based on this observation, the study proposed a statistical models which can be used to select an appropriate time interval. In our experience, collision is not a big problem if the error type and device information is used in data coalescing as shown in the above coalescing algorithm. Truncation is usually not considered to be a problem [14].

202

4.1.5 Preliminary Analysis Once coalesced data is obtained, basic dependability characteristics of the measured system can be identified by a preliminary statistical analysis. Commonly used measures in the analysis include error/failure frequency, TTE or TTF distribution, and error/failure hazard rate function. In the following discussion, data from a VAXcluster system [52] is used to illustrate analysis methods.

4.1.5.1 Basic Statistics It is important but easy to obtain basic statistics from the measured data such as frequency, percentage, and probability. These statistics provide an overall picture of the measured system. Often, dependability bottlenecks can be identified by analysis of these statistics. Table 4.1.4 shows the error/failure statistics for the measured VAXcluster. In the table, UO errors include disk, tape, and network errors. Machine errors include CPU and memory errors. Software errors are software-related errors. The 95% confidence intervals for the percentage and probability estimates are also provided in the table. Two bottlenecks can be identified from the table. First, the major error category is I/O errors (93%), i.e., errors from shared resources. This category of error has a very high recovery probability (0.996). However, these errors still result in nearly 43% of all failures. This result indicates that, although the system is generally robust to the impact of 1/0 errors, the shared resources still constitute a major reliability bottleneck due to the sheer number of errors. Improving such a system may require using an ultra-reliable network and a disk system to reduce the raw error rate, not just providing high recoverability. Table 4.1.4 Error/Failure Statistics for the VAXcluster Error

Recovery

Failure

Category Count

Percentage

Count

Percentage

Probability

I/O Machine Software Unknown

25807 1721 69 191

92.87±0.30 6.19±0.28 0.25±0.06 0.69±0.10

105 5 62 73

42.86+6.20 2.04±1.77 25.3115.44 29.80+5.73

0.996±0.001 0.970±0.002 0.101±0.071 0.61810.069

All

27788

100.0

245

100.0

0.991+0.001

203 Secondly, although software errors constitute only a small part of all errors (0.3%), they result in significant failures (25%). This is because software errors have a very low recovery probability (0.1). This software failure estimation is conservative because there are significant unknown failures (30%). Some of these unknown failures could be attributed to software problems. Thus, software-related problems are severe in the measured system.

4.1.5.2 Empirical TTE Distributions and Hazard Rates TTE/TTF probability distributions and error/failure hazard rates are commonly used to investigate how errors and failures occur across time. It is relatively easy to obtain empirical TTE/TTF distributions from data. Figure 4.1.2 shows the empirical TTE distribution function, f{t), for a measured VAXcluster system [52]. Notice that the logarithmic coordinate is used for f(t) because of the big contrast between the largest and smallest values. It is seen that about 67% of the TBEs are less than one minute. Most of these instances are "time between errors of two different machines" because errors of the same type occurring within a 5-minute interval of each other on the same machine have been coalesced into a single error event. This fact implies that errors arc likely to occur on the different machines in the measured system within a very short period of time. The hazard rate characterizes error/failure intensity in time. It can be considered to be the probability that an error (failure) will occur within the coming unit of time, given that no error (failure) has occurred since the start of the system or the last error (failure) occurrence. The mathematical definition of the hazard rate [42] is as follows: h{t) =

Pr(error in (t, t+dt)} _ f(t) Prfno errors in (0. t}J dt 1 -F(f)

1.000 Mean= 12.9 Median = 0.08 Std. Dev. = 46.1

0.100-

/(') 0.0100.001 0

10

20 30 / (minutes)

40

50

Figure 4.1.2 VAXcluster Empirical TTE Distribution

(4.1.1)

204 Figure 4.1.3 shows the empirical failure hazard rates computed from the VAXcluster failure data. The high hazard rate near the origin, i.e., the high probability that the second failure will occur within a short time after a failure occurrence, indicates that failures in the VAXcluster tend to occur in bursts. The most likely for a second failure is the first two hours after a failure occurrence. Failure bursts have been observed by many studies [4], [17], [23]. Actually, in an early study of transient errors [34], the Weibull distribution with a decreasing failure rate identified for the interarrival time of failures caused by transient errors implicated the existence of failure bursts. 0.40.3hU) 0.20.100

\ . Un__n.^-n, ,.-^_n—.

c

10

20 30 / (hours)

^j~i

40

1 —

5

Figure 4.1.3 VAXcluster Empirical Failure Hazard

4.1.5.3 Analytical TTE Distributions A realistic, analytical form of TTE distributions is essential in modeling and evaluating computer system dependability. Often, for simplicity or due to lack of information, TTEs are assumed to be exponentially distributed. Early measurementbased studies found that the Weibull distribution with decreasing failure rate is representative of the time between failures (TBF) in a measured DEC computer system [34] and a measured IBM operating system [22]. A recent comparative study of the dependability of the Tandem GUARDIAN, VAX VMS, and IBM MVS operating systems showed that the software TTE in a single machine can be represented by a multi-stage gamma distribution and the software TTE in multicomputers can be represented by a hyperexponential distribution [27]. In this section, we discuss these two types of distributions. We first explain how a TTE distribution is obtained from a multicomputer system. The above GUARDIAN and VMS operating systems were running on multicomputers. Typically, all the constituent machines work in a similar environment and run the same version of the operating system. The whole system is treated as a single entity in which multiple instances of an operating system are running concurrently. Every software error in the system is sequentially ordered, and a distribution

205 is constructed. The constructed TTE distribution reflects the software error characteristics for the whole system. We will call this distribution the multicomputer software TTE distribution. Figure 4.1.4 shows the analytical TTE or TTH (Time To Halt) distributions fitted using SAS for the three measured systems. All the three empirical distributions failed to fit simple exponential functions. The fitting was tested using the Kolmogorov-Smirnov or Chi-square test at a 0.05 significance level. The two-phase hyperexponential distribution provided satisfactory fits for the VAXcluster and Tandem multicomputer software TTE distributions. An attempt to fit the MVS TTE distribution to a phase-type exponential distribution led to a large number of stages. As a result, the following multi-stage gamma distribution was used:

(a) IBM MVS Software TTE Distribution /{f) = 0.748-g(f;2.1,-1)+ 0.55-s((; 0.5,0)+ 0.069-sO; 3.5..1) + 0.030-,?((;5.0,8) + 0.098's((;5.0,1.7)

/(')

200 I (minutes) (b) VAXcluster Software TTE Distribution

300

400

(c) Tandem Software TTH Distribution 12-

1

osJ /(')

fU) = a,Xte'*'' + 02X26'^'' a 1=0.67 ,i,=0.20 ^2=0.33 ^2=2.75

/(0 = « ! AjC .08-

m .04-

04

^'' + CC2^2^ '*''

a,=0.87 02=0.13

,i,=0.10 /i2=2.78

^ ^ _ _ 00

.0010 15 t (days)

20

25

10 15 / (days)

20

Figure 4.1.4 Analytical Software TTE Distributions Extracted from Data

25

206 n

fit) = Z a,g{t\ff,-.Si),

(4.1.2)

(=1

where a, > 0, J) a, = 1, and

t < s,

g(t;a,s)--

na) It was found that a 5-stage gamma distribution provided a satisfactory fit, which means that the software TTE distribution on the MVS has a complicated mode. Figures 4.1.4(b) and 4.1.4(c) show that the multicomputer software TTE distribution can be modeled as a probabilistic combination of two exponential random variables, indicating that there are two dominant error modes. The higher error rate, /I2, with occurrence probability a2, captures both the error bursts (multiple errors occurring on the same operating system within a short period of time) and concurrent errors (multiple errors on different copies of an operating system within a short period of time) on these systems. The lower error rate, A], with occurrence probability tt], captures regular errors and provides an inter-burst error rate. Error bursts can be explained as repeated occurrences of the same software problem or as multiple effects of an intermittent hardware fault on the software. Actually, software error bursts have been observed in laboratory experiments reported in [4]. The study showed that, if the input sequences of the software under investigation are correlated (rather than independent), one can expect more "bunching" of failures than those predicted using a constant failure rate assumption. In an operating system, input sequences (user requests) are highly likely to be correlated. Hence, a defect area may be triggered repeatedly.

4.1.6 Dependency Analysis Many underlying dependencies exist among measured parameters and components, such as the dependency between workload and failure rate and the dependency among failures on different components. Understanding such dependency is important for improving system dependability and developing realistic models and hence better designs. In this regard, the workload/failure dependency issue was studied in the early 1980s and the correlated failure issue was investigated recently.

207 Dependency between workload and failure was addressed in two approaches: statistical quantification of the dependence between workload and failure rate [5], [22] and stochastic modeling of failures as functions of workload [6], [7]. Both approaches demonstrated the strong correlation between workload and failure rate. This result indicated that dependability models cannot be considered representative unless the system workload is taken into account. Based on this result, several workload-dependent analytical models have been proposed [1], [10], [36]. Recent measurements on VAXclusters [48], [58] and Tandem machines [25] found that correlated failures significantly exist in distributed systems. Further studies showed that even a small correlation can have major impact on system dependability [9], [49], [50]. Neither traditional models that assume failure independence nor those that are believed to take correlation into account are representative of the actual occurrence process of the observed correlated failures in thefield[53]. In the following subsections, dependency analysis is illustrated through three examples: 1) using a workload hazard model to analyze the dependency between workload and software failures in an IBM 3081 system, 2) using the correlation analysis method to analyze dependency between errors on different machines and its impact on dependability in a VAXcluster system, and 3) using the factor analysis method to analyze the multi-way dependency among failures on multiple processors in a Tandem fault-tolerant system.

4.1.6.1. Workload/Failure Dependency In [19], a load hazard model was introduced to measure the risk of a failure as the system activity increases. The proposed model is similar to the hazard rate defined in Eq (4.1.1). Given a workload variable X, the load hazard is defined as z(x) =

Pr[failure in load interval (x, x + Ax)] gix) = Pr[no failure in load interval (0, x)] Ax 1 — G{x)

(4.1.4)

where g(x) is the p.d.f. of the variable "a failure occurs at a given workload value x" and G{x) is the corresponding c.d.f. That is, f{x) g{x) = Pr[ failure occurs \ X = x] = l{x)

,

(4.1.5)

where l{x) is simply the p.d.f. of the workload in consideration: l{x) = Pr[X = x],

(4.1.6)

208 and f{x) is the joint p.d.f. of the system state (failure state or non-failure state) and the workload: f{x) = Prlfailure occurs & X = x] .

(4.1.7)

A constant hazard rate implies that failures are occurring randomly with respect to the workload. An increasing hazard rate on the increase of X implies that there is an increasing failure rate with increasing workload. The load hazard model was applied to the software failure and workload data collected from an IBM 3081 system running the VM operating system. Based on the collected data, lix), fix), g(x}, and z(x) were computed for each workload variable. Figure 4.1.5 shows the zix) plots for three selected workload variables: (1)

OVERHEAD — fraction of CPU time spent on the operating system;

(2)

PAGEIN — number of page reads per second by all users;

(3)

SIO (Start I/O) — number of input/output operations per second.

The regression coefficient, R^, which is an effective measure of the goodness of fit, is also provided in the figure. The hazard plots show that the workload parameters appear to be acting as stress factors, i.e., the failure rate increases as the workload increases. The effect is particularly strong in the case of the interactive workload measures OVERHEAD and SIO. The correlation coefficients of 0.95 and 0.91 show that the failure closely fit an increasing load hazard model. The risk of a failure also increases with increased PAGEIN, although at a somewhat lower correlation (0.82). Note that the vertical scale on these plots is logarithmic, indicating that the relationship between the load hazard z{x) and the workload variable is exponential, i.e., the risk of a software failure increases exponentially with increasing workload.

+^^+

z(.x)

zM

I0-'

10-

0

0.2 0.4 0.6 O.i X = OVERHEAD

0

20 40 60 80 100 X = PAGEIN

•

0

R- = 0.9\

+

50

100 KiO 200 x = SIO

Figure 4.1.5 Workload Hazard Plots for the IBM 3081 System

209 4.1.6.2 Correlation Analysis Recent measurements on VAXclusters [48] [58] and Tandem machines [25] [26] found that correlated failures are not negligible (about 10% to 30% of all failure data) in distributed systems. By a correlated failure we mean the occurrence of more than one failure on the different components of a computer system within a small time window. Figure 4.1.6 shows a scenario of correlated failures in a measured VAXcluster system [51]. In the figure, Europa, Jupiter, and Mercury are machine names in the VAXcluster. A dashed line represents that the corresponding machine is in a failure state. At some time, a network error (netl) was reported from the CI (Computer Interconnect) port on Europa. This resulted in a software failure (softl) 13 seconds later. Twenty-four seconds after the first network error (netl), additional network errors (net2,net3) were reported on the second machine (Jupiter), which was followed by a software failure (soft2). The error sequence on Jupiter was repeated (net4,net5,soft3) on the third machine (Mercury). The three machines experienced software failures concurrently for 45.5 minutes. When errors/failures on two components are related, the correlation coefficient is a commonly used measure to quantify such dependence. Given random variables Xi and X2, the correlation coefficient between Xi and X2 is defined as [42] Cor(Xi,X2)-

netl .softl Europa _ | I 'l3sec.'

Jupiter _ j 24 sec.

net2

net3

1

1

9 sec.

60 sec.

(4.1.8)

reboot I '

47.83 min.

Mercury _ |

Note:

,

soft2

reboot

1

10 sec.'

L 47.33 min. nel4 net.^ soft3 I I I 78 sec.' 11 sec.

45.5 itiin.

reboot \ 4 sec.

softl, soft2, soft3 — Exception while above asynchronous system traps delivery or on interrupt stack, netl. net3, net.') — Port will be re-started. net2, net4 — Virtual circuit timeout.

Figure 4.1.6 A Scenario of Correlated Failures

L

210 where ^\ and jij are the means of Xj and X2, and a\ and Oi are the standard deviations of X\ and Xj. When calculating correlation coefficients, estimates of these parameters from samples can be used. The first step in correlation analysis is building a data matrix from the measured data. Assume that there are n components in the measured system and the measured period is divided into m equal intervals of A/ (e.g., 5 minutes). An mx« data matrix can then be constructed in the following way. The n columns of the matrix represent the n components in the measured system. The m rows of the matrix represent the m time intervals. Element (;, j) of the matrix is set to the number of errors occurring within interval / on component j . Column j can be regarded as a sample of the random variable, Xj, which represents the state of component j in the system. The second step is calculating correlation coefficients using Eq. (4.1.8) based on the data matrix. Each time, we pick up two columns (X, and Xj) to calculate Cor{X,, Xj). This step can be automated by using a statistical package such as SAS. Table 4.1.5 shows the average correlation coefficients of the 21 pairs of machines in a VAXcluster for different types of errors and failures [52]. Generally, the error correlation is high (0.62) and the failure correlation is low (0.06). Disk and network errors are strongly correlated, because the processors in the system heavily use and share the disks and the network concurrently. Table 4.1.5 Average Correlation Coefficients for VAXcluster Errors Error All

CPU

Memory

0.62

0.03

0.01

Failure

Disk

Network

Software

All

0.78

0.70

0.02

0.06

Although the failure correlation is low, further studies showed that even such a low correlation can have big impact on system dependability [9], [49], [50]. It can degrade system reliability and unavailability by several orders of magnitude. In [49], a formula was derived to calculate unavailability for l-out-of-2 systems with correlated failures and was validated against real data. The formula relates the correlation coefficient discussed above to the system unavailability:

Un = pW^i}^-U^Wy^)

+ f/|t/2 >

(4.1.9)

where ^712 is the unavailability of the l-out-of-2 system, U\ and U2 are the unavailability of components 1 and 2 in the system, respectively, and p is the correlation coefficient between the two components, i.e., Cor{X\ Xi). When ^ = 0, the formula

211 reduces to the traditional formula (t/12 = lJ\Ui) for calculating the joint failure probability of two independent components. When p > 0, even if it is small (e.g., 0.01), the first term in the formula can be dominant because U\ and Ui arc usually much smaller than p in modern computer systems. For the VAXcluster data, if each pair of processors is treated as a l-out-of-2 system, the traditional formula underestimates unavailability, on the average, by two orders of magnitude.

4.1.6.3 Factor Analysis If errors/failures on more than two components are related, the correlation coefficient is not enough to quantify the dependence among these components (multiway correlation). In such a case, ihe factor analysis method can be used to uncover the multiway correlation. Factor analysis is a statistical technique to quantify multiway dependency among variables [8]. The method attempts to find a set of unobserved common factors that link together the observed variables. Consequently, it provides insight into Ihe underlying structure of the data. Let X = (A|, . . . , Xp) be a normalized random vector. We say that the /:-factor model holds for X if X can be written in the form X = AF+E,

(4.1.10)

where A = (Ay) (i = I,...,p; j = \,...,k) is a matrix of constants called factor loadings, and F = (/,,...,/(,) and E = (C],... ,ei^) are random vectors. The elements of F are called common factors, and the elements of E are called unique factors (error terms). These factors are unobservable variables. It is assumed that all factors (both common and unique factors) are independent of each other and that the common factors are normalized. Each variable x^ {i - 1,..., p), can then be expressed as x, = 'LAijfj + e^,

(4.1.11)

<^' = S 4 + V, -

(4.1.12)

and its variance can be written as

where 1^1 is the variance of e,. Thus, the variance of Xj can be split into two parts. The first part, i.e.,

212 (4.1.13) is called the communality. It represents the variance of Xj that is shared with the other variables via the common factors. In particular Ay = CoriXj, fj) represents the extent to which, Xj depends on the jlh common factor. The second part,
Factor 1

Factor 2

Factor 3

Factor 4

Communality

]

2 3 4 5 6 7 8

0.997 0.000 0.061 0.001 0.982 -0.001 0.047 -0.007

-0.004 0.000 0.012 0.999 -0.000 0.447 -0,002 0.762

-0.069 0.000 0.853 -0.011 0.188 -0.005 0.862 0.090

0.023 0.000 -0.133 0.021 -0.018 0.009 0.506 0.641

1.00 0.00 0.75 1.00 1.00 0.20 1.00 1.00

Var. Var. %

1.965 24.6

1.781 22.3

1.519 19.0

0.685 8.6

213 explained by the common factors and their percentages to the total variance. According to [8], factor loadings greater than 0.5 are considered significant. However, in reliability analysis, factor loadings lower than 0.5 can be significant. The results show that there are four common factors. Factor 1 captures the dependence between processor 1 and processor 5 and accounts for 24.6% of the total variance. Factor 2 captures the multiway dependence among processors 4, 6, and 8, although the contribution of processor 6 is small (0.447^, i.e., 20% of its variance is explained by this factor). Factor 2 explains 22.3% of the total variance. Factor 3 captures the dependence between processor 3 and processor 7 and contributes 19% to the total variance. Factor 4 captures the dependence, although it is lower (with factor loadings 0.506 and 0.641), between processor 7 and processor 8 and accounts for 8.6% of the total variance.

4.1.7 Markov Reward Modeling Many natural and social phenomena can be modeled by Markov or semiMarkov stochastic processes [54]. In computer area, Markov process is one of the most frequently used models in performance and dependability evaluation. Compared to combinatorial models, Markov models have several advantages, such as the ability to handle time-dependent failure rate, performance degradation, and interactions among components. In the area of analytical modeling of computer systems, performability models [35], [37], availability models [12], and Markov reward models [41], [55] have all been addressed. Typically, Markov models arc built based on certain assumptions (such as independent failures on different components) using individual component parameters (such as failure and recovery rates). The evaluated results are highly dependent on the input parameters and on the model assumptions. A substantial amount of the research addresses the questions of solving a given model. However, how to identify an accurate model to start with remains unclear. Also, the assumptions made in building analytical models also need to be validated by measurement-based analysis. In measurement-based modeling, Markov models are identified from data and therefore called measured models [53]. No additional assumptions (more than the Markov property) are made in the construction of these models. Measured models provide the best evaluation for real systems as well as insight into the development of representative analytical models. Thus, it is valuable to identify appropriate models from measured data. In the following sections, measurement-based Markov reward modeling techniques are illustrated by a system model generated for a VAXcluster and a software model generated for an IBM operating system.

214 4.1.7.1 Modeling of a Distributed System The data used for this modeling was collected from a DEC VAXcluster system, consisting of seven machines, for 250 days [52]. For this system, an error was defined as an abnormality in any component of the system. If an error led to a termination of service on a machine, it was defined as a failure. A failure was identified by a reboot following one or multiple error reports.

A. Model Construction Since the measured VAXcluster has seven machines, an 8-state Markov error model is constructed. The eight states, EQ, E\, ..., and £7, are defined such that £, represents the state wherein ; machines observe errors at the same time (the time granularity is chosen to be 1 second). For example, state £0 represents that none of the machines experiences errors, i.e., the VAXcluster is in the normal (error-free) state; state £7 represents that all the machines experience errors. At any measured time, the VAXcluster is in one of these states. The transition probabilities for the 8-state model is estimated from the error event data. Given that the system is in state (, the probability that it will go to state j , Pjj, is calculated as follows: obsen'ed number of transitions from £, to Ej observed number of transitions out of E,

(4.1.14)

Table 4.1.7 shows the transition probabilities calculated from the VAXcluster error data. Based on the table, an error propagation model can be obtained by Table 4.1.7 Transition Probability for the VAXcluster Error Model State

£„

£1

£2

£3

£4

£5

£6

£7

£„

.000 .824 .239 .126 ,079 .058 .070 .125

.891 .000 .594 .211 .147 .115 .081 .104

.084 .145 .000 .401 .102 .054 .024 .000

.014 .023 .118 .000 .422 .073 .016 .021

.004 .004 .035 .227 .000 .367 .073 .036

.002 .003 .009 .024 ,205 .000 .406 ,161

.002 .001 .004 .009 .034 .315 .000 .552

.003 .000 .001 .003 .011 .018 ,331 ,000

c, /?2 £3 £4 £5

E, £7

215

Figure 4.1.7 An Error Propagation Model for the VAXcluster calculating the probability that the system goes from state £, (i - 1, ..., 6) to any of the lower states (f.-i, ..., £0) and the probability that it goes from £, to any of the higher states (£,+1, ..., £'7). These probabilities are easily determined by summing all the row elements to the left of element ((,;). and all row elements to the right of element (/, 0 in the tables. The error propagation model is shown in Figure 4.1.7. An interesting error propagation characteristic is uncovered with this model. Notice that the transition probabilities to higher states (numbers in the upper line) tend to increase as the state increases. That is, once an error domain encompasses more than one machine, the probability of the domain involving more machines increases. In such situations, error containment can become increasingly difficult. Table 4.1.8 shows the mean holding time, the total holding time in the measured period, and the occupancy probability in each state for the model. It is seen from the table that £7 has the longest mean holding time (2.31 minutes) among all error states. Clearly, when all seven machines are affected by errors, the system takes the longest time to recover. The occupancy probabilities provide evidence that errors on different machines (i.e., errors in the higher states) are related. It is found that the measured occupancy probabilities for the higher states (£3 to E-j) are quite Table 4.1.8 Holding Time (HT) & Occupancy Probability for the VAXcluster Error Model State

Mean HT (min.)

Total HT (hr.)

Occ. Prob.

£()

22.39 1.27 0.40 0.56 1.07 0.40 0.73 2.31

5578.89 347.42 29.24 14.07 15.13 3.37 4.50 7.38

0.9298 0.0579 0.0049 0.0023 0.0025 0.0006 0.0007 0.0012

El El

E, £4 £5

E, £7

216 different from the occupancy probabilities analytically determined assuming error independence. For example, we consider the occupancy probability for Ej. By Table 4.1.8, the measured occupancy probability for E-, is 0.0012. Assuming that errors on different machines are independent, we can easily determine the occupancy probability for this state to be at most 0.02', where 0.02 is the highest error occurrence probability among the seven machines. That is, the measured value is higher than the calculated value by at least eight orders of magnitude.

B. Reward Analysis Markov models can be used to conduct reward analysis [55] to quantify the loss of service due to errors and failures. The key step is to define a reward function which characterizes the performance loss in each degraded state. For a computer system, a generic reward function can be defined in the following way. Given a time interval AT (random variable), a reward rate for the system in AT is determined by KAT) = W(A7) / AT",

(4.1.15)

where W{AT) denotes the useful work done by the system in AT" and is calculated by AT W(A7-) = AT - nr 0

in normal state in error state , in failure state

(4.1.16)

where n is the number of error reports in AT and r is the mean recovery time for a single error. Thus, one unit of reward is given for each unit of time when the system is in the normal state. In an error state, the penalty paid depends on the recovery time the system spends in that state, which is determined by the linear function AT-nr (normally, AT > nr, if AT < nr, WiAT) is set to 0). In a failure state, W(AT) is by definition zero. The expected steady-state reward rate, Y, can be estimated by ^=^

I^ KA/;)Af;,

(4.1.17)

where T is the summation of all Atj's (particular values of AT) in consideration. If we substitute r from Eq. (4.1.15) and let AT" represent the holding time of each state

217 in the error model, Y becomes the steady-state reward rate of the VAXcluster, which is also an estimate of system availability (performance-related availability). If we substitute r from Eq. (4.1.15) and let AT" represent the time span of the error event for a particular type of error, Y becomes the steady-state reward rate of the system during the event intervals of the specified error. Thus, (I -Y) measures the loss in performance during the specified error event. Note that it is possible that there are failed machines when the system is in an error state. Since the model is an empirical model based on the error event data (of which the failure event data is a subset), the information about errors and failures of all machines for each particular Atj can be obtained from the data. The steady-state reward rate for the VAXcluster was computed with r being 0.1, 1, 10, and 100ms. The results are given in Table 4.1.9. The table shows that the reward rate is not sensitive to r. This is because the overall recovery time is dominated by the failure recovery time, i.e., the major contributors to the performance loss are failures, not transient errors. In the range of these r values, the VAXcluster availability is estimated to be 0.995. Table 4.1.10 shows the steady-state reward rate for each error type (r = 1 ms) for the VAXcluster. These numbers quantify the loss of performance due to the recovery from each type of error. For example, during the recovery from CPU errors, the system can be expected to deliver approximately 15% of its full performance. During the disk error recovery, the average system performance degrades to nearly 61% of its capacity. Since software errors have the lowest reward rate (0.00008), the loss of work during the recovery from software errors is the most significant. Table 4.1.9 Steady-State Reward Rate for the VAXcluster T

0.1 ms

1ms

10 ms

100 ms

Y

0.995078

0.995077

0.995067

0.994971

Table 4.1.10 Steady-State Reward Rate for Each Error Type in the VAXcluster CPU

Memory

Disk

Tape

Network

Software

0.14950

0.99994

0.61314

0.89845

0.56841

0.00008

4.1.7.2 Modeling of an Operating System The modeled operating system is the IBM MVS system running on an IBM 3081 mainframe [17]. The measurement period is one year. A Markov model is developed using data collected from the system to describe error detection and

218 recovery inside an operating system. The MVS is a widely used IBM operating system. Primary features of the system are reported to be efficient storage management and automatic software error recovery. The MVS system attempts to correct software errors using recovery routines. The philosophy in the MVS is that for major system functions, the programmer envisages possible failure scenarios and writes a recovery routine for each. It is, however, the responsibility of the installation (or the user) to write recovery routines for applications. Recovery routines in the MVS operating system provide a means by which the operating system prevents a total loss on the occurrence of software errors. When a program is abnormally interrupted due to an error, the supervisor routine gets control. If the problem is such that further processing can degrade the system or destroy data, the supervisor routine gives control to the recovery termination manager (RTM), an operating system module responsible for error and recovery management. If a recovery routine is available for the interrupted program, the RTM gives control to this routine before it terminates the program. The purpose of a recovery routine is to free the resources kept by the failing program, to locate the error, and to request either a retry or the termination of the program. More than one recovery routine can be specified for the same program. If the current recovery routine is unable to restore a valid state, RTM can give control to another recovery routine, if available. This process is called percolation. The percolation process ends if either a routine issues a valid retry request or no more recovery routines are available. In the latter case, the program and its related subtasks are terminated. If a valid retry is requested, a retry is attempted to restore a valid state using the information supplied by the recovery routine and then give control to the program. For a retry to be valid, there should be no risk of error recurrence and the retry address should be properly specified. An error recovery can result in the following four situations: (1)

Resume op (resume operation) — The system successfully recovers from the error and returns control to the interrupted program.

(2)

Task term (task termination) — The program and its related subtasks are terminated, but the system does not fail.

(3)

Job term (job termination) — The job in control at the time of the error is aborted.

(4)

System failure — The job or task, which was terminated, is critical for system operation. As a result of the termination, a system failure occurs.

219 A. Model Construction The states of the model consists of eight types of error states (see Table 4.1.11) and four states resulting from error recoveries. Figure 4.1.8 shows the model, where a circle represents a state and an arrow represents a transition with an associated transition probability. The normal state represents that the operating system is running error-free. The transition probabilities were estimated from the measured data using Eq. (4.! .14). Note that the system failure state is not shown in the figure. This is because the occurrence of system failure was rare, and the number of observed system failures was statistically insignificant. Table 4.1.11 shows the mean waiting time characteristics of the normal and error states in the model. Note that the waiting time distribution of the normal state is the TTE distribution. It has been shown in Section 0.0.5 that this distribution is not simply exponential (a multi-stage gamma distribution), so the model is a semiMarkov model. In the table a multiple software error is defined as an error burst consisting of more than one type of software error. The average duration of a multiple error is at least four times longer than that of any type of single error which is typically in the range of 20 to 40 seconds, except for DLCK (deadlock) and OTHR (others). The average recovery time from a program exception is twice as long as that from a control error (21 seconds versus 42 seconds). This is probably due to the extensive software involvement in recovering from program exceptions. An error recovery can be as simple as a retry or as complex as requiring several percolations before a successful retry. The problem can also be such that no retry or percolation is possible. Figure 4.1.7 shows that about 83.1% of all retries are Table 4.1.11 Mean Waiting Time Stale

# Observations

Mean Waiting Time (sec.)

Standard Deviation

Normal (Error-Free) CTRL (Control Error) DLCK (Deadlock) I/O (I/O & Data Management Error) PE (Program Exception) SE (Storage or Address Exception) SM (Storage Management Error) OTHR (Other Type) MULT (Multiple Software Error)

2757 213 23 1448 65 149 313 66 481

10461.33 21.92 4.72 25.05 42.23 36.82 33.40 1.86 175.59

32735.04 84.21 22.61 77.62 92.98 79.59 95.01 12.98 252.79

220

Figure 4.1.8 MVS Software Error/Recovery Model successful. The figure also shows that the operating system is able to recover from 93.5% of I/O and data management errors and 78.4% of control related errors by retries. These observations indicate that most I/O and control related errors are relatively easy to recover from, compared to the other types of errors such as deadlock or storage errors. Also note that "no percolation" occurs only in recovering from storage management errors. This indicates that storage management errors are more complicated than the other types of errors.

221 B. Model Evaluation The steady-state measures evaluated from the model is listed in Table 4.1.12. The definitions of these measures are given in [16]. (1)

Transition probability {TTJ) — probability that the transition is to state j , given a transition to occur

(2)

Occupancy probability (j) — probability that the system occupies state j at any time point

(3)

Mean recurrence time (0^) — mean recurrence time of state j

The occupancy probability of the normal state can be viewed as the operating system availability without degradation. The state transition probability, on the other hand, characterizes error detection and recovery processes in the operating system. Table 4.1.12(a) lists the state transition probabilities and occupancy probabilities for the normal and error states. Table 4.1.12(b) lists the state transition probabilities and the mean recurrent times of the recovery and result states. A dashed line in the table indicates a negligible value (less than 0.00001). Table 4.1.12(a) shows that the occupancy probability of the normal state in the model is 0.995. This indicates that in 99.5% of the time the operating system is running error-free. In the other 0.5% of time the operating system is in the error or recovery states. In more than half of the error and recovery time (i.e., 0.29% out of 0.5%) the operating system is in the multiple error state. The average reward rate for all software error and recovery states is Table 4.1.12 Error/Recovery Model Characteristics Mea-

Normal

sure

State

CTRL

DLCK

I/O

PE

SE

SM

OTHR

MULT

n

.2474 .9950

.0191

.0020

.1299 .00125

.0060 .00010

.0134 .00019

.0281 .00036

.0057

.00016

.0431 .00291

Error Stale

(a)

Recovery State Measure

Retry

n

0.1704

Qihr.)

4.25

Resultant State

Percolation

No-Percolation

Resume Op.

Task Term.

Job Term.

0.0845

0.0030

0.1414

0.0712

0.0348

8.55

241.43

5.11

10.16

20.74

(b)

222 estimated from data to be 0.2736, Based on this reward rate and the occupancy probability for all error and recovery states shown in the table (0.005), the steady-state reward loss in the modeled MVS can be evaluated to be 0.00363. By solving the model, it is found that the operating system makes a transition every 43.37 minutes. Table 4.1.12(a) shows that 24.74% of all transitions made in the model are to the normal state, 24.73% to error states (obtained by summing all the ;r's for all error states), 25.79% to recovery states, and 24.74% to result states. Since a transition occurs every 43 minutes, it can be estimated that, on the average, a software error is detected every 3 hours and a successful recovery (i.e., reaching the "resume op" state) occurs every 5 hours. Table 4.1.12(b) also shows that more than 40% of software errors lead to job or task terminations which cause the loss of service to users. However, a few of these terminations lead to system failures. This result indicates that recovery routines in MVS are effective in avoiding system failures but are not so effective in avoiding user job terminations.

4.1.8 Software Dependability A great deal of research has addressed software reliability during the development phase. Different models have been proposed [39] to characterize the reliability growth of the candidate software through this phase. In general, these models can be divided into two classes. The first assumes that the failure rate is a function of the number of remaining defects in the software. Imperfect debugging and uncertainty in the projected number of initial defects have also been modeled [II]. The second class of models does not depend on the knowledge of the number of the remaining defects [30]. The failure rate is assumed to be a random variable and the software reliability model involves two stochastic processes. Although most models perform well within their own contexts, their performance varies significantly from one data set to another. The operational phase of a mature software is much different from the development phase. In the operational phase, a typical situation involves frequent changes and updates installed either by system managers or by vendors. Often, without notification to the installation management, the vendor will install a change (patch) to fix a fault found at some other installation. In a sense, the system being measured represents an aggregate of all such systems being maintained by the vendor. In addition, software reliability in the operational phase is also attributed to workload effects, hardware problems, and environmental factors. Thus, software reliability in the operational phase cannot be characterized by simply applying analytical models proposed for the development phase.

223 This section discusses two issues in the dependability of operational software: 1) hardware-related software errors, which are a result of interactions between hardware and software, and 2) the effect of fault tolerance on software availability.

4.1.8.1 Hardware-Related Software Errors In [21], software errors related to hardware errors were described as hardwarerelated software errors. More precisely, if a software error (failure) occurs in close proximity (within a minute) to a hardware error, it is called a hardware-related software (HW/SW) error (failure). There are several causes of hardware-related software errors. For instance, a hardware error, such as a flipped memory bit, may change the software condition, resulting in a software error. Therefore, even though it is reported as a software error, it is actually caused by faulty hardware. Another possibility is that the software may fail to handle an unexpected hardware problem such as an abnormal condition in the network communication. This can be attributed to a software design flaw. Sometimes, both the hardware error and the software error are symptoms of another, unidentified problem. Table 4.1.13 shows the frequency and percentage of hardware-related software errors/failures (among all software errors/failures) measured from an IBM 3081 system [22] and two VAXclusters [51]. In the IBM system, approximately 33% of all observed software failures are hardware-related. HW/SW errors are found to have large error-handling times (high recovery overhead). The system failure probability for the HW/SW errors is close to three times that for software errors in general. The VAXcluster data shows that most hardware errors involved in HW/SW errors are network errors (75%). This indicates that the major sources of hardware-related software problems in the measured VAXclusters are network-related hardware or software components. This is a unique feature in the multicomputer system, where processes highly rely on inter-communications through the network. Table 4.1.13. Hardware-Related Software Errors/Failures Category

HW/SW Errors

HW/SW Failures

Measures

Frequency

Percent

Frequency

Percent

IBM/MVS

177

11.4

94

32.8

VAX/VMS

32

18.9

28

21.4

224 4.1.8.2. Software Fault Tolerance While hardware fault tolerance techniques have been used successfully, the issue of software fault tolerance is still not well addressed. Major approaches for software fault tolerance rely on design diversity [2], [40]. But these approaches are usually inapplicable to large operating systems because of immense cost in developing and maintaining the software. However, some fault tolerance techniques not explicitly designed for tolerating software faults can provide a certain amount of software fault tolerance. Understanding such techniques is important for designing good approaches to improving software dependability. The Tandem GUARDIAN system, running on the single-failure-tolerant multicomputer system, is a good target for such evaluations. The Tandem GUARDIAN operating system is a message-based distributed system built for on-line transaction processing [3]. High availability is achieved via single-failure tolerance techniques including the process-pair approach. For each user program, there are two processes — a primary process and a backup process — executing the same program on two processors. During normal operation, the primary process performs all operations for the user, while the backup process passively watches message flows. The primary process periodically sends checkpoint messages to its backup. When the primary process detects an inconsistency in its state, it fails fast, and the backup process takes over the responsibility of the primary process. This approach can tolerate transient software errors, which will usually not be repeated by re-executing the process. A study of operating system fault tolerance achieved by the single-failure tolerance techniques implemented in a Tandem multiprocessor system was reported in [26]. The measured system had 16 processors and was working in a high-stress environment. The data source was the processor halt log maintained by the GUARDIAN system for a period of 23 months. The effect of the built-in fault tolerance mechanisms on software availability was evaluated by reward analysis. Two reward functions were defined in the analysis. In the definition, / represents the system state in which there are i failed processors, and n represents the total number of processors in the system. The first function (SFT) reflects the fault tolerance of the Tandem system. In this function, the first processor halt does not cause any degradation. For additional processor halts, the loss of service is proportional to the number of processors halted. The second function (NSFT) assumes no fault tolerance. The difference between the two functions allows evaluation of the improvement in service due to the built-in fault tolerance mechanisms:

225 SFT (Single-Failure Tolerance): 1

if / = 0 i- 1

ri-

(4.1.18)

if 0 < i < « if i = n

0

NSFT (No Single-Failure Tolerance): r; = 1 - n

(4.1.19)

0 < /• < n

Based on the above reward functions, the expected steady-state reward rate, i.e., the Y in Eq. (4.1.17), was evaluated for software, non-software, and all halts. The results are listed in Table 4.1.14. The bottom row shows the improvement in service time (i.e., reduction in reward loss) due to the fault tolerance. It is seen that the single-failure tolerance in the measured system reduces the service loss due to software halts by 89% and due to non-software halts by 92%. This clearly demonstrates the effectiveness of the implemented fault tolerance mechanisms against software failures as well as non-software failures. The table also shows that software problems account for 30% of the service loss in the measured system (with SFT). Although the system was working in a high-stress environment, the overall reward loss is small (10^ with SFT). This reflects the high availability of the measured system. Table 4.1.14 Loss of Service Caused by Halts in the Tandem System Measure

Software

Non-Software

All

l-K

.00062

.00205

.00267

Percent

23.2

76.8

100

1-y

.00007

.00016

.00023

Percent

30.4

69.6

100

89%

92%

91%

NSFT

SFT Improvement

226

4.1.9 Failure Prediction Fault diagnosis and failure prediction are significant for maintaining highly reliable systems. Measurement-based studies have shown that it is possible to predict future failures based on the current and historical on-line error information. Several heuristic and statistical approaches have been proposed. The heuristic approach extracts characteristics of anomalous events and relates them to failures or faults by heuristic rules [29], The statistical approach uses statistical techniques to quantify relationships among system error states defined on the basis of error rates and recognizes failure patterns using the quantified relationships [24]. Recently, the fault injection method has been used on networks to diagnose faults based on the information of performance anomaly caused by the faults [33]. In the following, a statistical approach is di.scussed in detail.

4.1.9.1 Prediction Based on Statistical Analysis The objective of this approach is to recognize intermittent failures through statistical analysis and testing on recorded error data [24]. The approach starts by identifying key error patterns potentially symptomatic of failure occurrences and then refines these patterns by scanning the rest of the data in stages for similar error patterns. At each stage, the similarity is statistically tested. The approach is illustrated by the flowchart in Figure 4.1.9. In the first stage, data coalescing is performed on the raw data to eliminate redundant reports. The output of this stage is error records (tuples) characterized by error states (error type, machine condition, etc.). Next, all error records occurring within a small time interval (15 minutes) are identified as error groups. Error groups represent periods of high error activity (error bursts). Experience has shown that when system errors occur in bursts of a relatively high error rate, the errors are often related. In the second stage, statistical analysis and hypothesis testing are performed on each error group to determine whether a valid correlation exists among its members (error records). Randomly formed groups in which members arc statistically independent are rejected. Thus, the original error groups consisting of records among which relationships can exist are refined to the validated error groups consisting of records among which relationships do exist. Relationships can exist across error groups, i.e., a single cause can give rise to a persistent error and thus foster multiple error groups within a short time. In the third stage, the output groups from the second stage are examined to recognize related error groups and to eliminate stray error records. Several concepts are introduced for the analysis in this stage. An error event is defined as the collection of error groups occurring within a given period (e.g., 24 hours) and having at least two error states in common. A symptom is defined as a collection of statistically related

227 Slagc I RAW DATA Error ExLraclion and Time-Based Grouping Slalistical Analysis and Testing V 1 Unrelated 1 Records

Slagc 2

, y 1 Correlated 1 Groups

Scan Groups lor Recurring Patterns

Slagc 3

Event Validation: Statistical Analysis &. Hypothesis Testing Stage 4

Scan Events for Recurring Patterns

*'

1 Super 1 Evcnl 1

i

1

•i'

1 Super

1 Super 1 Evcnl M

Event 2

i

JL

Super Event Validation. Statistical Analysis & Hypothesis Testing

1

Intcrmitlent Failures Symptoms of Transient Failures

Symptoms of Single Persistent Failures

Symptoms of Multiple Persistent Failures

Figure 4.1.9 Automatic Recognition of Persistent Failures error states that are common to at least half of the groups in an event. A symptom set is defined as the collection of all symptoms in an event. Figure 4.1.10 illustrates an event and its symptom set. The event is composed of three groups: Gj, GT, and G3. The error states in these groups are represented by A\, ..., Aj. Two symptoms are extracted from these error states: 5) which consists of Aj and A4, and ^2, which consists of ^5 and A(,. Thus, 5) and Sj constitute the symptom set for this group. Further, three simple rules are used in the fourth stage to recognize related events and to group them into sets called super events. The rules ensure that the events so grouped will have sufficiently common structure to permit testing for

228 correlation. Two events are grouped into a super event if they satisfy any one of the following criteria: 1) they have at least one syinptom in common, 2) a symptom of one event is a proper subset of at least one symptom of another event, or 3) if they are single-group events, then they have at least two error states in common. Figure

Group Gi

[77

[A^

G ] 1 ^ G2 = A2 A4 G2 O C , = A5 Aft

G | n i G , =«> Group G2

Group GT^

[77 A4 [77 Afi [77 JAft

Figure 4.1.10 Derivation of an Event's Symptom Set

Figure 4.1.11 Construction of Super Events

229 4.1.11 illustrates how super events are constructed. There is no time restriction when these rules are applied to the event data. When a super event is created, a corresponding super symptom set is also created. The super symptom set starts with just the symptoms of the first event of that super event. As another event is added, set intersection is performed between its symptom set and each of the symptom sets already in the super event. All intersections are then added to the super event set. In each of the above stages, statistical analysis and hypothesis testing are performed to validate the correlations among members in the formed groups or sets. The super events derived in the final stage can be used by service engineers to judge potential failures. This methodology was applied to the on-line error log files from two CYBER systems, and the results were compared to the log offailures and repair maintained by the system staff. In nearly 85% of the cases, the engineers were directly able to confirm that the validated super events corresponded to real system problems. The evaluation was made both on the basis of their experience and from their field maintenance logs. For the remaining 15% of the cases, the engineers agreed that a problem had existed, but that its manifestation was not severe enough to be noticed by their analysis.

4.1.10 Conclusion In this paper, we discussed methodologies and advances in measurement-based dependability evaluation of operational computer systems. We reviewed the research work over the past 15 years in the area. After an introduction to measurement and data processing techniques, we presented authors' representative studies in basic error characterization, dependency analysis, Markov reward modeling, software dependability, and fault diagnosis. The discussion covered methods used in the area and several important issues previously studied, including workload/failure dependency, correlated failures, and software fault tolerance. Significant progress has been made in this area over the past 15 years. Increasing attention is being paid to combining analytical modeling and experimental analysis. State-of-the-art analytical modeling techniques are being applied to real systems to evaluate various dependability and performance characteristics. Results from experimental analysis are being used to validate analytical models and to reveal practical issues that analytical modeling must address to develop more representative models. Software failure data from operational phases are also providing feedback to the software design, improving software reliability. Further interesting studies and advances in this area can be expected in the future.

230

References [I]

B.E. Aupperle, J.F. Meyer and L. Wei, "Evaluation of Fault-Tolerant Systems with Nonhomogeneous Workloads," Proc. 19th Int. Symp. Fault-Tolerant Computing, pp. 159-166, June 1989.

[2]

A. Avizienis and J.P.J. Kelly, "Fault Tolerance by Design Diversity: Concepts and Experiments," IEEE Computer, pp. 67-80, Aug. 1984.

[3]

J.F. Bartlett, "A 'Nonstop' Operating System," Proc. Int. Hawaii Conf. System Science, pp. 103-117, 1978.

[4]

P.G. Bishop and ED. Pullen, "PODS Revisited — A Study of Software Failure Behavior," Proc. I8th Int. Symp. Fault-Tolerant Computing, pp. 2-8, 1988.

[5]

S.E. Butner and R.K. Iyer, "A Statistical Study of Reliability and System Load at SLAC," Proc. 10th Int. Symp. Fault-Tolerant Computing, pp. 207-209, Oct. 1980.

[6]

X. Castillo and D.P, Siewiorek, "Workload, Performance, and Reliability of Digital Computer Systems," Proc. 11th Int. Symp. Fault-Tolerant Computing, pp. 84-89, July 1981.

[7]

X. Castillo and D.P. Siewiorek, "A Workload Dependent Software Reliability Prediction Model," Proc. 12th Int. Symp. Fault-Tolerant Computing, pp. 279-286, June 1982.

[8]

W.R. Dillon and M. Goldstein, Multivariate Analysis, John Wiley & Sons, 1984.

[9]

J.B. Dugan, "Correlated Hardware Failures in Redundant Systems," Proc. 2nd IFIP Working Conf. Dependable Computing for Critical Applications, Tucson, Arizona, Feb. 1991.

[10] J. Dunkel, "On the Modeling of Workload-Dependent Memory Faults," Proc. 20th Int. Symp. Fault-Tolerant Computing, pp. 348-355, June 1990. [II] A.L. Goel, "Software Reliability Models: Assumptions, Limitations, and Applicability," IEEE Trans. Software Engineering, Vol SE-11, No. 12, pp. 1411-1423, Dec. 1985. [12] A. Goyal, S.S. Lavenberg and K.S. Trivedi, "Probabilistic Modeling of Computer System Availability," Annals of Operations Research, No. 8, pp. 285-306, March 1987. [13] J. Gray, "A Census of Tandem System Availability Between 1985 and 1990," IEEE Trans. Reliability, Vol. 39, No. 4, pp. 409-418, Oct. 1990.

231 [14] J.P. Hansen and D.P. Siewiorek, "Models for Time Coalescence in Event Logs," Proc. 22nd Int. Symp. Fault-Tolerant Computing, pp. 221-227, July 1992. [15] D.I. Heimann, N. Mittal and K.S. Trivedi, "Availability and Reliability Modeling for Computer Systems," Advances in Computers, Vol. 31, pp. 175-233, 1990. [16] R.A. Howard, Dynamic Probabilistic Systems, John Wiley & Sons, Inc., New York, 1971. [17] M.C. Hsueh and R.K. Iyer, "A Measurement-Based Model of Software Reliability in a Production Environment," Proc. 11th Annual Int. Computer Software & Applications Conf., pp. 354-360, Oct. 1987. [18] M.C. Hsueh, R.K. Iyer, and K.S. Trivedi, "Performability Modeling Based on Real Data: A Case Study," IEEE Trans. Computers, Vol. 37, No.4, pp. 478-484, April 1988. [19] R.K. Iyer and D.J. Rossetti, "A Statistical Load Dependency Model for CPU Errors at SLAC," Proc. 12th Int. Symp. Fault-Tolerant Computing, pp. 363-372, June 1982. [20] R.K. Iyer, S.E. Butner, and E.J. McCluskey, "A Statistical Failure/Load Relationship: Results of a Multicomputer Study," IEEE Trans. Computers, Vol. C-31, No. 7, pp. 697-705, July 1982. [21] R.K. Iyer and P. Velardi, "Hardware-Related Software Errors: Measurement and Analysis," IEEE Trans. Software Engineering, Vol. SE-11, No. 2, pp. 223-231, Feb. 1985. [22] R.K. Iyer and D.J. Rossetti, "Effect of System Workload on Operating System Reliability: A Study on IBM 3081," IEEE Trans. Soft^'are Engineering, Vol. SE-11, No. 12, pp. 1438-1448, Dec. 1985. [23] R.K. Iyer, D.J. Rossetti and M.C. Hsueh, "Measurement and Modeling of Computer Reliability as Affected by System Activity," ACM Trans. Computer Systems, Vol. 4, No. 3, pp. 214-237, Aug. 1986. [24] R.K. Iyer, L.T. Young, and P.V.K. Iyer, "Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data," IEEE Trans. Computers, Vol. 39, No. 4, pp. 525-537, April 1990. [25] I. Lee, R.K. Iyer and D. Tang, "Error/Failure Analysis Using Event Logs from Fault Tolerant Systems," Proc. 21st Int. Symp. Fault-Tolerant Computing, pp. 10-17, June 1991. [26] I. Lee and R.K. Iyer, "Analysis of Software Halts in Tandem System," Proc. 3rd Int. Symp. Software Reliability Engineering, pp. 227-236, Oct. 1992.

232 [27] I. Lee, D. Tang, R.K. Iyer, and M.C. Hsueh, "Measurement-Based Evaluation of Operating System Fault Tolerance," IEEE Transactions on Reliability, pp. 238-249, June 1993. [28] I. Lee and R.K. Iyer, "Faults, Symptoms, and Software Fault Tolerance in the Tandem GUARDIAN90 Operating System," Proc. 23rd Int. Symp. FaultTolerant Computing, June 1993. [29] T.T. Lin and D.P. Siewiorek, "Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis," IEEE Trans. Reliability, Vol. 39, No. 4, pp. 419-432, Oct. 1990. [30] B. Littlewood, "Theories of Software Reliability: How Good Are They and How Can They Be Improved?" IEEE Trans. Software Engineering, Vol. SE-6, No. 5, pp. 489-500, Sept. 1980. [31] R.A. Maxion, "Anomaly Detection for Diagnosis," Proc. 20th Int. Symp. Fault-Tolerant Computing, pp. 20-27, June 1990. [32] R.A. Maxion and F.E. Feather, "A Case Study of Ethernet Anomalies in a Distributed Computing Environment," IEEE Trans. Reliability, Vol. 39, No. 4, pp. 433-443, Oct. 1990. [33] R.A. Maxion and R.T. Olszewski, "Detection and Discrimination of Injected Network Faults," Proc. 23rd Int. Symp. Fault-Tolerant Computing, pp. 198-207, June 1993. [34] S.R. McConnel, D.R Siewiorek, and M.M. Tsao, "The Measurement and Analysis of Transient Errors in Digital Compute Systems," Proc. 9th Int. Symp. Fault-Tolerant Computing, pp. bl-lQ, 1979. [35] J.F. Meyer, "On Evaluating the Performability of Degradable Computing Systems," IEEE Trans. Computers, Vol. C-29, No. 8, pp. 720-731, Aug. 1980. [36] J.F. Meyer and L. Wei, "Analysis of Workload Influence on Dependability," Proc. 18th Int. Symp. Fault-Tolerant Computing, pp. 84-89, June 1988. [37] J.F. Meyer, "Performability: A Retrospective and Some Pointers to the Future," Performance Evaluation, Vol. 14, pp. 139-156, Feb, 1992. [38] S. Mourad and D. Andrews, "On the Reliability of the IBM MVS/XA Operating System," IEEE Trans. Software Engineering, Vol. SE-13, No. 10, pp. 1135-1139, Oct. 1987. [39] J.D. Musa, A. lannino, and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill Book Company, 1987. [40] B. Randell, "System Structure for Software Fault Tolerance," IEEE Trans. Software Engineering, Vol. SE-I, No. 2, June 1975.

233 [41] A. Reibman, R. Smith, and K. Trivedi, "Markov and Markov Reward Model Transient Analysis: An Overview of Numerical Approaches," European Journal of Operational Research, Vol. 40, pp. 257-267, 1989. [42] S.M. Ross, Introduction to Probability Models, 3rd Edition, Academic Press, Inc., 1985. [43] R.A. Sahner and K.S. Trivedi, "Reliability Modeling Using SHARPE," IEEE Trans. Reliability, Vol. R-36, No. 2, pp. 186-193, June 1987. [44] D.R Siewiorek, V. Kini, H. Mashburn, S.R. McConnel, and M. Tsao, "A Case Study of C.mmp, Cm*, and C.vmp: Part I — Experience with Fault Tolerance in Multiprocessor Systems," Proc. of the IEEE, Vol. 66, No. 10, pp. 1178-1199, Oct. 1978. [45] D.R Siewiorek and R.W. Swarz, Reliable Computer Systems: Design and Evaluation, Digital Press, Bedford, Mass., 1992. [46] M.S. Sullivan and R. Chillarege, "Software Defects and Their Impact on System Availability — A Study of Field Failures in Operating Systems," Proc. 21st Int. Symp. Fault-Tolerant Computing, pp. 2-9, June 1991. [47] M.S. Sullivan and R. Chillarege, "A Comparison of Software Defects in Database Management Systems and Operating Systems," Proc. 22nd Int. Symp. Fault-Tolerant Computing, pp. 475-484, July 1992. [48] D. Tang, R.K. Iyer and Sujatha Subramani, "Failure Analysis and Modeling of a VAXcluster System," Proc. 20th Int. Symp. Fault-Tolerant Computing, pp. 244-251, June 1990. [49] D. Tang and R. K. Iyer, "Impact of Correlated Failures on Dependability in a VAXcluster System," Proc. 2nd IFIP Working Conf. Dependable Computing for Critical Applications, Tucson, Arizona, Feb. 1991. [50] D. Tang and R.K. Iyer, "Analysis and Modeling of Correlated Failures in Multicomputer Systems," IEEE Trans. Computers, Vol. 41, No. 5, pp. 567-577, May 1992. [51] D. Tang and R.K. Iyer, "Analysis of the VAXA'MS Error Logs in Multicomputer Environments — A Case Study of Software Dependability," Proc. Third Int. Symp. Software Reliability Engineering, Research Triangle Park, North Carolina, pp. 216-226, Oct. 1992. [52] D. Tang and R.K. Iyer, "Dependability Measurement and Modeling of a Multicomputer Systems," IEEE Trans. Computers, Vol. 42, No. 1, pp. 62-75, Jan. 1993. [53] D. Tang and R.K. Iyer, "MEASURE+ — A Measurement-Based Dependability Analysis Package," Proc. ACM SIGMETRICS Conf. Measurement and

234 Modeling of Computer Systems, Santa Clara, California, pp. 110-121, May 1993. [54] K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications, Prentice-Hall, Englewood Cliffs, NJ, 1982. [55] K.S. Trivedi, J.K. Muppala, S.P. Woolet, and B.R. Haverkort, "Composite Performance and Dependability Analysis," Performance Evaluation, Vol. 14, pp. 197-215, Feb. 1992. [56] M.M. Tsao and D.P. Siewiorek, "Trend Analysis on System Error files," Proc. 13th Int. Symp. Fault-Tolerant Computing, pp. 116-119, June 1983. [57] P. Velardi and R.K. Iyer, "A Study of Software Failures and Recovery in the MVS Operating System," IEEE Trans. Computers, Vol. C-33, No. 6, pp. 564-568, June 1984. [58] A.S. Wein and A. Sathaye, "Validating Complex Computer System Availability Models," IEEE Trans. Reliability, Vol. 39, No. 4, pp. 468-479, Oct. 1990.

SECTION 4.2 Modeling and Evaluation of Opto-Electronic Computing Systems^ Ting-Ting Y. Lin^ Optics has many advantages over electronics for interconnection, especially in terms of higher transmission speed, energy efficiency, and better noise immunity. Since photons do not interact with each other, optical interconnects also provide parallelism and high density. The use of free-space interconnections alleviates many topological difficulties, thus facilitating optical implementations at all levels of computing: for local area networks, between processors or memory elements, between boards, between components on a board, and even between components on a chip. Multistage interconnection networks (MIN) are one architecture that implements free space optical interconnections. But this flexibility easily invites a different set of faults into the system. Therefore, it is essential that there exists a methodology to evaluate not only the performance OS such implementations, but also the fault-tolerance of the design.

4.2.1

Performability Analysis for Gracefully Degradable Multistage Interconnection Networks

A quantitative evaluation of a computer or communication system is usually done based on its performance and reliability. Stochastic modelling is often used for characterizing relevant measures in either domain, as are simulation studies for validating the cinalytic models. However, traditionally performance and reliability have been studied as two separate aspects. A need for a composite metric for integrating these two arises for analyzing so-called degradable systems. Most fault-tolerant multistage interconnection networks (MIN) are degradable, even without any explicit detection/reconfiguration mechanism. For example multipath banyeins, created by either dilation or replication of standard uniquepath blocking type bEmyan networks, would automatically degrade on a Hnk failure, because the link would appear to be permanently blocked due to a malfunactioned subhologram or lens. We will restrict our attention to permanent faults only, so that the system is assumed to be nonrepairable over a mission time. The goal is to study the transient phenomena through which the system finally reaches the configuration of complete failure. Performability, a generic composite performance-reliability measure, is now the widely accepted standard for evaluating degi^adable systems. It is defined as ' T h i s work is supported by ONR under grant N0001491-J1017 •^Department of Electrical and Computer Engineering, University of California, San Diego. The author would like to thank Ramesh Rao and Amiya Bhattacharya in preparation for this article.

236 the probability that the system reaches an accomplishment level over a utilization interval Ccilled mission time. Indeed, it is the various possible interpretations of "accomplishment" that make the measure so generic and powerful. Yet the complexity of performability models and their analyses have thus far prevented their practical application to large systems such as fault-tolerant MINs. In the following sections, we formulate a performability model for each of two widely used multipath MINs, viz. the replicated and dilated banyans, solely in terms of the lifetime distributions (or reliability) of its components, and derive closed form bounds for its distribution function and mean value. A brief introduction to performability appears in the following section, which formulates traditional performance analysis as a subproblem in our analysis. Performability and Its Applications The Definition Let {Xt}t>o be a continuous-parameter (time) discrete-space stochastic process (chain) representing the state of the degrading system. The state-space Q is finite for any practical system, although it may be unmanageably large. Each state q £ Q represents a possible configuration of the system with some of the components failed, eind has an associated reward rate pg = p{q), a non-negative real number reflecting the level of performance per unit time. Performability of the system over a mission time t is defined as the •probability density function of the accumulated reward (performance)

Jo 0

p{Xt)dt.

(4.1)

The cumulative distribution function of Vj is often referred to as the performability distribution function. The moments for VJ are useful ways to characterize the performability. Specifically, the mean value of Yt, which we would call mean reward before failure (MRBF) henceforth, shows the mean value of cumulative performance until failure. For a nonrepairable system, the trcmsitions between configurations take the shape of £m acyclic graph, so that reentry to a state is not feasible. The sojourn time or the residence time r,, which is the total time spent by the system in configuration q over the mission time t = J^ggQ 'V' '^ contiguous. Equation (4.1) thus reduces to

i'< = E^*^»'

(4-2)

which also provides a simple expression for MRBF

E[Y^] = J2p,E{rg].

(4.3)

<76Q

One implicit assumption behind formulating a composite performance-reliability measure as above is that the performance and failure rates are functions of only

237 the system state. This means that increasing load or stress during the degradation due to component fedlure would not affect the lifetime distribution of the surviving components. This is not only a feature which makes analysis tractable, but also a desired characteristic for any implementation. A P e r s p e c t i v e Integration of performance and reliability was first attempted by Beaudry [5]. The approach essentially involved transforming the Markov reliability model from the time domain to a computation domain by multiplying sojourn time in each state with appropriate computation capacity. Clearly the model is fully captured by the performability model described here when the stochastic process Xt is Markov. The problem reduced to the transient Euialysis of the underlying Markov chain. Huslende [6] proposed another measure a^ the probability that the system performance remains above a certain level during a mission time. Donatiello and Iyer [11] showed that it arises from a specific reward assignment following an indicator function. The definition of performability presented here is a derived definition from the original one by Meyer [7]. Meyer [8] and Furchtgott and Meyer [9] showed how to obtain closed form solution to the performability distribution by defining a reward structure corresponding to a semi-Markov process. This is probably the best suited model for degradable systems in terms of the generality. But their derivation of the integral solution involves enumeration of all possible state trajectories (sample path), and the solution involves numerical algorithms with Hnear complexity in terms of trajectories. For a large system, unfortunately, the number of trajectories may be of exponential order of the number of states, and the state-space may even have an exponential growth with the size of the system (e.g. the MINs under consideration)! As a result, a direct appUcation of their methodology seems inappropriate. Moreover, making the right choice for the sojourn time distribution in the semi-Markov model is the most difficult part [14]. Because of this problem, most researchers analysed performability over a Markov reward model [10, 11, 12, 13, 15], The only problem is that due to the memoryless property, it implies exponential distribution of sojourn time. In practice, system components have a varying hazard rate over the lifetime, often modelled by Weibull distribution [3]. It has been pointed out that use of exponential approximation to non-exponential distribution for transient analysis can cause considerable error [14], although this may be good enough for a steadystate analysis [3]. The method outlined in this paper seems to be a reasonable trade-off in this regard. We develop a conservative bound on performability by reducing the state-space drasticcilly for a class of large systems. However, we assume a general distribution of component lifetime, which provides the flexibility to cope up with manufacturer specification. A semi-Markovian process automatically arises out of the order statistic formulation, but there is no need to explicitly

238 define its sojourn time distribution. Some Interpretations It is necessary to appreciate the power of performability before we proceed to take a look at a traditional performance analysis technique for MIN in the next chapter. Indeed, the generality of the metric arises from the flexibility of choosing the reward rates. Let us look at some example reward assignments and the interpretation of performability £ind MRBF that follows. Reliability Let us consider a single component having just the working and failed states. If we assign p = 1 for the fault-free state and p = 0 for the failed one, the performability distribution function reduces to P r [ r , <j/J = P r [ T < y ] , where T is the random variable representing the lifetime of the component. The right side is indeed the unreliability of the component at instant t = y, Q{y), and the reliability R(t) can be written as R{t)

= = =

Pr[T>f] 1 - Pr [ r < «] 1-Pr[y;
Clearly, the MRBF reduces to the mean time to failure (MTTF) [3]. Average Delay As another exeimple [11], suppose we know the average throughput Jg and average delay J, for all the states q £ Q, and are interested in the probability that the average delay over the mission time is less than a threshold 9. Observing that Pr[Average delay < 0 ]

=

Pr[^^^S-^-^

< 0}

= Pr[53(<5,-^)V, <0], we would assign the rew^lrd p , = {Sq — 6)Jg. The above reward assignment is indeed a very suitable one for the multipath MINs we consider. Their structure, degradation and performance models are studied in the next chapter.

239

4.2.2

Multipath Banyan Networks

Construction In general, N xN (where N = m'^) symmetric MINs is configured by arranging m X m crossbar switches in K stages, with N/m switches in each stage. 2 x 2 switches are most common, and we use m = 2 henceforth. Switches of adjacent stages are connected by an interconnection pattern, so that a path can be established between any pair of input and output [1, 2]. They are cost-effective Stage 0

Stage 1

Stage 2

Figure 4.1: Self-routing through an omega network alternatives to a crossbar, and have been extensively used as routers in multiprocessors and other communication environments such as telephone systems cind computer networks. However, a given path between a pair of input and output may prevent set up of another path between a completely disjoint pair. Since a number of rcindomly distributed requests may collide, these networks are classified as blocking networks. MINs with unique paths between any inputoutput pair form a class known as generic banyan networks [2, 16], also referred to as delta networks by some authors [19]. These networks have self-routing capability, in the sense that switches at each stage use one bit of the destination address to explore the path for routing. Figure (4.1) shows an example of such a bit-controlled routing through a 8 x 8 multistage shuffle-exchange or omega network. Two strategies for augmentation are usually employed to enhance the banyan to a multipath MIN [20, 21], as shown in figure (4.2). 1. The d-dilation of the MIN is defined to be the network obtained by replacing each edge by d distinct edges. A request entering a switdi may

240

Figure 4.2: Dilated and replicated banyan networks exit using any one of the d edges going to the desired successor switch at the next stage. 2. The r-replication of the MIN, also known as layered banyan, is defined to be the network consisting of r identical distinct copies. O p e r a t i o n For the analysis presented here, we consider synchronous circuit switching as the mode of operation. Asynchronous circuit switching, buffered or unbuffered packet switching could have been other alternatives. In this mode, the source modules can submit requests for setting up circuits only at periodically occurring instants. This period is usually referred to as the network cycle time tc, and will henceforth be called a sweep. Starting from the beginning of a sweep, the apices of partially set up circuits propagate together, at steps of switch delay id, through successive stages creating a wavefront. Let us call these steps clock cycles or simply cycles whether the switches are clocked or not. After k cycles, circuits get set up and are subsequently released together after a hold time th- All of these constitute the network cycle or sweep length, given by tc = ktj + tii. This is illustrated in figure (4.3). Sources whose requests

r L

T kt.

1 kU

Figure 4.3: Timing diagram of two consecutive sweeps are blocked, become aw2ire of their fciiled attempts, when an indicator signal

241 back-propagates through the partifilly set-up circuit to reach them within the same sweep, and they resubmit the blocked requests at the beginning of the next sweep [18, 19, 25]. Performance model The most crucial factor for modelling and analysis of network performance is the distribution of requests at the point of generation. In a multiprocessor environment, it is governed by either the memory reference pattern or the interprocesser communication pattern imposed by the program in execution. Due to locality of reference in programs, locality of interference in the network is a natural phenomenon [18], which leads to creation of hot spots. However, the uniform distribution has become a standard for analysis because of its simplicity. It means that all processors generate requests with equal probability, and they are destined to all the memory modules with equal probability [18, 19, 20, 25]. The other important aspect is the way blocked requests are handled in a performance model. A natural demand which arises from the execution of sequential code within each processor is that, if a memory access request is not satisfied due to blocking, it should be resubmitted until success {data dependency assumption). Some of the earher studies [19] indicate that the steady state blocking probability under independent and uniformly distributed random request patterns does not change considerably if blocked requests are ignored {regenerative assumption), and this leads to a simple closed form expression to the blocking probability. But one loses an important performance measure, namely average delay or latency, simply because it cannot be defined. This is why we consider incorporating the effect of resubmission within the model, which leads us to the following problem. Let us now focus on the complications which arise when blocked requests are not dropped. A natural choice is to resubmit the same request during the next cycle [21]. However, this is not the only alternative. Data-dependency cissumptions demand that the blocked request be resubmitted sometime in the future with probability 1, before a new request can be generated at that source. If unblocked sources generate requests with probability p while blocked ones persist with probability 1, it creates a non-uniform distribution at the beginning of a sweep. This causes a serious analytical problem in extending the existing regenerative models to handle resubmission, because all of them are based on the assumption of uniformly distributed request generation and its preservation through stages [18, 20, 21). Mixing ruins the uniformity at the inputs of the MIN. So, we are motivated to force uniformity at generation by using the same probability of submission for both blocked and unblocked sources. Clearly, if new requests are not to be withheld, this submission probability must be greater than or equal to the generation probability. Inducing uniformity thus calls for generating possibly spurious requests in unblocked sources to match the

242 submission probability for the sake of analysis. The destinations also need to be re-randomized at the same time. This can be justified from the point of view that the addresses were originally generated at random. With this protocol, blocked sources are allowed to resubmit if a success results from the toss of a coin biased with the submission probability. What we achieve as a result is an upper bound on expected value of the latency. Following are the complete set of assumptions used for our analysis. 1. The source cind destination modules are statistically identical within their own group. 2. The requests generated by each source are randomly distributed over all the destination modules. 3. Requests for circuit set-up can only be submitted to the network at periodic instants which are multiple of the network cycle time t^. The partially set up circuit frontiers propagate to the next switch stage at instants which are multiples of clock cycle time tj, starting from the beginning of a sweep. 4. All switches are identiccil cuid Ccin only be connected in either straight or crossed mode. Conflicts occur at a switch during circuit set up only when requests arrive at both of its inputs and try to access the same output port. For dilated MIN, the total number of requests trying to access the same output port should exceed the capacity of the port to cause a conflict. Switches are unbiased in conflict resolution, and one or more of the conflicting requests cire blocked randomly. 5. Blocked sources do not generate a new request in the next sweep, but resubmit the unserviced request with probability q. However, they rerandomize the choice of destination. Unblocked sources, with probability p of generating new request {q > p), also submit a new request independently with probability q, possibly by loading the network with spurious requests. Analysis Analysis of Blocking In our assumptions, we insisted on inducing an independent and uniformly (hence identically) distributed request generation process. The gocd is to use the same counting argument as with the regenerative assumption, for computing the probability of acceptance within a sweep. The difference is that the probability of request generation q at the input of the network is now a design parameter, which controls the stochastics of backlog as the system moves on sweep over sweep. Following Kruskal and Snir [20], let us note that the uniqueness of paths in banyan networks implies that

243 Lemma [Kruskcil and Snir]: Let the requests be generated at the source nodes of a uniform banyan network by independent, identically distributed random processes that uniformly distribute the requests over all the sink nodes. Assume that the routing logic at each source is "fair", i.e. conflicts are randomly resolved. Then, 1. The patterns of request arrivals at the inputs of the same switch are independent. 2. Requests Jirriving at an input of a switch are uniformly distributed over the outputs of that switch. 3. For each stage in the network, the pattern of request arrivzils at the inputs of that stage have the same distribution. To analyze resubmission following our protocol, let us define the probability of success (ps) in a sweep to be the probability that a request survives at the output of the last stage, given that it is submitted at the input of the first stage at the beginning of that sweep. Now we introduce two integer random variables. Let R be the total number of unsuccessful attempts made by a request before it succeeds. Clearly R assumes nonnegative values, and is affected only by ps following the geometric distribution Pr[R = r]=p,{\-p,Y,

r > 0

(4.4)

with mean °°

1

E[R]=Y,rp.{\-p,Y^-~l. r=0

(4.5)

^«

The fact that X!^o ^^ \R = r] = 53^oP«(l ~ P^Y ~ ^ confirms the persisting behavior of blocked requests until success. Let us now denote by T,- the length (measured in number of sweeps) of the period following ith collision until the completion of the next attempt. Tjs are independent and identically distributed, and take only positive integer values according to the geometric distribution Pr[Ti=t]=q{l-qY-\

t>l

(4.6)

with mean oo

E[T] = ^tqil-qy-' t=l

-

= -.

(4.7)

^

As in the case of R, X ) ^ ! Pr[Ti = t] = YlZ\
E[S]

= ^ £ ; [ J | i ? = r ] - P r [ / ? = r] r=0

244

r=l

t=l

oo

=

oo

Pr[R = 0] + Y^Pr\R

= r] + YlE[T]-r-Pr{R

r=l

= r]

r=l

=

l + E[T]-E{R]

(4.8)

=

1+ — - -

(4.9)

QPs

Q

Evaluating the mean delay for a MIN thus reduces to computing its single sweep probability of success, which depends on the design parameter q and the redundant structure of the network. As a whole, we have an expression for the conditional expected delay, given the redundancy of a MIN, as a function of q. Unique-path Banyan To compute ps for a pure banyan, let us now consider one of the 2 x 2 crossbars used in the 0th stage. Because of the uniform distribution, the probability that one of its input ports receives a request for a particular output port is q/2, and so the probability that it does not receive a request for that output port is 1 — q/2. Since request generation at each source is an independent process, the probability that a particular output port does not receive any request is (1 — q/2)^. An output port carries a request if and only if it is requested by one or more input port,the probability of which is then given by 1 — (1 — q/2y. This is true for all of the output ports of any of the 0th stage switch, which are in turn the inputs to the next stage. Proceeding in this way, the distribution of requests at the inputs or outputs of any stage remains uniform, if they stcirt uniformly distributed. Let {qi,0 < i < K} denote the probability that an input port of an ith stage switch carries a request, while qii- is the probability that a request appears at any of the output ports of the kth stage (i.e. the output of the network). Then we have the recurrence 9.+i=l-(l-|)' (4.10) where qo = q. The probability of success is given by Ps = ^—

(4.11)

cuid substituting in the expression for mean delay, we obtain E[Delay] = 1 + — - - .

(4.12)

This is an increasing function of design parameter q. Thus, by choosing the minimum Vcdue oi q — p within the range [p, l], the minimum possible upper bound on mean delay can be established.

245 Replicated B a n y a n If we have an r-replication of the unique-path banyan network, the same recurrence can be used for each copy with proper initiaUzation and end correction. If requests are simultaneously submitted to all the layers (multiply loaded), qn'S for cill the layers are obtained as above, with no change in the initialization. If a layer should be selected randomly for submission (randomly loaded), the initialization need be changed as ^o = l/f- Although the copies are not independent due to nonuniform activity, we can still assume independence following Kruskal's approximation [20]. Then, probability of a request not surviving at ciny link of A'th stage of all of the r layers is {1 — qnY. Thus we have,

i-(i-9Ar Pa =

.

(4-13)

Dilated B a n y a n We can adopt the single sweep analysis by Kruskal and Snir [20, 21] for computation of p^ for dilated and replicated banyans. For the dilated case, let us define {qij,0 < i < K^j < d} to be the probability that there cire j requests on the d-dilated input port of an ith stage switch, with the special interpretation of q,-j as the probability of each d-dilated output port of the MIN carrying j requests. The initialization can be done as Qoj

= =

0, g,

1 J = l. JV

(4.14)

For the case of £Ui I'th stage 2 x 2 switch (m = 2), the probability of / requests entering a switch is ^r+8:=i1'.'-Qi,a- Associating with the probability that j of them are directed to a fixed output port, we have 2d

Qi+l..

= Eii: 2d

= E

^'•r.i.s

2-',

,
^r+ir=/ /

E^-?-)E( ; )2-',

J=d

(4.15)

1=) \r+s=(

Finally, the probability that one of the d edges of the output port carries the request is 1 — qk^o cind Ps =

-• (4.16) q Substituting the value of p, in the equation (4.9) we can now have a performance measure for the MIN. As we derived the result based on a uniform request distribution over uniformly replicated or uniformly dilated banyan, p^ needs to be recomputed for a lower level of replication or dilation and used for an estimate of new bound on mean delay, when switches and links fail.

246 Total Ordering and Order Statistics It has been mentioned earlier that the transition diagram of degradation of a nonrepeiirable system is an acyclic digraph. The set of sample paths (trajectories) is merely the set of all paths in the graph. Another way to look at this is to consider the peirtial ordering instcintiated by the digraph. However, for a large class of system, a total ordering or a cheiin can be created by collapsing a number of states. The key for this collapsing method is knowledge of the reward rates. First, states which aie equivalent in terms of the reward rates, should be merged. Next, states whose reward rates cannot be evaluated easily but can be ordered with respect to some other states, should be merged with the latter instantiating a conservative estimate. The technique will be illustrated in our analysis of replicated cind dilated bcmycms later in this section. Having reached at a chain of modified states through which the system now degrades, we need the distribution of the instants of state-transition as a result of system failure. The distribution of order statistics directly relates to this problem. Let us now develop performability evaluation schemes following this approach from a very simple system. A n n - C o m p o n e n t Parallel S y s t e m Order statistics of lifetime Let us consider a system made of n identical components, which degrades when they fail one by one. An example of such a system can be n processors connected to a bus or a bunch of n parallel links. For the present, we ignore the possibility of any abrupt failure due to failure of any critical component like the bus or demultiplexer/multiplexer. The transition diagram is naturally a chain of states qk,k = 0 , . . . , n , where qk denotes the state with k faulty components. Let pt be the reward rate at state qkIf a sequence of remdom variables Ti,T2,... ,Tn are eir ranged as Ti „ < T2.n < • • • < T„:„ in ascending order of magnitude, Tk.n is called the fcth order statistic forfc= 1,2,...,«. Suppose Ti,T2,... ,T„ represent the lifetime of the identical components, and as a result, they are iid (independent and identically distributed) with cdf F{t) = Pr [Ti < t]. Then Tk-n, k = 1 , . . . , n are the instants of transition. The cdf Fk.nit) of Tk.n is derived as Fk:„{t)

= Pr[Tk:„
Pr [fc or more components fail in [0, t] ]

=

2 J P^ [Exactly i components fail in [0, t] ] i=k

= E(")mo}'{i-i^(0}"-'

(4.17)

247

=

ml—4rrTT / B{k,n-k+l) JQ

u'-'il-ur-'du

(4.18)

Equation (4.18) follows from the identity relating partial binomial sura to incomplete beta function, a proof of which appears in the appendix. The pdf fk:n{t) can be obtained by differentiating Ft;„(0 with respect to t.

n To interpret the above expression, let us consider the probability of the event that the fcth failure occurs in [t, t+St]. This means k-1 components must fail in [0, (], and n — k must fail in \t+ 5t, oo). Moreover, there are n!/(fc— l)!l!(n —A;)! ways to chose the first A; — 1, the fcth and the remaining n — k components out of n. As all components fail independently, we have Pr[Tk:,.
+

St]-PT[Tk:„
"• '"''^it-iftt-\xtn_iru^xt\\'^-k {F{t)y-'f(t)5t{l-F{t + 6t)r (k - l)\V.{n - ky. Dividing both sides by St Jind taking limit as St -^ 0, we get equation (4.20). A straighforward extension of the above argument yields an expression for the joint pdf of two or more order statistics of the lifetime distribution of individu£d components. We would find later in this chapter that knowing the joint distribution of some of the order statistics is necessary even if we are interested only in the MRBF of a system with critical component failure. For the pairwise joint pdf /ij:n(s, t), s < t, we consider the event that the ith failure occurs in [s,s + Ss] and j t h failure occurs in [t,t -j- St]. The first i — 1 failures occur in [0, s], j — i — I intermediate failures occur in [s + Ss,t] and the remaining n — j failures occur in [t + St, oo). The choice of the sequence can be done in n!/(j —l)!l!(j —i —l)!l!(n —j)! ways. Due to the independence of their lifetimes Pr[Ti;„ <s + Ss,Tj:„
=

(.-l)!l!(i-r-l)m(n-,)!^^(-)^"'^(-)^{F{t) - F{s + Ss)y-'-'f{t)St

{1-F{t

+ St)}"-J

which gives fijAsJ)

=

,. , , „ . " ' , , „ (t - iy.(j -1 - iy.{n - jy.

{I-Fit)}"-'/is)

:r,{F{s)r-HF{t)-F{s)r-'-'

fit)

(4.21)

248 By the Scime logic, we Ccin have a general expression for the joint pdf of any k order statistics [4]. For 1 < ni < 712 <•••< Uk < n and In

I

njcV^li'-'

i*k

)

•"

(ni-l)!(n2-n,-l)!-.-(n-nO! {F(ti)r'-'/(f.){f(*2)-F(ti)r-'"-7(f2)--• • • {I - F{tk)}"-"'

(4.22)

Comparison with a Meirkov model Before exploring more about the distribution of order statistics, let us verify that we have indeed arrived at a more general form of a solution than what we could expect by transient analysis of Markov model of the system. Consider representing our model as a pure death process with a constant hazard rate (failure rate) A. At most one transition due to failure can occur in the small interval [t, t + 6t]. As a result, Markov model gives rise to exponential distribution with hazard rate A for component lifetime [3], F{t)=l-e'^' (4.23) The transition probabilities are given by Pk,k+i Pkk Pn,n

= = =

{n — k)X6t, \-{n-k)XSt, 1

Pij

=

0

fc

k = 0,... ,n — I = 0,...,n-l

(4.24)

otherwise

Denoting Pr [Tjt;„ < t] by pk{t), for k = 1 , . . . , n, we have Pk{t + 6t) = PT[Tk:„
=

{n-k+l){pk-iit)-pkit)}

with po(0) = 1 and pi(0) = 0,A; j ^ 0 as the initial conditions.

(4.25)

249 An expression for Fie;n{t) with the exponential F{t) should give us a general solution for the system of differential equations given be equation (4.25). Using

'"'"'^(.-l.n-.-H)/"'""'"''^-"'""'''" we get ^

' .fc-lM

..\n-fcl

^ /I

„-Xl^

^'^W = B ( . - i , . - ; ^ + i)t"^"'(^-"^""^'"=—-57^^-"^') ,1

(fc-l)!(n-A;)!

Ae-'"-''+')^'(l-e-^')''-\

while using

^*W = E ( " ) ( I - ^ ~ " ) ' > we get {n - k + 1)X {pk-iit)

-pk{t)}

=

(n-fc+l)A|x;

=

( n - f c + l ) A ( ^ " ^ j (l-e-^')*^-'(e-^')"-'^+'

_

~

"•

(")(l-e-^')*(e-'')"-''

X -(n-A+l)Al /, _

ik-l)\{n-ky.

U e

-Xf.k-1

J

,

that verifies our conjecture. Evaluation of performability We are now ready to evaluate the performability of the n-component parallel system. Defining To.n = 0, the sojourn time Tk at state qk are Tk = Tk+l:n-Tk:n,

k = 0,...,n-l

The accumulated reward in equation (4.2) can now be written as n-l

k=o n-l

k=0

(4.26)

250 n

=

Y.ipk-l

- Pk)Tk:n,

Pn = 0

(4.27)

SO that Pr(v;
=

PT{J2{pk-i-Pk)Tk:.
=

[••• J

T,l=iiPk-i

fl,...n:n{h,.-.,tn)dti---dtr,

-pk)tk

(4.28)

< y

Similarly, the MRBF in equation (4.3) reduces to n

E\Yt]

=

E[Y,{pk-i-Pk)Tk.n] k=l n

=

Y.^pk-^-pk)E[Tk.r.\

(4.29)

fc=i

= "Tipk-i-pk) k=x

tfkAt)dt

(4.30)

•^'=0

Until now we have not considered the failure of any critical component within the mission time. All it needs to incorporate this is to change the expression for the sojourn time by Tk

=

0,

Tc<

=

Tc-Tk:n

Tk:n
Tk:n

=

^Jt+lrn — Tk.n

Tk+l:n

(4.31)

< Tc

where Tc denotes the lifetime of the critical component, distributed with cdf Fc{t) and pdf Jc{t). Given that Tm:n < Tc < Tm+\:ni ^t from equation (4.2) reduces to V« =

Y^PkTk k=0 m-1

=

=

X , Pk{Tk+l:n fc=0 m

-Tk:n)

+ PmjTc

Y.{pk-X-pk)Tk:n-^PmTc. fc=l

-

Tm:n)

(4.32)

251 As Tm;n
= Pr[{v;
T „ + l:„}]

m=l

+ =

PT[{Yt
P r [ { p o 7 ; < y } n { r , <Ti:„}] n —1 m + Y, Pr[X^{(pfc-l - pk)Tk:n + PmTc < y} H {T„:„ < Tc < r„+l:„}] m=l fc=l + Pr[J2{iPk-i-Pk)Tk:„
fl.n{tl)fc{tc)dhdtc tc < ti

+ 2^ \ "'~ +

•••

fl

V ^T=liPk-l

/•••//l 12l=iiPk-i

m+l:n{tl,---,tm+l)fc(tc)dti---dtm - Pk)tk + Pmtc
+ ld <m+l

n:nitu---,tn)Mtc)dti---dt„dtc -Pk)tk

t„<

(4.33)

tc

An expression for MRBF can be obtcdned by conditioning on the same set of disjoint events and using the theorem of total probability, i.e. E[Yt] =

E[Yt\Tc<Ti..„]

• P r [ r , <Ti:„]

n-l

+ Y,E[Yt\

T^..„
m=l

+ ElYt\T„:n
• Pr[r„;„
252 =

E\poTc\T,
171

m=l

k=:l

• Pr[re
• Pr(r„;„
r,) • Pr[r„

+ E['^{pk-i-Pk)Tk *=1

=

PO

/ /

hh..n{tl)fc{tc)dhdU

tc
+ E

2 j ( P * - l - Pk)

/ • • • / **/*,m,m+l:n{t*,tm,*m+l)/c('c)
t* < tm < tc < tm+1 + Pm

/ / /

'e/m,m+l:n/c(tc)dtmf'tm+l dtc

tm < tc ^ tm+l

+ f^{pk-l-pk)

J J J «*/*,n:„(t*,t„)/c(«c))dhdt„dt^

(4.34)

Replicated Banyan An r-replicated banyan network has r unique-path banyan layers. Each layer hcis S = (N\og2 N)/2 switches and L = ^ ( l o g j A'' + 1) links. However, a layer gets corrupted by a single failure of any of its switches or links, in the sense t h a t at least one pair of source and destination CcUinot b e connected t h r o u g h t h a t layer. For a conservative b o u n d on performability, we ignore t h e contribution of such a defective layer, and pretend t h a t the network has as many layers as the perfect ones. Each layer can now b e considered as a component, and the analysis of n-component parallel system applies for n = r. T h e states are {qi;, /c = 0 , 1 , . . . , r } where qi denotes the configuration of k faulty and (r — k) perfect layers. It should be noted t h a t p,. = 0 in a conservative way. F{t), the cdf of lifetime of t h e layer, can b e obtained using the argument t h a t it would require all S switches and L links not to fail at time t so t h a t the layer does not fail until time (. i.e. F{t)=l-{l~Fs{t)}'{l-F,{t)}''

(4.35)

and

m

= {l-F,{t)}^L{l-F,{t)}'^-'f,{t)

+ (4.36)

253 where Fs{t) and Ft{t) are the cdf of the lifetime of the individual switches and links; fs{t) and fi{t) are the corresponding pdf. Assuming there cire 2A'^ demultiplexer/multiplexers as the critical components with individual cdf Fjnux(Oi ^^^ effective distribution of critical component is given by Fcit) = 1 - {1 - Fmnxit)}'" (4.37) and Mt) = 2N{1 - Fmux(0}

7mux(0

(4-38)

Equations (4.33-4.34) with n = r gives the expression for performability distribution and MRBF. Dilated B a n y a n Formulation A d-dilated banyan has only one layer of 5 = {N\og2N)/2 switches and L = A'^(log2 A'^) fat links. We should ignore the effect of degradation of the links at the input stage, as they never need to carry more than one request. Each of the fat links consists of d links, having iid lifetime distribution given by cdf Ft{t) (pdf fi{t)). Thus, each fat link degrades as an n-component parallel system we have analysed before. All the switches are critical components, as a fault in any one of them will disconnect at least one pair of source and destination. Let their lifetime be distributed by cdf Fs{t) (pdf fa{t))- Although the analysis of n-component parallel system does not apply to the dilated banyan readily, the order statistic technique can still be extended as follows. Let the degraded states {qk, k = 0 , 1 , . . . , d} where qk denotes the configuration when at least one fat link over the whole network has suffered k faults, but none has suffered the k + 1th fault yet. Since all Hnks having less than k faults perform at leeist as well as the configuration with all the links having k faults, we can pretend that the network has been brought down to (d—fc)-dilated configuration, and obtain a conservative bound on performability. We can assign a reward rate pk for state qk appropriate for such a conservative estimate. Evidently pd = 0 when a fat link is completely broken. Suppose we denote by Tk:d{i),l — 1 , . . . , Z- the time until the Arth fault in the /th fat link. The time until the earliest A;th failure seen over all links is L 'Fk:d(m\n) = min {Tk:d{t)} •

Equation (4.31) for sojourn time needs to be changed to Tk

=

0,

r,. < Tk:d{m\T>)

— =

Tc - Tk:d(m\n) 'Fk+l:d(m\n) — Tk:d(m\n)

Tk:d(m\ii)
(4.39)

254 To find the distribution ofTk:d(min) we first need the distribution of Tk:d{i)This directly follows by substituting n = d in the equations (4.17-4.20).

fk:d{l){t)

B{k ^/_k+i)

{^'(0}*-Mi - FiWr-'Mt)

(4,42)

n! (d-l)!(d-fc)!

{FMy-'il-F^'Hm'-'fM

(4.43)

The cdf of Tk:d{min) can be expressed as Fk:d(mm)(t)

=

P f [^fc:d(min) < ' ]

=

l-Pr[Tfc:d(„i„) > t ]

=

i-llpv[n.,du)>t]

= l-(l-Pr[rfc:d„)<<])^ = l-{l-Fi.d(/)(0}^ Differentiating with respect to t, we get /fc:d(min)(0 = ^{1 " FkADit)}"-'fk:d(l){t)

(4.44)

(4-45)

Equations (4.31-4.34) can now be used for the sojourn time, performability and MRBF of dilated MIN, if we substitute n = d and replace all occurrence of Tk:d and their marginal and joint pdfs by Tk:d{min) and their marginal and joint pdfs. The cdf and pdf of critical component should be given by Fc{t) = I - {1 - F,(t)}^

(4.46)

and fait) = S{l-F,{t}f-'Mt)

(4.47)

Joint distribution of earliest failures Evaluating the joint distribution of the earliest failures across fat links is not as straightforward as that within one fat link. The reason is that we attempt to define two levels of order statistics, and the reindom variables do not remain independent anymore after the first

255 level operation. The simplest choice is to evaluate the MRBF only, in which case we need only the joint distribution for consecutive pairs of earliest failures. Thus, we need fk,k+i:d(min){^jt)i the joint probability density function of the earliest fcth and k + 1th failures at time s and t. We see that for s < t, fk,k + l:d{miD){s,t) SsSt = Pr [Earliest fcth fcdlure in [s, s + Js], earliest (fc + l)th failure in [t, t + 5t]] L

=

L

2_] y j Pr [Earliest fcth failure in [s, s + 5s], earliest (fc + l)th failure in /=l

m=\

\t, t + St]\ earliest fcth cind (fc + l)th failure on link / and m respectively] x P r [Earliest fcth and (fc + l)th failure on link / and m respectively ] L

=

y j Pr [Ecirliest fcth failure in [s, s + (5s], earliest (fc + l)th failure in 1=1

\t, t + 6t]\ both earliest fcth and (fc + l)th failure on link I ] X Pr[Both ecirliest fcth and (fc + l)th failure on link / ] L

L

+ 2_] / J Pr [Earliest fcth fciilure in [s, s + Ss], earliest (fc + l)th failure in ;=1

m=l

[t, t + St] \ earliest fcth and (fc + l)th failure on link / and m respectively ] X Pr [Earliest fcth and (fc+ l)th failure on link I and m respectively] The above steps follow the theorem of total probability by conditioning on the occurrence of eeirliest faults on specific fat links. Computation of these probabilities of these elementary events are time independent, and we argue about that later. The conditional probability terms are functions of time, and they can be computed as follows. We first consider the failure process on the links where we know the earliest failure occurred. Then we consider the process in all others. Since all components of the fat links fail independently with lifetime distribution F{t), the evaluation follows from the next few steps. Pr [Earliest fcth failure in [s, s + Ss], earliest (fc + l)th failure in [t, t + St] I both earliest fcth and (fc + l)th failure on link I ] = Pr [ (fc — 1) fciilures in [0, s], one failure in [s, s + Ss], one failure in [i, t + (5(], (d —fc— 1) failures in [t + St, cx)) on link /] X Pr[ At most (fc — 1) failures in [0,s + Ss] and at most fc failures in [0, t + 6t] on all other links]

256 =

{F(s)}''-'

f{s)5s f{t)5t{l

k-i

k

• = 1

; • = . •

- F{t +

6t)}''-''-' ^

Similarly, Pr [Earliest kth failure in [s, s + Ss], earliest {k + l)th failure in [t, t + 5t\ I earliest /cth cind (fc + l)th failure on link / and m respectively ] = Pr [ (fc — 1) failures in [0, s], one failure in [s, s + Js], {d —fc)failures in \t + St, oo) on link /] X Pr [fcfailures in [0, t], one failure in [t, t + 5t], {d — k — 1) failures in [t + 5t,oo) on link m] X Pr [ At most (fc — 1) failures in [0, s + 6s] and at most fc failures in [0, ( + 6t] on all other links]

=

{F{s)}''-'f{s)Ss{l-F{s k-l

+ Ss)y-''

X {F(t)}''f{t)St{l-F(t

+ St)}"-''-'

k

Y^ Yi^i^+<5«)}"' {^(<+St) - ^(^+s^)y~' {1 - ^(<+^t)}''~' •=1

j=i

Our final job is to find out the probability that the earliest fcth and (fc + l)th failures on specific links. First we should note that all link components within or across fat links have iid lifetime. So, if we mark their identities, they are equally likely to generate all possible permutations. Instead, if we tag them by a ordered pair (i,fc), where / is the fat link identity and fc is the order of fault within a fat link, the problem boils down to all possible ways to merge L sequences of length d each without destroying their inherent order. This can be done in {Ldy./{dl)^ ways, as it is identical to select d places for each of the L sequences out of a total of Ld places. I need to count the total number of such merging which would generate sequences in which (/,fc) would precede (m,fc) for m yi I, and (/,fc-+ 1) would precede (m,fc+ 1) for m ^ I and so on. Finding this would complete the evcduation of the joint pdf required in our expression of MRBF. Other Multi-path MINs Replicated and dilated banyans are not the only options for designing fault-tolerant MINs based on banyan configurations. One other prototype, extra stage shuffle-exchange network (SEN+), has a considerably small hardware requirement [22, 24], The fault-tolerance they offer is much less too, as there are only two distinct paths between any source-destination pair. The number of steps of degradation is not high enough for them to be considered for performability analysis under such a conservative framework described in this chapter. Augmented shuffle network (ASEN) [22, 25], augmented bidelta

257 network (ABN), augmented C-network (ACN), merged delta network (MDN) [23, 24] are other Ccindidates for consideration. It is worthwhile to explore whether or not there exists some reward model, under which it is appropriate to use the distribution theory of order statistics to capture their degradation using performability and MRBF.

4.2.3

Making Choices

In the context of rapidly changing technology, it is important that the framework of modelling, measurement and analysis for a class of systems be developed in a technology independent way, so that possible design alternatives can be put into perspective to compcire Eind contrast the merits and demerits. A set of design parameters needs be identified at the system level, that can be used to tune up the system behavior to reflect upon the possible effects of technology. Our methodologies for the analysis £ind measurement of performability measures for the multistage interconnection networks (MINs) are targeted towards the system designer, and are equipped with the abovementioned capabilities. While the designer is assumed to be aware of the technological limits and cost-effectiveness for a specific alternative, these techniques aid in extracting the mission oriented views on the time and level of service available. Design P a r a m e t e r s A high-level description of the design cilternatives associated with multipath fault-tolerant MINs is given in the following, facilitating the system designers with choices. In some cases, their technological implications have been highlighted. 1. Topology: The two common topologies for MINs are replicated and dilated. We assume cube-derived banyan class. This includes the SWbanyan/butterfly, a suitable candidate for 2-D VLSI layout; and shufHeexchange/omega, that is ideal for 3-D free-space optoelectronic implementation. Some other enhancements of pure banyan which support only one or two redundant paths, although more reliable, may not be ideal candidates for graceful degradation. 2. Level of R e d u n d a n c y : In the above context, level of redundancy refers to the level of replication (r) and the level of dilation (d). While wirerouting complexity prohibits higher values of r and d on VLSI chip, freespace optical interconnects are particularly potent. The transformation is usually on a 2-D plane, and multiple columns of successive stages can be utilized in various ways. For example, three alternatives are: (1) use different columns as different layers of replicated MIN, (2) use them for bit parallelism, or (3) use 2D-to-lD mapping to obtain a MIN with bigger size. The essence is that replication can be inherent in design if the first choice is made. Dilation, on the other hand, can be achieved by exploiting

258 different angles of diffraction, planes of polarization, or distinguishable wavelengths. It is obvious that there will be limits on both r and d from technologicfd and cost point. 3. Component lifetime: We assumed known lifetime distribution, which can be obtcuned from the manufacturers, for all the components of the MIN, viz. links, switches cind demux/mux (distributor/collector, for the replicated version). The lifetime data analysis can be done functionally in terms of the switches and links. Thus, we can conform to whatever lifetime distribution the technology dictates. Specifically there is no need to know the device level parameters and technology whether they are made of transistors and meted lines or Iciser diodes, photodiodes and holographic element. This does away the exponential lifetime assumption adopted in other traditional methodology. 4. Switching and Timing: The choice of switching cind timing does not affect the trajectories of degradation, but it affects performability and mean reward (MRBF) through the reward rates (performance levels) at degraded states. TVaditionjJ performance ancilysis thus becomes a subproblem of the performability amalysis problem, and switching and timing is equally important. However, this aspect becomes transparent in the performability model, where the underlying stochastic process looks at the reward only as a number. The benefit is that, even if analytic methods do not exist for some mode of operation, the reward rates can be evaluated by simulation £ind then be used with the analytical frame work of performability. We proceed by making a choice in favour of synchronous circuit-switching, because set-up time is higher in optical switching as compEired to CMOS transistors. 5. Source Behavior: Source modelling is also another critical aspect for performance evaluation of communication systems, in that it affects performability. The key issue here is that we have successfully analysed a data-dependency model that is closer to the practice, where blocked requests cire resubmitted until success. This is a lossless model which allows one to consider the scenario when a given number of requests are to be routed, and the time within which that service can be achieved is of concern. 6. Copy Loading: Another performcince related issue that affects performability pertinent to the replicated case is whether it is chosen to submit multiple copies of the same request on redundant paths. The distributor may choose a multiple or random loading, thus changing its implementation £md analysis accordingly. This also true in the collector side of the network. In replicated network the distribution across layers is highly dependent, and there is some positive correlation of this occurring over different layers.The observation is that if multiple copies of the same request

259 wins over other requests, it does not add to the performance. Although this seems a protocol oriented feature, both the topology and hardware have effects on the performability.

260

References [1] H. J. Siegel, Interconnection Networks for Large-Scale Parallel Processing. Lexington Books, 1985. [2] J. Y. Hui, Switching and Traffic Theory for Integrated Broadband Networks, Kluwer Academic Publishers, 1990. [3] R. Rcimcikumar, Engineering Reliability: Fundamentals and Applications, Prentice-Hall, 1993. [4] H. A. David, Order Statistics, 2nd ed., John Wiley & Sons, 1981. [5] M. D. Beaudry, "Performance-Related Reliability Measures for Computing Systems." IEEE Transactions on Computers, June 1978, pp. 540-547. [6] R. Huslende, "A Combined Evaluation of Performance and Reliability for Degradable Systems." Proc. ACM SIGMETRICS, 1981, pp. 157-164. [7] J. F. Meyer, "On Evaluating the Performability of Degradable Computing Systems." IEEE Transactions on Computers, August 1980, pp. 720-731. [8] J. F. Meyer, "Closed-Form Solution of Performability." IEEE Transactions on Computers, July 1982, pp. 648-657. [9] D. G. Purchtgott cind J. F. Meyer, "A Performability Solution Method for Degradable Nonrepairable Systems." IEEE Transactions on Computers, June 1984, pp. 550-554. [10] B. R. Iyer, L. Donatiello and P. Heidelberger, "Analysis of Performability for Stochastic Models of Fault-Tolerant Systems." IEEE Transactions on Computers, October 1986, pp. 902-907. [11] L. Donatiello £ind B. R. Iyer, "Analysis of a Composite Performance Reliability Measure for Fault-Tolerant Systems." Journal of the ACM, January 1987, pp. 179-189.

261 [12] A. Goyal and A. N. Tantawi, "Evaluation of Performability for Degradable Computer Systems." IEEE Transactions on Computers, June 1987, pp. 738-744. [13] R. M. Smith, K. S. Trivedi and A. V. Ramesh, "Performability Analysis: Measures, an Algorithm and a Case Study." IEEE Transactions on Computers, April 1988, pp. 406-417. [14] G. Ciardo, R. Marie, B. Sericola and K. S. Trivedi, "Performability Analysis using Semi-Markov Reward Processes." IEEE Transactions on Computers, October 1990, pp. 1251-1264. [15] K. R. Pattipati, Y. Li and H. A. P. Blom, " A Unified framework for the Performability of Fault-Tolerant Computer systems." IEEE Transactions on Computers, March 1993, pp. 312-326. [16] G. R. Goke and G. J. Lipovski, "Banyan Networks for Partitioning Multiprocessor Systems." Proc. 1st Annual Symposium on Computer Architecture, pp. 21-28. [17] D. M. Dias and J. R. Jump, "Analysis and Simulation of Buffered Delta Networks." IEEE Transactions on Computers, .4pril 1981, pp. 273-282. [18] S. Thanawastien and V. P. Nelson, "Interference Analysis of ShuffleExchange Networks," IEEE Transactions on Computers, August 1981, pp. 545-556. [19] J. H. Patel, "Performance of Processor-Memory Interconnections for Multiprocessors," IEEE Transactions on Computers, October 1981, pp. 771-780. [20] C. P. Kruskal emd M. Snir, "The Performance of Multistage Interconnection Networks for Multiprocessors," IEEE Transactions on Computers, December 1983, pp. 1091-1098. [21] T. H. Szymanski and V. C. Hamacher, "On the Permutation Capability of Multistage Interconnection Networks," IEEE Transactions on Computers, July 1987, pp. 810-822. [22] J. T. Blake and K. S. Trivedi, "Reliability Analysis of Interconnection Network using Hiercirchical Decomposition," IEEE Transactions on Reliability, April 1989, pp. 111-120. [23] A. Varmaand C. S. Raghavendra, "Reliability Analysis of redundant-Path Interconnection Networks," IEEE Transactions on Reliability, April 1989, pp. 130-137. [24] J. T. Blcike and K. S. Trivedi, "Multistage Interconnection Network reliability," IEEE Transactions on Computers, November 1989, pp. 1600-1603.

262 [25] V. P. Kumar and A. L. Reibman, "Feiilure Dependent Performance Analysis of a Fault-tolerant Multistage Interconnection Networks." IEEE Transactions on Computers, December 1989, pp. 1703-1713. [26] C. R. Das, P. Mohapatra and C. Yu, "Allocation-based Subcube Dependability for MIN-based Multiprocessors," Proc. IEEE Workshop on FaultTolerant Parallel and Distributed Systems, 1992, pp. 124-131. [27] A. Bhattacharya, R. R. Rao and T.-T. Y. Lin. "Delay Analysis in Synchronous Circuit-Switched Delta Networks," Proc. 7th International Parallel Processing Symposium, 1993, pp. 666-670. [28] S. B. Akers and B. Krishneimurthy, "On Group Graphs and Their Fault Tolerance," IEEE Transactions on Computers, July 1987, pp. 885-888. [29] S. B. Akers cind B. Krishnamurthy, "A Group-Theoretic Model for Symmetric Interconnection Networks," IEEE Transactions on Computers, April 1989, pp. 555-566. [30] C. R. Das and J. Kim, "On Subcube DependabiHty in a Hypercube," Proc. ACM SIGMETRICS, 1991, pp. 111-119. [31] J. Kim and K. G. Shin, "Performability Evaluation of Gracefully Degradable Hypercube Multicomputers," Proc. IEEE Workshop on Fault- Tolerant Parallel and Distributed Systems, 1992, pp. 140-147.

INDEX

A Actor model 29 adaptive dependability 52 application-specific 8 availability 7, 182 avionics 64

B backup process 224 banyan network 239 benchmarks 71, 162 Bulk Synchronous Parallel Model 13 Byzantine faults 4

C checkpointing 8, 24 consensus 6 control system 77 correlated failures 209 coverage 117, 131 Critical Design Review 67

E embedded system 24 error classification 200 error collapsing 145 error containment region 72 error logging 199 error propagation 215 exception handling 46

F factor analysis 211 fail-silent 4 fail-stop 15 failure prediction 226 failure semantics 23, 54 false alarm 74 fault containment regions 65 fault effects 119 fault injection 68, 88, 114, 181 fault rate 65 functional fault analysis 71

H D hazard rate 177, 203 data coalescing 201 deadline 10, 79, 85, 90 degradable system 12, 66, 235 dependability specifications 64 diagnosis 8 distributed system 4, 66, 214

I interconnection network 235

264

L latency 79, 174 locality of reference 172, 241

replication 37 responsive system 4 responsiveness 7 retry 103,218 reuse 36 reward analysis 216, 224, 236 ROMP 147

M Markov analysis 68, 106, 213,237 measurement 199 meta-level architecture 35 mission time 236 Monte Carlo simulation 68, 181

S stability 84 stable storage 24 System Design Review 67 System Requirements Review 67

O open system 24 operating system, GUARDIAN 204 operating system, Mach 6 operating system, MVS 217 operating system, VMS 204 opto-electronic systems 235 overhead 131,208

P performability 236, 249 Preliminary Design Review 67 protocols 24., 39

R real-time system 4, 78 reconfiguration 54, 66 recovery 103,217 reflection 31 register transfer level 69, 127 reliability 11,181,238

T Tandem 209 Test Readiness Review 67 time-between-errors (TBE) 176 time-to-error (TIE) 203 time-to-halt (TTH) 205 timeliness 7 Triple Modular Redundancy 104

V validation 68, 123, 183 VAXcluster 203 VHDL 72

W Weibull distribution 173, 204, 237 workload 123, 171, 207

Foundations of Dependable Computing: Paradigms for Dependable Applications

Read more

Foundations of Dependable Computing: System Implementation

Read more

Foundations of Dependable Computing: System Implementation

Read more

Responsibility and Dependable Systems

Read more

Architecting Dependable Systems

Read more

Architecting Dependable Systems II

Read more

Architecting Dependable Systems V

Read more

Architecting dependable systems

Read more

Architecting Dependable Systems, ICSE 2002

Read more

Software for Dependable Systems: Sufficient Evidence?

Read more

Dependable Computing EDCC-4: 4th European Dependable Computing Conference Toulouse, France, October 23-25, 2002, Proceedings

Read more

Dependable Computing - EDCC 2005: 5th European Dependable Computing Conference, Budapest, Hungary, April 20-22, 2005, Proceedings

Read more

Dependable Computing - EDCC-2: Second European Dependable Computing Conference, Taormina, Italy, October 2 - 4, 1996. Proceedings

Read more

Dependable Computing, 3 conf., LADC 2007

Read more

Highly Dependable Software

Read more

Code Design for Dependable Systems: Theory and Practical Applications

Read more

Dependable Computing - EDCC-3: Third European Dependable Computing Conference, Prague, Czech Republic, September 15-17, 1999, Proceedings

Read more

Code Design for Dependable Systems: Theory and Practical Applications

Read more

Architecting Dependable Systems 4 conf., ICSE 2006 and DSN 2006

Read more

Architecting Dependable Systems 3 conf., ICSE 2004 and DSN 2004

Read more

Dependable Computer Systems (Advances in Intelligent and Soft Computing, vol. 97)

Read more

Security Engineering: A Guide to Building Dependable Distributed Systems

Read more

Architecting Dependable Systems III (Lecture Notes in Computer Science 3549)

Read more

Visualizing Project Management: Models and Frameworks for Mastering Complex Systems

Read more

Foundations for Programming Languages (Foundations of Computing)

Read more

Foundations for Programming Languages (Foundations of Computing)

Read more

IRJET- Dependable Flow and Flood Control of Water Dam

Read more

Computing systems reliability: models and analysis

Read more

Computing systems reliability: models and analysis

Read more

Origins and Foundations of Computing

Read more

Recommend Documents

Foundations of Dependable Computing: Paradigms for Dependable Applications

FOUNDATIONS OF DEPENDABLE COMPUTING Paradigms for Dependable Applications THE KLUWER INTERNATIONAL SERIES IN ENGINEER...

Foundations of Dependable Computing: System Implementation

FOUNDATIONS OF DEPENDABLE COMPUTING System Implementation THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER...

Foundations of Dependable Computing: System Implementation

Responsibility and Dependable Systems

P1: OTE/SPH P2: OTE SVNY331-Dewsbury March 24, 2007 18:14 Responsibility and Dependable Systems i P1: OTE/SPH P2: ...

Architecting Dependable Systems

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Architecting Dependable Systems II

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

Architecting Dependable Systems V

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Architecting dependable systems

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen springer Berlin Heidelberg New Y...

Architecting Dependable Systems, ICSE 2002

Software for Dependable Systems: Sufficient Evidence?

SOFTWARE FOR DEPENDABLE SYSTEMS SUFFICIENT EVIDENCE? Daniel Jackson, Martyn Thomas, and Lynette I. Millett, Editors Co...