This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
T H E K L U W E R I N T E R N A T I O N A L SERIES IN E N G I N E E R I N G A N D C O M P U T E R SCIENCE OFFICE OF NAVAL RESEARCH Advanced Book Series Consulting Editor Andr~ M. van Tilborg
Other titles in the series: OF D E P E N D A B L E COMPUTING: Paradigms for Dependable Applications, edited by Gary M. Koob and Clifford G. Lau ISBN: 0-7923-9485-2
FOUNDATIONS
FOUNDATIONS OF DEPENDABLE COMPUTING: Implementation, edited by Gary M. Koob and Clifford G. Lau ISBN: 0-7923-9486-0
System
PARALLEL ALGORITHM DERIVATION AND PROGRAM TRANSFORMATION, edited by Robert Paige, John Reif and Ralph Wachter ISBN: 0-7923-9362-7 F O U N D A T I O N S OF K N O W L E D G E ACQUISITION: Cognitive Models of Complex Learning, edited by Susan Chipman and Alan L. Meyrowitz ISBN: 0-7923-9277-9 F O U N D A T I O N S OF K N O W L E D G E ACQUISITION: Machine Learning, edited by Alan L. Meyrowitz and Susan Chipman ISBN: 0-7923-9278-7 F O U N D A T I O N S OF REAL-TIME COMPUTING: Formal Specifications and Methods, edited by Andr6 M. van Tilborg and Gary M. Koob ISBN: 0-7923-9167-5 F O U N D A T I O N S OF REAL-TIME COMPUTING: Scheduling and Resource Management, edited by Andr6 M. van Tilborg and Gary M. Koob ISBN: 0-7923-9166-7
FOUNDATIONS OF DEPENDABLE COMPUTING
Models and Frameworks for Dependable Systems
edited by
G a r y M. Koob C l i f f o r d G. L a u
Office of Naval Research
KLUWER ACADEMIC PUBLISHERS Boston / Dordrecht / London
Distributors for North America: Kluwer Academic Publishers 101 Philip Drive Assinippi Park Norwell, Massachusetts 02061 USA Distributors for all other countries: Kluwer Academic Publishers Group Distribution Centre Post Office Box 322 3300 AH Dordrecht, THE NETHERLANDS
Library of Congress Cataloging-in-Publication Data Foundations of dependable computing. Models and frameworks for dependable systems / [edited by Gary M. Koob, Clifford G. Lau]. p. cm. -- (The Kluwer international series in engineering and computer science ; 0283) Includes bibliographical references and index. ISBN 0-7923-9484-4 1. Electronic digital computers--Reliability. 2. Real-time data processing. 3. Fault-tolerant computing. I. Koob, Gary M., 1958II. Lau, Clifford. III. Series: Kluwer international series in engineering and computer science ; SECS 0283. QA76.5.F6238 1994 004.2' 1--dc20 94-29126 CIP
A Consensus-Based Framework and Model for the Design of Responsive Computing Systems ...........................................................
1
3
M. Malek 1.2
A Methodology for Adapting to Patterns of Faults ..................................
23
G. Agha and D. C. Sturman
2. REQUIREMENTS 2.1
MODELS
............................
Military Fault-Tolerant Requirements ...................................................
6 1 63
T.P. Monaghan 2.3
Derivation and Use of Deadline Information in Real-Time Control Systems ..............................................................................
K.G. Shin and H. Kim
77
vi SYSTEM
3.
VALIDATION
................................
111
3.1
Software-Implemented Fault Injection of Transient Hardware Errors .......... 113 C.R. Yount and D.P. Siewiorek
3.2
REACT: An Integrated Tool for the Design of Dependable Computer Systems .......................................................................... 169 J.A. Clark and D.K. Pradhan
o
SYSTEM
EVALUATION
................................
193
4.1
Measurement-Based Dependability Evaluation of Operational Computer Systems .......................................................................... 195 R.K. Iyer and D. Tang
4.2
Modeling and Evaluation of Opto-Electronic Computing Systems ............ 235 T.Y. Lin
Dependability has long been a central concern in the design of space-based and military systems, where survivability for the prescribed mission duration is an essential requirement, and is becoming an increasingly important attribute of govemment and commercial systems where reduced availability may have severe financial consequences or even lead to loss of life. Historically, research in the field of dependable computing has focused on the theory and techniques for preventing hardware and environmentally induced faults through increasing the intrinsic reliability of components and systems (fault avoidance), or surviving such faults through massive redundancy at the hardware level (fault tolerance). Recent advances in hardware, software, and measurement technology coupled with new insights into the nature, scope, and fundamental principles of dependable computing, however, contributed to the creation of a challenging new research agenda in the late eighties aimed at dramatically increasing the power, effectiveness, and efficiency of approaches to ensuring dependability in critical systems At the core of this new agenda was a paradigm shift spurred by the recognition that dependability is fundamentally an attribute of applications and services--not platforms. Research should therefore focus on (1) developing a scientific understanding of the manifestations of faults at the application level in terms of their ultimate impact on the correctness and survivability of the application; (2) innovative, application-sensitive approaches to detecting and mitigating this impact; and (3) hierarchical system support for these new approaches. Such a paradigm shift necessarily entailed a concomitant shift in emphasis away from inefficient, inflexible, hardware-based approaches toward higher level, more efficient and flexible software-based solutions. Consequently, the role of hardware-based mechanisms was redefined to that of providing and implementing the abstractions required to support the higher level software-based mechanisms in an integrated, hierarchical approach to ultradependable system design. This shift was furthermore compatible with an expanded view of "dependability," which had evolved to mean "the ability of the system to deliver the specified (or expected) service." Such a definition encompasses not only survival of traditional single hardware faults and enviromnental disturbances but more complex and less-well understood phenomena, as well: Byzantine faults, correlated errors, timing faults, software design and process interaction errors, and--most significantly--the unique issues encountered in real-
° . °
Vlll
time systems in which faults and transient overload conditions must be detected and handled under hard deadline and resource constraints. As sources of service disruption multiplied and focus shifted to their ultimate effects, traditional frameworks for reasoning about dependability had to be rethought. The classical fault/error/failure model, in which underlying anomalies (faults) give rise to incorrect values (errors), which may ultimately cause incorrect behavior at the output (failures), required extension to capture timing and performance issues. Graceful degradation, a long standing principle codifying performance/dependability trade-offs must be more carefully applied in real-time systems, where individual task requirements supercede general throughput optimization in any assessment. Indeed, embedded real-time systems--often characterized by interaction with physical sensors and actuators--may possess an inherent ability to tolerate brief periods of incorrect interaction, either in the values exchanged or the timing of those exchanges. Thus, a technical failure of the embedded computer does not necessarily imply a system failure. The challenge of capturing and modeling dependability for such potentially complex requirements is matched by the challenge of successfully exploiting them to devise more intelligent and efficient--as well as more complete-dependability mechanisms. The evolution to a hierarchical, software-dominated approach would not have been possible without several enabling advances in hardware and software technology over the past decade: (1) Advances in VLSI technology and RISC architectures have produced components with more chip real estate available for incorporation of efficient concurrent error detection mechanisms and more on-chip resources permitting software management of f'me-grainredundancy; (2) The emergence of practical parallel and distributed computing platforms possessing inherent coarse-grain redundancy of processing and communications resources--also amenable to efficient software-based management by either the system or the application; (3) Advances in algorithms and languages for parallel and distributed computing leading to new insights in and paradigms for problem decomposition, module encapsulation, and module interaction, potentially exploitable in refining redundancy requirements and isolating faults; (4) Advances in distributed operating systems allowing more efficient interprocess communication and more intelligent resource management;
ix (5) Advances in compiler technology that permit efficient, automatic instrumentation or restructuring of application code, program decomposition, and coarse and fine-grain resource management; and (6) The emergence of fault-injection technology for conducting controlled experiments to determine the system and application-level manifestations of faults and evaluating the effectiveness or performance of fault-tolerance methods. In response to this challenging, new vision for dependable computing research, the advent of the technological opportunities for realizing it, and its potential for addressing critical dependability needs of Naval, Defense, and commercial systems, the Office of Naval Research launched a five-year basic research initiative in 1990 in Ultradependable Multicomputers and Electronic Systems to accelerate and integrate progress in this important discipline. The objective of the initiative is to establish the fundamental principles as well as practical approaches for efficiently incorporating dependability into critical applications running on modern platforms. More specifically, the initiative sought increased effectiveness and efficiency through (1) Intelligent exploitation of the inherent redundancy available in modern parallel and distributed computers and VLSI components; (2) More precise characterization of the sources and manifestations of errors; (3) Exploitation of application semantics at all levels---code, task, algorithm, and domain--to allow optimization of fault-tolerance mechanisms to both application requirements and resource limitations; (4) Hierarchical, integrated software/hardware approaches; and (5) Development of scientific methods for evaluating and comparing candidate approaches. Implementation of this broad mandate as a coherent research program necessitated focusing on a small cross-section of promising application-sensitive paradigms (including language, algorithm, and coordination-based approaches), their required hardware, compiler, and system support, and a few selected modeling and evaluation projects. In scope, the initiative emphasizes dependability primarily with respect to an expanded class of hardware and environment (both physical and operational) faults. Many of the efforts furthermore explicitly address issues of dependability unique to the domain of embedded real-time systems. The success of the initiative and the significance of the research is demonstrated by the ongoing associations that many of our principal investigators have forged with a variety of military, Government, and commercial projects whose critical needs are leading to the rapid assimilation of concepts, approaches, and expertise arising from this initiative. Activities influenced to date include the FAA's Advanced Automation System for air traffic control, the Navy's AX project and Next Generation Computing Resources standards program, the Air Force's Center for Dependable Systems, the OSF/1 project, the space station Freedom, the Strategic
Defense Initiative, and research projects at GE, DEC, Tandem, the Naval Surface Warfare Center, and MITRE Corporation. This book series is a compendium of papers summarizing the major results and accomplishments attained under the auspices of the ONR initiative in its first three years. Rather than providing a comprehensive text on dependable computing, the series is intended to capture the breadth, depth, and impact of recent advances in the field, as reflected through the specific research efforts represented, in the context of the vision articulated here. Each chapter does, however, incorporate appropriate background material and references. In view of the increasing importance and pervasiveness of real-time concerns in critical systems that impact our daily lives--ranging from multimedia communications to manufacturing to medical instrumentat i o n - t h e real-time material is woven throughout the series rather than isolated in a single section or volume. The series is partitioned into three volumes, corresponding to the three principal avenues of research identified at the beginning of this preface. While many of the chapters actually address issues at multiple levels, reflecting the comprehensive nature of the associated research project, they have been organized into these volumes on the basis of the primary conceptual contribution of the work. Agha and Sturman, for example, describe a framework (reflective architectures), a paradigm (replicated actors), and a prototype implementation (the Screed language and Broadway runtime system). But because the salient attribute of this work is the use of reflection to dynamically adapt an application to its environment, it is included in the Frameworks volume. Volume I, Models and Frameworksfor Dependable Systems, presents two comprehensive frameworks for reasoning about system dependability, thereby establishing a context for understanding the roles played by specific approaches presented throughout the series. This volume then explores the range of models and analysis methods necessary to design, validate, and analyze dependable systems. Volume II, Paradigmsfor Dependable Applications, presents a variety of specific approaches to achieving dependability at the application level. Driven by the higher level fault models of Volume I and buiilt on the lower level abstractions implemented in Volume III, these approaches demonstrate how dependability may be tuned to the requirements of an application, the fault environment, and the characteristics of the target platform. Three classes of paradigms are considered: protocolbased paradigms for distributed applications, algorithm-based paradigms for parallel applications, and approaches to exploiting application semantics in embedded realtime control systems. Volume III, System hnplementation, explores the system infrastructure needed to support the various paradigms of Volume II. Approaches to implementing
xi suppport mechanisms and to incorporating additional appropriate levels of fault detection and fault tolerance at the processor, network, and operating system level are pre~nted. A primary concern at these levels is balancing cost and performance against coverage and overall dependability. As these chapters demonstrate, low overhead, practical solutions are attainable and not necessarily incompatible with performance considerations. The section on innovative compiler support, in particular, demonstrates how the benefits of application specificity may be obtained while reducing hardware cost and run-time overhead. This first volume in the series covers system architectures or frameworks that serve as the foundation for dependable system design and the various models required at each layer of the system hierarchy and stage of its lifecycle to guide design decisions and evaluate their effectiveness. Section 1 presents two frameworks for the study and design of dependable systems. Maiek emphasizes the layered view of dependability advocated throughout this series and presents the concept of universal consensus for realizing dependability at each level in the context of distributed real-time systems. Agha and Sturman introduce the concepts of reflection and encapsulation as vehicles for tuning dependability to a dynamically changing fault environment and application requirements while maintaining the transparency of dependability mechanisms to the application itself. "l~e concepts are made concrete through the example of an actor-based language and run-time system, highlighting the importance of language hooks in granting users ent~lced control over detection and recovery mechanisms. Given these frameworks, Section 2 addresses the issue of mathematically characterizing dependability requirements in a manner exploitable by them. Monaghan introduces the section by outlining the real requirements demanded by typical military systems and the difficulty of precisely translating those requirements for systems designers and verifying the results. In real-time systems, dependability encompasses timeliness as well as correctness. Shin presents an approach to deriving precise deadline constraints from the application semantics to provide dependability mechanisms with maximum flexibility based on true requirements rather than specifications of undocumented origin. Shin also extends the layered view of dependability to the larger system level: an erroneous output from an embedded computer, while technically a failure of that computer, may be still recoverable at the system level if the controlled process is robust enough. Once a system is designed using appropriate high-level abstractions and fault models the problem remains to validate the design against the types of faults anticipated in actual operation. An emerging approach to this critical problem is fault injection, in which the response of the system to low-level injected errors is gauged. Efficiency of this process demands an intermediate model to guide injection that preserves coverage while simplifying and accelerating the testing. One such approach
xii and the issues involved in applying it are examined by Yount and Siewiorek in Section 3. Clark and Pradhan complete the picture by describing the REACT testbed for modeling and validating dependable system designs. Whereas the models presented thus far capture the behavior of the system in response to particular fault scenarios, a global, quanititative analysis of system dependability in terms of the probabilistic measures of reliability, availability, and performability is necessary in order to judge whether the overall requirements have met and to guide allocation of resources to the most critical system components. Iyer and Tang take an empirical approach using data from operational systems to drive their models, identify trends, and capture the shifting focus of dependability concerns as hardware and software technology evolve. Lin takes an analytical approach to developing a quantitative method for evaluating design alternatives in the context of optoelectronic interconnection networks. Gary M. Koob Mathematical, Computer and Information Sciences Division Office of Naval Research Clifford G. Lau Electronics Division Office of Naval Research
A CKNOWLEDGEMENTS
The editors regret that, due to circumstances beyond their control, two planned contributions to this series could not be included in the final publications: "Compiler Generated Self-Monitoring Programs for Concurrent Detection of Run-Time Errors," by J.P. Shen and "The Hybrid Fault Effects Model for Dependable Systems," by C.J. Walter, M.M. Hugue, and N. Suri. Both represent significant, innovative contributions to the theory and practice of dependable computing and their omission diminishes the overall quality and completeness of these volumes. The editors would also like to gratefully acknowledge the invaluable contributions of the following individuals to the success of the Office of Naval Research initiative in Ultradependable Multicomputers and Electronic Systems and this book series: Joe Chiara, George Gilley, Walt Heimerdinger, Robert Holland, Michelle Hugue, Miroslaw Malek, Tim Monaghan, Richard Scalzo, Jim Smith, Andr6 van Tilborg, and Chuck Weinstock.
ERRATA
1. Due to a late editing decision, Section 2.2 was removed from this volume. The somewhat anomalous numbering scheme employed in Section 2 reflects the original organization. 2. The following notes were inadvertently omitted from Section 3.2: •
This section is based on research sponsored, in part, by the Office of Naval Research under grants N00014-9 l-J- 1404 and N00014-92-J-1366 and conducted at the University of Massachusetts at Amherst and Texas A&M University.
•
Jeffrey A. Clark is with the MITRE Corporation, Bedford, Massachusetts.
•
Dhiraj K. Pradhan is with Texas A&M University, College Station, Texas.
SECTION 1
FRAMEWORKS FOR DEPENDABLE SYSTEMS
SECTION 1.1
A Consensus-Based Framework and Model for the Design of Responsive Computing Systems Miroslaw Malek ^
Abstract T h e emerging discipline of responsive systems d e m a n d s faulttolerant cind real-time performaince in uniprocessor, parallel, and distributed c o m p u t i n g environments. A new proposal for a measure of responsiveness is presented, followed by an introduction of a consensus-based framework and model for responsive comp u t i n g . T h e model, called C O N C O R D S ( C O N s e n s u s / C O m p u t a t i o n for Responsive Distributed Systems), is beised on t h e integration of various forms of consensus a n d c o m p u t a t i o n (management, t h e n progress or recovery). T h e consensus tasks include clock synchronization, phaise initialization a n d termination, diagnosis, checkpointing, resource eillocation, a n d scheduling. I n d e x T e r m s : consensus, distributed computing, fault tolerance, operating system, parallel computing, real time, responsive systems. ' T h i s research was supported in part by the Office of Naval Research Contract No. N00014-88-K-0543, Grant No. N00014-91-J-1858 and the Texas Advanced Technology Grant 386. 'Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas 78712. Phone: 1-512-471-5704, Fax; 1-512-471-0954, Email: malekQece.utexas.edu
1.1.1
Introduction
The integration of three or four traditionally separate disciplines of fault-tolerant computing, real-time systems, and parallel/distributed computing, may provide solutions to one of the greatest challenges of this decade; namely, design and implementation of responsive computer systems. A responsive computer system [7] is a fault-tolerant, real-time system whose objective is timely and correct programs execution in uniprocessor, parallel, and distributed computing environments, even in the presence of faults. In a nutshell, responsive systems are expected to maximize the probability of correct programs execution on time despite faults. Society, business owners, and government officials worldwide have been sold on the promise that open, distributed computer systems would deliver improved reliability and timeliness. The reality seems to be to the contrary, and effective solutions for making these systems responsive are badly needed. Responsive computer systems are crucial in many applications. Communication, control, avionic systems, robotics, multimedia, decision support systems, banking, air traffic control, and even point-of-sale terminals are just a few examples of systems that require a high degree of responsiveness. The objective of this chapter is to introduce a concrete model for optimizing both timeliness and claissical measures of dependability, such as reliability and availability, in the multicomputing environment. Our philosophy is based on the observation that high responsiveness can only be achieved by developing an ultrareliable kernel, whose correctness is provable, and combining it with application-specific responsive system design techniques. This is due to the diversity of applications and environments that make highly responsive systems extremely difficult to design. Although there were several fault-tolerant systems designed in the past, only two of them focused directly on both fault tolerance and real time: the Multicomputer Architecture for Fault Tolerance (MAFT) [5] and the Maintainable Real-Time System (MARS) [6]. While MAFT manages redundancy via Byzantine agreement protocols, MARS uses duplicated processors and relies on a fail-silent fault model. MAFT
separates scheduling, synchronization, voting (SSV), and computing in space by using separate processors for SSV and computing. Both systems heavily rely on static scheduling which limits their applications domain. Our approach, called CORE (Consensus for REsponsiveness), intends to support fault tolerance and real time in dynamic environments where fault and task arrival times, durations and deadlines can be modeled by specific distributions. CORE is a software layer, ideally a kernel. Our approach focuses on the identification of the fundamental functions that must be incorporated in a responsive system kernel and the development of application specific techniques for maximizing responsiveness. This approach is based on two further observations; the omnipresence of consensus ^ and the inevitability of probability. We view synchronization, initialization, termination, communication, data access and storage protocols, diagnosis, selected error detection and error correction codes, scheduling, replicated data management, checkpointing, and reconfiguration as consensus problems. Therefore, the consensus concept is fundamental in any multicomputer environment. The inherent complexity of the system, combined with the random occurrence of faults, implies that probability cannot be avoided, and, even in so-called "hard" real-time systems, only probabilistic guarantees can be given. The chapter is divided into the following parts. In Section 2, we discuss a consensus-based framework for responsive computer systems. Next, a new proposal for a measure of responsiveness is presented. Section 4 introduces the model and some alternatives for responsive computing and outlines design options for implementing them in a distributed computing environment. Section 5 gives the specification of the universal consensus while Section 6 presents the conclusions and summarizes our model. ' In this case, consensus is defined as an agreement among computers.
1.1.2
The Consensus-Based Framework
The consensus-baised framework [10] for responsive systems aimed to define the core concepts and the functions that lead to comprehensive design methods for responsive computer systems. We propose a consensus for REsponsiveness (CORE) software layer that may be incorporated on top or as a kernel (or microkernel) in systems such as Unix, Mach, and Chorus in order to improve their responsiveness. Any successful design requires quantitative and/or qualitative goals that can be verified through measurement. The most successful designs are bcised on particular models that are accurate abstractions of reality. Of course, the ultimate model is a copy of the given system itself; however, with the high complexity of today's systems, such a model is frequently unattainable. Therefore, models for these systems tend to focus on a specific aspect of system behavior or a specific layer of system design. We propose a definition of responsiveness which combines availability and timeliness (a specific proposal for this measure is outlined in the next section) and introduce a design framework in which characteristics such as synchronicity, message order or lack of it, and bounded or unbounded communication delay are well defined for a specific environment. This framework [10] is based on the consensus problem [2] and is, in our opinion, fundamental in the design of responsive multicomputer systems. In multicomputer systems, the consensus problem is omnipresent. It is necessary for handling synchronization and reliable communication, and it appears in resource allocation, task scheduling, fault diagnosis, checkpointing, and reconfiguration. Consensus tasks take many forms in multicomputer systems. Figure 1.1.1 shows the necessary functions for fault and time management in a multicomputer environment in which each function represents a separate consensus problem, with exception of applicationspecific methods at the top layer, although some of these techniques, such as coding, might also be viewed as a consensus problem. At the base of the model is the synchronization level. For a system to be responsive (i.e., fault tolerant and real time), there must be an agreement about time for task execution, fault detection and checkpointing. The next layer represents the requirement for reliable communication. Responsive computers must agree on how and when information is ex-
Application-Specific Responsiveness
Responsiveness Coordinator (Scheduler and Resource Allocator) Fault Recovery (Reconfiguration/Repair)
Checkpointing
Fault Diagnosis or Fault Masking Reliable Broadcast
—
Kernel or Responsiveness Layer
Synchronization
Figure 1.1.1; A framework for responsive systems design.
8 changed and how many messages can be considered delivered or lost. Also, timeliness of communication is crucial in achieving a high level of responsiveness. The third layer, checkpointing, ensures that a consistent system state can be saved after a set of operations. The fourth layer, diagnosis, is fundamental to both real time and fault tolerance, for agreements must be reached on task scheduling and on who is faulty and who is not. Finally, the fifth layer illustrates the need for agreement on resource allocation and reconfiguration for efficient task execution and recovery from potential faults. In addition, we add a scheduler that manages space (resources) and time, and an availability manager. The scheduler and availability manager form the responsiveness coordinator which is responsible for maximizing responsiveness by ensuring that tasks are executed on time (scheduling and coordinating reconfiguration), even in the presence of faults. Application-specific responsive design methods are placed on top, as shown in Figure 1.1.1. An example of implementation of this framework is illustrated in Figure 1.1.2, in which functions in the CORE are incorporated in the kernel and support applications, armed with application-specific techniques for responsive system design. With the variety and complexity of the numerous applications in multicomputer systems today, we insist on this approach, as it is our belief that general techniques have some limitations and, when used alone, cannot assure a high level of responsiveness. We believe that, although the small generic kernel may be proved to be correct, the correctness of real-world applications, in most cases, cannot be proven. Hence, application-specific techniques are necessary. In responsive system design, all of the consensus problems, as well as computations, should be accomplished in a timely and reliable manner. To design a responsive system, we need responsive synchronization, responsive communication, responsive task scheduling, responsive fault diagnosis, responsive resource allocation, and responsive reconfiguration. This means each layer should incorporate timeliness and dependability measures, as well as the relevant fault model. This requirement opens a number of research issues, and only a few of them have been studied from the perspective of responsiveness which is discussed in the next section.
Figure 1.1.2: An example implementation of a responsive systems design framework.
10
1.1.3
The Responsiveness Measure
In principle, there are two approaches in dealing with the responsiveness: one, in which the optimization process takes a pair of parameters such as timeliness and availability and attempts to optimize them as a pair using, for example, an integer programming formulation of this problem, and another one, in which timeliness and availability are integrated in a form of responsiveness measure. An optimization of a tirrieliness-availability (reliability) pair hypothetically gives a designer an opportunity for a separation of concerns but, in reality, there is usually a relation, often complicated, between timeliness and availability, and the designer might want to resort to an integrated form of responsiveness measure. The responsiveness measure combines the traditional meeisures of fault tolerance and real time; namely, availability and the ability to complete a task set by some deadline, with the objective of co-optimizing these dependent system qualities. This is done by examining the system from a task and service perspective. Consider that a system consists of a set of tasks, each having a particular execution time and deadline that may or may not be met. Whether or not a task's deadline is met depends on the ability of the system to perform the task set, given that the system hardware and software have some probability of failing. By taking this approach, we acknowledge that responsiveness is the ability to correctly execute the system's required functions (i.e., its task set or a service) in a timely manner, even in the presence of faults. Responsiveness, then, is dependent on the user's needs with respect to timeliness, a realistic fault model, and his perception of correctness which should ultimately be the system's goals and is the probability that these needs will be fulfilled. Unless these needs are very simple, responsiveness will be less than one. The user must balance how often the system can fail with how much the system will cost. This makes building a responsive system an iterative process of the user specifying what task set must be performed with a particular responsiveness, and the system designer creating a system at a certain cost that will meet the user's requirements. Thus, the goal for the responsiveness measure is that it be sufficiently intuitive so the user may easily use it to specify his requirements, and sufficiently rigorous so the designer may
11
Figure 1,1.3: Combining systems attributes with service attributes. analytically evaluate the system's responsiveness. Notice that reliability and other specific measures of dependability have been traditionally perceived as system attributes, while timeliness hais been viewed as a service attribute (see Figure 1.1.3). The challenge is to optimize the two measures in tandem or to combine them to maximize the probability of correct execution on time. In general, one may assume that the system responsiveness is a function of timeliness and some dependability measure such as availability which are time dependent and can be defined as r{t) — f[a(t),p(t)]. Deriving a precise responsiveness criterion for a multicomputer environment, as well as methods for evaluating and measuring responsiveness, is still an open issue. If a single-processor system is required to run a single task, its responsiveness r{t) is the probability a{t) that the processor performs the task without error (i.e., the instantaneous availability, or reliability, of the processor) multiplied by the probability p{t) that the processor can complete the task before its deadline (i.e., the task is schedulable at a given time t). In short, r{t) — a(t)p(t). If this system is required to perform n tasks, the probability that a particular task is completed is related to its governing scheduling policy and the other tcisks in the system. In addition, if multiple tasks are executed on a single or multiple processor system, the responsiveness measure could be defined as
r{t) =
n
^f^aiit)p,{t), !=1
12 where a; is the lifetime availability of task i and pi is the probability that task i will be completed before its deadline [9]. Availability, a, is calculated using the standard evaluation techniques for the configuration of equipment performing task i and the duration of task i. Timeliness, p,- is the probability that task i will complete before its deadline, given that it is part of some task set scheduled on the system. Note that responsiveness r{t) varies between zero and one with higher values of r(t) indicating a greater likelihood that the task set will be completed within its deadlines. For a gracefully degradable system, such as a consensus-based system, a, and p, may be calculated to any desired precision by computing a weighted average of these probabilities over the various configurations. For example, a four-processor, single-fault-tolerant, consensusbased system would have a^ and pi calculated for the two cases when there are four and three fault-free processors (two, one or zero faultfree processors amounts to a system failure and Oj = 0), then weighted and averaged depending on the expected time the system will spend in these various configurations. We view the proper definition of responsiveness a.s the key to any optimization that might be done in responsive systems design. Only experience will show whether r{i) = f[a{t),p{t)] is a measure that can accurately portray a system's ability to satisfy the user's needs, yet can still be useful to the system designer. It should be noted that responsiveness is a function of the current time t, a task set T (with its timing characteristics), the available time to deadline (d = td — t), and an execution strategy (replication in time or replication in space). One of the principal goals in responsive computing is to accurately define responsiveness and make it useful, easily modeled, and easily measurable. Complete and accurate evaluation of responsiveness may require sophisticated models that can cross the bounds of the design hierarchy without the usual state space explosion. We are currently studying system and service models that focus on responsiveness. Since it seems that no existing modeling paradigms are suitable, we extend Petri Nets [3], limit the state explosion by using partitioning and hierarchy, and incorporate both exponential and delay functions.
13
1.1.4
The Model
The model for responsive computing should facilitate the goals of timely and correct execution of programs in parallel and distributed environments. The proposed model, called CONCORDS (CONsensus Computation model for Responsive Distributed Systems) can be described as follows. In CONCORDS, computing is viewed as a collection of tasks (units of computation) whose execution time and deadline are specified. Taisks may arrive periodically or sporadically at random at one or a number of processing elements (PEs) with an ability to send messages among themselves. In order to perform responsive computing, both logical and physical synchronization are necessary. Reliable communication, fault diagnosis and checkpointing are a must. All of these processes can be executed as consensus. Therefore, the consensus plus, of course, computation and recovery, are fundamental in responsive systems design. A parallel or distributed computation is simply a group of these PEs working towards a common goal. We extend Valiant's BulkSynchronous Parallel model (BSP) [11] and model such a computation as a sequence of two alternating phases; namely, local execution and scheduling/consensus. The local execution might represent progress (a computation towards achieving expected results) or a recovery from a fault. A computation involves each PE performing its portion of a number of tasks (including recovery procedures) until it must send or receive information from the other PEs involved. Regardless of the nature of this information, an agreement is needed, i.e., consensus is executed. The UNIversal CONsensus protocol, UNICON, [1] is a recognition of this, as any nonlocal requirements of a PE are met by a particular instance of it. Therefore, we may model a distributed computation as an alternating sequence of local executions and calls to the UNICON protocol. These phases (consensus/computation) alternate indefinitely or until the application is completed. As an illustration, consider Figure 1.1.4 which shows the execution profiles of four processors A, B, C and D. The column below a processor label represents when consensus is being performed (marked in white) and when a local computation is being performed (marked in
14
time
B
D
consensus type clock synchronization and diagnosis (computation) checkpointing
(computation) computation termination
Figure 1.1.4: A computation/consensus view of a distributed program.
15 gray). A horizontal bar is used to collect related executions of consensus. For example, the first consensus resulting in clock synchronization and diagnosis involves all four processors as marked by the horizontal bar connecting all four columns. As Valiant argues t h a t the BSP model eases the programmer's task in writing parallel computations, we feel this computation/consensus paradigm will ease the development of responsive operating systems and applications. A consensus-based approach towards the development of responsive operating systems was presented in [10]. The primary rationale is that we can never rely on a single resource; thus, it is necessary to progress via consensus. Therefore, the kernel of a responsive operating system consists of a number of consensus tasks as illustrated in Figure 1.1.1. We are currently working on implementation of universal consensus [1] within a C O R E framework, expecting to obtain some definite advantages. First, we can encompass many differing fault models in the system model, and second, UNICON, once a fault model is defined, can be based on a single algorithm. T h e result is a simplicity and functionality that allows for more reliable coding, low installation overhead, and an ease of software maintenance. Moreover, we anticipate t h a t our model will ease the estimation of timeliness of the operating system tasks and applications. An open question is t h a t of efficiency. While the UNICON protocols, by their very nature of combining numerous systems, functions in a single message or pass throughout the network and are bound to be efficient, the restrictive model of consensus and computation phases might not be, especially in a large system. To improve system efficiency, the following variations of a C O N C O R D S model may alleviate the problem: 1. Application-specific C O N C O R D S . In this model, processors involved in a specific application (computation) perform consensus/computation phases alternatively, regardless of other processors in the system. The consensus, scheduling, and reconfiguration tasks might be more efficient, as they involve only part of a system. 2. Exception-driven C O N C O R D S . In this variation, consensus occurs only on demand, i.e., if one of the processes/processors raises an exception. Note that this model is mainly applicable to a fail-stop fault model where a faulty processor stops and notifies others about its
16 failure. While the suggested alternative models may improve efficiency, the potential gains may disappear due to potentially high overhead of concurrency control for some applications. Another interesting problem, from a scheduling perspective, is to analyze and carry out comparative analysis of dynamic versus static scheduling and then compare them with respect to a single global versus multiple task queue(s). Also, placement of software for support of responsive computing remains an open problem. Three possibilities exist; a kernel with an operating system on top of it, a microkernel cooperating with an operating system, and a responsiveness layer on top of an operating system. It remains an open question as to which approach is most cost effective but, obviously, all three of them trade off cost and effectivenesss of maximizing responsiveness.
1.1.5
Specification of Universal Consensus
We assume that all of the processors in the system are (logically) completely connected by an underlying communication medium. Moreover, there exists a distribution of message delays such that a processor may estimate the time that a message was sent from the time it was received and from which processor it was sent [4]. We assume that there exist mechanisms for scheduling and preempting tasks on a processor. We also assume that the number of faults tolerated is bounded according to the fault model employed. The processor clocks are assumed to be synchronized by the UNICON's synchronization task. For consensus to take place, at lesist five questions must be answered. First, who is to be involved in the consensus? The answer to this would be a list of processors. Second, what are the characteristics' of those involved? That is, the fault models of the processors are needed. Third, what is to be agreed upon? The consensus task must know what information is to be collected. Fourth, what is the desired timeliness of the task? Fifth, what structure should the consensus take? In other words, do all the processors involved need all of the consensus
17 results, or will a subset of them be sufficient? From the defining requirements for consensus, we envision the declaration of UNICON as unicon(con_id *id,PE_set members,con.top topology,con.type type,con_time p r i o r i t y ) where id is set to the identification number of this instance of consensus and members is the set of processors involved in the consensus. We assume that the set members refers to a specific set of physical processors, but it could refer to a logical set maintained by the operating system. The structure, topology, possesses two fields; namely, structure which takes values from the set { s i n g l e - l e v e l , partitioned, unspecified}, and r e s t r i c t i o n which takes values from the set {global, a p p l i c a t i o n - s p e c i f i c } . The variable, type, is enumerated from a set including { synchronization, configuration, diagnosis, communication scheduling, checkpointing, termination}. This set can obviously be extended. We may also consider the caise in which type is a union of the various members of this set such as when multiple purposes may be served by a single instance of UNICON. For this chapter we assume that each consensus has a distinct goal but, in practice, various information items could be combined in the consensus messages in order to perform multiple functions at once. The parameter, priority, determines the timeliness requirements of the responsive consensus. From this simple system call, it is possible to invoke any number of the consensus protocols. By examining the fault models of the processors in members, the system can choose a suitable consensus algorithm; by examining the number of processors in members, the system can decide whether a partitioning method is useful or, alternatively, the structure may be specified with topology. The consensus algorithm is abstracted across synchronization, configuration, diagnosis, communication scheduling, checkpointing, termination and consensus information requirements with the type variable. The priority parameter defines the timeliness requirements of a particular instance of UNICON. It is a structure with members time, which is an absolute, real-time value as set by the synchronization pro-
18 cess; periodic, which is a boolean stating whether or not the consensus should be scheduled periodically; duration, which in the case that the consensus should be scheduled periodically is the length of that period; and sched, which is of the set {urgent, deadline, asap, Icizy}. Note that specifying that the consensus should be completed in 10 seconds implies that time should be set to the current time plus 10 seconds. Also, if a consensus is periodic, then after each deadline the time variable should be increased by duration and the task rescheduled. The meanings of these elements are as follows: urgent: The consensus is to be completed by time even if this implies that all other scheduled tasks must be preempted. If time has passed, then the consensus task should be the sole purpose of the processor until its completion. deadline: The consensus should be scheduled during the spare capacity of the system to complete by time. If time passes before completion, then the incomplete results should be provided. asap: The consensus should be begun and completed as soon after time as the processor's schedule allows. By analogy to an interrupt hierarchy, it could be called a polite consensus. lazy: The consensus should be scheduled during the processor's spare capacity, but if it is not completed by time then it should be abandoned with the incomplete results reported. As we assume that all processor clocks are synchronized to within some bound called synch_error, time has meaning, although somewhat inconsistent, throughout the consensus. When we initiate UNICON, therefore, it is necessary to let other processes know that a result is needed by time - synch_error in order that the result will be available at the initiating process by time. A prototype for CORE runs as a set of library functions on top of AIX (IBM version of UNIX operating system) on a network of RS/6000 workstations. It provides the application developer with support for
19 incorporating fault tolerance and real-time requirements. The earliestdeadline-first scheduler had to be adapted to coexist with the AIX system. UNICON's protocol is implemented as a set of functions that are called by the system and the application to perform communication and consensus. We anticipate that the prototype will enable us to perform experiments that will give us insights into an ultimate implementation of CORE as a UNIX microkernel.
1.1.6
Conclusion
There is an urgent need to provide the users community with responsive processing in parallel and distributed computing environments. The thesis of this chapter is that the proposed framework and model will facilitate the development of systems which would improve the feasibility of achieving two crucial qualities, reliability and timeliness. The concepts of consensus and computation progress are fundamental. The universal consensus algorithm for synchronization, reliable communication, diagnosis, scheduling, checkpointing, and reconfiguration, combined with a consensus/computation model, illustrate a promising approach to responsive computing. In addition, a responsive systems layer such as CORE should be integrated with application-specific methods such as those presented in [7, 8] because high responsiveness can only be achieved by combining an ultrareliable layer (ideally a kernel) with application-specific techniques. As computer systems proliferate and our dependence on them increases, responsiveness will become the most sought-after quality in computer and communication systems.
Acknowledgment I would like to acknowledge my student Mike Barborak for contributing to the specification of universal consensus.
20
References [1] M. Barborak and M. Malek. UNICON —^ a UNIversal CONsensus for responsive computer systems. Technical Report TR-92-36, Department of Computer Sciences, The University of Texas at Austin, October 1992. [2] M. Barborak, M. Malek, and A.T. Dahbura. Consensus problem in fault-tolerant computing. ACM Computing Surveys, 25(2):171220, June 1993. [3] G. Brat and M. Malek. Incorporating delays and exponential distributions in Petri Nets for responsive systems. Technical report, The University of Texas at Austin, March 1993. [4] F. Cristian. Probabilistic clock synchronization. Distributed Computing, 3:146-158, 1989. [5] R. Kieckhafer et al. The MAFT architecture for distributed fault tolerance. IEEE Transactions on Computers, 37(4);398-405, April 1988. [6] H. Kopetz et al. Distributed fault-tolerant real-time systems: The MARS approach. IEEE Micro, 9(l):25-40, February 1989. [7] L. Laranjeira, M. Malek, and R. Jenevein. On tolerating faults in naturally redundant algorithms. In the 10th Symposium on Reliable Distributed Systems, pages 118-127, September 1991. Pisa, Italy. [8] L. Laranjeira, M. Malek., and R. Jenevein. Nest: A nestedpredicate scheme for fault tolerance. IEEE Transactions on Computers, 42(11):1303-1324, November 1993. [9] M. Malek. Responsive systems: A challenge for the nineties. In Euromicro 90, Sixteenth Symposium on Microprocessing and Microprogramming, pages 9-16, August 1990. Amsterdam, The Netherlands, Keynote Address.
21 [10] M. Malek. A consensus-based framework for responsive computer system design. In The NATO Advanced Study Institute on RealTime Systems, October 5-18 1992. Springer-Verlag, St. Martin, West Indies. [11] L. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103-111, August 1990.
SECTION 1.2
A Methodology for Adapting to Patterns of Faults* Gul Agha and Daniel C. Sturman ^
Abstract In this paper, we present a language framework for describing dependable systems. Our framework emphaisizes modularity and compoaition. The dependability and functionality aspects of an application may be described independently, providing separation of design concerns. Furthermore, the dependabUity protocols of an application may be constructed bottom-up cLS simple protocols that are composed into more complex protocols. Composition makes it easier to reason about the behavior of complex protocols and supports the construction of generic reusable dependability schemes. A significant aspect of our language framework is that dependability protocols may be loaded into a running application and installed dynamiccdly. Dynamic installation makes it possible to impose additional dependability protocols on a server as clients with new dependability demands are integrated into a system. Similarly, if a given dependability protocol is only necessary during some particular phase of execution it may be installed during that period only.
1.2.1
Introduction
A number of systems have been developed to support the development of dependable computing applications. Such support is given in terms of failure *The research described has been made possible by support from the Office of Naval Research (ONR contract numbers N00014-90-J-1899and N00014-93-1-0273), by an Incentives for Excellence Award from the Digital Equipment Corporation Faculty Program, and by joint support from the Department of Defense Advanced Research Projects Agency and the National Science Foundation (NSF CCR 90-07195). ' A u t h o r s address: Dep£irtment of Computer Science, University of Illinois at UrbanaChampaign, 1304 W. Springfield Avenue, Urbana, Illinois 61801, USA. Email: { agha j sturman } S c s . u i u c . e d u
24 semantics which specify legal ways in which a component can fail [11]. Failure semantics are enforced through the use of dependability protocols which guarantee that the probability of a failure of a type not specified in the semantics is acceptably small. However, existing systems assume that the failure semantics of a service are static and, therefore, the dependability protocols used may be fixed. In many computer systems, it is either unsatisfactory to adhere to a static failure semantics or impossible to adequately enforce the semantics with a fixed group of dependability protocols. We illustrate this concept with two example systems: • Consider an embedded system which is required to function over a long duration, yet is fault-prone due to the uncertain environment in which it operates. If this system is physically isolated, such as in the control system of a satellite, physical modification of system components is often infeasible. In such a system, a change in the physical environment may result in protocols designed for the old environment failing to uphold the failure semantics in the new environment. A different group of dependability protocols may then be required to enforce the desired failure semantics of the system. • Consider an open system. Open systems allow interactions with the external environment; in particular, new services may be added or deleted dynamically from the system in response to external events. Consequently, it may not be possible to statically determine the system configuration. Without knowing the system configuration, it may not be possible to determine what failure semantics a process must have, or what protocols are necessary to enforce these semantics, until after the process actually joins the system. Furthermore, the addition of new services may require a change in the failure semantics of existing components. For example, a file server may initially address safety only by check-pointing the files to stable storage. New clients that are added to the system, however, may require the server to also provide persistence and a protocol to support replication may need to be added. In this paper, we describe a methodology for the modular specification of systems that adapt to patterns of faults. We call the resulting systems adaptively dependable. We present a methodology which allows the transparent installation and reuse of dependability protocols as well as their dynamic installation in a system. Our methodology, when combined with a suitably structured exception handling mechanism and fault detection, allows for the development of
25 fault handlers which can maintain consistent failure semantics within a changing environment and can alter failure semantics as system needs change. We have provided programmer support for our methodology in the language Screed which is implemented on our run-time system Broadway. We employ reflection as the enabling technology for dynamic installation of dependability protocols. Reflection means that an application can reason about and manipulate a representation of its own behavior. This representation is called the application's meia-level. The components of an object that may be customized at the meta-level are referred to as the meia-architecture. In our case, the meta-level contains a description which implements the failure semantics of an executing application; reflection thus allows dynamic changes in the execution of an application with respect to dependability. Besides supporting dynamic installation, our meta-architecture supports transparency and reuse of dependability protocols. For example, the metaarchitecture allows protocols to be expressed as abstract operations on messages. Since the individual fields of a particular message are never examined, the same protocol may be used with different applications. Given this technique for dynamic modification of the dependability protocols used in a system, we describe how fault-detection and exception handling may be used in conjunction with our meta-architecture to support adaptive dependability. We model both failures and exceptions as objects. Each type of fault which may be detected is described as a specific system exception. We construct managers — exception handlers with meta-level capabilities — to address system exceptions. Managers serve three purposes: • A manager may correct for recoverable faults. The corrections allow the system to continue to function despite a fault. This role is generally referred to as performing forward error recovery. • Managers provide failure prevention. When a manager discovers a pattern of component failures, it dynamically installs protocols which mask future failures or facilitate future fault-correction by expanding the set of recoverable faults. In this manner, we have taken forward error recovery one step further; rather than simply adjusting the system state, the actual dependability characteristics of the system may be modified. • Managers support reconfiguration of the dependability protocols in a system. This may be done either to alter the system's failure semantics or to correctly enforce these semantics once the environment changes. Thus,
26 we can develop dependable long duration systems whose fault patterns are not known at start-up time. A prototype implementation of a run-time system which tests these idesLS is described: the system Broadway supports our meta-architecture as well as failure detection and a set of system exceptions. On top of Broadway, we have implemented the language Screed. Screed is a prototype concurrent actor language which provides complementary constructs for both fault-detection through exception handling and dynamic installation of protocols through a meta-architecture. Screed is presented as a demonstration of how such constructs may be added to existing languages. This paper is organized as follows. Section 1.2.2, discusses related research in the areas of reflection, exception handling, and languages for fault-tolerance. Section 1.2.3 provides a brief description of the concepts of reflection, the Actor model, and object-orientation. Section 1.2.4 provides a guide to the syntax of Screed to assist in understanding our examples. Section 1.2.5 discusses our meta-level architecture and how it may be used to construct dependability protocols. We also discuss the effect of our meta-level architecture on protocol performance. Section 1.2.6 describes exception handling in Screed and how exception handling may be used in conjunction with our meta-level architecture to implement adaptively dependable systems. We then illustrate this technique with an example of a system adapting to a change in environment. In
1.2.2
Related Work
A number of languages and systems offer support for constructing fault tolerant systems. In Argus [23], Avalon [15] and Arjuna [31], the concept of nested transactions is used to structure distributed systems. Consistency and resilience is ensured by atomic actions whose effects are check-pointed at commit time. The focus in [27], [9] and [7] is to provide a set of protocols that represent common communication patterns found in fault tolerant systems. None of the above systems support the factorization of fault tolerance characteristics from the application specific code. In [38] and [28], replication can be described separate from the service being replicated. Our approach is more flexible: fault tolerance schemes may not only be described separately, they may be attached and detached dynamically. Another unique aspect of our approach is that different fault tolerance schemes may be composed in a modular fashion. For example, check-pointing may be composed with replication without requiring that the representation of either protocol know about the other.
27 Non-reflective systems which support customization do so only in a systemwide basis. For example, customization in a micro-kernel based system [1] affects all the objects collectively. In an object-oriented system such as Choices [8], frameworks may be customized for a particular application. However, once customized, the characteristics may not change dynamically. Reflection in an object based system allows customization of the underlying system independently for each object. Because different protocols are generally required for very specific subsets of the objects in a system, this flexibility is required for implementing dependability protocols. Reflection has been used to address a number of issues in concurrent systems. For example, the scheduling problem of the Time Warp algorithm for parallel discrete event simulation is modeled by means of reflection in [40]. A reflective implementation of object migration is reported in [37]. Reflection has been used in the Muse Operating System [39] for dynamically modifying the system behavior. Reflective frameworks for the Actor languages MERING IV^ and Rosette have been proposed in [16] and [35], respectively. In MERING IV, programs may access mtta-instanccs to modify an object or meta-ciasses to change a class definition. In Rosette, the meta-level is described in terms of three components: a container, which represents the acquaintances and script; a processor, which acts as the scheduler for the actor; and a mailbox, which handles message reception The concept of unifying exception handling and fault detection was originally proposed in [30] and then refined in [29]. In these papers, detected failures are considered asynchronous events much as exceptional conditions are treated in distributed programming languages. Therefore, exception handling construct provide a natural way to incorporate failure-response code into an application. Goodenough introduced the idea of exceptions and exception handling in [19]. Since then, many different exception handling mechanisms have been proposed. Exception handling constructs have been developed for object-based languages such as Clu [24] and Ada [12]. Dony [13] describes an approach for object-oriented languages and its implementation in Smalltalk. In this approach, exceptions are implemented as objects much as we do. Exception handling for C-I--I- is discussed in [33]. A good overview of techniques proposed for other object-oriented languages can be found in [14]. A critical difference between object-oriented approaches to exception handling and non-object-oriented approaches such as CLU [24] or Ada [12] is that, in the latter, the exception object is represented by a set of parameters to a function. Therefore, on generating the signal, a parameter list must provide all possible information used by the handler.
28 For concurrent systems, another technique has been proposed for languages which use RPC communication [10]; the technique is based on synchronized components which allows the exception handling constructions to be closer to that of a sequential system than an asynchronous system. Exception handling mechanisms have been proposed for other Actor languages. An exception handling mechanism was proposed for ABCL/1 and for Acore [21, 26]; the mechanism uses complaint addresses to support exception handling. A complaint address is a specific location, specified with each message, to which all signals are dispatched.
1.2.3
Background
Before discussing our meta-architecture and how we use it to support adaptive dependability, we first discuss in greater detail some concepts that are important in our framework. The organization of this section is as follows. First, we briefly discuss some of the advantages of object-oriented programming and how they are useful with our methodology. Secondly, we describe the Actor model of concurrent computation. We chose the Actor model as the basis of our work due to the ease with which it may be extended, Finally, we give a more in-depth discussion of reflection and how it relates to a programming language.
Object Orientation In an object-oriented language, a program is organized as a collection of objects. Each object is an encapsulated entity, representing an instance of an abstract data type. The local data comprising each object may only be accessed through an interface specified as a set of methods. The operations carried out by a method are not visible outside the object. Objects communicate with messages which invoke a method in the receiving object. The local data of another object cannot otherwise be accessed or modified. Objects are mstanttated from classes. A class is a user-defined abstraction. Classes may be thought of as types and objects as elements of that type. Instantiation is the creation of an object of a particular class. Classes contain the description (code) of the methods and of the instance variables for objects instantiated from that class. Classes may inherit from other classes. Inheritance provides the inheriting class with the properties - the methods and instances of the ancestor class. The inheriting class can then utilize these properties as well as augment them with new instances variables or methods. Methods may be inherited directly or redefined, facilitating code reuse.
29 Object-oriented languages allow for modular development of systems. The implementation of each component is hidden from other components: only the interface is known. In this way, a component's implementation may change without affecting other components. Code may also be reused efficiently since components may share code by inheriting from a common ancestor class. Note that our use of classes and inheritance differs from that in sequential objectoriented languages in that we do not support class variables.
The Actor Model We illustrate our approach using the Actor model [2, 3]. Actors can be thought of as an abstract representation for multicomputer architectures. An actor is an encapsulated object that communicates with other actors through asynchronous point-to-point message passing. Specifically, an actor language supports three primitive operators; send Actors communicate through asynchronous, point-to-point message passing. The send operator is used to communicate a message asynchronously to another actor. Each message invokes a method (or procedure) at the destination. Upon reception of the message at the destination, the message will be buffered in a mail queue. Each actor has a unique mail address which is used to specify a target for communication. Mail addresses may also be communicated in a message, allowing for a dynamic communication topology. new Actors may dynamically create other actors. The new operator takes an actor behavior (class name) as a parameter, creates a new actor with the correct behavior and returns its mail address. The mail address is initially known only by the creating actor. However, the creator subsequently include this new mail address in future messages. become The become operator marks the end of state modifications in the execution of a method. Once a become has executed in a method, the actor may continue to modify state local to the method. However, such state changes do not effect the way in which the actor may process the next message. Therefore, once this operator is executed, the actor may begin processing its next pending message. Judicious use of the b e c o m e operator may improve performance by allowing internal concurrency; i.e., multiple threads of execution within a single actor. It is important to note that the idea of using reflection to describe dependability is not tied to any specific programming language. Our methodology
30 assumes only that these three operators are in some way incorporated into the language; we require that new actors may be created dynamically and that the communication topology of a system is reconfigurable. In fact, the actor operators may be used to extend almost any standard sequential language to provide coordination and communication in a distributed environment; local computations may still be expressed in terms of the sequential language. The level at which the sequential and actor constructs are integrated determines the amount of concurrency available in the system. An actor language may be used to "wrap" existing sequential programs, serving as an interconnection language. With this approach, each method in an actor class invokes a subroutine, or set of routines, written in a sequential language and dispatches messages based on the values returned. Such an approach was taken by the Carnot project at MCC [34]. In Carnot, the actor language Rosette "glues" sequential components together to facilitate heterogeneous distributed computing. A complementary approach is to actually integrate the actor operators into an existing language. Broadway, the run-time platform we use to implement the ideas in this paper, supports C++ calls for both send and new; the b e c o m e operator is implicit at the end of each method. Using Broadway, developers of distributed programs may use a well known language — C+i to develop distributed programs. Actor operators have also been combined with functional languages. Specifically, actor operators have been added to the call-by-value A-calculus [5]. In this case, the local computation is modeled as a sequential functional computation. An operation semantics is developed for the resulting language. The semantics supports operational reasoning. In [36], the semantics is extended to support formal reasoning about meta-architectures such as the one we describe here. If necessary, the actor operators may also be extended to support more complex functionality. In particular, communication model may be modified to support more complex message passing constructs. The asynchronous point-topoint communication model for actors has been extended to include patternbased multicasts using AciorSpaces [6]. Furthermore, remote procedure calls may be transformed into a set of asynchronous messages using a concurrent analog of the continuation passing style [22]. Synchronization constraints [17] and multi-object constraints [18] are two other extensions of the actor operators which greatly simplify distributed programming. Constraints allow the programmer to specify "when" asynchronous
31 events may occur based on the state of a single object or the occurrence of other events in the system. Using these techniques, the non-determioism of asynchronous communication may be constrained to maintain a consistent system state without requiring an overly restrictive comrrmnication model.
Reflection
Application
System
Figure 1.2,1: Through reflection, an application may modify the system by modifying its meta-objects. Meta-objects are a system level description of the base-level application objects.
Reflection means that a system can manipulate a causally connected description of itself [32, 25]. Causal connection implies that changes to the description have an immediate effect on the described object. In a reflective system, a change in these descriptions or meta-objecis results in a change in how objects are implemented. The object for which a meta-object represents certain aspects of the implementation is called the base object. This relationship is shown in Figure 1.2.1. Meta-objects may be thought of as objects which logically belong in the underlying run-time system. For examples, a meta-object might control the message lookup scheme that maps incoming messages to operations in the base object. Another meta-object may modify how values are read from memory. Using reflection, such implementation level objects can be accessed and exam-
32 ined, and user defined meta-objects may be installed, yielding a potentially customizable run-time system within a single language framework. The reflective capabilities which are provided by a language are referred to as the meta-level architecture of the language. The meta-level architecture may provide variable levels of sophistication, depending on the desirable level of customization. The most general meta-level architecture is comprised of complete interpreters, thus allowing customization of all aspects of the implementation of objects. In practice, this generality is not always needed and, furthermore, defining a more restrictive meta-level architecture may allow reflection to be realized in a compiled language. The choice of a meta-level architecture is part of the language design. Customizability of a language implementation must be anticipated when designing the run-time structure. Although a restrictive meta-level architecture limits flexibility, it provides greater safety and structure. If all aspects of the implementation were mutable, an entirely new semantics for the language could be defined at run-time; in this case, reasoning about the behavior of a program would be difficult. We limit our meta-level to contain only the aspects that are relevant to dependability. Application specific functionality is described in the form of base objects and dependability protocols are described in terms of meta-objects. Thus, dependability is modeled as a special way of implementing the application in question. Our methodology gives modularity since functionality and dependability are described in separate objects. Since meta-objects can be defined and installed dynamically, the objects in a system can dynamically change the protocols enforcing their failure semantics as system needs change. Furthermore, new dependability protocols may be defined while a system is running and put into effect without stopping and recompiling the system. For example, if a communication line within a system shows potential for unacceptable error rates, more dependable communication protocols may be installed without stopping and recompiling the entire system. Since meta-objects are themselves objects, they can also have meta-objects associated with them, giving customizable implementation of meta-objects. In this way, meta-objects realizing a given dependability protocol may again be subject to another dependability protocol. This scenario implies a hierarchy of meta-objects where each meta-object contributes a part of the dependability characteristics for the application in question. Each meta-object may be defined separately and composed with other meta-objects in a layered structure supporting reuse and incremental construction of dependability protocols. Because installation of a malfunctioning meta-level may compromise the dependability of a system, precautions must be taken to protect against erroneous
33 or malicious meta-objects. To provide the needed protection of the meta-level, we introduce the concept of privileged objects called managers. Only managers may install meta-objects. Using operating system terminology, a manager should be thought of as a privileged process which can dynamically load new modules (meta-objects) into the kernel (meta-level). It should be observed that, because of the close resemblance to the operating system world, many of the operating system protection strategies can be reused in our design. We will not discuss particular mechanisms for enforcing the protection provided by the managers in greater detail here. Because only managers may install metaobjects, special requirements can be enforced by the managers on the structure of objects which may be installed as meta-objects. For example, managers may only allow installation of meta-objects instantiated from special verified and trusted libraries. Greater or fewer restrictions may be imposed on the metalevel, depending on the dependability and security requirements that a given application must meet.
1.2.4
Screed
Screed is an object-oriented actor language that compiles applications for Broadway. Screed will be used to illustrate examples in this paper. Screed is an object-oriented language: programs are written in terms of class definitions. Each class defines a single actor behavior and consists of a set of variable declarations and a set of method definitions. Screed supports inheritance. A class for which a parent class is not specified will, by default, inherit from the system defined Object class. At any point, a parent method may be referenced by sending a message to the "object" parent. Inheritance is specified when the class is defined: c l a s s MyMailQueue : MailQueue { . . . instance variables . . . getO { . . . method body... } putO { . . . method body... } } In this example, the class MyMailQueue with the methods get and put is defined. It inherits from the class MailQueue.
34 Classes may be instantiated using the new command which returns a new actor address. For example: too = new MyHailQueue; This statement creates a new actor with the behavior MyMailQueue and returns the address of this actor, which is assigned to loo. There are five primitive types in Screed. The types i n t , r e a l , and s t r i n g are self-explanatory. The type a c t o r holds the address of any actor, regardless of class. The type method can have the value of any legal method name. In addition, one-dimentional arrays of any of these types may be specified. Arrays are defined and used as in C++: a c t o r myReplicas[5]; myReplicas[2] = new ...; Actors communicate through atsynchronous message passing. In the current implementation of Broadway message ordering (from a given source to the same destination) is guaranteed, although actor semantics enforce no such requirement. Messages are sent by specifying a method and an actor address with the appropriate parameters: foo.getO; In this case, the method get is invoked on the actor f oo without any parameters. Since methods are first-class values, it would be possible to specify a variable instead of the name of a particular method. Parameters are dynamically typechecked upon reception at the message destination. Note that since we are using asynchronous message passing, this method invocation does not block. Although asynchronous message passing provides improved performance and concurrency, a drawback is the difficulty in providing return values: since the method does not block upon sending a message, it is necessary to specify a return address and method in the message itself. Therefore, method invocations may return a value, thereby acting as an remote procedure call (rpc). For example: A X = loo.get(); B With rpc-communication, the current method invocation will block. The instructions in A will execute, followed by a message send to loo. B will not execute until a return value arrives and the value is assigned to x.
35 In a asynchronous system, the programmer may want to prevent certain methods from executing based on an actor's state. Therefore, we support synchronization constraints [17] in Screed. Using synchronization constraints, the programmer will be able to specify exactly which methods may not be invoked. Maximal concurrency is then preserved since only the minimal synchronization — as specified by the programmer not the language - will be enforced. The other constructs which comprise expressions in Screed (if, while, etc.) are similar to those in C; we do not describe them further.
1.2.5
Meta-level Architecture for Ultra-dependability
In this section we introduce MAtro (Meta-level Architecture for Ultra Dependability) [4]. MAUD supports the development of reusable dependability protocols. These protocols may then be installed during the execution of an application. MAUD has been implemented on Broadway, our run-time environment for actors. We begin with a discussion of MAUD'S structure. We then discuss how transparency and reusability of protocols are supported by MAUDand provide an example to illustrate the concepts. We finish this section by demonstrating how MAUD also allows the composition of protocols and give an example of composition.
A Meta-Level Architecture As previously mentioned, MAUD is designed to support the structures that are necessary to implement dependability. In MAUD, there are three metaobjects for each actor; dispatcher, mail queue and acquaintances. In the next three paragraphs we describe the structure of meta-objects in MAUD. Note that MAUD is a particular system developed for use with actors. It may be possible, however, to develop similar systems for other models. The dispatcher and mail queue meta-objects customize the communication primitives of objects so that their interaction can be modified for a variety of dependability characteristics. The dispatcher meta-object is a representation of the implementation of the message-send action. Whenever the base object issues a message send, the run-time system calls the transmit method on the installed dispatcher. The dispatcher performs whatever actions are needed to send the given message. Installing dispatchers to modify the send behavior makes it possible to implement customized message delivery patterns.
36 A mail queue meta-object represents the mail queue holding the incoming messages sent to an actor. A mail queue is an object with get and put operations. After installation of a mail queue meta-object, its get operation is called by the run-time system whenever the base object is ready to process a message. The put operation on a mail queue is called by the run-time system whenever a message for the base object arrives. By installing a mail queue at the meta-level, it is possible to customize the way messages flow into the base object. The acquaintances meta-object is a list representing the acquaintances of a base object. In an actor system, all entities are actors. Although they may be implemented as local state, even primitive data objects, such as integers or strings, are considered acquaintances in an actor system. Therefore, in an actor language the acquaintances and the mail queue comprise the complete state of an actor. The acquaintances meta-object allows for check-pointing of actors. Meta-objects are examined and installed by means of meta-operaiions. Metaoperations are defined in the class called Object which is the root of the inheritance hierarchy. All classes in the system inherit from Object implying that meta-operations can be called on each actor in the system. The mela-operations changejnailQueue and change_dispatcher install mail queues and dispatchers for the object on which they are called. Similarly, the meta-operations getjnailQueue, get_dispatcher and get_acquaintances return the metaobjects of a given actor. If no meta-objects have been previously installed, an object representing the built-in, default, implementation is returned. Such default meta-objects are created in a lazy fashion when a meta-operation is actually called.
Transparency and Reuse By describing our dependability protocols in terms of meta-level dispatchers and mail queues, we are able to construct protocols in terms of operations on messages where we treat each message as an integral entity. There are several advantages to developing dependability protocols in this manner. The first advantage is the natural way in which protocols may now be expressed. When dependability protocols are described in the literature, they are described in terms of abstract operations on messages, i.e. the contents of the messages are not used in determine the nature of manipulation to be performed. Therefore, it is logical to code protocols in a manner more closely resembling their natural language description. Secondly, because the protocols are expressed in terms of abstract messages
37 and because every object may have a meta-leve] mail queue and dispatcher, a library of protocols may be developed which may be used with any object in the system. Such a library would consist of protocols expressed in terms of a mail queue and dispatcher pair. T h e meta-objects may then be installed on any object in the system. Since the protocols deal only with entire messages, the actual d a t a of such messages is irrelevant to the operation of the protocol. Only fields common to every message, such as source, destination, time sent, etc., need be inspected. T h e libraries could also be used with other systems, allowing the reuse of dependability protocols. One set of developers could be responsible for the dependability of multiple software systems and develop a protocol library for use with all of them. Since protocols implemented with MAUD are transparent to the application, other development teams, who are responsible for development of the application programs, need not be concerned with dependability. In the final system, protocols from the library may be installed on objects in the application, providing dependability in the composed system. E x a m p l e 1: A R e p l i c a t e d S e r v e r In this section, we provide an example of how a protocol may be described using MAUD. In a distributed system, an important service may be replicated to maintain availability despite processor faults. In this section, we will give an example of how MAUD can be used in an actor domain to develop a modular and application-independent implementation of a protocol which uses replication to protect against crash failures. The protocol we describe is quite simple: each message sent to the server is forwarded to a backup copy of the server. In this manner, there is an alternate copy of the server in case of a crash. Reply messages from both the original and backup servers are then tagged and the client eliminates duplicate messages. Figure 1.2.2 shows the resulting actions occurring when a message is sent to the replicated service. The original server is actor Si. When a message is received by the Forwarder, the message is forwarded to the backup 52. 52 is initialized with the same behavior and state of Si- Since they will receive the same messages in the same order, their state will remain consistent. Therefore, any replies will be identical and in the same order. The replies are tagged by the dispatchers of class Tagger and only the first copy of each message is passed on to the client by Eliminator. Forwarding messages to the backup server is implemented using a meta-level mail queue. The Screed code for this mail queue is presented in Figure 1.2.3. Using a dispatcher, each reply message of the server is tagged to allow the
Figure 1.2.2: When a message is sent by the clients A or B to the replicated service Si, tlie message is received by the Forwarder and a copy is forwarded to the backup S2. When the servers reply, the Tagger dispatchers tag each message so that the Eliminator mail queues may remove duplicate results. If Si crashes, manager actors will install S2 as the new server.
elimination of duplicate replies by the client. A mail queue at the client performs this duplicate elimination. The code for this mail queue is shown in Figure 1.2.4. We assume that managers themselves install appropriate raeta-objects realizing a given dependability protocol. Therefore, we specify the relevant dependability protocols by describing the behavior of the initiating manager as well as the installed mail queues and dispatchers. A manager in charge of replicating a service takes the foUomdng actions to achieve the state shown in Figure 1,2.2: 1. The specified server is replicated by a manager by creating an actor with the same behavior and state. 2. A mail queue is installed for the original server to make it act as the Forwarder described above. 3. The mail queues of the original clients are modified to act as the Eliminator described above.
39 class Forwarder : MailQueue { actor backup; actor server; put(msg m) { m.base_send(); m.set-dest(backup); ra.sendO;
} } Figure 1.2.3: Code for the server-end mail queue which implements replication, The mail queue Forwarder sends a copy of each message to a backup copy of the server.
4. The dispatchers of the servers are changed to tag all messages so that the Eliminator may remove copies of the same message. 5. Upon detection of a crash of 5i, the manager takes appropriate action to ensure all further requests to the server are directed to 52- The manager may also create another backup at this time. Although this example is simple, it does illustrate some of the benefits of our approach. The manager initiating the replication protocol needs no advance knowledge of the service to be replicated nor does the replicated service need to know that it is being replicated. Because the clients using the replicated service are not modified in any way, this gives us the flexibility to dynamically replicate and unreplicate services while the system is running.
Composition of Dependability Characteristics In some cases, dependability can only be guaranteed by using several different protocols. For example, a system employing replication to avoid possible processor faults may also need to guarantee consensus on multi-party transactions through the use of three-phase commit or some similar mechanism. Unfortunately, writing one protocol which has the functionality of multiple protocols can lead to very complex code. In addition, the number of possible permutations of protocols grows exponentially — making it necessary to predict all possibly needed combinations in a system. Therefore, it is desirable to be able to compose two protocols written independently. In some cases this may not
40 class Eliminator : Mailq { int tag; actor members[NUMREP]; actor client; /* No get method is required since we use * the default behavior inherited from Mailq •/ put(msg m) { int i;
for (i=0; i < NUMREP; i = i + 1) if (m.getjsrcO == members[i]) / * Since t h e message was from a r e p l i c a , + we know t h a t the f i r s t argument i s a t a g and • t h e second i s t h e o r i g i n a l message. */ if (m.argCO] < t a g ) / * Discard message */ return; e l s e if (m[0] == t a g ) { self.enqueue(m[l]); t a g = t a g + 1;
} } } Figure 1.2.4: Code for the server-end mail queue which implements replication. The mail queue Eliminator removes tags (which have been added to all server replies by some other dispatcher) and takes the first message labeled by a new tag.
41 add_mailq ( a c t o r aMailq) { if (mailq == n i l ) { s e l i . chauige j n a i l q ( a M a i l q ) ; e l s e mailq.addjnailqCaMailq); ) add_dispatcher ( a c t o r aDispatcher) { i l ( d i s p a t c h e r == n i l ) { s e l l . changejdispatcher(aDispatcher); else dispatcher.addjiispatcher(aDispatcher); } Figure 1.2.5: The additional methods which must be inherited to allow for protocol composition.
be possible due to a conflict in the semantics of the two protocols. In other cases, performance may depend greatly on the way in which two protocols are composed. For many common protocols such as replication, checksum error detection, message encryption, or check-pointing, composition is possible. Because the meta-components of an object are themselves objects in a reflective system, there is a general solution for composing two protocols using MAUD. A simple change to the meta-operations inherited from the Object class, along with a few restrictions on the construction of mail queues and dispatchers, allows us to layer protocols in a general way. Figure 1.2.5 shows how an add-mailq method could be expressed in terms of the other meta-operations to allow layering. Because the mail queue and the dispatcher are objects, we can send a message to install meta-objects customizing their mail queue or dispatcher. By adding protocols in the above manner, the outer mail queue functionality will be performed on incoming messages before they are passed on to the "inner" mail queues. For the send behaviors, the process is reversed with the innermost send behavior being performed first and the outermost behavior last, thereby creating an onion-like model with the newest layer closest to the outside world. To preserve the model, however, several restrictions must be applied to the behavior of dispatchers and mail queues. We define the partner of a mail queue as being the dispatcher which handles the output of a protocol and the partner of a dispatcher as being the mail queue which receives input for the protocol. In Figure 1.2.6, B and C are partners as well as E and D. Each pair implements one protocol. It is possible for a meta-object to have a null partner.
42
Figure 1.2.6: Partrieis and Owner relationships. A is the owner of all other actors in the figure. Dispatcher B and mail queue C are partners as well as dispatcher D^ and mail queue E.
The owner application of a meta-ohject is inductively defined as either its base object, If its base object is not »• ineta-object,; or the owner application, of its base object. For example, in figure 1.2.6, A is, the owner application of meta-objects B, C, D, and E. With the above definition we can restrict the commnnication behavior of the actors so that: • A mail'queue or dispatcher may send or receive messages from its partner or an object created "by itself or its partner. • A dispatcher may send messages tcj the outside world, i.e. to an, object which is not a mail queue or dispatcher of the owner application (although, the message might be sent through the dispatcher's dispatcher). A dispatcher may receive t r a n s m i t messages from its base object and otherwis,c may only .re.ceive messages from its m,ail quetie partner. Therefore, a dispatcher with a null mail queue partner may only receive t r a n s m i t mes,sag,e,s from its base object or eomniunicate with actors it created. • A mail 'queue may receive messages from the outside world (through itS' own mail queue), and send pmt messages when responding to get, messages from itsbase object. Mail queues may otherwise only send messages
43 to its dispatcher partner or actors it created. Therefore, a mailq queue with a null dispatcher partner may only send put messages to its base object or communicate with actors it created. • Objects created by a mail queue or dispatcher may communicate with each other, their creator, or their creator's partner. Because of the above restrictions, regardless of the number of protocols added to an object there is exactly one path which incoming messages follow — starting with the outermost mail queue — and exactly one path for outgoing messages in each object — ending with the outermost dispatcher. Therefore, when a new dispatcher is added to an object, all outgoing messages from the object must pass through the new dispatcher. When a new mail queue is installed, it will handle all incoming messages before passing them down to the next layer. Thus, a model of objects resembling the layers of an onion is created; each addition of a protocol adds a new layer in the same way regardless of how many layers currently exist. With the above rules, protocols can be composed without any previous knowledge that the composition was going to occur and protocols can now be added and removed as needed without regard not just to the actor itself, but also without regard to existing protocols. In Figure 1.2.6, actors B and C are initially installed as one "layer." Messages come into the layer only through C and leave through B. Therefore, D and E may be installed with the add-mailq and add-dispatcher messages as if they were being added to a single actor. Now messages coming into the composite object through E are then received by C. Messages sent are first processed by B and then by D. Example 2: Composing Two Protocols Figure 1.2.7 shows the result of imposing the protocol described in Example 1 on a set of actors already using a checksum routine to guarantee message correctness. Originally, each actor had a corresponding Check-In mail queue and aCheck-Out dispatcher. When server Si is replicated, its meta-level objects are also replicated. The Forwarder mail queue is installed as the meta-level mail queue of Si's mail queue. It will forward all messages to S2. A Tagger dispatcher is installed for each of the two servers and the Eliminator mail queue removes duplicate messages at the client. Although this protocol would be difficult to write eis one entity, composition allows their modular, and therefore simpler, development. In terms of our onion-layer model, each Check-In/Check-Out pair forms a layer. For example, the innermost layer for server Si consists of a Check-Out
44
Key: Dispmchei-rirr-T _
Mai'Queue.
m- Message send m Causal Cotmeclion
Figure 1.2.7: System resulting from the composition of a replication protoco.1 and message checksum protocol. W]:ien a message is sent by the client A (1), the Check-Out dispatcher adds the checksum information to the message. The message is then forwarded to tlie replica as describe in Example 1 (2-3). The checksum inform,ation is removed by tiie Check-In mail queue(4) and the messages are processed, resulting in a reply (5). The reply messages both have the checksum (6) information added before they are tagged and sent to the client (7). At the client, duplicate messages are removed, the checksum information is checked, and the message is delivered.
45 dispatcher and a Check-In mail queue. The outermost layer at Si is comprised of a Tagger dispatcher and a Forwarder mail queue. The client A also has two layers. However, its outer layer consists solely of the Eliminator, this mail queue has a null dispatcher partner. Similarly, at server 52, the outermost layer consists only of a Tagger dispatcher with a null mail queue partner. As can be seen in the above example, the onion-layer model only provides consistency for mail queue and dispatcher installation at a single node: a manager that follows the above rules may still install protocols incorrectly. Such an error may occur if the protocols are installed in one order at one node and in a different order at another node. For example, if the manager installed the Ehmma/or mail queue at client A as the innermost layer rather than the outermost, the system would not operate correctly. An area of current research is developing methods for specifying managers which simplify protocol installation and guarantee global consistency.
Performance In implementing MAUD in Broadway, we have found that, in most cases, the additional message overhead caused by reflection is small compared to the actual cost in messages accrued by the protocols themselves. Using .MAUD, there is an additional in messages upon message reception, where n is the number of protocols composed together to provide dependability for a single object. Upon reception of a message, the message is routed up the chain of meta-level mail queues (n messages) and then worked its way down through a series of get and put messages. For message transmission, there are n additional transmit messages. Since each object is usually protected by only a small (1 or 2) number of protocols, this cost is not great. Since meta-level objects are most likely to be local to the base actor, most messages to meta-objects will be local and inexpensive. Furthermore, we use caching of the highest level mail queue to eliminate n of the messages: the system records the address of the top level mail queue and directs all messages intended for the base object to this mail queue. To preserve correctness with caching, meta-object installation is made atomic. Once a protocol installation begins, no messages are processed until the protocol is installed. This optimization is especially critical if some meta-objects need to be on a separate node. Placement of meta-objects on a different node from the base object is only done when physical separation is necessary for dependability: in this case, the inter-node communication from meta-mail queue to base-object
46 or base-object to meta-dispatcher would normally be required by the protocol, regardless of implementation technique. On the other hand, the communication cost from base-object to meta-mail queue is only due to the nature of using reflection. Therefore, caching eliminates this additional expense.
1,2.6
Exception Handling
Given a meta-level such as MAUD, it is still necessary for a programming language to provide flexible constructs supporting adaptive dependability. In particular, it is important to convey information to the correct entities when system failures occur. We have chosen exception handling as the medium through which managers are informed of problems in the system. This technique has been used extensively with forward error recovery; we simply extend the notion by having our managers prevent future failures through dynamic protocol installation. In this section, we describe the exception handling mechanism in Screed, our prototype actor language. To support adaptive dependability, faults and exceptions have been unified as one concept and exception handlers may be shared between objects. Broadway provides a set of system exceptions, some of which are notifications of failures. For example, when an actor attempts to communicate with an unreachable node, a crash exception is generated. We begin with a discussion of the general structure of exception handling in Screed followed by a specific illustration of the syntax used. We then show how this structure may be used with the meta-architecture to design adaptively dependable systems.
Exception Handling C o m p o n e n t s Exceptions are signaled whenever an unexpected condition is encountered. An exception may be signaled either by the run-time system or by the application. The former are referred to EIS system exceptions and the latter as user-defined exceptionsExceptions in Screed are represented as objects, as proposed in [13] for sequential object-oriented languages. Although none of the other concurrent languages discussed above have taken this approach, we feel representing exceptions as objects allows for more flexible and efficient exception handling; all the information needed by a handler is contained in one object. All system exceptions are derived, through inheritance, from the class exception. Userdefined exceptions may inherit from the exception class or from any other node
47
on the system exception inheritance tree. Below, we discuss the parties involved in the generation of an exception and then the structure of system exceptions. There are four roles involved in handling any exceptional condition: invoker, signaler, exception, and handler (see Figure 1.2.8). Each role is represented as an object in the system. The invoker initiates the method of the signaler which results in an exception. The occurrence of an exception generates a signal. When a signal occurs, a new exception object is created. The signaler notifies the appropriate handler object of the exception's mail address. The handler must then diagnose the exception and perform any appropriate actions for handling the exception. Exception handlers are constructed by the programmer as Screed actorclasses. For each exception a handler accepts, a method must exist with the same name as the exception and which takes an instance of the exception class as a parameter. In all other ways, handlers are identical to other actor classes: they may have a set of instance variables, inherit from other classes, and may communicate with any of their acquaintances. They may also have other, nonexception methods.
invoker
signaler /^ ^
Q \
*\ handler//
J
• I
'^z^^-**•
I *" V^ exception
Actor creation Message sent
Message may be sent
Figure 1.2.8: The four roles involved with an exceptional condition. The invoker initiated the method in the signaler which caused the exception. An object of the appropriate exception class is created and the handler is notified of the exception's mail address. The handler may then interact with the invoker and/or the signaler as well as the exception object to resolve the exception.
All exceptions must inherit from the class exception. When an exception is signaled, an object of the appropriate exception class is instantiated and initialized with any information needed by the handler to process the exception. Some of the initialization fields are supplied by the run-time system. These
48 fields are contained in the exception class from which all exception objects inherit, and are utilized through the methods inherited from the exception class. Additional arguments for the initialization of an exception may be specified by the objects raising a signal. For example, an arithmetic exception which is initiated by an application could be initialized when signaled with the values of the operands in the arithmetic operation. This exception object would still have default values specified by the system. Methods defined in the exception class make use of the system-supplied values. These methods are: name returns the name of the exception as a method value. Since method names are first-class values in Screed, this method enables the automatic calling of the correct method to handle it. invoker returns the mail address of the actor which invoked the method resulting in the generation of the signal. signaler returns the mail address of the signal generator. source returns the name of the method in which the signal was generated. arguments returns a list of the arguments that were passed to the method in which the signal was generated. request returns TRUE if the invoker is waiting on a reply, FALSE otherwise. reply allows a handler to reply to a request that was interrupted by the signal. The reply method can be used to supply an acceptable value to the invoker, thereby allowing the continuation of the computation. Each exception handler may utilize only a few of these fields. However, since our environment is asynchronous, we want to preserve all available information. There are no guarantees that this information will be retained by either the invoker or the signaler. Use of exception objects provides us with the flexibility to include a large amount of information without creating complicated function calls or messages: all the information is packed into an object and is referenced through a standard interface. In a procedural approach, long parameters lists would be necessary to achieve the same effect. Broadway currently supports three different system exceptions. All three inherit directly from the class exception. A bad-method exception is instantiated when an actor receives a message it cannot processes. The bad-method
49 exception class provides the behavior of the destination actor. In general, there is very little the run-time system can do to correct such an error, but this information allows a handler to provide meaningful error messages to the user. An a r i t h m e t i c exception is generated whenever Broadway traps an arithmetic error. Currently, this exception provides the state under which the exception occurred. We plan to expand this exception to include a string representing the expression being evaluated. Broadway also provides some failure detection capabilities. Each node on Broadway has a failure detector which uses a waich-dog timer approach to detect the failure of, or inability to communicate with, other nodes. A crash exception is generated whenever an actor attempts to communicate with an actor on a non-existent or unreachable node. A crash exception consists of the original message and the identity of the node which cannot be reached. Notice that, although Broadway has detected a component failure, it is treated similar to any other system exception. It is also possible for an object to subscribe to a failure detector. In this case, the subscriber's handler will automatically receive an exception whenever a failure is detected, even if the object did not try to communicate with the failed node. Besides detecting node crashes, Broadway will also handle the failure of individual actors. If an actor crashes due to an error that is trapped by Broadway, that actor address will be marked as a creish. Currently, only arithmetic errors are trapped by Broadway and, therefore, this is the only manner in which a single actor may crash. If the defunct actor receives a message, a d e a d - a c t o r exception will be generated. The dead-actor exception inherits from the crash exception. It also contains a reference to the exception generated when the actor crashed. (Currently, this is always an a r i t h m e t i c exception.)
Exception Handling in Screed In this section, we describe our two syntactic additions to Screed which enable exception handling: the handle statement which associates exceptions with handlers, and the s i g n a l statement which generates an exception. In Screed, handlers can be associated with exceptions for either entire actor classes or for arbitrary code segments within a method. Figure 1.2.9 gives the syntax for a handle statement. The statement defines a scope over which specific exceptions are associated with a particular handler. If any method invocation contained within the code block of the handle statement results in an exception, the signal is routed to the correct handler as specified by the with
50 handle (exceptionl, exception^
exception2 with with handler2,
handler!,
)
{ /* Any block ol code goes here */
} Figure 1.2.9: The structure of a handle block in Screed. exceptionl, exception^ are actor class names, handler is the name of an object.
bindings. As explained above, the exceptions are specified as cleiss names and the handlers are addresses of objects. Handler statements may be nested. In this case, when an exception is generated, the innermost scope is searched first for an appropriate handler. If a handler for the exception does not exist then higher level scopes are checked. handle (eirithmetic with arithhemdler, bad-method with aborthandler) { actor A; actor B; actor E; A = new complex(2,3); B = A.divide(C); handle (arithmetic with myhandler) E = B.divide(D); rayNum = r e s ; } } Figure 1.2.10: An example of handler scopes and their effect. The outermost handle statement provides handlers for arithmetic and bad-method exceptions. The inner statement overrides the outer scope in that all arithmetic exceptions will be handled by myhandler. Figure 1.2.10 demonstrates the scoping rules. In the scope of the outer handle statement, if in computing B (by dividing A by C), an arithmetic exception is generated (possibly by dividing by zero), the signal will be passed
51 to arithhcindler. The computation of E through the division of B by D, however, is in the scope of the second handle statement. Therefore, any arithmetic signals generated by this action are sent to rayhandler. Conversely, if our complex objects do not have a divide method, our actions will generate a bad-method signal which will be handled by aborthandler. Unlike the complaint address based schemes[21, 26], our syntactic mechanisms do not require explicit specification of a handler's address with each message. For any given scope, including a single message send, handlers — our equivalent of complaint addresses — may be specified for each individual exception or for any group of exceptions. One handler need not be specified for all exceptions. Additionally, our method takes greater advantage of the available inheritance mechanisms as well as the general structure of object-oriented languages: both exceptions and handlers are expressed as objects in our system. The above constructs work well within methods. However, there are two levels of scoping above the method level in Screed: the global and class levels. Exception handling at the class level is specified through the use of a handler statement which encloses several method definitions. In this manner, exception handling may be specified for an entire class by enclosing all methods in one handler statement. Such a construction does not prohibit handler statements inside the methods. A handle statement may not be defined across class boundaries as that would require the use of shared variables between class instances. However, to provide exception handling at the global level, Screed supports the systemdefined handler class Default-Handler. An instance of this class handles all signals which are not caught by another handler. Default system behavior is for a signal to be simply reported to the terminal. Delault-Heindler may be overwritten by a programmer defining a custom class of the same name. In this way, a final level of exception handling may be defined by the programmer. This type of facility is especially useful for writing debuggers. Any exception not defined in a custom Default-Handler class is handled by the system-default. Note that the system creates only one instance of the Default-Handler class: all otherwise unhandled signals are delivered to this instance. As mentioned previously, exceptions may be generated as user-defined signals. A signal is generated by a s i g n a l statement. signal
exception-class-name(_args . . . ) ;
The s i g n a l statement generates a message to the appropriate exception handler. The arguments are used for initialization of the exception as defined by
52 the interface of the particular exception class. The signal does not interrupt the flow of control in the code, although a Screed return statement could follow the signal to end the method. In many cases, it is necessary for the signaler of the exception to await a response from the handler before proceeding, signal statements are treated, syntactically, as message sends to a handler. Therefore, signal statements may act as an remote procedure call in the same manner as Screed message-sends. Thus, the handler may return a value to be used by the signaler. Such a case would be:
res = signal div-zero(); For this example, the exception handler would return an actor address as the value res. Then, the rest of the signalling method may compute. In other systems, a special construct exists for generating signals within the current context, i.e. generate a signal which is caught by the handle statement in whose scope the statement occurs. An example of such a construct would be the exit statement in Clu [24]. In Screed, such a construct in not necessary: the actor can explicitly send a message to the appropriate exception handler.
1.2.7
Supporting Adaptive Dependability
A significant difference between exception handling in Screed and other languages is the use of third-party exception handlers. In languages such as CLU [24], SR [20], and Ada [12], exception handling routines are defined within the scope of the invoking objects. We refer to this approach as two-party exception handling (the invoker and the signaler) and our approach as three-party exception handling (the invoker, the signaler and an independent handler). We have found that two-party exception handling is unsatisfactory for modeling the complex relationships between objects such as those required for adaptively dependable systems. The key difference between two- and three-party systems is in the potential knowledge of a given object. With two-party exception handling, the exception handler, which is also the invoker, may know only of exceptions its invocations generated. Therefore, in such a system it is very difficult to develop a global picture of the fault pattern in the system. In a three-party system, such monitoring may be naturally expressed in terms of the exception handler since it may handle exceptions for a large group of objects or even the entire system. Furthermore, an autonomous exception handler may subscribe to any available
53
Backup Node
^r:^-^^ Aoplicaiion
/
'^l
Fa'liire Detector Manager
Figure 1.2.11: A prototypical adaptively dependable system. In this case, the manager M receives input in the form of exceptions from applicatioE objects and .notices from the failure detector. Upon, determination that the Primary Node is unstable, M allocates the Backup Node and creates the appropriate objects to replace the Primary Node. Note that, in actuality, M is probably a replicated entity to ensure its resilience and availability. failure detectors, thereby augmenting the knowledge received through exception signals. A third-party exception handler may also be designated as an object with special rights. In this manner the system may be safely modified in response to exceptions and failures. Since it is dangerous to allow the arbitrary modification of one actor by another, most two-party systems, can express reconfiguration of the system only by mimicking a three-party system, i.e. they must notify a third object with special rights of the exceptions they encounter and this object 'may then reconfigure the system. Thus in adaptively dependable systems, the resulting system architectures will look qaite similar tO' Figure 1.2.,11. Such a system may allow the dynamic installation of dependability protocols or may simply support the reconfiguration of several objects m rcspofise to exceptiens. In either case, the system will
54 hiH' a mil ;<,ger Aii'i -.i • ;ial nclits to ifiOn *j nhe: objects. The manager will dCL d,s tiip excpption iianjler for --v^mt j^roap csf objscts being monitored. FurtheniiDr the n..iinL't-i r"dv -ilsi siii^tiihc f^. tailLre-detection services. Upon JCKI wiR/i, f'jid^h mpiii t • d'i»rniint ^'lar su^.t^ i : icceptable condition exists. tL ' niaiiaz^ r irc nfi^iii'^' th"" ei iiij jv^r v nn h it hag authority. Exainplf 3: B«-< oiifigiiratioii t o P r e s e r v e Failure Seiriantics
Figure 1.2.12: This figure snows the topology of the network on a sateUite. The .shaded circles represent actors and the manager for raemory errors is denoted by the rcpHcated actor M.. The memory uses checksums to avoid returning incorrect values. Instead, if an incorrect value is found, a special exception value is returned m'Mch may then trigger an exception in the application. The manager is replicared to guarantee a high level of fault-tolerance. To illustrate the concepts we described above, consider a distributed system which is operating in a hostile envixonm,ent. A good example, of such a. system is a satellite with multiple processors, each with its own memory. .Assume that the memory of these processors was developed to never return incorrect data to a read. Instea,d, the memory will detect the error through some checksum algorithm and return an error condition value. This reserved value can then be used to signal an exception and initiate forward error recovery. The specifications for this system state that such memory errors should occur with, probability ID^''. Such a system is shown in Figure 1.2.12. Notice that the manager is itself replicated to ensure fault-tolerance: in this vital .system component. Once, the system, is launched, these memory components seem to operate correctly and the manager responsible for memory errors occasionaJiy performs forward error recovery on the objects. However, the rate at which these memory errors are occurring is imacceptably high (10^* faults/read) and system performance d.egrades significantly due to repeated memory faults and subsequent error recovery. Therefore, the manager installs the replication protocol described in Example 1. T^'he resulting system is shown in Figure 1.2.13. Since
Figure 1.2.,:13: The applications on each node have now been replicated using the meta-ievel protocol described in Example 1. Since we are protecting only against the corruption of individual memory locations, the replica iiiay reside on the same node as the original actor, thereby improving the performance of the protocol. both the original actor and the replica will be reading values from different memory locations, the probability that a memory error will be noticed is now 10""-^"j well within the specified tolerance. When an exception occurs, the manager will still have to perform some corrections, but the system can keep computing during this time and the error recovery will be simplified due to the existence of the replica. Considering the nature of the faults, the rephca may be placed OH the same node as the original actor. However, if instead of signalling an exception, nodes crashed when they could not read memory, the replicas would have been placed on different nodes.
1.2.8
Conclusion
In this paper, we have described a methodology for the development of adaptively dependable systems. Adaptively dependable systems may function over a long duration despite a changing execution environment. Whether the changes are due to a variance in the components comprising the system or to a change in the physical environment in which the component operates, the use of dynamic protocol installation combined with exception handling allows fault tolerance tO' be guaranteed, by the 'System. Dynamic protocol installation is enabled througii the use of a mctadevel architecture. Our meta-level, MADD, allows the customization of an object's communication behavior on a per-object basis. By describing protocols in terms of modifications to the comrauaicatioo behavior, protocols may be dynamically installed on objects as necessary. Furthermore, if a protocol is mo longer re~
56 quired, it may be removed. Tiirough the use of caching and atomic protocol installation, meta-level description of protocols may be implemented with a minimal cost in performance. We also support composition of protocols. Provided there are no inherent semantic conflicts between two protocols and both protocols are implemented using MAUD, these two protocols may then be composed without foreknowledge that a composition may occur. In this manner, protocols may be constructed in a modular fashion and later combined to provide the desired level of faulttolerance. To provide adaptive dependability, we combine dynamic protocol installation with exception handling. We make extensive use of third-party exception handlers which are shared between multiple objects. Since these handlers have privileges to modify meta-level objects, they are termed managers. A single manager will be informed of all exceptions related to a particular problem. The knowledge may be augmented through subscription to failure-detection services. In this manner, the manager will have all information necessary for a correct diagnosis of fault patterns. The concepts described in this chapter have been implemented on Broadway — our actor run-time system — and are accessed through the language Screed. Screed provides exception handling constructs which support managers and provides access to the meta-level architecture implemented in Broadway. One problem not solved by our framework is guaranteeing the consistent installation of protocols on multiple actors. We are currently developing a Protocol Description Language which will allow a protocol to be expressed as a single entity. A protocol compiler can convert these protocols into MAUD mail queues and dispatchers. To address the installation problem, the compiler creates objects that guarantee correct installation of the protocol. Managers use these objects to install protocols on all actors involved in the protocol.
Acknowledgments The research described in this paper has benefitted from earlier collaboration with Svend Fr0lund and Rajendra Panwar. The authors would also like to acknowledge helpful comments from Christian Callsen, Svend Fr0lund, WooYoung Kim, Rajendra Panwar, Anna Patterson, Shangping Ren, Carolyn Talcott, Nalini Venkatasubramaniam, and Takuo Watanabe, among others.
57
References [1] M. Acceta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young. Mach: A New Kernel Foundation for UNIX Developement. In USENIX 1986 Summer Conference Proceedings, June 1986. [2] G. Agha. Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press, 1986. [3] G. Agha. Concurrent Object-Oriented Programming. Communications of the ACM, 33(9):125-141, September 1990. [4] G. Agha, S. Fr0lund, R. Panwar, and D. Sturman. A Linguistic Framework for the Dynamic Composition of Dependability Protocols. In C.E. Landwehr, B. Randell, and L. Simoncini, editors, Dependable Computing for Critical Applications 3, volume VIII of Dependable Computing and Fault-Tolerant Systems, pages 345-363. IFIP Transactions, SpringerVerlag, 1993. [5] G. Agha, I. Mason, S. Smith, and C. Talcott. Towards a Theory of Actor Computation. In R. Cleaveland, editor, The Third International Conference on Concurrency Theory (CONCUR '92). Springer-Verlag, 1992. LNCS (forthcoming). [6] Gul Agha and Christian Callsen. ActorSpace: An Open Distributed Programming Paradigm. In Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, California, May 1993. Also published as a Special Issue of SIGPLAN Notices vol. 28, No. 7, pp23-32, July 1993. [7] K. P. Birman and T. A. Joseph. Communication Support for Reliable Distributed Computing. In Fault-tolerant Distributed Computing. SpringerVerlag, 1987. [8] Roy Campbell, Nayeem Islam, David Raila, and Peter Madany. Designing and Implementing Choices: An Object-Oriented System in C-|—1-. Communications of the ACM, pages 117-126, September 1993. [9] E. Cooper. Programming Language Support for Multicaist Communication in Distributed Systems. In Tenth International Conference on Distributed Computer Systems, 1990. [10] Antonio Corradi, Paola Mello, and Antonio Natali. Error Recovery Mechanisms for Remote Procedure Call-Based Systems. In 8th Annual International Phoenix Conference on Computers and Commumcaton Conference
58 Proceedings, pages 502-507, Phoenix, Arizona, March 1989. IEEE Computer Society Press. [11] Flaviu Cristian. Understanding Fault-tolerant Distributed Systems. Communications of the ACM, 34(2):56-78, 1991. [12] Quian Cui and John Gannon. Data-Oriented Exception Handling in Ada. IEEE Transactions on Software Engineering, 18:98-106, May 1992. [13] Christophe Dony. Improving Exception Handling with Object-Oriented Programming. In Proceedings of the l^th Annual International Computer Software and Applications Conference, pages 36-42, Chicago, 1990. IEEE Computer Society, IEEE. [14] Christophe Dony, Jan Purchase, and Russel Winder. Exception Handling in Object-Oriented Systems. OOPS Messanger, 3(2):17-29, April 1992. [15] Jeffrey L. Eppinger, Lily B. Mummert, and Alfred Z. Spector, editors. CAMELOT AND AVALON: A Distributed Transaction Facility. Morgan Kaufmann Pubhshers, Inc., 1991. [16] Jacques Ferber and Jean-Pierre Briot. Design of a Concurrent Language for Distributed Artificial Intelligence. In Proceedings of the International Conference on Fifth Generation Computer Systems, volume 2, pages 755762. Institute for New Generation Computer Technology, 1988. [17] S. Fr0lund. Inheritance of Synchronization Constraints in Concurrent Object-Oriented Programming Languages. In Proceedings of ECOOP 1992. Springer Verlag, 1992. LNCS 615. [18] S. Fr0lund and G. Agha. A Language Framework for Multi-Object Coordination. In Proceedings of ECOOP 1993. Springer Verlag, 1993. LNCS 707. [19] John B. Goodenough. Exception Handling: Issues and a Proposed Notation. Communications of the ACM, 18(12):683-696, December 1975. [20] Daniel T. Huang and Ronald A. Olsson. An Exception Handling Mechanism for SR. Computer Languages, 15(3):163-176, 1990. [21] Yuuji Ichisugi and Akinori Yonezawa. Exception Handling and Real Time Features in an Object-Oriented Concurrent Language. In A. Yonezawa and T. Ito, editors. Concurrency: Theory, Language, and Architecture, pages 92-109. Springer-Verlag, Oxford, UK, September 1989. LNCS 491.
59 [22] Wooyoung Kim and Gul Agha. Compilation of a Highly Parallel ActorBased Language. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing, volume 757 of Lecture Notes in Computer Science, pages 1-15. Springer-Verlag, 1992. [23] Barbara Liskov. Distributed Programming in Argus. Communications of the ACM, 31(3):300-312, March 1988. [24] Barbara Liskov and Alan Snyder. Exception Handling in Clu. IEEE Transactions on Software Engineering, 5(6):546-558, November 1979. [25] P. Maes. Computational Reflection. Technical Report 87-2, Artificial Intelligence Laboratory, Vrije University, 1987. [26] Carl Manning. ACORE: The Design of a Core Actor Language and its Compiler. Master's thesis, MIT, Artificial Intelligence Laboratory, August 1987. [27] S. Mishra, L. L. Peterson, and R. D. Schlichting. Consul: A communication Substrate for Fault-Tolerant Distributed Programs. Technical report, University of Arizona, Tucson, 1991. [28] M. H. Olsen, E. Oskiewicz, and J. P. Warne. A Model for Interface Groups. In Tenth Symposium on Reliable Distributed Systems, Pisa, Italy, 1991. [29] Richard D. Schlichting, Flaviu Christian, and Titus D. M. Purdin. A Linguistic Approach to Failure Handling in Distributed Systems. In A. Avizienis and J.C. Laprie, editors, Dependable Computing for Critical Applications, pages 387-409. IFIP, Springer-Verlag, 1991. [30] Richard D. Schlichting and Titus D. M. Purdin. Failure Handling in Distributed Programming Languages. In Proceedings: Fifth Symposium on Reliability in Distributed Software and Database Systems, pages 59-66, Los Angeles, CA, January 1986. IEEE Computer Society Press. [31] Santosh Shrivastava, Graeme Dixon, and Graham Parrington. An Overview of the Arjuna Distributed Programming System. IEEE Software, pages 66-73, January 1991. [32] B. C. Smith. Reflection and semantics in a procedural language. Technical Report 272, Massachusetts Institute of Technology. Laboratory for Computer Science, 1982.
60 [33] Bjarne Stroustrup. The C++ Programming Language. Addison-Wesley, second edition, 1991. [34] C. Tomlinson, P. Cannata, G. Mereditii, and D. Woelk. The Extensible Services Switch in Carnot. IEEE Parallel and Distributed Technology: Systems and Applications, 1(2), May 1993. [35] C. Tomlinson and V. Singh. Inheritance and Synchronization with Enabled-Sets. In OOPSLA Proceedings, 1989. [36] Nalini Venkatasubramanian and Carolyn Talcott. A MetaArchitecture for Distributed Resource Management. In Proceedings of the Hawaii International Conference on System Sciences. IEEE Computer Society Press, January 1993. [37] T. Watanabe and A. Yonezawa. A Actor-Based Metalevel Arhitecture for Group-Wide Reflection. In J. W. deBakker, W. P. deRoever, and G. Rozenberg, editors, Foundations of Object-Oriented Languages, pages 405-425. Springer-Verlag, 1990. LNCS 489. [38] C. T. Wilkes and R. J. LeBlanc. Distributed Locking: A Mechanism for Constructing Highly Available Objects. In Seventh Symposium on Reliable Distributed Systems, Ohio State University, Columbus, Ohio, 1988. [39] Y. Yokote, A. Mitsuzawa, N. Fujinami, and M. Tokoro. The Muse Object Architecture: A New Operating System Structuring Concept. Technical Report SCSL-TR-91-002, Sony Computer Science Laboratory Inc., Feburary 1991. [40] A. Yonezawa, editor. ABCL An Object-Oriented Concurrent System, chapter Reflection in an Object-Oriented Concurrent Language, pages 45-70. MIT Press, Cambridge, Mass., 1990.
SECTION 2
REQUIREMENTS MODELS
SECTION 2.1
Military Fault Tolerant Requirements Timothy P. Monaghan'* 2.1.1
Trends in Modern Weapon System Procurement
The DoD is entering a radically new era of weapon systems procurement. The Cold War is over and new scenarios and new force structures are changing the requirements of the next generation of weapon systems. The major drivers shaping this new era of weapon system procurement are: decreasing military budgets, stretched out development cycles, an emphasis on prototypes and demonstrations, open systems standards, and intensive modeling and simulation before any hardware is built. The next generation of computer based weapon systems will be designed using a high percentage of commercial off-the-shelf parts. A key focus in this next generation of computer based weapon systems will be the fault tolerant features of the system's design. Studies have shown that a system can actually be more dependable, even though it uses less reliable parts, if adequate attention is paid to the fault handling features of the system's design. [1] A key issue in this new era of defense procurement is - how can the DoD manage the procurement of a dependable and cost effective, computer-based weapon system? The most critical task a program office faces in the development a dependable computer based weapon system is the Specification and Statement of Work (SOW) package. The government does not design the weapon system or give direct supervision to the ongoing work. The government's role is to specify the features of the system it requires and validate that the evolving system and the final delivered system meets those requirements. The specifications of the system describe the requirements and features of the system to be built what the system shall do. The SOW describes the deliverables that the contractor shall deliver to the government at each critical milestone of the evolving system's design and development - what the contractor shall do * Naval Air Warefare Center, Aircraft Division, Warminster, Pa.
64 at each milestone to ensure that the system will meet the government's specifications. The SOW also calls for specific validation information to be delivered in reports and the holding of periodic design review meetings. Design review meetings occur at the system's critical milestones. Their focus is to provide the government with the assurance that the contractors have considered all relevant trade-offs for dependability and fault tolerance and that adequate attention is being paid to dependability and fault tolerance in the system's design. For the government to ensure that the final system is a dependable computing system, it is essential to have accurate and clear fault tolerant specification language and proven methods for their validation. The courts have generally held any ambiguities in the specifications and SOW against the government and the burden is on the government to come up with "reasonably precise" contract design language.[2] The topic of what constitutes an adequate description of fault tolerant requirements is currently in transition within the fault tolerant community. The rest of this chapter will describe this trend and give examples of specific fault tolerant language in a current, progressive, mission avionics Specification and SOW. 2.1.2
Specifications
What legal language which to be in the Specification to clarify to the contractor the exact fault tolerant and dependability requirements that the government expects to see in the future system? Past avionics fault tolerant specifications have described vague requirements such as: "The contractor shall identify and discuss high risk areas..." and "single point failure modes that significantly degrade mission performance or prevent successful mission accomplishment shall be addressed...". The specifications also include descriptions of reconfiguration and degraded modes as well as the traditional reliability, availability, and maintainability numbers. But very few of these requirements translate into a directly testable quality of the design like a performance benchmark. In the extreme time pressures of modern complex weapon system design, vague language describing broad fault tolerance and dependability requirements are too easily put on the back burner and will not steer a contractor to produce a dependable cost effective system for the government to own and operate. Dependability requirements are often demonstrated after the design is complete when
65 non-compliance is easier to waive then to fix. The fault tolerant community is seeking to make the fault tolerance and dependability features of a system more visible, quantifiable, and verifiable throughout the development cycle. Legal, formal specifications are the only way the government can define computer performance and dependability requirements of the system it is procuring. Examples of more precise fault tolerant specification requirements are: (a) Fault Set The fault tolerant system design shall define the functional and physical partitioning of the system's fault containment regions (FCR). Each FCR shall handle hard faults, transient faults (including massive power supply transient faults), intermittent faults, coincident faults (a second fault occurring while recovering from the first fault), timing faults, minor battle induced physical faults and operator faults. All faults will be recorded in the error history log. The frequency of hard fault rates is decreasing for both military and commercial digital parts but transient fault rates are increasing. The next generation of systems need to be designed to automatically handle transient faults and notify the operator only when reconfiguring around hard faults yet record all fault activity to ensure quick diagnosis of failing parts. (b) Fault Characterization Each fault from the fault set shall be characterized by: (1) Uming - the probability density of its expected frequency, (2) extent - the extent of damage it causes to information or hardware in the system, (3) duration - the length of time that the faults persist, and (4) error model - the incorrect logical behavior expected when the fault produces an error. The redundancy management of the system shall be characterized by:
66 (c) Automatic Reconfiguration The system shall automatically reconfigure to circumvent a failed component and select a back up component or data path to redistribute the processing. This reconfiguration shall be transparent to the operator and shall be accomplished within the real time constraints of the system (typically 20ms to 50ms for an avionics system). The operator shall have the ability to be notified of recovery from hard faults. The system shall also be capable of automatic and transparent reconfiguration of the system when a failed component is restored to operation if the failure was due to a transient fault. This process shall continue as long as the processing and component resources continue to meet the demands of the reconfigured system. When system resources are inadequate to meet the real time system demands the system shall enter the degraded mode state.
(d) Degraded mode Multiple processors and distributed processing architectures shall permit the automatic, controlled and graceful degradation of system capabilities when errors occur and the real time processing demands exceed system resources. Functional capabilities shall be automatically redistributed where possible to the rest of the system. Impact on mission operations shall be kept to a minimum for as long as possible when operating in the degraded mode. 2.1.3
SOW
The SOW specifies what the contractor shall do at each stage of the system's design to assure that the system being built will be a dependable and cost effective system for the government to own and operate. The SOW really describes what reports and interactions with the contractor the government expects to see as it oversees the development of the system. The government's oversight of the development of a complex weapons system is divided up into five basic development phases:
67 Concept Exploration Stage, Demonstration and Validation Phase, Full Scale Development and Procurement and Deployment. [3,4] The first major design review is the System's Requirements Review (SRR). At this review the requirements of the system are carefully gone over to ensure that both the government and the contractor have a mutual understanding of the system requirements and that adequate attention has been paid to software requirements, the system's test requirements, and the applicable architectural and fault tolerant approaches. In the next review, the System Design Review (SDR), the contractor presents the results of the trade off studies and the rationale for the architectural approach and fault tolerant strategy that has been selected. This stage of the design, validation and evaluation by the government is critical. The contractor has selected an architectural approach for the system and after this design review, they will be committing the resources of the project to that approach. Now is the time to carefully consider the rationale of that architectural and fault tolerant approach before major funding is committed to it. At the next major review, the Preliminary Design Review, (PDR), the top level design efforts have been completed and the detailed design is about to begin. At PDR, the contractor presents the preliminary hardware and software design of the system and the initial performance and fault tolerance evaluations. At the Critical Design Review, (CDR), the contractor will provide a completed hardware, executive software design, a refined analysis of performance and fault tolerance and their plans to demonstrate and test the brassboard. The final stages of the system design is the actual building of a brassboard, a Test Readiness Review, (TRR), and the demonstration and formal testing of the system's performance and dependability features. Throughout each of the stages of the systems' design from SRR to the final testing the government can influence the system's design. But the later the stage the weaker that influence becomes due to the increasing cost to change major design features once the program has heavily
68 invested in that particular design. The government can raise the probability that the final system will be a dependable system by carefully describing the types of dependability exit criteria to be done at each stage of the system's design and requiring clear proofs of that testing. These proofs shall be delivered at each of the design reviews.
Concept Demonstratior Program Concept Program Exploration Validation Selection Initiation Go Ahead Phase Phase
A
SRR
A
SDR
A
SSR
Full Scale Development Phase
A
PDR
A TRR A
CDR
igure 2.1.1 2.1.4
Analytic Modeling and Fault Injection
Current SOW dependability and fault tolerant requirements stress reliability prediction and Failure Modes, Effect, and Criticality Analysis (FMECA) as described in MIL-STD-785, MIL-STD-756B and MILSTD-1629. The parameters used in reliability prediction or analytic models are failure rates, coverage estimates, repair rates, and costs of repair and parts. Reliability prediction modeling can be combinatorial or non-combinatorial. The combinatorial elements are reliability graphs, which aid in predicting the hard failure rate of the current system from the statistical past failure rates of similar components, and fault trees, which are trees of conditions that lead to certain system failures. The noncombinatorial modeling elements are Markov chains and Petri and Monte Carlo simulations. [5] While analytical studies such as these are useful as the system design is evolving, these mathematical models have numerous sources of error and need to be validated with fault injection. [6] The initial Markov, Petri or combinatorial models of the system describe how the designer assumes the system will handle faults but the current functional model, and the fault injections into it, describes how this evolving system is handling faults.
69 Fault injection experiments are performed on the functional simulation of the system that is developed from the high level conceptual model of the system. The functional system model is gradually transformed from behavioral descriptions to RTL descriptions to structural models to the final brassboard implementation. As the design is further refined, faults are injected into the current functional representation of the system and the results of the fault injection experiments are used to validate and refine the initial gross assumptions of the Markov, Petri or combinatorial models. The functional system model can never be exhaustively tested but a further refined analytic model can reveal which system parameters are most sensitive to change and need closer functional fault injection experiments. Thus, the analytic modeling can enable the test engineer to wisely inject faults in the short amount of time allowed for the fault injection testing. Fault injection techniques provide a measure of confidence that the system's fault tolerance will perform according to specifications but fault injection is a probabilistic means of validation and cannot provide a guaranteed deterministic validation. Thus, a variety of fault injection techniques (random and functional) are used along with analytic techniques to provide a complete system fault tolerant validation. Random fault injections should also be carried out to allow for any fault handling assumptions that have been overlooked in the analytic model. The overriding goal should be an adequate test set at each stage of a system's design, given the amount of time allocated for testing, to ensure that the evolving system will be dependable and continue to deliver its services in the presence of transient and hard faults. Figure 2.1.2 shows these two complementary streams of analytic modeling and fault injection that should be carried out through the evolving systems design and development. Analytic modeling provides the insight to the fault injection experiments and the results of the fault injection experiments are used to calibrate the parameters of the analytic modeling. Thus, the results of both streams are constantly being reconciled and the estimates of dependability are becoming more accurate as the design approaches a brassboard implementation. Carefully planned fault injection experiments in the brassboard are the final proof of the systems dependability. The SOW should specify the specific fault injection experiments and analytic modeling results the government requires to be delivered at each of the major milestones.
70 PHASES: Investigating Candidate Architectures-
A
ArthiUtturs Seleajgn-
A
SRR
A
A SDR
Impjimentation
PDR
CDR
Functional Simulation/Fault Injection
TEST PROTOTYPE
VHDL Behavioral/Gate
Architectural Level Simulation
I MODEL VAUDATIOW
Reliability Prediction/AnalyCIc Modeling M»Jelipg Element? Sys input -failure rate -repair rate Extrapolating Field -coverage estimate Fault Data -costsCparts, service, time) from Previous Sys Behavior Similar Systems Combinatorial -reliability graph -fault tree Noncombinatonal -Markov -Petri Net
'Tods: CAREIII HARP SURE
Figure 2.1.2
2.1.5
Design Reviews
Each design review requires specific exit criteria for that design stage. As well as the standard relability prediction and FEMA studies, the next generation of computer based weapon systems will also specify fault injection. Examples of some milestone requirements from a current, progressive mission avionics SOW that stress fault injection studies throughout the system's development are given below: (a) Functional Fault Analysis
71 The fault tolerance of the design shall be validated at each stage of the systems evolution. This evaluation shall be done by mapping the specified fault set onto each subsystem partition and then identifying the fault detection, isolation, removal and recovery mechanisms of the system that will establish system operation. This functional fault analysis (FFA) shall be done at each stage of the system design and development and any single points of failure that degrade mission performance or prevent successful mission accomplishment shall be rectified. [7] (b) Exit Criteria for SDR - Preliminary architectural description, architectural partitioning and functional allocation - A system development plan including: 1. Modeling tools for the evaluation of system operational performance and system fault tolerance; 2. Identification of a baseline set of application based benchmarks; 3. Identification of critical state information which must be protected under all fault conditions for transparent fault recovery. (c) Exit Criteria for SSR - Software architecture description and functional allocation - Justification for architectural approach including: 1. Preliminary operational performance estimates using the benchmarks defined at the SDR; 2. System fault tolerance paradims, approaches, maximum acceptable delay estimates and redundancy required for transparent fault recovery. (d) Exit Criteria for PDR For development items the following:
72
- A complete VHDL behavioral description of the architecture which provides accurate timing estimates; - A description of the planned logic layout which provides accurate prediction of circuit complexities, failure modes, error detection methods, and detection coverage and latency. For non-development and development items: - A description and simulation of the executive software including inter-process and inter-node communication mechanisms; - An integrated functional and fault tolerant description and analysis of the system demonstrating how the system mechanisms work together to achieve the desired performance and automatic recovery in the presence of faults; - A functional fault analysis of the system which shall include the mapping of representative faults from the specified fault set to each hardware and software component witiiin each Error Containment Region (ECR) and analyzing the effectiveness of the ECRs to contain errors. The system's error handling procedures shall be characterized in terms of coverage, delay times and system wide effects. (e) Exit Criteria for CDR For Development items: - A complete description of the final design including IC and logic layout, packaging, executive software and representative application programs. For Non-Development and Development Items: - A refined prediction of the designs performance and fault tolerance including detailed circuit analysis and fault injections simulations to determine fault coverage and recovery times;
73
- The time required and probability of recovery from each fault type in the specified fault class; - A plan for the detailed brassboard construction and evaluation shall be presented. The detailed design shall include provisions to allow experimental fault injection testing of the brassboard to verify that the physical system will perform in the fashion assumed in the previous simulations and reliability models. (f) Exit Criteria for TRR - A fullup demonstration of the brassboard with the executive software and a representative application. - Timing measurements from an agreed set of benchmarks shall be taken and reconciled with previous timing and throughput estimates. - Representative faults shall be injected into the brassboard to verify that the previous fault tolerant predictions accurately reflect the behavior of the implemented system. Representative faults from the specified fault set shall be selected by the contractor and the Navy. These faults will be inserted randomly and in selected locations. - The laboratory fault injection experiments shall be reconciled with the simulation and analytic results to provide assurance that the previous dependability and availability predictions are correct. Any discrepancies found at this stage shall be analyzed and corrected.
2.1.6
Conclusion
The DoD is giving increased attention to what precise legal language is needed for its future Specifications and S(DW requirements of complex computer based weapon systems. The DoD is seeking to specify
74
clear and quantifiable validation techniques at each stage of the system's design. This will allow the DoD to be informed customers able to quantify a design's dependability and fault tolerant metrics and reasonably ensure at each stage of the design that the evolving system will be a dependable system to own and operate. During the Gulf War, Naval aircraft sustained high Full Mission Capable (FMC) rates (in the high 90% range), but these high PMC rates were sustained at the cost of high Maintenance Man Hours per Flight Hour (MMH/FH), (30 to 65 MMH/FH), high spares usage and high false alarm rates, (consistently in the 30% to 35% range). Future complex weapon systems will increasingly rely on digital sub-systems and the dependability of these digital designs will play a critical role in the effectiveness of those systems in the field. In future combat situations, the rate of spares usage, the time to isolate and repair the equipment, and the false alarm rates will play a critical role in the effectiveness of those systems. As budgets decrease, fewer systems are purchased and commercial parts are used, the dependability of those systems will be a critical factor in the effectiveness of the next generation of computer based combat systems. References [1] Advanced Launch System Life-Cycle Cost Optimization Studies, Phillip Babcock, The Charles Stark Draper Laboratory, Inc. Cambridge, Massachusetts [2] Boyle v. United Technologies Corp.. 487 U.S. 500, (1988) [3] Defense System Management College, Systems Engineering Management Guide, January 1990, U.S. Printing Office. [4] SDIO BM/C3 Processor and Algorithm Working Group, D.A. Rennels, ed., "Volume Two, Management Issues: Contractor Milestones and Evaluation," Application of Fault Tolerance Technology, October 20, 1987 [5] A. Johnson, M. Malek, "Survey of Software Tools for Evaluating Reliability, Availability, and Serviceability", ACM Computing Surveys, Vol, 20, No.4, Dec. 1988
75 [6] D. Lee, G. Gilley, "Minimizing Error In Predicting The Reliability Of On-Board Satellite Systems", The Aerospace Corporation, April 1991 [7] 0. Gilley, "Validation of Fault Tolerant Designs," lEEE/AIAA 10th Digital Avionics Systems Conference, October 1991.
SECTION 2.3 Derivation and Use of Deadline Information in Real-Time Control Systemsi Kang G. Shin and Hagbae Kim^ Abstract This section presents a method for deriving the hard deadline of a real-time control system, which captures the interplay between a controlled process and its controller computer by formally specifying the need of the controlled process in a form understandable to the controller computer. Several examples of deriving and applying the hard-deadline information are also presented to demonstrate the importance of the deadline information.
2.3.1
Introduction
The use of digital computers for such applications as aircraft, automated factories and nuclear reactors, has become commonplace due to the availability of inexpensive, powerful computers and the increasing need to control more sophisticated processes. A traditional control system consists of simple components, such as an electrical, mechanical, or hydraulic hard-wired controller, in the feedback loop of the controlled process. These controllers have the virtue of simplicity in designing, implementing, and evaluating their performances. By contrast, recent and ^The work reported was supported in part by the Office of Naval Research under Grant N00014-91-J-1115 and NASA under Grant NAG-1220. Any opinions, findings, and conclusions or recommendations expressed in this section are those of the authors and do not necessairily reflect the view of the funding agencies. ^The authors are with the Department of Electrical Engineering and Computer Science, The University of Michigan, Ann Arbor, MI 48109-2122
78 future developments — which are likely to be intrinsically unstable to achieve other purposes — such as a highly fuel-efficient "fly-by-wire" aircraft demand the controllers to react fast enough to maintain stability. Digital computers are thus playing a more important role in maintaining safety-critical functions while improving control system performance significantly. Computers for controlling life-threatening processes must be highly reliable to avoid any catastrophe. For example, the computer used in commercial aircraft requires a failure probability to be lower than 10"^" per hour. To guarantee the safe operation of these processes in the presence of computer-component failure(s), some form of fault-tolerance should be provided with the controller computer. Fault-tolerant controller computers are far more complex and difficult to design than non-fault-tolerant counterparts due to the stringent reliability required and their interaction with the operating environment as well as the controlled processes. When designing these fault-tolerant controller computers, one must resolve such issues as control-task scheduling and allocation, optimal queue control, and optimal redundancy management, and also must specify and evaluate the controller's performance in the context of the controlled processes. A real-time control system may fail not only because of the slow execution of critical tasks but also because of massive hardware or software component failures. T h e design and evaluation of highly-reliable controller computers gets complicated due to the difficulty in expressing the needs of the controlled processes and due to the difficulty in predicting the execution time of control programs in the presence of component failure(s). A controller computer calculates the control input by executing a sequence of instructions, thereby introducing an unavoidable delay - called the
compulation-
time delay — to the controlled process. This is an extra delay in addition to the system delay commonly seen in the control literature. C o m p u t a t i o n lime is an important component of the delay in the feedback loop, which also includef contributions related to measurement or sensing, A / D and D / A conversion, and actuation. However, these other delay contributions are usually constant, and thus easy to model. Due to data-dependent loops and conditional branches, and contention in resource sharing during the execution of programs that imple-
79 ment control algorithms, the computation-time delay is a continuous random variable which is usually much smaller than one sampling period, T,, if no failure occurs to the controller computer. When a component failure or environmental disruption such as an electromagnetic interference (EMI) occurs, the time taken for error detection, fault location, and recovery must be added to the execution time of a control program, thus increasing the computation-time delay significantly. This increase in task execution time seriously degrades the system performance and may even lead to a dynamic failure by either violating the necessary conditions for system stability or causing the system to leave the region of the state space defined by the specification (i.e., the allowed state space) if the delay exceeds a certain limit called the hard deadline [18]. The hard deadline can formally specify the need of the controlled process in a form that is understandable to the controller computer. The notion of hard deadline thus captures the interplay between a controlled process and its controller computer. This section presents a method for deriving the hard deadline of a real-time control system, which is defined as the maximum tolerable duration of a controllercomputer failure by using the dynamic equation of the controlled process, information about failure occurrence rates, environment characteristics, faulttolerance features, and properties of the control algorithms employed. The dynamic equations are modified to account for the presence of some controllercomputer failures that result in computation-time delays. Using these modified equations, we compute the hard deadline iteratively by examining the necessary conditions for (asymptotic) system stability or the residence of system states in the allowed state space. We also extend this method to cover the case of nonzero error latency [19] owing to an imperfect error detection coverage. There may be control input disturbances due to erroneous computation during the error latency in addition to the computation-time delay incurred due to the occurrence of componentfailure or environmental-interference.
80 To demonstrate the application of the derived deadline information, we include (i) a design example that optimizes time-redundancy recovery methods such as retry or rollback, and (ii) an evaluation example that assesses the system reliability by using a Markov chain model augmented with new states representing system inertia. The rest of this section is organized as follows. Section 2.3.2 surveys the previous work related to the reliability or safety of control systems. In Section 2.3.3, we address the generic problem for qualitative analysis of computation-time delay and/or input disturbance effects and review the basic definition of a hard deadline in real-time digital control systems. Section 2.3.4 presents a method for modifying the state difference equation in the presence of the computationtime delay only, and then analyzes system stability by examining the pole positions of the modified state difference equation. In Section 2.3.5, we extend the method proposed in Section 2.3.3 to the case covering the effects of input disturbances. Hard deadlines are derived both deterministically and stochastically. In Section 2.3.6 the hard deadline associated with the one-shot event model is derived, when even one occurrence of a computer failure for a long period may drive the controlled process out of its allowed state space, i.e., a dynamic failure occurs. In Section 2.3.7, several simple linear systems are examined to demonstrate our approach. Section 2.3.8 deals with the application of the hard-deadline information by presenting a design example that optimizes time redundancy such as retry or rollback, but also an evaluation example that assesses the system reliability by using the derived deadline information.
2.3.2
Related Work
Failures in actuators, sensors, mechanical parts, A/D and D/A converters may also induce a system failure. In [14], a mathematical framework was presented to describe the interactions between the detection/isolation devices for component^ failures and the reconfiguration of control algorithms. The authors of [22] focused on the design of fault-tolerant control systems to enhance system reliability. In contrast to the work cited above, the main intent of this chapter is ' that is, actuator, sensor, or computers
81 to (a) analyze the interplay between a controlled process and its fault-tolerant controller computer, and (b) derive the deadline information of the controlled process t h a t is useful for the design and evaluation of the controller computer. Several researchers have attempted to analyze the effects of computation-time delay on the performance or stability of a control system. The authors of [24] considered the qualitative effect of feedback delay on a multivariable, computercontrolled linear system by proposing an algorithm to compensate for the delay effect. T h e sufficient (necessary) conditions of stability with a feedback delay and the delay effects on quadratic performance indices were presented for a linear control system [3] (for a nonlinear robot control system [16]). In [15], a more detailed analysis of the stability of a digital control system with a feedback delay was carried out by modifying the state difference equation. However, all of these analyses are based on the assumption that the feedback delay is fixed or constant. Although the stability problem with a variable-feedback delay was investigated in [4], it was still based on a regular and periodic (thus deterministic) pattern of delays. In [2], a control system with a random time-varying delay in the feedback loop was modeled with a stochastic-delay differential equation, and sufficient conditions were derived for the almost-sure sample stability under which almost every possible differential equation for an ensemble of stochastic systems has a stable solution. However, it did not show any explicit relation between the performance (or stability) and the magnitude of delay but, instead, gave a condition of the coeificients of the state equations and the average rateof-change of delay for sample stability. Furthermore, this work assumed a delay to be bounded by the 'worst-case' inter-sample period. In [18], the hard deadline in controlling the elevator deflection of the aircraft landing problem was obtained numerically by using the concept of allowed state space. T h e problem associated with disturbances to the control input was treated by analyzing the observability of a linear system with a dynamic feedback controller under unknown disturbances in the control input [5] and experimentally testing the functional error modes of computer-based control systems in a "harsh" operating environment [1].
82 u(t)
yit) Controlled system X
D/A
atch circuit
(<)
A/D
uikT s)
x{kT,) Controller computer r(i :r,)
F i g u r e 2.3.1
2.3.3
A computer-controlled system.
The Generic Problem of Controller-Computer Failures
As shown in Fig. 2.3.1, a linear time-invariant controlled process is generally represented by the vector difference equation as shown in Eq. (2.3.1) and is equipped with a well-designed controller that stabilizes the overall control system and optimizes the given control objective:
x(k+ 1) = Ax(k) + Bu{k)
(2.3.1)
where k is the time index, one unit of time represents the sampling interval T,, and X G 7^" and u E TZ^ are the state and input vectors, respectively. The coefficient matrices, A & TZ"^" and B G "R.""*', are obt ained from those of the corresponding continuous-time model:
Jo
(2.3.2)
where Ac and Be are the corresponding coefficient matrices of the continuoustime model. The (digital) controller computer reads sensor values, compares them with desired values, and calculates the control input once every T, seconds
83 according to a programmed control strategy. The control input, which is held constant within each sampling interval by a latch circuit, is applied to the continuous-time controlled process. Eq. (2.3.1) must be modified to account for the delays associated with the measurement or sensing, A/D and D/A conversion, and the execution of control algorithms. The sum of these delays is usually much smaller than one sampling period, T,, in the absence of computer failure(s) or external interferences like an EMI (called the delay problem in [16].) For the delay problem where the magnitude of delay, A, is smaller than T,, one can change the state equation (2.3.1) to: x{k + 1) = Ax{k) + Biu{k) + B2u{k - 1) where B^ = / p e^'^^'-'^BJs
and B2 = /^f
(2.3.3)
e^'^'^'-'^B,ds.
As mentioned earlier, digital computers are highly susceptible to transient EMI such as lightening, high intensity radio frequency fields (HIRE), and nuclear electromagnetic pulses (NEMP). The main problem caused by EMI is functional error modes — or computer failures — often without component damages. When a computer failure occurs and it is detected upon its occurrence (i.e., with a zero error latency measured from the occurrence to the detection of an error [19]), a certain recovery process is invoked.^ Then, the computation-time delay resulting from the corresponding recovery actions may be large relative to Tf. This was termed the loss problem in [16]. Note that detection schemes are, in reality, imperfect (i.e., the detection probability is smaller than one), resulting in a non-zero error latency. During this error latency, the control input may be updated erroneously until the fault inducing the computer failure disappears, or the computer failure is detected and handled properly. Suppose a computer failure occurs at time ko, it is detected ni sampling intervals after its occurrence, and the subsequent recovery takes n^ sampling intervals during which the control input will be held constant at u{ko -f- ni -|-1) by the D/A converter and latch circuits. The control inputs during this period A fault-tolerance mechanism consists of error detection, fault location, system reconfiguration, emd recovery. General recovery processes are retry, rollback, and reconfiguration, each of which takes a finite time. See [21] for a more detailed account of fault-tolerance mechanisms.
84 are formally represented as; u(ko + 1 ) / A , M ( ^ O + 2 ) / ^ , • • • ,u{ko
+ n i ) / ^ , • • •,
u{ko + ni + l ) , u ( * o + "1 + 1), • • • ,u{ko + ni + l),u{ko ^ „, ' where 7^ is a diagonal matrix with Diag[Ij^]i
+ ni+rj2+l),--.
= 1 + Au,- and Aii, is a ran-
dom sequence which is modeled by the output of a dynamic system with a white-noise input. Since faults/interferences occur randomly during the mission lifetime, their occurrences are considered stochastic perturbations to the controlled process, which can be modeled depending on the fault characteristics. When the environment is assumed to be stochastically stationary, the occurrence and duration of computer failure(s), and the magnitude of disturbances in the control input can be represented by several probability density functions. T h e relative frequency of disturbance and delay due to such computer failure(s) depends upon the coverage (the probability of detecting an error induced by an arbitrary fault), which is determined by failure-detection mechanisms [19]. Stationary occurrences of controller-computer failures/interferences not only degrade the performance of the controlled process but may also lead to loss of system stability if their active duration exceeds the hard deadline. Even one occurrence of this abnormality with a long active duration in the one-shot event model may make the system leave its allowed state space [17]. T h e hard deadline of the stationary model is defined as the m a x i m u m allowed duration of the failure-handling process without losing system stability [17]. More formally, we have the following definition. D e f i n i t i o n 1: Let Xg denote an equilibrium state of the system represented by Eq. (2.3.1). Then, x^ is said to be stable at time kg if for each e > 0 there exists a <5(e, ko) > 0 such that \\x{ko)-x,\\<6=^\\x(k)-x,\\<(, T h e equilibrium state x^ is said to be asymptotically
\fk>ko.
(2.3.4)
stable at time ko, if it is
stable at time ATQ, and there exists a (5i(A;o) > 0 \\x(ko)-Xe\\<6i(ko)=^\\^i>')-Xe\\—'0
as A : — • oc.
(2.3.5)
85 In linear time-invariant systems, stability can be checked simply by using the pole positions of the controlled process in the presence of random computer failures. Using this information one can derive hard deadlines stochastically or deterministically with the sample(s) and the ensemble average of the controlled process: D(N)
= inf supjiV : ||A(.V)|| < 1},
(2.3.6)
Cert V
where A(A'') is the eigenvalue of the controlled process in the presence of computer failures of the maximum duration NTs and Cenv represents all the environmental characteristics t h a t cause controller-computer failures. Consider a state trajectory evolved from time kt) to kj. Let A'^(A*) and UA be the allowed state space at time k and the admissible input space, respectively. If a computer failure, which occurred at ^i (A'o < ki < kj) and was detected I\\Ts later, is recovered within N2Ts, then the control input during these jV =
Ni+Nn
sampling periods is: u^ (k) = u{k)lAnk,{Ni)
+ u(ki + Ny)nk,+NA''^2),
k:
ki+N
where Umin) is a rectangular function on the interval [m,m + n], i.e., U,n(n) = ^{k —TO)— ^(k — 771 — n), where ^ is the unit step function. Then, the system hard deadline during [koT^, kfTs] is also defined as: D(N,x{ko))=
inf
supjA':
(p{k,kQ,x{kQ),u'^[k))
e XA(k),
ko < k <
kj},
(2.3.7) where the state trajectory is assumed to evolve as: xik) = (k, ko, xiko), y'aik)).
2.3.4
(2.3.8)
Deadlines of Stationary Model Due to Feedback Delays
In our model, the controlled processes represented by Eq. (2.3.1) are usually unstable in the absence of feedback control, i.e., the real part of at least one eigenvalue of .4 is greater than one. For example, the aircraft designer may
86 push his design to the edge of instability to improve the fuel-efficiency of a future aircraft, where the fast, accurate, and consistent control is required [18]. T h e state-feedback control input t h a t stabilizes such unstable systems can be calculated by using the observed (or estimated) states according to their own control objectives such as time-optimal control with an energy constraint, optimal state-tracking, and optimal linear regulator [12]. Suppose the feedback control input is computed by u{k) = —Fx{k),
where the feedback matrix F
depends upon the control objective used. We want to derive necessary conditions, under which it will remain (asymptotically) stable even in the presence of random failures to the controller computer. These conditions require the knowledge of the system dynamic equations, the control algorithm to be used, and the environmental characteristics. Consider a simple controlled process represented by a linear time-invariant differential equation, which can be converted to a discrete-time problem by using Eq. (2.3.2) and a quadratic performance index:
J = l \ Y .
[^^('')Qxik)
+ u'^(k)Ru{k)]
+ x'^{k;)Qx{kj)
where the matrix Q G 72."'*" and R £ TZ ^
1,
(2.3.9)
are positive semidefinite and
positive definite, respectively, and are determined by the control objective of interest. T h e optimal control input is calculated by minimizing J while maintaining system stability, that is, u(k) = —Fx(k),
where the state feedback gain
m a t r i x F is obtained by solving a discrete Riccati equation [12]. Let the control input have been updated at < = mNT,.
If the control input had
not been updated for i {Q
due to a
long computation-time delay, the corresponding state equations for the group of intervals during which the system failed to update the control input become: x{mN
+ 1)
x{mN-i-2)
=
Ax{mN)
-
Ax{mN
=
A^x{mN)
+
Bu{mN) ^\)^-Bu{mN)
^ {A ^
l)Bu(rnN)
87 time t 3(m - 1)
3(m-l) + l
3(m-l) + 2
3(m + l)
3(m + l) + l
3(m + l) + 2
1
•+•
k
3m + 1
1 -m-1
2 —
+
3m + 2
•+•
3
*-
3(m + 2)
1
1
+2
-m+1
^
3
H
F i g u r e 2.3.2 Time index for argumented vectors representing the stationary occurrence of computer fciilures (delays) when TV = 3.
i-1
x{mN
+ i)
=
A'x{mN)
+
YI^A^Bu{mN) j=o
x{mN
+ i+l)
=
A'+^x{mN)
-
A'^x{mN)+
+ '^A^Bu{mN)
+BuimN+
i)
N-l
x{(m + l)N)
^
A^Bu{mN)
j=N~i N~i-1
+
Y,
A^Bu{mN + N - j - i),
(2.3.10)
where m is the time index for the groups of TV sampling intervals each. Let X{m)
= [XI,X2,...,XN]'^
= [x{mN + l),x{mN
U{m) = [ M I , W 2 , - - - , W N F = [uimN is, X{m)
+ l),u{mN
+ 2 ) , . . . , x ( ( m + 1)A^)]^ and + 2),...,u{{7n+l)N)]'^.
That
and U{m) are respectively the augmented state and control vectors
at the group of sampling intervals during which the system failed to u p d a t e the control inputs (see Fig. 2.3.2).
When the delay is equal to i sampling
periods, the following augmented state equations result: X{m + l)
=
ADX{Tn) + Bl,^U{m)
U{m)
=
-FoXim),
+ BlV{m+l),
(2.3.11) (2.3.12)
where • •
0
0
A
F 0
0
•••
F
•
•
0
A
0 0
0
0
0 0
.
•••
•
•
F
•
AD
•
•
0
J
=
B {AB+B)
Bk =
Er=
^W — l - « ( , ) _ g
A^B
0
0 5 AB
Bl =
0 0
• B O (2.3.13)
Suppose the occurrence of computer failure(s) is binomially distributed with parameter P at each sampling time. Let qo,qi • • • ,qf^ he the probabilities of delays 0, r , , •• •, NT,, respectively, such that E«=i 1i = P' where the maximum delay is assumed to be NT,. This delay distribution can be derived in practice from the knowledge of the failure-handling policy and environmental characteristics. Since the delay is equal to the time interval from failure occurrence/detection to its recovery — called the Fault-Tolerance Latency (FTL) — it can be mecisured via fault-injection experiments or can be evaluated analytically by a certain pdf (probability density function) considering all fault-tolerance features [9]. From the experimental samples or the fitted pdf of the FTL, one can estimate the parameters of the multinomial distribution (^,'s). By combining the state equations (??) with the delay distribution, the state equation including the effects of delay no greater than NT, becomes: N
X{m -h 1) = ADX(m) +^^t 1= 0
(Sl,.t/(m) + BlU{m
-f 1)) ,
(2.3.14)
89 where ^i G {0,1} is a binomially-distributed random variable with parameter qi, i.e., Pr[^,- = l] = q^. Then, the first moment of Eq. (2.3.14) is: N
X{m + 1) = AoXim) + ^
9. {B}jU(m) + Bl_U{m+l)} .
(2.3.15)
Since the period of the index mis N, the complex variable z in the ^-transform of Eq. (2.3.15) corresponds to z^, where Zk is the complex variable in the Ztransforms of equations with index k, that is, the period of Eq. (2.3.15) is jr in the frequency domain, where ui indicates the period of equations with index A; in the frequency domain (sub-sampling). Using Eqs. (2.3.12) and (2.3.15), the characteristic equation of the control system in the presence of stationary occurrences of feedback delay is represented by: N
det
N
(I + Y. li^D, FD)Z'' -AD+Y. !=0
liBh.Fo
0.
(2.3.16)
1=0
T h e asymptotic stability can b e tested with the pole positions of Eq. (2.3.16) whose characteristic equation reduces t o a simple form due to the simple structure of Ao despite its augmented dimension. The characteristic equation of •.N
the zero-delay case (i.e., qo = I and thus P — J2i-i 9» — 0) is: det \z^I - (A-BFf]
=0.
(2.3.17)
Further, one can get the following characteristic equation for the worst case in which q^ = 1, or the control input is updated only once every N sampling intervals due to the periodic delay of an active duration NTg-. N-l
det
z^I-A^
AT-l
+ J2^'BF
= det
1= 0
N-l
z''^i-J2 •'^'^^^ -BF)+J2 -'^0, ' 1=1
(2.3.18) where A — QDQ~^
by the similarity transformation if D = diag[di, • • •,(!„]
and di's are the eigenvalues of A. The conditions for asymptotic stability are used to derive the hard deadline. Let NTs and DT, be the assumed m a x i m u m and the actual m a x i m u m delays, respectively. Then, the hard deadline can be obtained by iteratively testing the necessary conditions for system stability in the modified charcteristic equation Eq. (2.3.16) while changing N from 1 t o D.
90
2.3.5
Deadlines in the Presence of Delays/Disturbances
The hard-deadline information is also derived for a linear time-invariant controlled process in the presence of disturbances as well as delays in the control input. We use the same method in Section 4, i.e., iteratively testing for the stability of the modified (stochastic) dynamic equation while changing A'' from 1 to D, where NT, and DT, are redefined as the assumed maximum and actual maximum duration of delays and/or disturbances, respectively. To modify the given state equation of Eq. (2.3.1) by incorporating all stochastic behaviors of controller-computer failures, we need the following definitions and assumptions:
•
d is the conditional probability of successful detection given that a computer failure had occurred.
•
qf is the conditional probability of an input disturbance for i sampling intervals (X2i=i lT — 1) if a computer failure occurs and is not detected until its disappearance. It can be estimated by using the experimental data or the analytic model for the error latency.
•
9AU is the probability density function {pdf) of Au which is the magnitude of a control input disturbance at time kT,, i.e., Uactuat{k) = Udeiired(k)lAThe mean and variance of ^^u are given as /iAu and (T'^^, respectively.
•
The probability that two transient failures occur sequentially within a small number of sampling intervals, (A'^ — i)Ts, where the delay (recovery duration) or duration of the erroneous control input (active duration of a transient failure) is i sampling intervals and NT, is the assumed maximum value of such intervals — is so small as to be ignored. Thus, we consider only one computer failure possible during a group of N intervals.
•
Any random sequence will be independent identically distributed (i.i.d) for the time index k.
91 The control input at {mN + i)T, is updated to be u(mN + i ) / ^ for the disturbance case or u{mN) for the delay case. We, then, obtain the augmented state equations by using the augmented state and control vectors for a group of N sampling intervals: X{m + 1)
=
ADX{m) + B})^U{m) + Bl^U(m+l),
(2,3.19)
U(m.)
=
-FoXim),
(2.3.20)
where [J3]j,,Bp,] becomes [ B D ^ - ^ D O ] ^^"^ ^^^ normal behavior, [B'I)\,B'I^] for the delay case, and [5^J,5^^] for the disturbance case, respectively. (In [7], these augmented matrices are described in full.) Eqs. (??) and (2.3.20) are, in turn, modified by combining the random squences representing the behavior of computer failures: X{m + i)
=
ADXim)+(^il-i>)Bl\+4>(l-
+
\
v-^E^'^
t^(M + ((1 -
N
N
^)B"DI+ \
V-(l-^)^C.i5^f + ^^^^,5^Mt/(m+l)(2.3.21) where ip,(p E {0,1} are binomially-distributed random sequences with probabilities P, d, and ^,, (^i G {0,1} are muitinomially-distributed random sequences with probabilities qf,q^, i.e., Pr[^, = 1] = qfThe asymptotic stability of Eq. (2.3.21) can be examined deterministically or stochastically. A. Deterministic Approach Similar to the method used in the Section 4, the deterministic value of the hard deadline is obtained by examining the pole positions of the first moment (ensemble average) of Eq. (2.3.21). Although the resulting hard deadline ha^ little practical meaning, it can show the trend of the ensemble system behavior with an uncertainty (in the state and output) which can be characterized by the second moment of Eq. (2.3.21). The first moment of Eq. (2.3.21) is: X{m+l)
=
ADXim)+(^il-P)B1,\
+ Pil-d)B'^l
+
92 pdf^qfBiA !=1
U(m) + {(1 - P)Bll + /
N
N
\
P{1 ~ rf)^?r5S: + P ' ^ L ? . ' < t=l
1=1
f^(^ + 1)(2.3.22) /
Using Eqs. (2.3.20) and (2.3.22), one can get the characteristic equation of Eq. (2.3.22): N
The characteristic equation of the zero-delay case (i.e., no computer failure, or p = 0) is; det [;•''/ - (.4 - BFf] = 0, (2.3.24) where the magnitudes of eigenvalues are equal to those obtained from: dei[zI-{A-
BF)] = Q
(2.3.25)
which is the characteristic equation of the original state equation (Eq. (2.3.1)) in the absence of computer failures. B. Stochastic Approach The effectiveness of the deterministic approach decreases as the variance of q^iu gets large. In such a case, we can derive the probability mass function (prnf) of the hard deadline with respect to q^u rather than the deterministic value of the hard deadline based on the mean of q/^u- Now, the mapping between the hard deadlines and the magnitudes of disturbances (Au's) is not one-to-one, and the hard deadlines can be derived numerically for each sample value of Aw's. In all but simplest cases it is impossible to derive a closed-form expression for the prnf of the hard deadline or the exact relation between the hard deadline and AM. The method we use is therefore to quantize uniformly the ^AU continuum in the interval [a, 6], where / q^^^dAu = 7. Let this quantization result in M equal-length subintervals (cells). There is a tradeoff between the accuracy and the amount of computation in determining 7 and M, and a and b are determined
93 appropriately according to 7. Then, points are allocated to the quantized cells, and let the point of the i-ih cell ([a + (i— 1 ) ; ^ , a + ? j ^ ] ) be Au, which corresponds to the midpoint of the cell, i.e., Aui — a + ~',\_l\ • then the probability a + i-M-
of the point is calculated as 5^^ = J ,, ^T.JUL QAuis)!^^.
A hard deadline is
derived for each Aui, and let it be Di whose probability is equal to that of q'^^. F'inally, the pmf
of the hard deadline is derived numerically by multiplying
Di and q'^^^, 1 < « < M . The accuracy of the resulting pmf
depends on o . t ,
and M, which must be determined by considering the controlled-process state equations, the environment, and the amount of computation. Although the above method uses a stochastic approach in deriving the
pmf
of the hard deadline, it is still based on the mean values of binomially- and multinomially- distributed random sequences since the hard deadlines of all possible samples cannot be derived due to the excessive number of possible samples. However, the stability of each individual system (i.e.. samples) has more practical meaning than the stability of the average system or the ensemble of all possible systems which might be built. Thus, in addition to the deterministic analysis (or combined with the stochastic analysis) of the averaged system stability, we will a t t e m p t to stochastically analyze system stability by using the almost-sure
sample stability concept (which is actually almost deterministic).
D e f i n i t i o n 2:
•
Probabilistic
Stability
: For every pair of positive numbers a and 6, there
exists a positive number d(a,b.to) P[sup 11^(11 > a]
such that
for xt„ such that ||i-,J| <
rf,
(2.3.26)
«>«o
where Xf = {x{s) : to < s < t} is a. segment of the past history. •
Almost-Sure
Stability:
For every pair of positive numbers a and 6, there
exists a positive number rf(a,6,
sup ||i;t|| ) > a] < b. \t>to
(2.3.27)
94 No control u p d a t e during this period
y
failure(s) occur ^'
time t
l - ^ - « k :
0
F i g u r e 2.3.3
1
successful recovery
^
—I ko ko + I
^
)(-
•
. , — H
ko + N
Time index for the one-shot event/delay model.
In fact, the almost-sure sample stability means that almost every possible difference equation for a given ensemble of such systems has a state which is stable in the Lyapunov sense.
2.3.6
One-Shot Event Model
T h e pole locations do not change in case of only one failure with a relatively long ( > T,) active period (Fig. 2.3.3). The (asymptotic or global) stability condition discussed thus far is therefore no longer applicable.
Instead, the
terminal state constraints can be used to test whether or not the system leaves its allowed state space. Note that every critical process must operate within the state space circumscribed by given constraints, i.e., the allowed state space. When the control input is not updated correctly for a period exceeding the hard deadline, the system may leave the allowed state space, thus causing a dynamic failure. T h e allowed state space consists of two sets of states A'^ and X\
defined
as follows:
•
X\:
the set of states in which the system must stay to avoid an
immediate
dynamic failure, e.g., a civilian aircraft flying upside down is viewed as an immediate dynamic failure. This set can usually be derived o priori
from
the physical constraints. •
X\:
the set of states that can lead to meeting the terminal constraints
with appropriate control inputs. This set is determined by the terminal constraints, the dynamic equation, and the control algorithm used.
95 The system must not leave X\ nor X]^ in order to prevent catastrophic failure. Assuming that some computer failure may not be detected upon its occurrence but every detected failure can always be recovered successfully, we can consider three cases for the analysis of the effects of computer failures during a finite time: (i) delay: when a computer failure is detected upon its occurrence, (ii) disturbance: when a computer failure is not detected until its disappearance, and (iii) disturbance and delay: when a computer failure is detected at some time after its occurrence but before its disappearance. Let ko,kf, Ni, and A'^2 denote the indices for the failure occurrence time, the mission completion time, and the period of disturbance, the period of delay measured in sampling periods, respectively, where N = N1 + N2, 0 < Ni,N2 < N. The dynamic equation for a one-shot event model is; - /)nte+N,(^2)] (2.3.28) where TIkgiN) is the rectangular function as defined in Section 2, and A'^i and A'^2 are random variables and determined by the conditional probability of successful detection (d) \i N is given; :E(^+1)
= Axik)+B
PT[NI
[u{k) + (uiko) - uik))Uko(Ni) +
- i] =
Fi[N2 = i] = Pr[7Vi=7V]
=
d(l -dy
«(A:)(/A
0
d{l - d)^-'
N
-I
l
Pr[7Vi = 0 ] = l - r f ( l - r f ) ^ - ^
(2.3.29)
Thus, the first moment of Eq. (2.3.28) is; /
x{k+i)
=
Ax{k) + V
N
Bluik)+Y^[qi{u(ko)-u{k))nk,{Ni) »=0
+ qr^mi^ - mko+N,{N2)]).
(2.3.30)
Using Eq. (2.3.30), one can derive the states at time ko + N and kj, and examine whether or not the state trajectory satisfies the immediate and terminal constraints, iteratively for each N {I < N < D). ko+N-l
x{ko + N)
=
A^x{ko)+
Yl i=ko
A''°+^-'-^B[qiu{ko)
+
qTuii)I^],
96 fco + N - l « = A;o
ki-\ i = ko+N kf-1
=
A'''-'"'-^x{ko
+ N)+
Y,
In addition to this deterministic approach, the pmf
A''^-'-'^Bu{i).
(2.3.31)
of the hard deadline t h a t
depends on q^^u is also derived by using the same stochastic approach employed in the stationary model. Without using the first moment of the state equation (2.3.30) the probability of the hard deadline being N (i.e., Pr[A'^ = D]) is thus obtained as the sum of the probabilities^ of the sample equations (2.3.28) in which a dynamic failure occurs at time N. One must continue this process until a dynamic failure occurs for all samples (i.e., Pi[N = D] = l ) . T h e following pseudo-code is used to iteratively derive the hard deadline for the system with the initial state XQ at time ko, where Td is the duration of abnormality due to a computer failure.
\+ Recursive testing the allowed state space while increasing Td *\ m = I \* sub-samphng period, mA *\ D = integer.time{Nttm) \* integer times of T, *\ if (D < Num) t h e n while (m ^ Ts/A or test^constraintsO ?iTRUE) d o Td := {D- l)Td + mA it test.constatnts{Td) then return Td \* hard deadhne is Tj *\ else m := m -|- 1 \* test larger Td by increasing m *\ end_whlle else return no hard deadline
integer-time(N(,m) N =1 while {N # Nt„n or test.constaints() Td := NT, ' N^ testings are required for each N.
?^TRUE) d o
97 if test.consiaints{Td)
then return A^ \* D = N max is obtained *\ \* test larger N *\
else A^ := JV + 1 end_while test.constraints{Td) compute x({ko + N)Ts) from xo if (xiiko + N)Ts) e X i ) then begin compute x[kjTs) from either x{[ko + N)Tg) or xo if(x{kfTs) 6 X{) then return FAILURE else return TRUE end else return TRUE Xf € X^ can be tested indirectly by the following relation: x{kf)
eX{<^:^
x{ko + N) e Xl,
(2.3.32)
kj-l where X^
^ { x \ [A'''-''"~^X
+
^
A''^-"'Bu{i)]
G A'^ }.
i = ko+N
In practice, it is difficult to obtain X ^ .
Although there may be a one-to-
one mapping between x(ko + N) and x{kj),
X'^ is usually a continuum, which
requires an excessive amount of computation due to the curse of dimensionality. The size of X\ size of X\
will decrease as either A' increases or k approaches kj, but the
is usually larger than that of Xj^ due to the (asymptotic) stability
of a controlled process.
2.3.7
Examples
To demonstrate the concept of hard deadline, we derive the deadlines for two simple, yet practical, example control systems. E x a m p l e 7.1: To show the hard deadline of a linear time-invariant control system in the presence of stationary occurrences of input delays/disturbances due t o computer failures, we consider a simple controlled process: xi{k+l)
where the coefficient matrices of a quadratic performance index are given as: ^xr
—
10 0
0 10
•>
^Miu
5
0
0
5
One can derive the optimal feedback control gain matrix F that stabilizes the controlled process by solving a discrete Riccati equation as: 3.1251 -1.0791
0.3090 0.5512
This feedback control changes the system eigenvalues from {0.95,11.02} to {0.0777,0.2101}. We then derived deterministically the change of poles as a result of iterativeiy incrementing N for the occurrence of the largest delay possible [p = q^ = 0.045). The results are given in Table 2.3.1, where the first case represents the perfect coverage (d = 1) and the second case represents the presence of an input disturbance {d = 0.9 and fi^-u = —5). The deterministic value of the hard deadline is D = 6r, in the absence of input disturbances under instant failure detection, whereas it decreases to D = 5T, with some (infrequent) input disturbances. The pmf of hard deadline is plotted in Fig. 2.3.4 along with the pmf of the magnitude of input disturbances. Example 7.2: The physical meaning of hard deadline based on the one-shot delay model (with the assumption of perfect detection coverage) can be explained with the real-time controller of a robotic manipulator, where the obstacles in the robot's workplace are translated into state constraints. That is, the system states (robot's positions) must be constrained to avoid collision with the obstacles. In addition to these state constraints, there are usually control input constraints due to the bounds on joint motor torques and energy. In [6], a point
99
-9
0
Input disturbance (Au) -6 -5 -4 -3
-8
-7
1
1
1
1
1
1
-z 1
-1
0.30
0
:
0
-
-
-
-
0.27 0.S4 0.21
0
So go
0.18; 0.15 I
I 1°
Au a
/
\
0.12 "
b
i
0.09
n
•
D
0,
0.06
I"
_j
0.03
0, 1
0,
3
0,
I
1
i /
1
0.00
4 5 6 7 Hard deadline [ft] (D)
F i g u r e 2.3.4 P r o b a b i l i t y m a s s f u n c t i o n s of A u , w h i c h is g i v e n a priori, D, w h i c h is d e r i v e d .
and
robot of Cartesian-coordinate class weis modeled by a set of decoupled double integrators: i V
' 0 /' 0
u
X V
+
' 0
0 ' ' 0 '
1 0
u
where x, v, and u are 2-dimensional position, velocity, and control vectors, respectively, and I £ "R.2 x 2 . The robot's end-effector is guided by the control input determined by (2.3.34) mm Vd-w| 2 subject to the control constraints |w,| < u^"^ for i — 1,2, and the state constraints specified by the obstacles. The Obstacle Avoidance Strategy (OVS) used in [6] considers the dynamic environment and real-time control needs, vs^here the obstacle-related state constraints are transformed into state-dependent control constraints by mapping each end-effector's position relative to an obstacle into the P-funciionals:
where the obstacle is assumed to be centered at the origin and covered by a circle of radius r.
100 / y
4 obstacles (circles)
o
Xi
delay for 3 -
.-^
s f
^ o/& 1,2,3/^
/
delay for 3
1
1
1
3
3 xl(k)
F i g u r e 2,3.5 State trajectories for a point robot with four obstacles, 1) in the absence of delay and 2,3) in the presence of delay equal to DTs.
The k in the P-functional must be chosen such that the set of permissible controls remains nonempty, and the system state and each obstacle's position determine x — \i:\ — a, X2 — b] , wliere (a, 6) is the coordinate of a 2D obstacle. According to Theorem I in [6], k is set to 0.5 so that the control input constraint set may remain nonempty over the duration from the detection of a potential collision to the disappearance of this collision danger. A collision-free trajectory is guaranteed if P{t) > 0 ^t. A potential collision was detected by checking P(i) = P ( 0 ) + tPiO) + -P„,ar
> 0 V< ^
Pmax > \p,Q.
,
(2.3.35)
where i — —P{0)/Pmax is the time at which the minimum of P occurs. If the condition (2.3.35) is violated, the instantaneous value of Pmax is calculated, and an additional constraint (Eq. (2.3.36)) is included until P becomes positive, i.e., the danger of collision disappears. ^(X, V) = {u : Pju
where P^ = (x' x)
-1/2
J-,
Px.
+ li^PxJv
l ^ - ( X X^
X
V
> Pmax},
x) -1/2 ,
(2.3.36)
and m{x) = 2k(x'^x)''
^.
101
1
J u I y
1
V
1
1
0.5
1
1
1.5
2
'
3.5
xl
Figure 2.3.6 Hard deadlines of state trajectory 1 of Fig. 2.3.5 in the absence of delay without terminal constraints.
The intersection of these constraints with the admissible control set Q results in a polygonal control space: A= The Optimal
Decision
Strategies
^(x,v)r]Q. (ODS) in [13] can be used to solve such a
constrained minimization problem, similarly to a class of pointwise-optima] control laws constrained by hard control bounds. The hard deadline is derived using a pointwise-optimal control law with the OVS. Since the computationtime delay causes the failure to update the control input, a collision (or a dynamic failure) occurs if the computation-time delay is longer than a certain threshold, which is a function of the system state and time. In this example, the state constraints change with system state (time), i.e., the state-dependent control constraints. Thus, the control input must be updated on the basis of new information to avoid any collision. The trajectories derived in both the absence and presence of delay DT^, and the hard deadlines on these trajectories in the absence of delay are plotted in Figs. 2.3.5 and 2.3.6.
102 Controlled Process: System Dynamics
"
Controller Computer:
Environment: Fault behaviors
Hard Deadline
Control Algorithms
H/W Design
S/W Design 4 Evaluation (Reliability)
Figure 2.3.7 The source and application of hard-deadline information in a real-time control system.
2.3.8
Application of Deadline Information
T h e hard-deadline information allows us to deduce the timing constraints of the controlled process. It can be used to evaluate the fault-tolerance requirements of a real-time control system in terms of its timing constraints. Fig. 2.3.7 shows the source and application of hard-deadline information. As we shall see, this deadline information about the controlled process is quite useful for the design and evaluation of a controller computer. When designing a controller computer, one has to make many design decisions in the context of controlled processes that are characterized by their hard deadlines and cost functions [18], including:
hardware design issues dealing with the number of processors and the type of interconnection network to be used, and how to synchronize the processors.
103 •
software design issues related to the implementation of control algorithms, task assignment and scheduling, redundancy management, error detection and recovery.
Since the timing constraints of the controlled processes are manifested as hard deadlines, the deadline information is also essential to evaluate the system reliability, an important yardstick to measure the goodness of the controller computer. To illustrate the general idea of applying the knowledge of the deadline information (i.e., system inertia), we consider two specific examples; (i) a design example that optimizes time-redundancy recovery methods such as retry or rollback, and (ii) an evaluation example that assesses the system reliability. Example 8.1: When an error is detected the simplest recovery method is to re-execute the previous instruction, called simply retry, which is effective in case of immediate error detection [10, 11]. When retrying an instruction, one must determine a retry period, which is long enough for the present fault(s) to die away. If the retry does not succeed in recovering from the error, we have to use an alternative recovery method like rollback or restart. So, the retry period must also be short enough not to miss the deadline by considering the amount of time to be taken by the subsequent recovery method in case of an unsuccessful retry. Let Tt, TQ, and tr be the "nominal" task execution time in the absence of error, the actual task execution time, and the retry period, respectively. Then, one can obtain a set of samples ofTa'. Ta E {Tt,Tt +
where T,, T, and j - are the resetting time, the mean occurrence time of an error, and the mean active duration of a fault. Since Ta has discrete values, the probability mass function (pmf) of Ta is: /^^
=
PT[Ta = k{f + Z+tr)
+ Tt+6^l
=
p',+'iTt)il-p,(U))''a-Pe{Tt)y-'ps{irY,
0
Se{0,l} (2.3.37)
104 where Pe{Tt) and Ps{tr) are the probability of occurrence of an error during Tt (i.e., after restart) and the probability of a successful retry with a retry period tr-^ Then, the probability of missing a hard deadline is:
Pmh(Tt,D)= where foiy)
Yl
(2.3.38)
is the probability density function of the hard deadline. When the
hard deadline is deterministic, foiu) PmhiTi,D)
fxM)lD{y)dy.
is a delta function and the corresponding
becomes simpler. Consequently, the optimal retry period can be
determined by minimizing ^ ^ ^ ( r t , D) with respect to tr, using the derived hard-deadline information
foiy)-
Similarly, the hard-deadline information is also useful to rollback recovery, where checkpoints must be placed optimally.
The checkpoints are usually
placed so as to minimize the mean execution cost [23]. However, the mean cost must be minimized while keeping the probability
of dynamic
failure
—
the probability of missing a deadline [18] — below a prespecified level in a real-time control system [20]. The hard-deadline information is necessary to compute the probability of dynamic failure, which can, in turn, be used for the optimal placement of checkpoints. E x a m p l e 8.2: We consider a simple example of a triple modular redundant ( T M R ) controller computer in which three identical processors execute the same set of cyclic tasks. The T M R controller computer updates, once every T, seconds or every sampling period, the control input to the controlled process (plant). T h a t is, the period of each cyclic task is equal to T,. T h e input of the cyclic task is a discretized output of the plant and the o u t p u t of the cyclic task will be used to control the plant during the next sampling interval. T h e o u t p u t of the T M R controller is correct for each task only if at least two of the three processors in the T M R controller produce correct outputs. A TMR failure
is
said to occur if more than one processor in the T M R controller fail during T,. T h u s , the o u t p u t of the T M R controller would not be correctly updated in case of a T M R failure. T h e condition for a system (dynamic) failure resulting from 'See [8] for the derivation of Pe(Tt) and pj((r) in a TMR system.
105 Pl3
Si:
No failure
52; O n e processor with failure(s) 53: T w o or three processors with failure(s) = O n e T M R failure 54: T w o sequential T M R failures Ss'. Dynamic failure
F i g u r e 2.3.8 tia.
A Markov reliability model with knowledge of the system iner-
controller-computer (TMR) failures^ is derived from the hard deadline, which is the allowable maximum duration of TMR failure(s). In other words, this condition gives us the knowledge about the controlled system's inertia against controller-computer failures. More than 90% of computer failures have been reported to be transient, especially with short active durations [21]. Thus, the controller computer may recover from most failures in a few sampling intervals, and it can correctly update the control input without causing any dynamic failure, if the active duration of controller-computer failure is smaller than the hard deadline. ^ As mentioned earlier, the other sources of system failure(s), such as failures in actuators or sensors or mechanical parts and failures of A/D and D/A converters, are not considered in this chapter, because the main intent is to analyze the coupling between a controlled process and its (fault-tolerant) controller computer.
106 Suppose the hard deadline derived from the controlled system is three sampling periods and a TMR controller computer is used. That is, no dynamic failure occurs if the faults inducing computer failures disappear (or are recovered by a fault-tolerance mechanism) within three sampling periods. Then, the reliability model for this controller computer (Fig. 2.3.8) is built by extending a Markov chain model whose parameters are to be estimated at a given level of confidence from empirical data. The additional states account for the system inertia, i.e., a dynamic failure results from only three consecutive erroneous (missing the update of) outputs of the controller computer or for a period of 3T,, not immediately from one or two erroneous outputs. Without the information of hard deadline, one can over-estimate the probability of a system failure under the assumption that the system has no delay-tolerance, i.e., one incorrect output can lead to a dynamic failure.
2.3.9
Conclusion
The hard deadline for a critical-control task is usually cissumed to be given a prion. This presupposes the existence of a precise definition of the hard deadline and a method to derive it, which, however, have not been addressed in detail. The knowledge of hard deadlines, which must be derived from real control applications, is very important for task assignment and scheduling, specification and evaluation of fault-tolerant controller computers. We derived hard deadlines for linear time-invariant control systems in the presence of the computation-time delay and/or the control input disturbances due to a controller-computer failure. First, when the occurrence of computer failures is stationary or represented by an appropriate probability density function, the hard deadline is obtained as an integer multiple of the sampling period by using the (asymptotic) stability condition for a modified state equation. Second, a heuristic algorithm is proposed to iteratively compute the hard deadline as a function of the system state and time by testing for the given (or derived) constraints of control inputs and states.
107 As an extension of this work, it would be interesting to derive the deadlines of nonlinear time-invariant control systems in the presence of control input disturbances. The derivation of hard deadlines for time-varying systems will also be a challenging task.
108
REFERENCES
[1] C. M. Belcastro, "Laboratory test methodology for evaluating the effects of electromagnetic disturbances on fault-tolerant control systems," NASA TM-101665, November 1989. [2] A. F. Belleisle, "Stability of systems with nonlinear feedback through randomly time-varying delays," IEEE Trans, on Automat. Conir., vol. A C 20, no. 1, pp. 67-75, February 1975. [3] A. Gosiewski and A. W. Olbrot, "The effect of feedback delays on the performance of multivariable linear control systems," IEEE Trans, on Automat. Contr., vol. AC-25, no. 4, pp. 729-734, August 1980. [4] K. Hirai and Y. Satoh, "Stability of a system with variable time delay," IEEE Trans, on Automai. Conir., vol. AC-25, no. 3, pp. 552-554, J u n e 1980. [5] G. Hostetter and J. S. Meditch, "Observing systems with unmeasurable inputs," IEEE Trans, on Automat. Contr., vol. AC- 18, pp. 306-307, June 1973. [6] S. Kheradpir and J. S. Thorp, "Real-time control of robot manipulators in the presence of obstacles," IEEE journal of Robotics and Automation, vol. 4, no. 6, pp. 687-698, December 1988. [7] H. Kim and K. G. Shin, "On the m a x i m u m feedback delay in a linear/nonlinear control system with input disturbances caused by controllercomputer failures," IEEE Trans, on Control Systems Technology (in press). [8] H. Kim and K. G. Shin, "Design and analysis of an optimal instructionretry policy for T M R controller computer," submitted for publication, 1993. [9] H. Kim and K. G. Shin, "Evaluation of fault-tolerance latency from realtime application's perspectives," submitted for publication, 1993. [10] I. Koren and Z. Koren, "Analysis of a class of recovery procedures," Trans, on Computers, vol. C-35, no. 8, pp. 703-712, August 1986.
IEEE
109 11] Y. H. Lee and K. G. Shin, "Optimal design and use of retry in faulttolerant computing systems," Journal of the ACM, vol. 35, pp. 45-69, J a n u a r y 1988. 12] G. Leitmann, An Introduction McG ray-Hill, 1969. 13] D. G. Luenberger, Optimization 1969.
to Optimal
Control,
New York, NY:
by vector space methods, New York, Wiley,
14] M. Mariton, "Detection delays, false alarm rates and the reconfiguration of control systems," Int. J. Control, vol. 49, no. 3, pp. 981-992, 1989. 15] Z. V. Rekasius, "Stability of digital control with computer interruption," IEEE Trans, on Automat. Contr., vol. A C - 3 1 , no. 4, pp. 3 5 6 3 5 9 , April 1986. 16] K. G. Shin and X. Cui, "Effects of computing time delay on real-time control systems," in Proc. of 1988 American Control Conf., pp. 1071-1076, 1988. 17] K. G. Shin and H. Kim, "Derivation and application of hard deadlines for real-time control systems," IEEE Trans, on Systems, Man, and Cybernetics, vol. 22, no. 6, pp. 1403-1413, Nov./Dec. 1992. 18] K. G. Shin, C. M. Krishna, and Y.-H. Lee, "A unified method for evaliating real-time computer controller and its application," IEEE Trans, on Automat. Contr., vol, AC-30, no. 4, pp. 3 5 7 3 6 6 , April 1985. 19] K. G. Shin and Y.-H. Lee, "Error detection process — model, design, and its impact on computer performance," IEEE Trans, on Computers, vol. C - 3 3 , no. 6, pp. 529-539, June 1984. 20] K. G. Shin, T.-H. Lin, and Y.-H. Lee, "Optimal checkpointing of realtime tasks," IEEE Trans, on Computers, vol. C-36, no. 11, pp. 1328-1341, November 1987. 21] D. P. Siewiorek and R. S. Swarz, The Theory and Practice of Reliable System Design, Digital Equipment Corporation, Bedford, MA, 1982. 22] D. D. Siljak, "Reliable control using multiple control systems," Int. Control, vol. 31, no. 2, pp. 303-329, 1980.
J.
23] A. Tantawi and M. Ruschitzka, "Performance analysis of checkpinting strategies," ACM Trans. Computer Systems, vol. 2, pp. 123-1441, 1984.
110 [24] K. Zahr and C. Slivinsky, "Delay in multivariable computer controlled linear systems," IEEE Trans, on Automat. Contr., vol. AC-19, no. 8. pp. 442-443, August 1974.
SECTION 3
SYSTEM VALIDATION
SECTION 3.1
Software-Implemented Fault Injection of TVansient Hardware Errors^ Charles R. Yount^ and Daniel P. Siewiorek^ ABSTRACT As computer applications extend to areas which require extreme dependability, their designs mai\date the ability to operate in the presence of faults. The problem of assuring that the design goals are achieved requires the observation and measurement of fault behavior parameters under various input conditions. One means to characterize systems is fault injection, but injection of internal faults is difficult due to the complexity and level of integration of contemporary VLSI implementations. This chapter explores the effects of gate-level faults on system operation as a basis for fault models at the program level. A new fault model for processors based on a register-transfer-level (RTL) description is presented. This model addresses time, cost, and accuracy limitations imposed by current fault-injection techniques. It is designed to be used with existing software-implemented fault-injection (SWIFT) tools, but the error patterns it generates are designed to be more representative of actual transient hardware faults than the ad-hoc patterns currently injected via most SWIFI experiments. 3.1.1 INTRODUCTION As persons become more and more dependent on computers in everyday life, and, in many cases, for life itself, the understanding of how those computers react to faults becomes more important [13], Today's systems are much too complex to use purely formal techniques to prove their correctness, much less to prove that recovery from faults is performed correctly. Also, present technology produces parts with mean times to failure (MTTF) of several years, so designers cannot
'The research described in this chapter was funded in part by the Office of Naval Research under grant N00014-91-J-4139. ^Inter-National Research Institute. ^Carnegie Mellon University Department of Electrical and Computer Engineering.
114 simply wait for faults to occur in the finished product to determine their effects. Consequently, engineers and computer scientists have turned to fault injection to artificially induce premature faults into a system. The research described in this chapter seeks to overcome the limitations of current fault injection techniques by allowing high-speed injection experiments without sacrificing accuracy. To achieve this goal, a fault model is developed which may be applied to a register-transfer-level (RTL) description of the processor under study. Experimental results show that such a model of the IBM ROMP is capable of reproducing 97% of the errors which are generated by a detailed gate-level model with a time reduction of over 500 to 1. 3.1.2 FAULT-D«JJECTION TECHNIQUES This section lays out a background for fault-injection studies. Some reliability terms are presented, and a fault injection taxonomy is given. Then, some recent research is described and categorized in that taxonomy. 3.1.2.1 Reliability Terms This chapter uses the following definition of fault pathology—the relationship between faults, errors, and failures [23]: A system failure occurs when the delivered service deviates from the specified service... The failure occurred because the system was erroneous: an error is that part of the system state which is liable to lead to failure... The cause—in its phenomenological sense—of an error is a fault. In addition, faults may be classified by their duration [34]: Permanent faults are continuous and stable; in hardware, permanent failures reflect irreversible physical change. Intermittent faults are only occasionally present due to unstable hardware or varying hardware or software states. These faults are potentially repairable by replacement or redesign. Transient faults result from temporary environmental conditions. Since the hardware is undamaged, these faults are not repairable. Figure 3.1.1 relates several sources of errors with possible durations and manifestations to errors and failures. Note that some of these errors are physical faults (physical defect, unstable or marginal hardware, and unstable environment) and some are human faults (incorrect design and operator mistake).
!15
Physical defect
Incorrect design
Unstable or marginal hardware
Unstable environment
Permanent fault
^
X
Error
Service failure
Intermittent fault
Transient fault
Operator mistake
Figure 3.1.1: Sources of errors and failures (Taken from [34].)
3.1.2.2 Responding to Faults Computer systems can respond to faults at a variety of levels. Non-redundant systems can only take advantage of fault-intolerant techniques such as increasing the quality of the design process or a circuit's constituent parts. Redundant systems can use fault detection or fault tolerance. Fault detection provides a level of integrity to the result of computation, and fault tolerance uses static or dynamic techniques to mask faulty components or reconfigure the system to operate without them. For example, if one or more processors in a multi-computer fails, the connectivity may be reconfigured to operate in a mode of degraded performance [30]. Table 3.1.1 shows a list of common reliability techniques from these categories. Grading fault-avoidance measures and validating complex fault-detection and fault-tolerant techniques are important but almost intractable tasks. To stimulate these mechanisms, engineers and computer scientists have used fault-injection techniques, the artificial introduction of faults or their manifestations into a sys-
Table 3.1.1: Classification of reliability techniques (Taken from [34].) "Single-error correction/double-error detection N-modular redundancy tern or a model of a system under study. The next section provides a taxonomy of techniques used in fault-injection studies. 3.1.2.3 A Fault-Injection Taxonomy Methods have been developed to inject faults into both software and hardware. Even though this chapter describes a model for hardware faults, it borrows ideas from software methods and combines them with existing hardware methods. Thus, this section provides a brief overview of software fault injection techniques
117 and then provides a two-way classification of hardware fault injection: by method of injection and by the level at which the fault is injected. Software Fault Injection. Fault injection has been used to test software by making modifications to the source code the programmers have written. Injection of faults into software falls into two major categories: error seeding and mutation analysis [1]. Error seeding is based on the assumption that artificially injected errors will be found at approximately the same rate as actual programmers' mistakes. Errors are injected during the debug phase, and when an acceptable number of them are found, the debug phase ends with the assumption that an acceptable number of actual errors were also found. Mutation analysis is used to determine the coverage of software test programs. Errors are injected, and the test program is used to attempt to detect them, providing coverage statistics. Hardware Fault Injection. A hardware fault injection experiment can be classified by both method of injection and the level of abstraction at which the faults are injected. First, consider a taxonomy of hardware fault injection methods containing four categories [9]: Software Simulation: The computer system under test (SUT) is simulated on another computer system. Faults are induced by altering logical values during simulation. Unfortunately, even the fastest simulators run several orders of magnitude slower than the actual SUT. Also, a device or gate-level simulation requires a full internal hardware description of the processor, not usually available to external testers because the designs are considered proprietary. Hardware Emulation: An engineering prototype, e.g., a wire-wrapped model, of the SUT is used, and faults are introduced into this model by using additional hardware to alter voltage values during execution. Again, a full hardware description is necessary, even more so than for the simulator model. In addition, the models are expensive to construct and are not exact replicas of the final system. Fault Emulation: The actual hardware of the SUT is used, and faults are emulated through special capabilities built into the hardware or through software (so-called software-implemented fault injection, or SWIFI). Often, though, with this method, there is no theoretical correspondence between the emulated faults and actual faults which might occur in the hardware.
118 Physical Fault Injection: Again, the actual hardware is used. FauUs may be injected by forcing voltage levels at the chip boundaries, but internal components of the processor will not be accessible. Alternatively, one may bombard the SUT with radioactive particles which cause internal faults, but this method is very expensive and difficult to control. As with fault emulation, there is often no theoretical correspondence between the injected faults and actual faults which might occur in the hardware. Table 3.1.2 summarizes the advantages and disadvantages of each of these methods. Fault injection methods
Advantages
Disadvantages
Software simulation
Hardware emulation
Fault emulation
Physical fault injection
Access to system at any level of detail. Fault types and control are unlimited.
Representative hardware with favorable access and monitoring.
True hardware and software in use.
True hardware and software in use.
Simulation time explosion.
Implementation and other parameters change with deployed system.
Fault types are limited and unknown.
VLSI limits access and monitoring points. Task is difficult.
Table 3.1.2: Trade-offs of fault-injection methods (Taken from [9].)
In addition to the method by which faults are injected, fault-injection techniques may also be categorized by the level of abstraction at which the faults are injected. High-level fault-detection or fault-tolerant mechanisms may be verified by injecting faults at any lower level of abstraction and allowing the errors to propagate to higher levels. In general, researchers prefer to inject faults at the lowest levels to assure accuracy, the degree to which the injected faults reflect actual faults. In practice, however, they are forced to inject faults at higher levels to attain a procedure that is tractable in terms of development cost and injection time. The following taxonomy presents a list of increasing levels of abstraction from the device level to the network level. In addition, they fall into two broad categories:
119 circuit-level abstractions at the lower end and functional-level abstractions at the higher end. Circuit: In this category, the physical makeup of the processor is considered. Device: The focus is on the actual transistor and how it operates. For a simulated injection, an analog simulator is used for each device. For a physical injection, the operation of the devices is affected with radiation or other physical stress. Gate: Simple logic gates (ANDs, ORs, multiplexors, and so forth) are faulted. Usually, a simple stuck-at (for permanent faults) or fixed-at (for non-permanent faults) model is assumed, although some experimenters use more accurate models including other effects such as coupling between lines or storage cells. Basic Block: The circuit is subdivided into functional groups such as adders, decoders, and registers. Here, the fault model may be ad-hoc or based on lower-level constituents. Chip: Faults are injected at the boundary of the integrated circuit. For microprocessors, this would include faults on the external address, data, and control lines. Functional: In this category, the circuit itself is no longer considered, but a functional description of the goal of the circuit is used. Micro-operation: The operation of the processor during each micro-instruction is faulted based on a fault model containing erroneous data transfers and micro-sequencing. Macro-operation: The operation of the processor's instructions are faulted by flipping bits in the instruction word or by using a fault model based on lower-level activity. System: Memory and I/O writes of the proces.sor are faulted because these are the activities which ultimately have an effect at the system level. Network: Messages and other accesses to shared resources are faulted. 3.1.2.4 Recent Research in Fault Injection Research has been performed in many of the areas formed by the cross-product of the two taxonomies of method and abstraction. Much of the fault-effect analysis work has actually been performed to gain insight to facilitate test generation. In this case, the faults themselves may not actually be injected; rather, the effects are studied to produce the appropriate tests to detect the faults.
120 Device. Within the device-level software-simulation area, a mixed-mode transient-fault simulator based on SPLICE has been used [6, 7, 12] to inject current transients corresponding to charge levels between 0.5 and 8.0 picoCoulombs. Also, simulation routines were developed [19] to accurately model single event upsets (SEUs) in CMOS SRAMS. Physical injection at the device level has included bombarding dynamic memories with alpha particles [24], a Zilog Z-80 with heavy ions and protons from a cyclotron [8], and a Motorola MC6809E with heavy ions from a ^^Californium source [16, 25]. A useful summary of ion bombardment studies is found in [26]. Also, performance at the device level has been affected indirectly through power supply disturbances [25]. Gate. Most gate-level injection is done by software simulation. For example, a recent fault-injection study of a 32-bit RISC used a register-level model in VHDL and injected 2,779 flip-to-1 and flip-to-0 faults via the simulator [29]. Another study [9, 10] injected transient faults into the IBM ROMP to determine how fault manifestations are related to workload. In addition, software fault simulation at the gate level has been the target of much optimization because of its popularity in testing. Exact solutions for permanent faults include concurrent, deductive, differential, and parallel fault simulation, and approximations include fast fault grading, critical path tracing, and statistical fault analysis. See [27] for a survey of these techniques. Basic-Block. Crossing the line between circuit and functional-level simulation is work done with basic circuit blocks [22] used to develop an approach for test generation which retains the accuracy of the gate-level while reducing computation time by using high-level primitives for fault propagation. Chip. At the chip level, most work has focused on physical fault injection such as altering the voltages on the output pins [2]. Micro-Operation. At the micro-operation level, researchers [4, 35, 36, 37] have developed test generation techniques which verify the functional behavior of microprocessors. Their work uses functional fault models on a graph-theory representation of a processor's data paths. Other techniques use a hardware description language (HDL) description of the processor to generate tests [28] or simulate faults [38]. However, the HDL models are often largely ad-hoc, reflecting HDL-level faults rather than hardware faults. Higher Levels. At the macro-operation, system, and network levels, fault emulation is the predominate method.
121 FIAT [3, 31] and FERRARI [20] are SWIFI tools which have been used to inject bit faults into a processor during execution of a workload. FIAT flips bits in an instruction word before its execution or flips bits in memory to simulating erroneous writes. FERRARI emulates line errors in the address or data lines when a data operand or opcode is fetched by modifying code sequences. It also injects errors into the condition codes. These errors can be transient or permanent. In [39], SWIFI is augmented with a hardware monitor for higher accuracy of evaluation. As stated earlier, these models are ad-hoc, chosen mainly because of their simplicity. The most recent attempt to increase the validity of these models is through current research into development of the EMAX process [21]. EMAX uses a Zycad hardware simulator to inject faults into a gate/transistor level description of a circuit. The error patterns are recorded and used to generate high-level models for injection by FERRARI. At the highest levels, studies have largely focused on the emulation of software faults (bugs). However, since the level of abstraction is so high, they also emulate many transient hardware faults which propagate through several levels of abstraction to manifest as errors in software structures. For example, in testing the FAA's Advanced Automation System [II], faults included corrupting virtual memory; corrupting, dropping, delaying or duplicating messages and timers; and terminating or delaying processes. A study using the DEPEND simulation environment [15] corrupted messages across a communications link. Similarly, a study to develop a deterministic fault injection-strategy injected message-passing faults by skipping transfers or sending extra, out-of order, out-of-time, or wrong-value messages [14]. Finally, as a means lo failure acceleration, whole pages of memory have been corrupted to simulate catastrophic errors in an IBM mainframe system [5]. Comparison. Table 3.1.3 shows a summary of these research projects, categorizing them by level of abstraction and the fault sources they attempt to model. 3.1.2.5 Section Summary Faults in hardware or software originate from many sources and may eventually manifest to failures in the encompassing computer system. The designers of a system may elect to incorporate fault-detection or fault-tolerant mechanisms to respond to these faults. To verify these mechanisms, the designers often use faultinjection techniques. Fault-injection techniques may be categorized by method and the level of abstraction at which the faults are injected. In general, lower levels of abstraction provide
122 Fault sources from Figure 3.1.1)' Level of abstraction
simulation [38] test generation [4. 28, 35. 36, 37]
simulation [29]
test generation [4, 28, 35. 36. 37]
Macro-instruction
SW1FI[3,31]
System
SWIFI[5, 11, 15]
Network
SWIFI[1I, 14]
SWin[3,9. 10,20,21,31, 39] and this research
SWIFI [3, 9. 10,20,21,31, 39] and this research
Table 3.1.3: Summary of recent research
^No recent research in the area of operator mistakes was found. high accuracy at high cost, and higher levels of abstraction provide various levels of accuracy at lower costs. The need for higher accuracy is based on two assumptions; (1) that a more accurate set of faults will propagate to a more accurate set of failures and (2) that a more accurate set of failures will produce a more accurate validation of fault-detection or fault-tolerant mechanisms. The following section will lay the framework for a low-cost SWIFI fault model for intermittent and transient hardware faults which provides a higher level of accuracy than is currently available.
123 3.1.3 DEVELOPING A NEW FAULT INJECTION METHODOLOGY FOR FAULT EMULATION In order to determine system-level errors or to verify a fault-tolerant configuration with redundancy at the processor level, an engineer must be able to determine the effects of hardware failures on actual processor workloads. Therefore, he or she must inject faults which will manifest to errors at the processor level. To maintain confidence in the validity of the errors, the nature of these injections should be based on actual hardware faults. Such studies are hindered, however, by the limitations discussed in the previous section. To address these limitations, this research provides a new fault model which allows SWIFI tools to inject faults at the instruction level which will produce errors that are representative of those produced by actual hardware faults. An overview of this process is illustrated in Figure 3.1.2. Since studies have shown that eighty [ 18] to eighty-five [32] or even ninety [34] percent of all computer failures are due to transient or intermittent faults, development of this new fault model focuses solely on modeling non-permanent faults. As such, it models faults caused by unstable or marginal hardware or an unstable environment. 3.L3.1 Hybrid Fault Emulation Any sequential network can be reduced to the simple model shown in Figure 3.1.3. This model consists only of (1) combinational logic which performs a boolean transformation on its input, (2) a collection of storage elements known as its "slate," (3) an input vector, and (4) an output vector. Also, there exists one or more clocks which determine when the state can change. For a processor, the combinational network comprises the ALU, decoding logic, etc.; the state comprises visible and non-visible registers; and the inputs and outputs are signals coming into and generated by the processor on its I/O pins. When injecting faults, it is important to note two things about any given interval between state changes. First, the only factors determining what happens during the interval are the current state and the input vector during the interval. Second, the only direct effects of these factors are the new state and the output vector during the interval. Subsequent states and outputs are only affected indirectly through the new state. Consequently, if a fault occurs and is contained completely within an interval, it can only directly alter the new state and the output vector. If a fault affects neither, it is called overwritten for this interval; an example of this might be a transient fault in the ALU at some time when it is not being used. It follows that if a fault is overwritten for any given interval, it will have no effect on subsequent intervals.
124
Manifestations are used by fauitinjection machin^ eryto emuiate faults during the execution of actual worltioads.
User inputs instruction set description of processor under study.
Automatic procedure generates instruction-level error manifestations of tiardware faults in that processor.
F i g y r e 3.1.2: A n o v e r v i e w of t h e new f a y l t - l n j e c t i o n p r o c e s s
Conversely, if the fault does affect the state or output, it is called non-overwritteji for this interval, although it could become overwritten at some interval in the future. Using software simulation, hardware emulation, or physical injection at the device, gate, or basic-block level, one may inject faults into all portions of a processor: tlic input, output, state, and combinational portions. When using SWIFI to avoid tJie time and accuracy limitations of these methods, however, one can only
A transient or intermittent fault or multiple sucli faults (and certainly a permanent fault) may affect a circuit across more tiian one time interval. Ttiis research only focuses on single faults that are contained within a specified dme interval.
125 Input
Combinational network
- •
Output
State
Clock
J Figure 3.1.3: General sequential network
access the CPU through its instruction set. How, then, can transient faults in the combinational network be emulated? Since the only factors that determine what happens during an interval T = [ t-, t^\ are the current state and input vector, one approach would be to set the current state at r and provide input during Tin such a way that the combinational network will produce a new state and output to match the effects of the fault. This is referred to diS forward fault emulation, FFE. Note, however, that FFE will only be possible in those cases where the combinational network is capable of producing those effects. A combinational network is a transformation from an input vector to an output vector, and in general, all values of the output vector are not producible. Also, if the fault is not directly associated with the processing of the current state or input, computing the required input values will require one to solve the inverse of the combinational network. On the other hand, since faults affect only the new state and output vector, one can emulate a fault occurring during some interval T = {t:,t^\ by simply generating the output during T and setting the new state at t^. to match the effects of the desired fault. This is referred to as reverse fault emulation, RFE. To perform RFE, one must choose the proper patterns to apply to the new state and output. This can be done in an ad-hoc manner by capturing the fault-free patterns and arbitrarily flipping bits to derive injection patterns. A more accurate method involves a hybrid fault injection approach like the one described in this chapter.
126 With the hybrid approach, one allows the network to run until / and captures the state at that point. Then, another fault-injection technique is used to determine the proper patterns based on the captured state and an appropriate fault model. For example, one could initialize a gate-level simulation of the network with the captured state and simulate for the duration of 7 while injecting the desired gate-level faults. The results of the simulation can then be used to artificially stimulate the outputs and set the new state of the actual network. Thus, the hybrid approach uses the actual circuit to perform the fault injection, thereby gaining a speed advantage. However, it uses a lower-level model to determine the injection patterns for accuracy. Since the lower-level model is used over a limited interval, the time penalty is reduced. The natural interval T to consider for a processor when emulating faults via SWIFI is the time required to execute one instruction. Emulating faults using hybrid fault injection is illustrated in Figure 3.1.4. Run the processor up to the instruction in which the errors should be injected.
Instruction boundaries
Again run the processor at full speed after the instruction in which the errors were injected.
r \\\
T
An output Fault to emulate
Emulate the fault by altering only the new state at the end of the instruction and outputs within the instruction (RFE).
An input
I".
time
Errors in outputs and states after the affected instruction are caused indirectly by the state at the end of that instruction.
Figure 3.1.4: A fault in an instruction stream
3.1.3.2 Developing a Fault Model FIAT and FERRARI, mentioned earlier, are SWIFI platforms which provide environments in which many error coverage statistics may be obtained. The abstract
127 fault models (bit flips) often used on the platforms, though, are generated in an adhoc manner. There is no theoretical mapping between actual faults in the computer system and these fault models. One may obtain many statistics such as detection coverage (percentage of injected faults that were detected) and detection time, but there is no way to interpret these results with respect to the detection of actual faults. To produce higher accuracy, one could invoke a complete device or gate-level simulation based on the state at f to determine what outputs and state changes to make. Recall, however, the difficulties with using device or gate-level simulations: (1) accurate descriptions are usually proprietary information not available to the user, (2) developing such models is costly, and (3) such simulations are extremely time-consuming, even given the time reduction of simulating only a limited time interval. EMAX [21] seeks to alleviate the third limitation by using a Zycad MACH-1500, a hardware simulation engine, which is capable of executing a concurrent fault simulation algorithm on three models of the processor under test. The error patterns are recorded and injected via FERRARI. One current limitation of EMAX is that the MACH-1500 can only simulate permanent faults. Given this limitation, however, EMAX can be considered both accurate and fast. Of course, this technique also requires a gate-level model of the processor. In order to provide even higher error-generation speed and lower cost, the experiments described in this chapter model the system at a higher level of abstraction, the register-transfer level (RTL) level. RTL can be considered slightly higher in abstraction than the micro-instruction level since a cycle-by-cycle description of the operation of the processor is not required. On the other hand, it is lower in abstraction than the bit-flipping models used previously for SWIFI since the functionality of the instruction is considered. The following section describes a tool built to support hybrid fault injection at the RTL level: ASPHALT.^ 3.1.3.3 An RTL Language To support the fault-generation process, the user provides ASPHALT with a description of the processor from an architectural viewpoint. This includes a list-
^ASPHALT is a loose acronym for ASembly-language level FAULT simulator. It also punningly suggests that the tool can be used as an underlying model for fault-injection test-beds, some of which are named after automobiles, such as FIAT and FERRARI.
128 ing of its architectural registers and an RTL description of the data transfers that take place for each instruction. The ASPHALT language is similar to the C programming language or the Verilog simulation language. The basic data type in ASPHALT is the register, which must be declared with a certain bit-length. Registers may be accessed wholly or with a bit-range specification. Registers of equal size may be grouped into a register file and accessed with an array subscript. The language contains all the expected control constructs such as if/then/else, while, for, do/while, and a specialized switch-type construct known as decode/ case. The decode statement takes an expression similar to C's switch, but the case statement is much more powerful: it allows matching to the result of the decode expression using a bit-field description containing fixed fields and wild-cards, and the wild-card fields can automatically be assigned to variables. This feature makes opcode decoding straight-forward and intuitive. Some other features which tailor the language to processor modeling are •
Built-in routines exist for memory and I/O read and write.
•
Built-in routines exist to generate various types of processor traps: address, bus, program, privilege, and opcode.
•
The add and subtract math operators allow an optional carry bit argument.
•
All arithmetic and logical operators automatically generate negative, zero, carry, and overflow flags as appropriate.
•
Bit lengths of operands are checked for compatibility; concatenation and extraction of bit fields is supported.
The user may also write subroutines which use the call-by-name protocol, and recursion is allowed. Figure 3.1.5 shows an extract from code used to model the ROMP, IBM's RISCOriented Micro-Processor. The complete description requires only about 400 lines of code. 3.1.3.4 An RTL Fault Model Once an RTL description of the processor is created, a fault model is applied to it to emulate the desired faults. This section develops a fault model from a taxonomy of actual hardware faults. The taxonomy of the hardware fault model and some of the hardware faults themselves are taken from [36]. For completeness, these fault model elements will be included in this section along with the new elements.
129 /* RTL code for the RISC-Orianted Micro-Processor (FiCMP) */
/* visible state variables */ r e g [32] iar; /* the instruction-aciir reg */ r e g [8] cs; /* the ccndition-status reg */ r e g [32] gpr[16],- /* the general-purpose reg file */ /* define bits of the ccmditicn-status reg */ #define I t cs{6} /* less than */ #define eg cs{5} /* equal */ #define gt cs{4} /* greater than */ ttdeflne cO cs{3) /* carry */ #define av cs{l} /* overflow */ #define tb cs{0) /* test b i t */ /* set ccnditicn status flags */ p r o c e d u r e set_cs_arith() {
131 The processor is divided into two major sections: data processing and control. The data processing is subdivided into storage, transfer, and processing functional units. This section presents a fault model which can be applied at the RTL level for elements within each category. The basic goals of the fault model are to •
Produce faults which are representative of actual hardware faults, and
•
Be implementation independent.
Obviously, to meet these goals, many assumptions and compromises must be made. This introduces the most crucial trade-off of the fault model: coverage versus overhead. Coverage is defined as the percentage of actual hardware faults that are modeled. Overhead is defined as the percentage of faults which were modeled but not produced by the actual hardware. A successful model will have a high coverage with low overhead. Unfortunately, an increase in one will produce an increase in the other. If assumptions are made to cover almost any implementation, the coverage will be high, but the increased overhead will make the model unusable. On the other hand, if very narrow assumptions are made about the implementation, the model will not cover any but the simplest processors. The experimental results will be evaluated against coverage and overhead. For each section of the processor, a list of possible hardware faults is given. A discussion explains why a subset of these faults was chosen for emulation through the RTL model. Then, a list of the RTL faults for that section is presented. Although the fault model from [36] was intended to test for permanent hardware faults only, most of the faults from that work can describe permanent or transient faults. Furthermore since 80 to 90% of hardware faults are transient or intemiittent, only non-permanent faults are considered in the following model. At the RTL level, a transient or intermittent fault is modeled as being active during a single execution of an RTL statement. Only one fault per simulation will be injected. After the basic model is presented, this section explains how the coverage of certain invisible registers is affected. The experiments in the next section are intended to show how well this model meets the preceding goals when applied to a contemporary processor. Data Storage and Transfer. The data storage units are the individual flip-flops, registers, and register files: anything declared in the RTL input file as a reg. Possible hardware faults for these units^ are
6
Addressing a register file is considered in following sections.
132 •
One or more cells are stuck at 0 or 1 [36].
•
One or more cells fail to make a 0-1 or 1-0 transition [36].
•
One or more pairs of cells are coupled [36].
•
All or part of a register is loaded with the value on its input lines when it should not be (extraneous load).
•
All or part of a register is not loaded when it should be (missed load). The last two faults cover all faults related to generating and acting on the loadenable signal.
To avoid a combinatorial explosion, only a subset of "one or more" is chosen. Specifically, the faults considered in this model are single-bit faults and wholeregister clear and set faults. This gives a fault space of n + 2 faults instead of 2" - 1 faults for each «-bit register. The missed loads are easily modeled. The extraneous loads, however, are not. Since the RTL description does not specify an internal connectivity structure, there is no way to know what the input value to a register might be. If all possible or even probable inputs are chosen, the number of faults grows combinatorially. For example, the input to a register might be the ALU output. Since, in general, no default ALU function or input values can be assumed, the number of possible combinations is enormous. For this reason, extraneous loads are not modelled. Furthermore, missed loads are limited to the full register. Therefore, the faults injected by the RTL simulator are as follows: •
XOR one bit in storage; this is repeated for each bit in the register before each access.
•
Set all bits to one.
•
Set all bits to zero.
•
Do not perform an assignment statement (except for any side-effects).
Temporally, for a given register, the first three of these faults are injected before each access, i.e., any place where its value is needed to resolve an expression, such as on the right-hand side of an assignment. Also, for each architectural variable (one which is declared as a global variable in the RTL input), the first three faults are injected at the end of the main routine to cover all remaining faults. Recall, of course, that only one fault is injected per faulty simulation. The example code in Figure 3.1.6 shows where and when register faults are injected. For any multiplexer (mux), the following faults may occur for a given source:
133 register A not loaded (the initial value is zero),
register B not loaded.
register A not loaded. clear all bits, set all bits, and XOR each of the 16 bits of A before access. clear all bits, set all bits, and XOR each of the 16 bits of B before access.
clear all bits, set all bits, and XOR each of the 16 bits of all global registers (A and B) at end of function main.
Figure 3.1.6: Example injection of registers For this code, there will be 75 faulty runs due to data storage faults: 3 missed loads, 4 clears, 4 sets, and 64 bit flips. There will be only one fault injected per run; for example, during the first run, A will not be loaded with Oxabcd.
•
No source is selected [36].
•
The wrong source is selected [36].
•
More than one source is selected, and the mux output is either a wired-AND or a wired-OR function of the sources, depending on the technology [36].
Note that these faults cover muxes built from standard logic gales as well as those implemented with tri-state drivers onto a bus as shown in Figure 3.1.7. There may be many muxes in a processor: selecting a register from a register file, selecting an input for the ALU, selecting a data value to be written to memory, and so forth. Only the fault of the wrong source selected in the context of a register file is emulated with the RTL fault model. Selecting the wrong source in other contexts leads to combinational explosion for the same reason as the extraneous load problem. Also, there is no way to predict what the value would be if no source were selected. Finally, selecting more than one source would result in an intractable number of combinafions, even if confined to the register files. Therefore, the only fault injected is
134 RO
R1
IZ^
Rn
I
Mux
Control
T
Value of Selected Reg
RO
Control
R1
Rn
n_j ^
Value of Selected Reg
Figure 3.1.7: Two implementations of a logical mux The first represents a logic-gate implementation, and the second represents gating onto a bus with tri-state drivers.
•
During a register array access, select an alternate register; repeat for each alternate register, giving n - 1 faulty simulations for an array of size n.
Decoder faults associated with addressing into a register file. The possible faults are •
No destination is selected [36].
•
Instead of, or in addition to the correct destination, one or more other destinations are selected [36],
The "no destination" fault is covered by the missed load fault discussed earlier. To avoid numerous combinations, the only other fault to be covered is that of one other destination selected. Therefore, the only additional fault injected is
135 •
During an assignment to a register array, make the assignment to an alternate register; repeat for each alternate register, giving n - 1 faulty simulations for an array of size n.
Possible hardware faults in data buses are •
One or more lines arc stuck at 0 or 1 [36].
•
One or more lines form a wired-OR or wired-AND function due to shorts or spurious coupling [36].
Faults in internal buses will only manifest when data is being transferred internally. Therefore, all single-line bus faults will be covered under register bit faults and single-bit input and output faults in the functional units. Multiple bit faults are not emulated. For the external buses, single bit faults arc injected on the address and data lines. The RTL faults, then, are as follows: •
XOR one bit of the address before a memory or I/O read or write; repeat for each bit.
•
XOR one bit of the data before a memory or I/O write or after a read; repeat for each bit.
Data Processing Functional Units. Data processing functional units include ALUs, shifters, and special-purpose computational units such as dedicated incrementers. Due to the large number of possible units, a general fault model is presented. •
One or more input, output, or internal data lines are stuck at 0 or 1 or are coupled.
•
One or more gates produce an incorrect output.
•
One or more condition codes are calculated incorrectly.
•
The wrong function is performed.
•
A pre-processing step is eliminated.
•
An extraneous pre-processing step is performed.
In order to develop an RTL fault model, generic processing units must be developed. First, a generic ALU is presented. Figure 3.1.8 shows the external interface and a bit slice of such a unit.
Bus control lines are covered later.
136
Control
Control
^ T ^
Condition codes: zero, overflow, negative, and carry
(a) External Interface
(b) Bit slice
Figure 3.1.8: Generic ALU model The external interface consists of two n-bit inputs, A and B; an n-bit result, R; control lines to select a function; and four condition codes. The bit slice calculates a result for each bit of input; some functions require a carry input and generate a carry output. Externally, there may be a single-bit fault on any of the data inputs or outputs. Control and pre-processing faults are limited to simple permutations listed at the end of this section. Faults not modeled include the substitution of a binary operation for a unary one. In that case, a second input would have to be chosen, of which there are numerous possibilities. Note that single-bit errors in the condition codes are covered by status line faults in the control section. Within a bit slice, no assumption is made about the independence of the generation of the carry and result bits. In other words, a single fault in a bit slice may cause errors in both the carry and the result. Therefore, the fault model includes single-bit faults in the carry, the result, and both simultaneously. It is easy to show that these faults cover all single-bit faults in the data inputs, except the carry input to bit 0 (CQ), which is handled as a special case. For performance reasons, of course, most ALUs do not use a simple bit slice implementation. Emulating a fault in the carry bit of each stage of a bit-slice model, however, covers all faults in the
137 generation of corresponding carry bits computed via one or more carry look-aiiead circuits. The RTL faults for ALU operations are as follows: •
For all unary operations, XOR a bit of the result; repeat for each bit. Perform an alternate unary operation; repeat for each of unary minus and logical negation. Pass the input directly to the output. For all binary operations, XOR a bit of the result; repeat for each bit. Perform an alternate binary operation; repeat for each of add, subtract, OR, NOR, XOR, XNOR, AND, and NAND. Pass the A or B input directly. Replicate the A input on the B input or vice-versa. For example, A+B becomes A+A or B+B. Logically negate A or B before the operation. This covers missing negations as well as extraneous negations because —i(-iA) = A. Clear A or B before the operation. For addition and subtraction and unary minus only, XOR the carry input. XOR a carry bit; repeat for each bit. XOR both a carry bit and the corresponding result bit; repeat for each bit.
•
For subtraction only, switch operands A and B. All other ALU operations are commutative, so this transformation would have no effect.
Next, consider a generic shifter. For performance reasons, a unit capable of shifting a word an arbitrary number of bits is usually separate from the ALU, often in parallel with the ALU data paths. A typical interface to a shifter and an example of the internal circuitry are shown in Figure 3.1.9. Possible hardware faults in the shifter include the following: •
One or more bits of the input or output are stuck at 0 or 1 or coupled.
•
One or more pass gates in the shift network are stuck open or closed.
138 Amount ' log2n
1 o t
t
I I i I Control
<> <> <>
Condition codes: zero, negative
(a) External interface
(b) An example shift network
Figure 3.1.9: Generic shifter module The control lines indicate direction (left or right) and mode (logical or algebraic). The amount value is decoded into n lines, one of which is active. These lines are combined with the direction lines to select the proper bit. The example circuit shows the generation of bit 1 of Z with an word size of 4 bits, logical shifting only.
•
Tlie wrong function is performed (left/right or logical/algebraic).
•
Incorrect condition codes are generated.
•
The amount value is decoded improperly, producing one or more active shifting signals in addition to, or instead of, the correct signal.
To reduce the number of RTL faults, the only bit faults modeled are single bit flips on the output which cover single bit faults on the input (except for the sign bit during a right algebraic shift, which is handled as a special case) and single gate faults in the shift network. Also, faults in the decoder are limited to one active output instead of the correct signal. The resulting RTL faults, then, are as follows: •
XOR a bit of Z; repeat for each bit.
139 For shift algebraic right only, XOR the sign bit prior to shifting. •
Shift the wrong amount; repeat for each incorrect amount between zero and the length of X.
•
Perform an alternate operation; repeat for each of shift left, shift right, and shift right algebraic (shift left algebraic is equivalent to shift left).
Special-purpose units such as multipliers and dividers have not yet been modeled due to the large number of possible implementations. Modeling can be approximated with iterative shift and add algorithms. Control. There is a large variety of ways to build the control section of a processor: "hard-wired," micro-programmed, micro and nano-programmed, or a combination of more than one of these methods. Also, the implementation of the control section is responsible (by exploiting the capabilities of the data section) for differences between various realizations of the same architecture. These are usually seen by the user as differences in execution time measured in nanoseconds or clock cycles and are achieved by differing amounts of parallelism, including pipelining. Because of these factors, only a high-level fault model is presented. •
For any instruction, any other instruction may be executed due to incorrect decoding [36].
•
A control command c(i) is faulty: •
c(i) is missing when it should have been issued [36].
•
c(i) is always present [36].
•
c(i) is issued extraneously in addition to, or instead of, some command c(j) whenever c(j) should be issued [36].
•
Status signals could be faulty, in a fashion similar to control commands [36].
To model the first fault, all the alternative choices in a decode statement are executed in lieu of the correct one. In order to simulate all possibilities, then, each instruction must be coded as a separate case statement within a single decode statement. To model missing control statements, certain statements are not executed. A missing load was covered under register faults. Other missing statements include the test instruction (set the condition codes based on an expression), expressions executed only for their side-effects (i.e., expressions with no left hand side), and memory and I/O writes.
140 Faults where control statements are always present are only applicable for permanent fault models, so they are not modeled. Extraneous commands will be limited to exactly one command being executed instead of the correct one; other combinations would lead to combinational explosion. Most of these have already been covered: the wrong register selected for accesses into and out of a register file and incorrect control in the ALU or shifter. Additional faults modeled are the substitution of a memory read for an I/O read or vice-versa, the substitution of a memory write for an I/O write or vice-versa, and the substitution of a signed bit-extension for an unsigned one and vice-versa. Obviously, there are many more conceivable extraneous control faults. For example, extraneous writes to memory or I/O and extraneous register loads. Also, faults such as erroneous branching within a micro-instruction sequence could cause extreme deviation from the expected behavior. As before, without implementation details, a high coverage of these faults would lead to unacceptable overhead. It is the purpose of the experiments described later to determine the impact of these restrictions. Faults in the status signals are modeled by negating the results of logical values. This covers errors in branching decisions and condition-code calculations. The model for control faults which were not covered under previous sections is as follows: •
Do not perform a test instruction.
•
Do not evaluate an evaluate-only expression.
•
Do not perform a memory write.
•
Do not perform an I/O write.
•
Perform an I/O read instead of a memory read, and vice-versa.
•
Perform an I/O write instead of a memory write, and vice-versa.
•
Perform a signed bit extension instead of an unsigned one, and vice-versa.
•
Invert the results of the following logical operations or values: logical NOT, OR, and AND; comparison tests (equal, not equal, less than, less than or equal, greater than, and greater than or equal); and the condition codes (negative, zero, carry, and overflow).
Implementation-Dependent Registers. An implementation-dependent register is one not specified in the architectural description of a processor. Therefore, no fault model can explicitly exist for one or for data paths or control sections associated
141 with it. Assumptions in the way these registers are used, however, make coverage of most such faults by the preceding fault model quite likely. Fault Model Summary. The fault model is summarized in Table 3.1.4. This table reviews the list of transient hardware faults for each section of a processor. A list of RTL faults is given that corresponds to each hardware fault. H/W element Register
Transient hardware fault
Original RTL =» faulted RTL
Missed load
R <— expr => no-op
Extraneous load
None
Level change in storage
R (on r.h.s.) =;. R <- R 0 2' (before reading R), V i"
Level change on input or output
Covered by bit faults on output of source or input of destination
Many covered previously; others are: write (addr, data) => output (addr, data) output (addr, data) => write (addr. data) read (addr) => input (addr) input (addr) => read (addr) sign extend expr to length i => extend expr to length i extend expr to length i => sign extend expr to length i Many others not covered
Level change on status signal
Logical operations, comparisons, and condition codes => invert result
Control lines
Original RTL => faulted RTL
A op B =*• (A ® 2") op B, where n is the position of the sign bit in A, and op is shift right algebraic
Table 3.1.4: Summary of RTL fault model (Sheet 3 of 3) "Only single bit faults considered; this is true for all "XOR" faults. Subtraction and unary minus are covered by these faults because A - B is implemented as A -I- -B -I- 1, and -A is implemented as 0 -h -A + 1. '^A fault in carry bit i is actually modeled by iteratively simulating each slice of the bitslice chain and XORing the carry bit between slices i and i-i-l. The "± 2'" term indicates that 2' is added if the carry bit was 0 and became I due to the fault; 2' is subtracted if the bit was I and became 0. A is the value to shift, B is the amount, and op specifies direction and type.
144 3.1.3.5 Operation The operation of ASPHALT is outlined in Figure 3.1.10 and broadly consists of four phases: reading the RTL code and the initial machine state, the golden run, the faulty runs, and post-processing. begin read RTL code and initial irachine state sinvLLate golden run: begin initialize registers sinulate instructicn and keep RTL executicn trace save final state and outputs end siitulate faulty runs: for each RTL cperaticn i in executicn trace for each possible fault j on node i begin initialize registers sinulate instruction vd.th fault j on i save final state and oitputs aid past-processing: begin ooll.^)se error states report statistics End Old
Figure 3.1.10: Pseudo-code for ASPHALT operation Reading the RTL Code and Initial State. ASPHALT starts by reading a description of the processor like the fragments shown in Figure 3.1.5. Then it inputs the values for all the visible registers captured from the actual processor just before the instruction to be faulted. It is also given access to the processor's memory image to obtain instruction and data values. The Golden Run. To obtain a basis for comparison, ASPHALT first simulates the instruction without any faults. Such a fault-free standard is called a "golden" run. ASPHALT initializes itself by loading its internal copies of the visible registers
145 with the initial machine state. All other registers are set to zero. It then executes main and any routines called by main. During this execution, ASPHALT keeps an RTL execution trace, which is a list of all the RTL operations evaluated during the golden run. The final machine state and a list of all memory and I/O writes is saved. The actual processor memory or I/O space is not written to during the simulation. The Faulty Runs. Now ASPHALT is ready to inject the RTL faults. It begins this process by sequentially stepping through the execution trace from the golden run. ASPHALT makes one faulty run for each possible fault for any given RTL operation. For example, if the current operation in the trace is the access of a 16-bit register from a 4-register file, 21 faulty runs are made: one for the XORing of each of the 16 bits, one for setting the register to all ones, one for setting it to all zeros, and one to access each of the three incorrect registers in the file. When each run completes, the final machine state and a list of memory and I/O writes is saved for that run. Alternatively, if the injected error causes one of the built-in fatal-error routines to be called, the type of error is saved; address, bus, program, privilege, or opcode trap. Post-Processing (Error Collapsing). When all faulty runs are completed, there exists a list of machine states and outputs, one for each injected fault. It is quite probable that some of these states will be identical. When performing fault-injection studies, it would be pointless to inject the same error more than once, so the list is collapsed into a set of unique errors. There are two ways in which errors are considered identical: equivalent fatal errors or equivalent states. If two injected faults cause the same fatal-error routine to be called, they are considered equivalent. For example, if two bit-flips in the instruction register cause the opcode trap to be invoked, the error states would be considered equivalent and would be identified only as an opcode-type error. It is also quite probable that two faults cause the exact same final machine state and outputs to be generated, i.e., all the bits in all the registers are the same, and the same data are written to the same memory and I/O addresses. In this case, these error states are considered equivalent and are identified by the state and outputs. A special case of this is the unique state which is identical to the state at the end of the golden run. The faults that generated this state are considered overwritten because they can never manifest into errors. A diagram showing the error collapsing process is shown in Figure 3.1.11.
146 Injected RTL Faults
Resultant States
Overwritten Faults
Equivalent Fatal Errors
Equivalent Error States
Figure 3.1.11: Error collapsing In this example, there are ten Injected faults, three of which are owerwrltten, leawing seven non-overwritten faults. Of tliese seven, there are only five unique states because two faults produced tlie same fatal error, and two produced equivalent non-fatal error states. Only the equivalent fatal errors and equivalent error states need to be used In a fault-injection experiment.
The following section will describe experiments used to verify the preceding RTL fault model and injection metliodology. 3.1.4 EXPEMMENTATION The main goal of the experiments presented in this chapter is to determine the extent to which the RTL fault model covers actual hardware faults. To estimate the error manifestations of actual hardware faults, a gate-level simulation model was used. Any of the methods from Table 2-2 on page 11 could have been used for fault injection at the gate level, but the advantage of full control over fault types and injections made software simulation the preferred choice. Figure 3.1.12 shows an overview of the experimentation process. First, both the gate-level and RTL models arc initialized with the same instruction to be faulted and the same machine state. Second, both models are exhaustively injected with
147 transient faults: gate-level faults into the gate-level model and RTL faults into the RTL model. This produces a set of machine states for each model. Finally, these two sets can be compared.
• Instruction to be taulted • Initial machine state (registers and memory)
f Gate-level simulation J
^ ^ L simulation ^
Set of resultant machine states
Set of resultant machine states
Comparison
Figure 3.1.12: Overview of experiments
Figure 3.1.13 shows a Venn diagram of this comparison. The intersection between the two sets is the covered faults. Removing the covered faults from the gate-level set leaves the non-covered faults. Removing them from the RTL set leaves the overhead, those faults generated by the RTL model but not produced by the hardware. The goal, of course, is to maximize the coverage while minimizing non-coverage and overhead. 3.1.4.1 ROMP Overview The IBM RISC-Oriented Micro Processor (ROMP) was chosen as a basis for the experiments for a number of factors: •
The gate-level model had already been developed by another researcher [9]. Therefore, since it was developed by a disinterested party, the model was not intentionally or unintentionally written to conform to the implementation assumptions of the RTL fault model.
•
There exists a fault-injection test-bed based on the IBM RT [3]. The IBM RT employs the ROMP as its CPU.
The ROMP contains a number of interesting features: RISC technology, pipelining, both hard-wired and micro-coded control logic, and an asynchronous instructioa pre-fetch buffer.
Salient architectural features of the ROMP include sixteen 32-bit general purpose registers and a four gigabyte virtual address space broken into 16 segments of 256 megabytes each. The foUowins description of the processor is taken directly from [17]: The RDMP chip is a pipelined processor capable of executing a new instruction every cycle. While one instruction is being executed, the next instruction can be decoded, and following instructions can be fetched from memory and buffered. A one-cycle execution rate is prevented when either an instruction requiring multiple execution cycles is executed, or a hold-off occurs [i.e., when the processor waits for an instruction or data from memory). Most instructions take only one cycle, but Store and I/O Write instructions each take two cycles...
149 The ROMP processor is partially microprogrammed. It uses ROM for control during the execution cycles, but hturdwired control logic is used for instruction prefetching and for memory data requests, sincc: those operations are usually overlapped with tlie execution of other instructions. [Figure 3.1,14] shows, a block diagrain, of the ROMP processor data flow.
Storaae Channel (RSC)
FIgyre 3..t.14: ROMP data flow Taken from [9], as adapted from the original [17]. Major sections include the instfuction pre-fetch buffer (IPB), the micro-instruclion fetch (MIF), data fetch and storage (DFS), the ALU and shifter fALU), and the ROMP storage channel interface (RSei).
150 Instruction Fetching The instruction fetch area includes the Instruction Prefetch Buffers (IPBs), the IPB Multiplexer (MUX), and the Instruction Address Register (lAR) and its incrementers. Four IPBs are provided to keep the processor supplied with instructions. Instructions are prefetched whenever an IPB is available and there is no other request for use of the RSC [ROMP Storage Channel]. Every cycle, each IPB that is waiting for an instruction ingates the data from the RSC. During the following cycle, the tag associated with that data is examined to determine if it was addressed to any of the IPBs. If so, then that IPB will hold its contents until that instruction word is gated by the MUX to the decode circuits... Execution Unit The execution unit includes the register file, the AI and BI latches, the ALU and shifter, and the ALU output latch. It also includes the MQ [Multiplier Quotient] register and the Condition Status register, which are both SCRs [System Control Registers]. To support a one-cycle execution rate, a 4-port register file is used... Two of the ports of the register file are used to read two operands simultaneously into the AI and BI latches. Another port is used to write the result back from the ALU output latch, and the fourth port is used to write data from memory or I/O... The ALU is used for the execution of all arithmetic and logical instructions, and for address calculations when executing Load, Store, I/O and Branch Instructions... RSC Interface The RSC interface consists of the request area and the receive area. The request area arbitrates for use of the RSC, and it receives the acknowledgment signals after requests are sent. [Read requests are sent with a 5-bit tag which uniquely identifies the data when they return from the memory or yo.] The RSC receive area contains buffer registers from one incoming data word and tag. Each cycle, they capture whatever data word and tag is on the RSC... During the following cycle, the tag is examined to determine if the word is addressed to the ROMP, and if so, whether it is an instruction or data. If it is an instruction, the tag will also identify which IPB has been assigned to it. If it is data, the tag will point to one of two descriptors which will control the alignment by the formatter and store it into the proper location in the register file...
151 Control Unit The control unit includes the microcode ROM, the Control Register (C Reg), the instruction decoders, and the ROM and register file address latches and control circuits. It also includes circuits for detecting and handling interrupts, machine checks, and program checks, as well as the SCRs associated with these events. The ROM contains 256 control words of 34 bits each... The control words are needed only for the execution cycles of the instructions, since instruction prefetching is controlled by hardwired circuits, and the last control word of each instruction controls the decoding of the next instruction... The control words are fetched from the ROM into the C Reg during the cycle prior to the one when they are executed. During the last execution cycle of any instruction, the next instruction is selected from one of the IPBs by the MUX and decoded. This decode cycle is used to simultaneously fetch a control word from the ROM and fetch the two operands from the register file into the AI and BI latches. The operation code is taken directly form the output of the MUX and used as the ROM address... Also during the decode cycle, the register address for the result, called the destination address, is put into a two-stage pipeline to be used two cycles later for storing the result from the ALU output latch into the register file. The ROMP provides 118 instructions in ten classes as shown in Table 3.1.5. The branch instructions include "Branch with Execute" versions which allow overlap of the fetch of the branch target instruction with the instruction following the branch instruction (called the subject instruction). This eliminates dead cycles that would normally occur when flushing the pipeline. 3.1.4.2 The ROMP Hardware Model The gate-level model of the ROMP was written in Verilog, a hardware description language and simulator, in approximately 2600 lines of code. To generate error manifestations from the gate-level model, there must be a gatelevel fault model and a methodology of fault injection. Fault Model. There are two portions of the fault model: gate-level faults and behavioral faults for the registers. The gate-level faults are single fixed-at-0 and fixed-at-1 faults. Fixed-at faults are distinguished from stuck-at faults by being transient as opposed to permanent. The lines are fixed to a logical zero or one value for one clock cycle using a forcing
152
Instruction class Memory Access Address Computation Branch and Jump Traps Moves and Inserts Arithmetic Logical Shift System Control Input and Output Total
Number of instructions 17 8 16 3 13 21 16 15 7 2 118
Table 3.1.5: ROMP instruction classes feature of the Verilog simulator. Faults are injected into only one line at a time; no multiple faults are simulated. These faults are injected exhaustively into all nets in the model (spatial) and during all clock cycles of the instruction at the fault injection location (temporal). Since the registers are modeled at a behavioral level, a behavioral fault model is necessary. The model is based on the list of transient hardware faults from the "Register" section of Table 4-1. The following faults are injected exhaustively: Missed load: The register is not loaded when it should be. Extraneous load: The register is loaded when it should not be. (Recall that these faults are not directly emulated in the RTL fault model.) Level change in storage: A bit in the register is flipped. It is not necessary to model level changes in the input or output lines, because they are accessible at the gate-level. The procedure in Figure 3.1.15 is used to inject faults into a certain net or register. The process is repeated for all nets and registers in the Verilog model.
Actually, this "feature" was intended to be used as a debugging tool.
153 begin initialize irBchine state siimlate instructicn without faults (goldai run) recxjrd state and outputs at each cycle for i = each cycle during instructicn begin for j = each fault for the currait net or reg begin initialize itachine state siimlate instructicn with fault j in cycle i record state and outputs aid and end
Figure 3.1.15: Pseudo-code for gate-level fault injection
3.1.4.3 Results Faults were injected into the ROMP "a" instruction (add) following the procedure outlined above. The results are shown in Table 3.1.6. First, by looking at the row labeled "Total gate level," note that the gate-level injection into the Verilog model produced 711 error states. Of these, 692 were successfully reproduced by the simpler RTL model. Recall from the development of the RTL fault model that the faults for the control section were more general than those for the data section. The effects of this are evident in the results. Almost all (0.9970) of the data errors were reproduced, but only about two thirds (0.6885) of the control errors were covered. Fortunately, there are only 61 control errors compared to 669 data errors, so the average (0.9733) is weighted heavily by the data coverage. An error coverage of over 0.97 is encouraging for this simple instruction, but how do these results compare to those for different opcodes? Similar experiments were done using 91 of the ROMP's 118 instructions by injecting over 1.3 million gatelevel faults and 176 thousand RTL faults. The results arc shown in Figure 3.1.16. The mean coverage is 0.970 with a standard deviation of 0.0138. Also note that the mean overhead is 0.205, indicating that about 20% of the errors generated by ASPHALT were not generated by the Verilog model. Detailed data for each instruction may be found in [40]. A Study of the Outlying Data. This section looks at the instructions which produced the lowest coverage values. The sources of repeating non-covered faults are discussed. To improve the coverage values, the RTL model is modified based on the observations, and the effects of the new model are presented.
154
Total faults injected
Source
Faults causing fatal errors
Equivalent error states
Error states covered by RTL injection
Gate level, control section
4452
60
61
42 (68.85%)
Gate level, data section
7686
4
669
667 (99.70%)
12138
64
711
692 (97.33%)
40
834
N.A.
Total gate level RTL level
1718
Table 3.1.6: Injection into "a" instruction The first three rows show data from injection into the gate-level model: the first row shows the results of injection into the control section only; the second, data; and the third, both. The fourth row shows data from injection with ASPHALT. The first column of data shows the total number of faults injected over all cycles of the "a" instruction. The second column shows the number of those faults which caused fatal errors such as addressing or opcode traps. The third column shows the number of equivalent error states as illustrated in Figure 3.1.11. The fourth column shows how many of the gate-level equivalent error states were also generated by ASPHALT.
From the graphs of the data, the four leftmost points are immediately identified as the most extreme outliers. These values are all between 0.90 and 0.94, whereas the other 87 are between 0.94 and 0.99. These values were produced by the four branch-and-link instructions: branch and link immediate (bali), branch and link (balr), and the two versions which execute the subject instruction (balix and balrx).^ Is there a property of the RTL fault model or the RTL ROMP model which accounts for these relatively low coverage values? The branch-and-link instructions are so-called because they copy the value of the updated lAR into a link register before branching, thereby providing a return address to be used at the end of a subroutine. An analysis of the resultant machine states from the gate-level simulation which were not covered by the RTL fault model reveals two separate effects: one for bali and balix and another for balr and balrx. These are discussed in turn below.
Actually, there are six branch-and-link instructions in the ROMP. The two branch and link absolute instructions (bala and balax) were not implemented in the Verilog model.
155 Coverage
Figure 3.1.16: Dot chart of coverage data for 91 instructions The opcodes are listed along the left-hand side.
156 First, in the set of non-covered states from both bali and balix, there were 32 single-bit faults in both lAR and the link register. These faults were caused by singlebit faults in lAR and fixed-at faults in the ALU inputs. If these 32 errors were covered in bali, for example, the coverage would rise from 0.915 to 0.962. Figure 3.1.17 shows the key RTL transfers that are executed by the RTL model during bali. From this sequence, it is evident that no single RTL fault can affect both lAR and rl5, the link register, due to the temporary variable instruction_address used in the RTL code. By analyzing the microcode from the Verilog model, one can see that no such temporary variable is used. Instead, the following two transfers are made in parallel; rl5
*r-
IAR + 4
(3.1.1)
lAR
4-
lAR + immediate_valuc.
(3.1.2)
Thus, a single-bit fault can affect both rl5 and lAR. To compensate for this discrepancy, a change is introduced in the RTL model as shown in Figure 3.1.18 to force a single-bit fault in the temporary register to affect rl5 and lAR. 1. 2. 3. 4. 5. 6.
instruction_address <— lAR IR *- readhw(IAR) lAR <- lAR + 2 wDrd2 <- readhw(IAR) lAR
7. lAR <— instruction_address + initiediate_value (where iitrtiediate_value is a simple function of WDrd2 and 5 bits of the IR) Figure 3.1.17: Transfers due to RTL model for bali instruction Line 1 saves lAR to be used to compute the branch address later in line 7. Lines 2 and 3 fetch the opcode and update lAR (readhw(x) is a function to read 16 bits from address x). Lines 4 and 5 fetch the remainder of the immediate operand because bali is a 32-bit instruction. Line 6 saves the link address in r15, and line 7 updates lAR with the branch address. Any single-bit failure that would affect both lAR and r15 would have to occur before lAR is saved in line 1, but that would cause a different instruction to be fetched and executed altogether.
Now, the bair and balrx instructions are considered. In these, the predominate non-covered gate-level error is multiple bit faults in lAR due to single-bit faults in the ALU inputs. A closer inspection reveals that these multiple faults are due to an erroneous carry being generated which propagates through several bits.
157 1. instruction__address <— lAR 2. IR f- readhw(IAR) 3. lAR <- lAR + 2
4. word2
6. r l 5 «- instruction_address + 4 7. lAR <— instruction_address + irtmediate_value (where inTnediate_value is a simple function of word2 and 5 bits of the IR) Figure 3.1.18: Transfers due to modified RTL model for bali instruction Line 6 is modified to use the saved lAR value. Now, a single-bit failure in the lnstruction_address register immediately prior to line 6 affects both lAR and r15.
The balr and balrx instructions use the contents of a register as the branch address. An analysis of the microcode in the gate-level model reveals that the ALU is being used to transfer the value from the register to lAR by adding an immediate zero to the value. Since an add is being executed, faults which insert ones into the immediate zero may cause a carry to propagate along a sequence of ones in the other input. It is also observed that the branch-on-bit instructions where a register provides the branch address (bbr, bbrx, bnbr, and bnbrx) perform the same transfers through the ALU and exhibit similar non-covered error states. The RTL code, of course, does not use an addition of zero to transfer the register value to lAR. To mimic the gate-level behavior, however, a constant zero is added to the register during the transfer to lAR. The next value that stands out in Figure 3.1.16 is that for instruction onec, with a coverage proportion of only 0.950. All other arithmetic instructions have coverage values in excess of 0.970. Onec is a unary operation to take the one's complement (bit-wise negation) of a register. Recall the fault model for the ALU derived earlier. To reduce overhead, the fault model for unary operations was a special case. For example, the only alternative operation for negation is a unary minus or a direct passage of the argument. In the gate-level model, however, onec is implemented as subtraction from an immediate zero without the carry input being set. In other words, A - B is implemented as A -H —IB +1; —iB is implemented as 0 + —iB +0. To mimic this behavior in the RTL description, an immediate zero is added to the negated operand. Similarly, the other two unary instructions, twoc (two's complement) and abs (absolute value), are modified by adding or subtracting an immediate zero as required.
158 If the new data for the 11 modified instructions are substituted, the mean coverage for all 91 instructions is raised slightly from 0.970 to 0.972 and the standard deviation is reduced from 0.0142 to 0.00889. The more salient effect is that the extreme nature of the outlying data has been reduced to a large extent. Compare the resulting dot chart in Figure 3.1.19 to previous data. The following section will examine the time required for simulation at the RTL and gate levels. 3.1.4.4 Time Requirements The experiments were performed on a DECstalion 5000. The average CPU time required to simulate one instruction, and thus, one fault, with the Verilog simulator is approximately 1.65 seconds. There were 1,327,004 gate-level faults injected across the 91 implemented instructions which produced 61,286 unique errors for an injection-to-error ratio of almost 21.7:1. Thus, the average CPU time devoted to producing one error is approximately 35.8 seconds. In contrast, the average CPU time required to simulate one fault with ASPHALT is 24.2 milliseconds. This represents a per-fault gate-level to RTL simulation time ratio of 68.2:1. There were 176,245 RTL faults injected across the 91 implemented instructions which produced 75,052 unique errors for an injection-to-error ratio of 2.35:1. Thus, the average CPU time devoted to producing one error is approximately 56.9 milliseconds. Injecting the 1.3 million faults via Verilog consumed approximately 609 CPU hours. Injecting the 176 thousand faults via ASPHALT consumed approximately 1.19 CPU hours. Thus, the overall average ratio of gate-level to RTL simulation time is 512:1. 3.L5 CHAPTER SUMMARY Modern computer applications often require high levels of dependability. To meet these requirements, computer architects often use fault-detection and fault-tolerant mechanisms. In order to validate these mechanisms, these engineers often use fault-injection techniques. Major sources of faults are transient and intermittent hardware faults. Current methods to inject, simulate, or emulate these faults have significant time, cost, or accuracy limitations. This chapter has discussed a new method to emulate faults
DECstation is a trademark of Digital Equipment Corporation.
159 Coverage
Figure 3.1.19: Dot chart of coverage with new RTL code Compare to Figure 3.1.16.
160 using software-implemented fault injection (SWIFI) which addresses these concerns. By studying how faults propagate in general sequential networks, it was determined that reverse fault emulation via SWIFI could emulate a large majority of the faults that occur in processors. In order to determine how to emulate those faults, a fault model based on a register-transfer-level (RTL) description of the processor was proposed. This fault model was developed based on a proposed list of possible hardware faults in all sections of a processor: data storage and transfer, data processing units, and control. The goals of this model were to (1) obtain high coverage, the proportion of errors that are generated by gate-level faults that are successfully reproduced by the RTL fault model, and (2) minimize overhead, the proportion of errors that are produced by the RTL fault model but not by the gate-level model. A prototype RTL simulator and fault injector, ASPHALT, was developed to obtain experimental data. As a paradigm for future fault-injection studies, a study of the IBM RISC-Oriented Micro-Processor (ROMP) was proposed. An RTL model was developed, and a gate-level model was used for comparison purposes. Faults in all implemented instructions were injected exhaustively. Coverage ranged from 0.90 to 0.99 with a mean of 0.97. Overhead ranged from 0.10 to 0.40 with a mean of 0.21. The coverage value of 0.97 was derived from a naive RTL processor model which was coded with no knowledge of the implementation. Considering some simple implementation details allows a modest increase in coverage. To accomplish this, the instructions which exhibited the lowest coverage values were examined. A study of the microcode for those instructions revealed that certain RTL statements could be slightly revised to better emulate the actual data transfers. Minor changes for eight branch instructions and three unary arithmetic instructions increased the minimum coverage from 0.90 to 0.94. An analysis of the time requirements found that it took an average of 1.65 CPU seconds to simulate a fault at the gate-level using the Verilog simulator but only 24.2 milliseconds when using ASPHALT, the RTL simulator, a ratio of 68:1. Exaggerating this effect is the fact that the injcction-to-error ratio at the gate level is almost 22:1 but only 2.4:1 with RTL. This leads to an overall reduction of effort of512:l. 3.L5.1 Recommendations In general, based on the above results, the following procedure is recommended for a fault-injection study.
161 1. Develop an RTL model of the processor. 2. If only preliminary results are requested or the specifications for coverage are not stringent, use only the RTL fault simulator with a SWIFI tool such as FIAT or FERRARI to perform the studies. Random sampling of instructions in the workloads might also be used to decrease injection time. 3. If the specifications are more demanding, obtain a gate-level description. Follow the example experiments performed on the ROMP to determine average coverage and overhead statistics. Based on these results, a partial gate-level simulation may be used, or the RTL fault model may be enhanced based on the gate-level knowledge. In any case, a substantial savings over exclusive gate-level modeling is to be expected. 3.1.5.2 Contributions This work described in this chapter has contributed the following to the field of fault-injection: •
A new fault model which operates on an RTL description of a processor and is based on lower-level fault models.
•
An automated tool which injects these faults, producing error patterns over 500 times faster than a gate-level model of the same processor.
•
Experimental results which show that this abstraction is capable of reproducing over 97% of the error patterns generated by a gate-level model of the chosen processor.
•
A paradigm for developing future fault-injection studies on future processors.
3.L5.3 Future Work Future research can be carried out in at least three broad areas: additional verification, enhancements, and applications. First there are several experimental parameters that were not varied due to time constraints. In order to more fully verify and quantify the RTL fault model, some additional experiments may be performed: •
The procedure can be repeated on other processors to show that the fault model has generality across processors. These results will be especially interesting when the experiments are performed on higher-performance CPUs with the latest performance-enhancing devices.
162 •
The procedure can be repeated with various RTL models coded by independent persons. This will show any sensitivity to coding style.
Second, there are several alternatives for enhancing the RTL fault model: •
A broader spectrum of architectural knowledge, such as that applied to the branch and unary operations, could be applied to the RTL model for additional data.
•
Similarly, if architectural knowledge is available, the RTL fault model itself may be tailored for a particular processor. This might include a richer set of alternatives for ALU operations based on microcode analysis, for example.
•
Computer synthesis techniques might be applied to determine reasonable data and control paths for the processor being simulated. This could perhaps provide more intelligent control faults, increasing coverage and decreasing overhead.
Third, of course, there is a plethora of possible applications for this technology: •
The speed and accuracy advantages offered by this new fault model make fault-injection studies much more feasible. ASPHALT could be integrated into FIAT, FERRARI, or a similar system to perform these studies. Data gathered can not only facilitate verification of existing fault-detection and fault-tolerance techniques, but also can be used in the design cycle to help develop the next generation of highly-dependable computing systems.
•
Similarly, ASPHALT could be integrated into a robustness benchmark [33] to help derive reliability figures-of-merit for computer systems.
REFERENCES [1]
W. Richards Adrion, Martha A. Branstad, and John C. Cheriavsky. Validation, Verification, and Testing of Computer Software. ACM Computing Surveys. 14(2):159-192, June, 1982.
[2]
J. Arlat, Y. Crouzet, and Jean-Claude Laprie. Fault-Injection for Dependability Validation of Fault-Tolerant Computing Systems. I9th International Symposium on Fault-Tolerant Computing (FTCS19), pages 348-355. June, 1989.
[3]
James H. Barton, Edward W. Czeck, and Zary Segall and Daniel P. Siewiorek. Fault Injection Experiments Using FIAT. IEEE Transactions on Computers. 39(4):576-582, April, 1990.
163 [4]
Dhananjay Brahme and Jacob A. Abraham. Functional Testing of Microprocessors. IEEE Transactions on Computers. C-33(6):475485, June, 1984.
[5]
Ram Chillarege and Nicholas S. Bowen. Understanding Large System Failures—A Fault Injection Experiment. Technical Report RC 14233, IBM Corporation Research Division, IBM T.J. Watson Research Center, Yorktown Heights, NY, 1989.
[6]
Gwan S. Choi and Ravishankar K. Iyer. FOCUS: An Experimental Environment for Fault Sensitivity Analysis. IEEE Transactions on Computers. 41(5), May, 1992.
[7]
Gwan S. Choi, Ravishankar K. Iyer, and Victor A. Carreno. Simulated Fault Injection: A Methodology to Evaluate Fault Tolerant Microprocessor Architectures. IEEE Transactions on Computers. 39(4) :486-490, April, 1990.
[8]
J. Cusick SEU Vulnerability of the ZILOG Z-80 and NSC-800 Microprocessors. IEEE Transactions on Nuclear Science. NS32(46):4206-4211, December, 1985.
[9]
Edward Czeck. On the Prediction of Fault Tolerant Behavior Based On Workload. Ph.D. thesis, Carnegie Mellon University, December, 1990.
[10] Edward Czeck and Daniel Siewiorek. Observations on the Effects of Fault Manifestation as a Function of Workload. IEEE Transactions on Computers. 41(5):559-564, May, 1992. [11] Thomas R. Dilenno, David A. Yaskin, and James H. Barton. Fault Tolerance Testing in the Advanced Automation System. 21st International Symposium on Fault-Tolerant Computing (FTCS2I), pages 18-25.June, 1991. [12] R Duba and R.K. Iyer. Transient Fault Behavior in a Microprocessor: a case study. Proceedings of the 1988 IEEE International Conference on Computer Design: VLSI in Computers and Processors. October, 1988.
164 [ 13] Joanne Bechta Dugan. On Measurement and Modeling of Computer Systems Dependability: A Dialog Among Experts. IEEE Transactions on Computers. 39(4):506-509, April, 1990. [14]
Klaus Echtle and Yinong Chen. Evaluation of Deterministic Fault Injection for Fault-Tolerant Protocol Testing. 2]st International Symposium on Fault-Tolerant Computing (FTCS21), pages 418-425. June, 1991.
[ 15] Kumar Goswami and Ravishankar K. Iyer. DEPEND: A SimulationBased Environment for System Level Dependability Analysis. Technical Report UILU-ENG-92-2217/CRHC-92-11, Coordinated Science Laboratory, College of Engineering, University of Illinois at Urbana Champaign, June, 1992. [16]
Alf Gunneflo, Johan Karlsson, and Jan Torin. Evaluation of Error Detection Schemes Using Fault Injection by Heavy-ion Radiation. 19th International Symposium on Fault-Tolerant Computing (FTCSI9), pages 340-347. June, 1989.
[17]
IBM RT Personal Computer Technology. Technical Report SA231057, IBM, 1986.
[18]
Ravi K. Iyer and D.J. Rossetti. A Measurement-Based Model for Workload Dependence of CPU Errors. IEEE Transactions on Computers. C-3511-519, June, 1986.
[19]
Richard L. Johnson Jr., Sherra E. Diehl-Nagle, and John R. Hauser. Simulation Approach for Modeling Single Event Upsets on Advanced CMOS SRAMS. IEEE Transactions on Nuclear Science. NS-32(46):4122-4127, December, 1985.
[20]
Ghani A. Kanawati, Nasser A. Kanawati, and Jacob A. Abraham. FERRARI: A Tool for the Validation of System Dependability Properties. 22nd International Symposium on Fault-Tolerant Computing (FTCS22), pages 336-344. July, 1992.
[21] Ghani A. Kanawati, Nasser A. Kanawati, and Jacob A. Abraham. A High-level Error Model Automatic Extractor Technical Report UTCERC-TR-JAA93-01, Computer Engineering Research Center, University of Texas at Austin, January, 1993.
165 [22] Ramachandra P. Kunda, Prakash Narain, Jacob A. Abraham, and Brarat Deep Rathi. Speed Up of Test Generation using High-Level Primitives. 27th ACM/IEEE Design Automation Conference, pages 594-599. 1990. [23] Jean-Claude Laprie. Dependable Computing and Fault Tolerance: Concepts and Terminology. 15th International Symposium on FaultTolerant Computing (FTCS15), pages 2-11. June, 1985. [24] T. May and M. Woods. Alpha-Particle-Induced Soft Errors in Dynamic Memories. IEEE Transactions on Electron Devices. ED262-9, January, 1979. [25] Ghassem Miremadi, Johan Karlsson, Ulf Gunneflo, and Jan Torin. Two Software Techniques for On-line Error Detection. 22nd International Symposium on Fault-Tolerant Computing (FTCS22), pages 328-335. July, 1992. [26] Donald K. Nichols et al. Trends in Parts Susceptibility to Single Event Upset from Heavy Ions. IEEE Transactions on Nuclear Science. NS-32(46):4189-4194, December, 1985. [27] Thomas M. Niermann, Wu-Tung Cheng, and Janak H. Patel. PROOFS: A Fast, Memory Efficient Sequential Circuit Fault Simulator. 27th ACM/IEEE Design Automation Conference, pages 535540. 1990. [28]
Forrest E. Norrod. An Automatic Test Generation Algorithm for Hardware Description Languages. 26th ACM/IEEE Design Automation Conference, pages 429-434. 1989.
[29] Joakim Ohlsson, Marcus Rimen, and Ulf Gunneflo. A Study of the Effects of Transient Fault Injection into a 32-bit RISC with Built-in Watchdog. 22nd International Symposium on Fault-Tolerant Computing (FTCS22), pages 316-325. July, 1992. [30]
Micheal Peercy and Prithviraj Banerjee. Design and Analysis of Software Reconfiguration Strategies for Hypercube Multicomputers under Multiple Faults. 22nd International Symposium on Fault-Tolerant Computing (FTCS22), pages 448-455. July, 1992.
166 [31] Z. Segall, D. Vrsalovic, D. Siewiorek, D. Yaskin, J. Kownacki, J. Barton, R. Dancey, A. Robinson, and T. Lin. FIAT—Fault Injection Based Automated Testing Environment. Proceedings of the 18th International Symposium on Fault Tolerance, pages 102-107. The Computer Society of the IEEE, Washington, DC, June, 1988. [32]
Daniel P. Siewiorek and Robert S. Swarz. The Theory and Practice of Reliable System Design. Digital Press, DEC, Bedford, MA, 1982.
[33] Daniel Siewiorek, John Hudak, Byung-Hoon Suh, and Zary Segall. Development of a Benchmark to Measure System Robustness: Experiences and Lessons Learned. 23nd International Symposium on Fault-Tolerant Computing (FTCS23). July, 1993. [34]
Daniel R Siewiorek and Robert S. Swarz. Reliable Computer Systems: Design and Evaluation. Digital Press, DEC, Bedford, MA, second edition, 1992.
[35] Satish M. Thatte and Jacob A. Abraham. User Testing of Microprocessors. Spring '79 Compcon, 18th IEEE Computer Society International Conference, pages 108-114. 1979. [36] Satish M. Thatte and Jacob A. Abraham. A Methodology for Functional Level Testing of Microprocessors. 8th International Symposium on Fault-Tolerant Computing (FTCS8), pages 90-95. June, 1978. [37] Satish M. Thatte and Jacob A. Abraham. Test Generation for General Microprocessor Architectures. 9th International Symposium on Fault-Tolerant Computing (FTCS9), pages 203-210. June, 1979. [38] RC. Ward. Behavioral Fault Simulation in VHDL. 27th ACM/IEEE Design Automation Conference, pages 587-593. 1990. [39] Luke T. Young and Ravishankar K. Iyer. A Hybrid Monitor Assisted Fault Injection Environment. Technical Report UILU-ENG-922207/CRHC-92-04, Coordinated Science Laboratory, College of Engineering, University of Illinois at Urbana Champaign, February, 1992.
167 [40] Charles R. Yount. The Automatic Generation of Instruction-Level Error Manifestations of Hardware Faults: A New Fault-Injection Model. Ph.D. Thesis, Carnegie Mellon University, May 1993.
SECTION 3.2
R E A C T : An Integrated Tool for the Design of Dependable Computing Systems Jeffrey A. Clark and Dhiraj K. Pradhan 3.1
Introduction
T h e REliable Architecture Characterization Tool ( R E A C T ) is a generalized software testbed for analyzing a variety of fault-tolerant multiprocessor systems. R E A C T abstracts a system at the architectural level and performs autom a t e d life testing through simulated fault-injection to accurately and efficiently measure dependability. It integrates detailed system, workload, and fault/error simulation models t h a t have been derived, in part, from the published results of computer performance studies and low level fault-injection experiments. This chapter introduces R E A C T and discusses the theory behind its approach toward dependability modeling. The system, workload and fault/error models it employs are developed in the next three sections. Section 4 describes the implementation of the testbed, including its basic operation and the derivation of its dependability measures. Section 5 explains the appi oach taken to verify and validate R E A C T . Performance limitations of the tool are considered in Section 6. Appendix 1 contains an overview of the R E A C T software. Sample input and o u t p u t files appear in Appendix 2.
3.2
System Model
R E A C T will analyze most systems that can be represented by the model pictured in Figure 1. This class of architectures consists of one or more processor modules (P) interconnected via buses (B) to one or more memory modules (M) through a block of fault-toleranco mechanisms. Any number of processors and memories may he specified and each can be designated as initially active, or a hot or cold standby spare. Homogeneous groups of processors or memories can be defined in which all modules operate redundantly. A group of processors execute the same workload in lock-step synchronization. Memories in a group have identical contents and are accessed simultaneously. Multiple, non-overlapping groups may coexist within the system. Processor and memory modules are abstracted at the architectural level. T h e operation of each processor is governed by a stochastic model discussed later in this chapter. The state of a processor determines what values are driven onto its address and d a t a buses during memory accesses. Processor state changes when it experiences a fault or an erroneous value is read from memory (through the fault-tolerance inechanisms). The state of a memory is defined by the contents of its bit-array. T h e functionality of its addressing-logic
170
p
p
p
B
B
B
Fault Tolerance Mechanisms
B
B
+
B
Interconnect,
M
M
M • • •
• • • Figure 1: R E A C T System Model
is simulated on every memory access. Memory state changes when it experiences a fault or an erroneous value is written into it by the processor (through the fault-tolerance mechanisms). In order to reduce the complexity of the system model, logical "0" and " 1 " are not differentiated. Individual bits may take either an crror-fne or erroneous value. Memory depth (A'words) is variable up to 256 Mwords per module and a 3'2-bit word width has been assumed for memory and all d a t a paths. .Although caches are not explicitly represented within the R E A C T system model, the workload model it employs does account for the influence of caches on memory error latency and propagation. More detailed modeling of the processor and memory modules is possible, but may result in longer simulation run-times. T h e fault-tolerance mechanisms provide the hardware necessary to detect, correct or mask errors during memory accesses, and reconfigure the system when modules fail. Hardware components typically found in this block of mechanisms include majority voters, comparators, switches and error control codes. The operation of these mechanisms is simulated with a set of six software routines that delineate how addresses and d a t a are transformed by the hardware when passing between the processors and memories during an access. Users of R E A C T have complete freedom in specifying the number, functionality and interconnection of these mechanisms. This gives the system model the flexibility needed to represent a variety of multiprocessor architectures that utilize passive, active or hybrid redundancy to achieve fault-tolerance. Some examples of tlie systems t h a t can be analyzed with R E A C T are; • N-modular redundancy; C.vmp [31], F T M P [19], Integrity S2 [20]
A synthetic workload is assumed for which the processors continually perform tnslruction cycles consisting of up to three memory references plus the simulated execution of an instruction. Real code and d a t a are not used by R E A C T , but errors will propagate throughout the system as if an application program was actually being executed. Only those instruction cycles that change the state of the system, by propagating or overwriting errors, need to be simulated. It would be necessary to simulate every instruction, regardless of its impact, if real programs were to be used. The computational expense of simulating possibly trillions of instructions over the operational life of each system is obviously unacceptable. Behavior of the application workload is specified by the mean instruction execution rate {^Z^nstr)^ the mean rates at which main memory is read (T^Mread) and written (7\.Mwrite) per instruction, and a locality-of-reference model. The following memory references are assumed to take place during every instruction cycle: • One access to read the instruction • One access to read d a t a with probability Pdata read • One access to write d a t a with probability "Pdata write Memory references in real systems can access words in either the cache, located within each processor, or the (main) memory module. Because caches are not explicitly modeled, references to them are not simulated. However, the effect of accessing erroneous words in the cache is represented at a higher level of abstraction by the processor fault model which is described in the next section. Main memory is accessed only when the cache does not contain the referenced word. The miss probabiltly ('Pmiss) defines the likelihood a referenced word does not reside in the cache. Words in main memory are usually accessed in blocks, rather than individually. However, block accesses do not have to be simulated to correctly model the net flow of words between the processor and its associated memory. The workload model is specified in terms of time-averaged rates and probabilities that are independent of block size. All main memory references can therefore be assumed to read or write only one word, allowing the mean access rates to be computed as: '^Mread
—
' miss ( t - r r d a t a r e a d )
( 1)
'^Mwrite
-—
^ m i s s ' d a t a write
(^J
Values for T'dataread, •Pdata write and 'Pniiss may be obtained directly from memory reference traces of the uniprocessor t h a t will be used in the realization of the fault-tolerant multiprocessor system being evaluated. Table 1 lists
172 Table 1: Typical Values for the D a t a Access Probabilities a read
Machine VAX 8350 DEC Titan MC88000 VAX 8800 VAX-11/780 Z8000 VAX VAX (Lisp) IBM 370 iHM 360/91 CDC 6400 VAX-11
Reference Agarwal [1] Borg [5] Bugge [7] Clark [13] Emer [16]
Smith [32]
Wiecek [39]
typical values for the probabilities of accessing inemory d a t a on several different machines. The probability of performing a data read during an instruction ranges from approximately 0.2 to 0.8. The probability of performing a d a t a write is usually about half that of reads. Cache miss probabilities have been observed to vary from 0.01 to 0.2 [18]. The miss probability for writes can be set independently of that for reads, to simulate different main inemory update policies. The locality- of-refe re nee model determines which memory locations are accessed during an instruction cycle. R E A C T implements a first-order approximation of a model based on firnrf/orrf-Zi;)/distributions [8, 9]. The locality model assumes t h a t aiocaiiiv ^ 100% of all references access /^locality x 100% of the memory under the condition aux.aiity + Aocaiity = 1- Values for ftioeaiity and /^locality can be extracted from a curve such as the one shown in Figure 2, which plots the cunudative frecjuency of word references against the logarithmic rank of the words. Heising first reported what is now commonly referred to as an ''80/20 Rule", when parameter values «iacaiity — 0.8 and locality = 0.2 were observed for many commercial applications [17]. Reference addresses are assumed to be uniformly distributed inside and outside of the locality and no a t t e m p t is made to separate code from d a t a in memory with this model. This locality-of-rcference model was adopted for a specific reason. Unlike most models intended for trace-driven simulation [22, 33, 36], the BradfordZipf model does not recjuire complicated state updates when the system clock is advanced over large periods of lime. The low computational overhead of this model makes it very attractive for use in a tool that may siinulate millions of systems over several years of operation.
173
0.8 h Cumulative Frequency ^ of Word References 0-4 0.2 0 0.001
0.01
0.1
1
Logarithmic Rank of Words Figure 2: Typical Bradford-Zipf Memory Reference Curve
T h e last phase of the instruction cycle to consider is the simulated execution of an instruction. The purpose of the workload model is to propagate errors between the processor and memory modules during an access, as they would in a real system. T h e particular computation performed during an instruction cycle is irrelevant, and therefore does not require explicit simulation. An instruction is assumed to be successfully executed if no faults are active in the processor and no errors are read from memory during the instruction and d a t a fetches. No errors will be written into memory when an instruction is successfully executed. If a fault is active in the processor or an error is read from memory, the instruction will not execute successfully, and any write into memory that takes place will be erroneous.
3.4
Fault and Error Models
R E A C T will automatically inject both permanent and transient faults into the processors, memories and fault-tolerance mechanisms of a system. Faults are assumed to occur independently within each component. Their inter-arrival times are randomly sampled from a Weibull distribution, Weibull (A, a ) , with the following cumulative distribution function (CDF): .F'WeibullC) = 1 - e' -(\tr
(3)
The parameters A and a are respectively known as the scale and shape parameters. Exponentially distributed inter-arrival times, which are assumed by many failure rate models, can be obtained simply by setting o = 1. Fault injections take place at the beginning of the instruction cycles specified by the C D F
174 Table 2: Typical Values for the Latency Distribution Parameters -^latent
samples. When two or more faults simultaneously affect the same bit within a component, the fault with the longer duration takes precedence. Repair times for failed components are randomly sampled from a log-normal distribution, LogNorm (fi,a), with the following C D F :
F, og-norm •At)
/2^7-,
-'^-'''dx
(4)
T h e parameters // and cr are the mean and standard deviation of the distribution, respectively. Repairs are initiated after a constant, user specified logistics delay (Tiogistics)- Multiple failed components may be repaired either sequentially or concurrently. All faults in a component are removed during repair. The time required to resynchronize a repaired component (Tresynch) and reboot the system after a critical failure (Ti-eboot) are also constant and user specified. 3.4.1
Processor Fault M o d e l
Processor operation in the presence of faults is governed by the stochastic model shown in Figure 3. Ovals in this diagram represent valid processor states, while arcs indicate what state changes may take place. Dashed boxes are temporary nodes that are passed through when moving into a valid state. They have been included only to clarify the state transitions of the model. All processors initially start in the Fault Free state. Permanent faults occur at WeibuU {Xpp„n-^, appenn) distributed inter-arrival times and become Latent. No errors are generated by Latent processor faults. A permanent fault will remain Latent for a period of time specified by a Wetbull (Aiatent-«iateni) distribution, and then become Active. Table 2 lists typical values for the latency distribution parameters, measured through fault-injection experiments on several different processors. A processor is considered to be failed and will produce errors during every instruction cycle when it is in the Active state due to a permanent fault. It can only leave this state and return to the Fault Free state if it is repaired and resynchronized. Transient processor faults occur at l'l''ezAu//(Aptrari)Qptrati) distributed interarrival times. A transient is assumed to generate a single logic upset which must be latched to become Latent. If not latched with probability 'Piatch' the
175 Permanent I Fault
H
H
Latent
Active
Wethull (•^latent . ^ l a t e n t )
Wetbidl i^Pperm '^Pperni;
Fault Free
U. J
^
Weibull
l-Vv
' Transient
r
Pte
atch
(-^Ptran.Qptran)
Fault
'
^
,
!
^
'^'^^^"^
,
,
"^latch
Weibull
Weibull
J
(\iatent,autent)
with probability
(•^latent-Qlatcnt)
with probability
*" upset
^
Control Upset
Active
•' t Erroneous Read
'^upset
upset
Dormant
1 cycle
Weibull
1
(Aactive-Qactive
with probability 1 — Pte Input Error
' i
i _ T)
-•
^
A'^upset
Figure 3: Processor Fault Model
176 Table .'5: [ypical Values for the Transient Latching Probability ^latch
0.54 0.29 - 0.36 0.23 - 0.44
Processor HS1602 IBM RT PC Custom RISC
Reference Choi [12] Czeck [15] Ohlsson [29]
Table 4: Typical Values for the Control Upset Probability r upset.
0.62 0.60 - 0.82 0.32 - 0.35
Processor IBM RT PC MC6809E Custom RISC
Reference Czeck [15] Miremadi [28] Ohlsson [29]
processor will immediately return to the Fault Free state. Table 3 gives typical values for the transient latching probability of several different processors, again measured through fault-injection experiments. A Latent transient fault will upset the control flow of an executing program with probability Pupset, after a atent, Qiateiu) distributed period of time. This will cause the processor to diverge from the correct instruction stream and enter the Control Upset state. Typical values for the control upset probability, also measured through fault-injection experiments on several different processors, appear in Table 4. It is assumed t h a t the processor will remain in the Control Upset state, producing errors during every instruction cycle, until it is resynchronized. If the Latent transient does not result in a control flow upset, it will become Active after a ^iV^fe«//(Aiatent, «iatent) distributed length of time. An Active transient fault is assumed to produce errors during one instruction cycle and then move into the Dormant state. W i t h termination probability 'Ptermi the processor will immediately return to the Fatilt Free state. Otherwise, it will hold in the Dormant state for a period of time that is Weibull (Aactive, «active) distributed. After the holding time in the Dormant state expires, the transient will again become Active and repeat the same sequence of state changes. This cyclic behavior is intended to represent multiple errors being sourced from a single fault t h a t eventually disappears. Register and cache faults may exhibit such behavior when they are read several times before being overwritten. Parameters for the distribution of transient fault activity are not readily available in the research literature. However, several papers have published distributions of the time between errors (TBEs) in computer systems, from which -'^active and Oactive c^n be estimated [25, 34, 35]. The T B E distributions were obtained through the analysis of system error logs. Most large computers maintain
177 1""
1
...0.
0.1
•••..
m
1
Actual Approximation
t:^
0.01
n nni
0
1
1
1
5
10
15
20
Time t (Minutes) Figure 4: Error Hazard Rate of a VAX 8600
an error log, in which time-stamped entries describing abnormal system events are recorded. Entries are usually made in the log whenever an error is detected by any system component, including the processor, memory, I/O devices and software. Although it is possible to compute TBE distributions for individual components, most published TBE distributions utilize all the data in a log. It is therefore assumed that the TBE distribution for a processor resembles the aggregate TBE distribution for all the system components. Figure 4 graphs the error hazard rate, h(t), computed by Tang and Iyer from the TBE distribution of raw data in a VAX 8600 error log [351. The hazard rate of a distribution with probability density function f(t) and CDF F{i) is: h(t)
=
m
(5)
1 - Fit)
Error hazard rate can be interpreted as the conditional probability an error will occur within the next unit of time, given that no error has occurred since the last error was seen. This curve was approximated (by the author) with the Weibutl {X,a) hazard function: /lWeibull(0 =
a\{Xt)"
having parameter values Aactive = 0.022222 minute" mean of a Wetbull (A, a) distribution is: mwrnbuW
= T- r 1 H — A V a
(6) and Q,
0.5. The
(7)
178 where the gamma function.
T(x]. is defined for positive integers x as: r(j-) =
(i--i)!
Equation 7 yields a mean for the fitted H'ciAu//(Aactive-Oactive) distribution of 90 minutes, which corupares favorably to tlie 88.2 minute mean of the empirical data. Similar T B E distributions have also been observed for the VAX 11/780 [34] and Tandem Cyclone and V'LX systems [25]. Distribution |)aranieters needed to model other processors can be obtained using the approach described here. TBK distributions have been shown to exhibit two error modes, around which many samples accrue [25]. The first mode has a peak at the origin and represents a burst of errors froin one source. It is modeled by the Wdhull (-^active, ^active) fault activity distribution. T h e second mode occurs in the tail of the T B E distribution and represents single, isolated errors. It is modelled by the WeihuU (Xp,f^n^f^P\.niu) fault occurrence distribution. The terniination probability designalt's where the tail of the fault activity distribution should b(^ truncated to eliminate the second mode. The value of the T B E C D F at this point in time is equal to 1 — T'teimContinuing with the discussion of Figure 3, when a processor reads an error from memory (through the fault-tolerance mechanisms), its state may become corrupted. All input errors will be latched internally, so they are assumed to have an immediate effect on the processor. Based on the probability 'Pupset- an input error will put the processor in either the Control Upset or Actirc state. T h e input error is assumed to behave like a transient fault once the processor has entered either of these states. Processor errors can affect both the addresses and (write) d a t a generated during an instruction cycle. The probabilities of an error affecting only an address (Paddr error) or d a t a CPdata error) "^rt^ specified by the user. Addresses and d a t a are simultaneously affected by an error with probability (1 — Paddr em>r — ^data error)- An erroneous address is assumed to access a random memory location. Erroneous d a t a take on a random value. 3.4.2
M e m o r y Fault M o d e l
Memory faults are divided among the bit-array and addressing-logic regions of a module. The fraction of faults which fall into each of these regions may be approximated by their relative chip areas. Typical values for the fraction of memory chip faults that affect the bit-array (/"bit-array) are listed in Table 5. Bit-array faults are assumed to affect a single random bit in a word at a random address. It is assumed that references to memory access words at random addresses during an addressing-logic fault. An access to a random address reads or writes a value with randoiidy corrupted bits, corresponding to tlie difference in bit values between the word that was accessed and the word that should have been accessed. Similar fault models for memory have been employed in other experimental dependability studies [10, 26]. The stochastic model seen in Figure 5 illustrates how faults affect the o|)eration of a memory module. All memories initially start in the Fault Free stat(?. Permanent faults occur at Weibull (AMpem,. Q.Mpcrni) distributed inter-arrival
179 Tabic 5: Typical Values for the Bit-Array Fraction ''bit-array
0.39 0.85 0.89 0.89
Memory Chip 16K RAM 16K DRAM 64K DRAM 4K RAM
Reference Ayache [3] Koo [23] Marston [27]
limes, and affect the Addressing Logic with probability 1 — /"bn.array A meinory is considered to be failed if it is in this state, and will always return random words when accessed. It can only leave the permanent Addnssing Logic fault state and return to the Fault Free state if it is repaired and resynchronized. T h e permanent fault could have corrupted a bit in memory instead of the addressing-logic with probability ^bit-array It would have become Latcnl or Aciivf. based on the value of the bit latency probability Pbii-iatt-nt • Through fault-injection experiments on a VAX 11/780, Chillarege and Iyer found t h a t the probabilities a s-a-0 and s-a-1 fault are initially latent are 0.7 and 0.3, respectively [11]. Assuming both faults are equally likely, the probability a bit-array fault will be latent is Pbit-iatent = 0.5. A Lateni permanent fault in the bit-array generates no errors. It will become Active with probability 1 — T^bit-iatent when it is next written, at a time determined by the workload model. Once Active, the permanent bit fault cannot be overwritten and is assumed to remain in that state in order to simplify the fault model. Transient faults occur at WeibuU {\y[\.^m^^<-'^^Nltv^r\) distributed inter-arrival times in memory. Their effects are similar to, but shorter than those of permanent faults. Random words will be accessed if a transient affects the Addressing Logic. However, the fault is assumed to last only one instruction cycle after which the memory returns to the Fault Free state. A transient fault that strikes a memory bit has the probability 1 —'Pbit-iatent of becoming Active. Its duration is assumed to be so short that it cannot become latent, like a permanent fault can. As a result, a transient fault in the bit-array will immediately return to the Fault Free state with probability 'Pbit-iatent• An Active transient will produce errors whenever it is read, until overwritten at a time specified by the workload model. 3.4.3
F a u l t - T o l e r a n c e M e c h a n i s m Fault M o d e l
Permanent and transient faults occur at Wfizfti'// (Ai.>TMperm, OFTMpcrm) and W''e«6«//(AFTMtran, "FTMtran) distributed inter-arrival times, respectively, within the components t h a t make up the block of fault-tolerance mechanisms. Because most fault-tolerance mechanisms are relatively simple, having little associated state in comparison to the processors and memories, all faults are assumed to affect single random bits with no latency. A fault-tolerance mechanism component afflicted with one or more permanent faults is considered to be failed.
180 Address Logic 1 - /•b.t. array
I Permanent [ 1 Fault I
1 — ^bit-latent
-^bit- array
1
Array
Weibull
Latent
, Pbit -latent
('^Mperm i^^^Mperm j
H
Active
Next Write with probability 1 - 'Pb.t-iatem
Fault Free Wetbull
1 cycle
(•^Mtrarn'I^Mtran)
^[ Address
f\^
1 - n,t
'"ft^nt"; Fault I — _
\
Logic J
-array
Overwrite
^bit-la
J
/"b
Bit Array
' \
J \
. . Active
1 — r'bit-latent
Figure 5: Memory Fault Model
181 3.5
Implementation
3.5.1
Basic Operation
R E A C T utilizes the system, workload and fault/error models developed in the last three sections to perform life testing through simulated fault injection. Monte Carlo, event-driven simulation is employed to operate the system, inject faults and make repairs, based on these models. During a single simulation run, experiments or trials are conducted in which an initially fault-free system is operated until it fails or reaches a user specified censoring time (Tcensor)T h e number of trials (.Vtriais) is determined by the desired confidence intervals about the system dependability metric being evaluated. The censoring time dictates the m a x i m u m operational lifetime relevant to the given system. Those trials in which the system remains operational beyond the censoring tiine are terminated, thereby shortening the run-time of the simulation without affecting the measurements of interest. 3.5.2
Dependability Measures
Extensive instrumentation has been incorporated into R E A C T to collect d a t a from each trial, which is later aggregated over the entire simulation run in order to compute the system dependability metrics. A plot of reliability or availability as a function of time, various statistical measurements such as M T T F , M T B F , mean time to repair ( M T T R ) and steady-state availability (vlsteady-state)? plus a comprehensive system failure mode report are provided as outputs. Reliability is calculated from the set of failure times: FT={ftuft2
M.alsl
(8)
where /<, is the time at which the system failed or was censored during trial i. Define a sample period of length: _ 'sample
max(/
—
./•
,q^ \-J)
.'Vpomts
where A/'pomts is the number of sample points in the reliability curve. T h e failure frequency nfa,i(x) in sample period x [l < x < Appoints) is computed as the number of failure times in FT for which: {x - l)rsample < fti
< •P''"sample
(1 < i < W , , „ a l s )
(10)
T h e probability mass function (pmf) of failure times can be estimated by: /fauW
=
^ ^ ^ ^
( 1 < X < A-po.nts)
(11)
For a sufficiently large number of trials, /faii(a;) —+ /faii(x) and the corresponding failure time C D F may be obtained by integrating the pmf: X
ffail(x)
=
^/fail(0
(0<X<.Vpo,nts)
(12)
182 Tliis distribution represents the unnliabtiily of the system in sample period x. Reliability is computed as the probabilistic inverse of F(a,,\{x) when continuous time t is quantized to a sample period: R(t)
=
t
1 - F M
{Q
<
ll,n.cr)
(13)
, L ^sample _
Availability is calculated from the set of down-times; '•
'
•^'
:N)
^"points J
where d/j. is the total system down-time accumulated over all trials in sample period x. Sample periods are now defined to have length: ^ cen
'sarnplf
ir))
—
where Appoints specifies the number of points in the availability curve. Let U{x) represent the unavailability of the system in sample period x. It can be (Estimated by the ratio of average down-time to the length of a sample period: U(x)
=
dt.
; i < a; < Apo.ntb
16)
rials '^sample
Like the failure time pmf, U(x) —f U{x) when the number of trials is sufficiently large. Assume an initial unavailability of f/(0) — 0. Availability is the probabilistic inverse of l'[x) when continuous time / is quantized to a sample period: f t A{i) l• - ' - ••M ' (0<7;.,,„„,) (17) \ L ^sample J /
Estimates for the M T T F , M T B F and M T T R of a system are computed as: A' trials
MTTF
= Afa,l
MTBF
=
MTTR
=
;i8)
^
v'v trials -'censor
Anrail K points
Mfail
E ^^^
(20)
where A/faii is the total number of system failures that occurred before censoring. Using the relation MTBF = MTTF + MTTR, steady-state availability can be obtained in terms of these estimates: MTBF-
MTTF ^sCeady-staCe
MTTF +
MTTR
MTBF
MTTR
(2i:
183 Confidence intervals about point estimates for reliability, M T T F and M T B F are readily calculated. Assuming the number of trials is large, tli*^ normal approximation to the binomial distribution may be applied. This yields the following (1 — o) X 100% confidence interval around reliability point estimate R: - V
''^trials
2 y
A trials
where Zg represents the value at which the standard normal distribution is equal to 9. If the failure time distribution is approximately exponential and repair limes are short, the (1 — a ) x 100% confidence interval around T. a point estimate of M T T F or M T B F , is given by:
^J2Mu^
< T < T , ^'^^-'
(23)
where X% e represents the value at which the chi-square distribution with .V degrees of freedom is equal to 0. The failure mode report gives the percentage of system failures that can be attributed to each of the 26 possible causes listed in Table 6. The M T T F or M T B F associated with each failure mode is also provided.
3.6
Verification and Validation
Verificaiiov is the process of ensviring a model has been correctly implemented. R E A C T was verified primarily through jirogram tracing. A debugging flag may be .set when compiling the software to insert code that will dynamically print the state of the simulator as it executes. This feature was used to collect numerous traces of the program for a variety of system architectures and input parameter combinations. Each trace was then analyzed by hand to check for proper operation. R E A C T was also traced from certain pre-determined states, in order to observe the simulation of rare events that do not normally appear in a randomly collected trace. Implementation errors were minimized through the use of abstract d a t a types and standardized routines to generate random variates [6]. Validation is the process of checking if a correctly implemented model approximates reality with sufficient accuracy. R E A C T was partially validated by comparing its dependability predictions with those of coiribinatorial models. Many system attributes simulated by R E A C T cannot be easily modeled analytically. In order to facilitate the comparison, only permanent faults were considered and input parameters values were chosen to eliminate the effects of workload. This was accomplished by using the input parameter values listed in Table 7, which reduce fault latency and accelerate error propagation. Event distributions were assumed to be exponential, fixing all Weibull shape parameters at a = 1. Failure rates and censoring times were varied over a range of values, while the remaining input parameters were set to zero.
184 Table 6: System Failure Modes
1. 2. 3.
Processor Modules Permanent faults only Transient faults only Permanent and transient faults
4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
Memory Modules Permanent addressing faults only Transient addressing faults only Permanent bit faults only Transient bit faults only Permanent addressing and transient addressing faults Permanent addressing and permanent bit faults Permanent addressing and transient bit faults Transient addressing and permanent bit faults Transient addressing and transient bit faults Permanent bit and transient bit faults
14. 15. 16.
Fault-Tolerance Mechanisms Permanent faults only Transient faults only Permanent and transient faults
17. 18. 19. 20. 21. 22. 23. 24. 25. 26.
Error Propagation Errors propagating to the processors only Permanent processor faults and propagating errors Transient processor faults and propagating errors All types of processor faults and propagating errors Errors propagating to the memories only Permanent memory addressing faults and propagating errors Transient memory addressing faults and propagating errors Permanent memory bit faults and propagating errors Transient memory bit faults and propagating errors All types of memory faults and propagating errors
Instruction execution rate Memory read access rate Memory write acces.s rate Locality-of-refereiice access fraction Locality-of-refcrence size fraction Processor fault latency scale jjaraineter Processor control flow upset probability Number of words per memory modul(^
'•^--Mread write Q^locality ^locality -^latent 'upset A words
Several different system architectures were used to vaHdate R E A C T . One of these was a 2-of-3 module system, in which at least two of its three ]>rocessor and memory modules must be operational. In order to simplify the analysis, the fault-tolerance mechanisms of this architecture were assumed to have perfect coverage and never fail. The reliabilities of the processor and memory modtdes were defined to be exponential: i ? p ( 0 = e-
Hu{i) = e
-
AM(
m)
with failure rates Ap and AM, respectively. An expression for llie reliability of a 2-of-3 system is obtained by summing the probabilities of no failed modules, one failed processor or memory module, and both a failed processor and memory module:
/i2-of-3(<)
=
Rl{t)Rl^{t) ZRl{t)\l
+
- Rpit)]Rl^{t)
^Rl{t)Rl,{t)[\
-RM{i)]
9«|(<)[1 - R^(t)]RlS)[l
+ + - «M(0]
(25)
Coefficients on the product terms in this equation represent the number of possible combinations for module failures. T h e reliability of the 2-of-3 system was evaluated for three different memory failure rates and plotted in Figures 6 and 7 over mission times of 100 and 10000 hours, respectively. Processor failure rates were held constant at Ap = 10"'' failures/hour. Symmetry in Equation 25 yields the same reliability curves when the memory failure rate is fixed at, A^ = 10"'' failures/hour and the processor failure rate is varied equivalently. Results from the corresponding R E A C T simulation are also graphed in both plots. The reliability measured by RIOACT matched t h a t of the combinatorial model with little error. Table 8 gives tlie
Figure 6: Reliability of a '2-of-3 System over 100 Hours
1.0 f%: ^a.
D
0.8
+
Analytical ,\]^ = 10 '' Analytical A ^ = l O ' ' '
0
Analytical Aj,, = 10~-'
-
REACT
^.
0.6
R(t)
Q •Q.
^. •Q
Q.
•+.
0.4
'+.•+..
•D.
•°--.-'
•+..
0.2 |_
*
••+-^ •••+..
•+••
0.0
6'-«i» 0 4
0
2000
4000 /
J_
J_
6000
8000
+ 10000
(Hours)
Figure 7; Reliability of a 2-of-3 System over 10000 Hours
exact analytical and empirical reliabilities at 50 hours. The true reliability always fell within the 95% confidence interval around the estimate. For the final check, the MTTF of the 2-of-3 system was derived as; M7'7'F2-of-3
f
Jo
«2-of-3(0 di
187 Tabic 8: Reliability of a •2-of-3 System at, 50 Hours Ap
and evaluated for the various failure rates. Table 9 compares those values with the predictions of R E A C T . T h e 95%. confidence interval again contained the exact M T T F in all cases. The validity of the portion of R E A C T ' s modeling not tested through this comparison would be very difficult to check analytically. However, because the modeling employed by R E A C T is based on measurements from actual systems, one would expect it to be a fairly accurate representation of real system behavior.
3.7
Performance Limitations
The execution time of R E A C T is a function of the number of trials performed during a simulation run and the average number of events simulated i)er trial. The number of events simulated per trial is determined mainly by the censoring time, but depends heavily on other input parameters. R E A C T may sinnilate as few as O.l events per trial, on average, at low censoring times (Tcensor < 100 hours) or up to several thousand events per trial at higher censoring times ("Tconsor > lOOOO hours). As a "rule-of-thumb", R E A C T can sinudate the reliability of one million systems, over a mission time of 100 hours, in about one hour of real time, when running on a 25 MHz RISC workstation. R E A C T has successfully evaluated multiprocessor systems with cnd-of-mission reliabilities as high as'R{t) = (1 - 10"^) ± 1.5 x 10"*' at 90% confidence in less than eight hours of run-time. Availability generally requires more time to analyze than
188 reliability, since repairs must be simulated in addition to failures and every system is operated until the censoring time.
3.8
Conclusion
This chapter has introduced R E A C T , a simulated fault-injection testbed for fault-tolerant multiprocessor architectures. The next two chapters demonstrate the effectiveness of R E A C T through the analysis of reliability, performance and cost tradeoffs in several alternative triple-modular redundant systems.
189
References [1] Agarwal, A. and G u p t a , A., "Memory-reference characteristics of multiprocessor applications under MACH," in Proceedings of the 1988 Sigmetrtcs Conference on Measurement and Modeling of Computer Systems, pp. 215225, ACM, May 1988. [2] Avizienis, A., Gilley, G. C., Mathur, F. P., Rcnnels, D. A., Rolir. J. A., and Rubin, D. K., "The STAR (Self-Testing And Repairing) computer; An investigation of the theory and practice of fault-tolerant computer design," IEEE Transactions on Computers, vol. C-20, no. 11, pp. 1312-11521, Nov. 1971. [3] Ayache, J. M. and Diaz, M., "A reliability model for error correcting memory systems," IEEE Transactions on Reliability, vol. R-28, no. 4, pp. 310314, Oct. 1979. [4] Bernstein, P. A., "Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing," IEEE Computer, vol. 21, no. 2, pp. 37-45, Feb. 1988. [5] Borg, A., Kessler, R. E., and Wall, D. W., "Generation and analysis of very long address traces," in Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 270 279, ACM, May 1990. [6] Bratley, P., Fox, B. L., and Schrage, L. E., A Guide to Simulation. York, NY: Springer-Verlag, 2nd ed., 1987.
New
[7] Bugge, H. O., Kristiansen, E. H., and Bakka, B. O., "Trace-driven simulations for a two-level cache design in open bus systems," in Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 250-259, ACM, May 1990. [8] Bunt, R. B. and Murphy, J. M., "The measurement of locality and the behaviour of programs," The Computer Journal, vol. 27, no. 3. pp. 238245, Aug. 1984. [9] Bunt, R. B., Murphy, J. M., and Majumdar, S., "A measure of program locality and its application," in Proceedings of the 198Jf Sigmetncs Conference on Measurement and Modeling of Computer Systems, pp. 28-40, ACM, Aug. 1984. [10] Chen, C. L. and Rutledge, R. A., "Fault-tolerant memory simulator," IBM Journal of Research and Development, vol. 28, no. 2, pp. 184-195, Mar. 1984. [11] Chillarege, R. and Iyer, R. K., "An experimental study of memory fault latency," IEEE Transactions on Computers, vol. 38, no. 6, pp. 869 874, J u n e 1989.
190 [12] Choi, G. S. and Iyer, R. K., "FOCUS: An experimental environment for fault sensitivity analysis," lEKE Transaciums on Coinpuifva, vol. 41. no. 12, pp. 1515-1526, Dec. 1992. [13] Clark, D. W., Bannon, P. J., and Keller, J. B., "Measuring VAX 8800 performance with a histogram hardware monitor," in Proceedings of the I51k Annual International Symposturn on Computer Architecture, pp. 176 185, ACM, May 1988. [14] Courtois, B., "Some results about the efliciency of simple mechanisms for the detection of microcomputer malfunctions," in Proceedings of the 9th International Symposium on Fault-Tolerant Computing, pp. 71-74, lEEf], J u n e 1979. [15] Czeck, E. W. and Siewiorek, D. P., "Ellfecls of transient gate-level faults on program behavior," iu Proceedings of the 20th International Symposium on Fault-Tote rant Computing, pp. 236-243, IEEE, June 1990. [16] Emer, J. S. and Clark, D. W., "A characterization of processor performance in the VAX-11/780," in Proceedings of the 11th Annual International Symposium on Computer Architecture, pp. 301 310, ACM, June 1984. [17] Heising, W. P., "Note on random addressing techniques," IBM Journal, vol. 2, pp. 112-116, June 1963.
Systems
[18] Hennessy, J. L. and Patterson, D. A., Computer Architecture, tative Approach. San Mateo, CA: Morgan Kaufmann, 1990.
Quanti-
A
[19] Hopkins, A. L., Smith, T. B., and Lala, J. H., " F T M P - A highly reliable fault-tolerant multiprocessor for aircraft," Proceedings of the IEEE, vol. 66. no. 10, pp. 1221 1239. Oct. 1978. [20] Jewett, D., "Integrity S2: A fault-tolerant Unix platform," in Proceedings of the 21 si International Symposium on Fault-Tolrrant Computing, pp. 512-519, IEEE, J u n e 1991, [21] Kanawati, G. A., Kanawati, N. A., and Abraham, J. A., " F E R R A R I : A tool for the validation of system dependability properties," in Proceedings of the 32nd International Symposium on Fault-Tolerant Computing, pp. 336-344, IEEE, July 1992. [22] Kobayashi, M. and MacDougall, M. H., "The stack growth function: Cache line reference models," IEEE Transactions on Computers, vol. 38, no. 6, pp. 798-805, J u n e 1989. [23] Koo, D. Y. and Chenowith, H. B., "Choosing a practical M T T F model for E C C memory chip," in Proceedings of the 198^ Annual Reliability and Maintainability Symposium, pp. 255-261, ll^EF], J a n . 1984. [24] Krol, T., "(N,K) concept fault tolerance," IEEE ers, vol. C-35, no. 4, pp. 339-349, Apr. 1986.
Transactions
on
Comput-
191 [25] Lee, I., Iyer, R. K., and Tang, D., "Error/failure analysis using event logs from fault tolerant systems," in Proceedings of the 21st International Symposium on Fault-Tolerant Computing, pp. 10-17, IEEE, June 1991. [26] Libson, M. R. and Harvey, H. E., "A general-purpose memory reliability simulator," IBM Journal of Research and Development, vol. 28, no. 2, pp. 196-205, Mar, 1984. [27] Marston, D.. Memory System Reliability Application Note AP-73, 1980.
with ECC.
Intel Corporation,
[28] Miremadi, G., Karlsson, J., Gunneflo, U., and Torin, J., "Two software techniques for on-line error detection," in Proceedings of the 22nd International Symposium on Fault-Tolerant Computing, pp. 328-335, IEEE. July 1992. [29] Ohlsson, J., Rimen, M., and Gunneflo, U., "A study of the effects of transient fault injection into a 32-bit RISC with built-in watchdog," in Proceedings of the 22nd International Symposium on Fault-Tolerant Computing, pp. 316-325, IEEE, July 1992. [30] Shin, K. G. and Lee, Y.-H., "Measurement and application of fault latency," IEEE Transactions on Computers, vol. C-35, no. 4, pp. 370 375, Apr. 1986. [31] Siewiorek, D, P., Kini, V,, Mashburn, H., McConnel, S., and Tsao, M., "A case study of C . m m p , Cm*, and C.vmp: Part I - Experiences with fault tolerance in multiprocessor systems," Proceedings of the IEEE, vol. 66, no. 10, pp. 1178-1199, Oct. 1978. [32] Smith, A. J., "Cache evaluation and the impact of workload choice," in Proceedings of the 12th Annual International Symposium on Computer Architecture, pp. 64-73, ACM, J u n e 1985. [33] Spirn, J. R., Program Behavior: NY: Elsevier, 1977,
Models and Measurements.
New York,
[34] Tang, D. and Iyer, R. K., "Dependability measurement and modeling of a multicomputer system," IEEE Transactions on Computers, vol. 42, no. 1, pp. 62-75, J a n . 1993. [35] Tang, D., Iyer, R. K., and Subramani, S. S., "Failure analysis and modeling of a VAXcluster system," in Proceedings of the 20th International Symposium on Fault-Tolerant Computing, pp. 244-251, IEEE, June 1990. [36] Thiebaut, D. F., Wolf, J. L., and Stone, H. S., "Synthetic traces for tracedriven simulation of cache memories," IEEE Transactions on Computers, vol. 41, no. 4, pp. 388-410, Apr. 1992, [37] Toy, W. N., "Fault-tolerant design of local ESS processors," Proceedings of the IEEE, vol. 66, no. 10, pp. 1126-1145, Oct. 1978.
192 [38] Webber, S. and Beirne, J., "The Stratus architecture," in Proceedings of the 21st International Symposium on Fault-Tolerant Computing, pp. 79 85. IEEE, J u n e 1991. [39] Wiecek, C. A., "A piler execution," in chitectural Support pp. 177-184, ACM,
case study of VAX-11 instruction set usage for comProceedings of the 1st International Conference on Arfor Programming Languages and Operating Systems, Mar. 1982.
SECTION 4
SYSTEM EVALUATION
SECTION 4.1
Measurement-Based Dependability Evaluation of Operational Computer Systems Ravishankar K. Iyer and Dong Tang
Abstract This paper discusses methodologies and advances in measurement-based dependability evaluation of operational computer systems. Research work over the past 15 years in this area is briefly reviewed. Methodologies are illustrated through discussion of authors' representative studies. Specifically, measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software dependability, and fault diagnosis are addressed. The discussion covers methods used in the area and several important issues previously studied, including workload/failure dependency, correlated failures, and software fault tolerance.
4.1.1 Introduction When a computer system is operating, various types of errors can occur both in the hardware and in the software. There are many possible sources of errors, including untested manufacturing faults and software defects, wearing out of devices, transient errors induced by radiation, power surges, or other physical processes, operator errors, and environmental factors. The occurrence of errors is also highly dependent on the workloads running on the system. A distribution of operational outages from various error sources for several major commercial systems is reported in [45].
This research was supported in part by the Office of Naval Research under Grant N00014-91 -J-II16 and in part by the National Aeronautical and Space Administration under NASA Grant NAG-1-613, in cooperation with the Illinois Computer laboratory for Aerospace Systems and Software (ICLASS). Ravishankar K. Iyer is with the Coordinated Science Laboratory, University of Illinois, Urbana, IL 61801. Dong Tang is currently with SoHaR Incorporated, Beverly Hills, CA 90211.
196 There is no better way to understand dependability characteristics of a complex computer system than measuring real systems and analyzing the measured data. Here, measuring a real system means monitoring and recording naturally occurring errors and failures in the system while it is running under user workloads. The collected data contain a large amount of information about naturally occurring errors/failures. Analysis of this data can provide valuable information on actual error/failure behavior, identify system bottlenecks, quantify dependability measures, and verify assumptions made in analytical models. Measurement-based dependability analysis of operational systems has evolved significantly over the past 15 years. This chapter reviews methodologies and advances in this area. The discussion is focused on authors' work. The results discussed are based on over 100 machine-years of data gathered from IBM, DEC, and Tandem systems. Table 4.1.1. gives a brief overview of representative studies in this area. Table 4.1.1. Measurement-Based Studies of Computer System Dependability Category Data Coalescing
Issues
Studies
Analysis of time-based tuples Clustering based on type and time
Heuristic (rend analysis Statistical analysis of symptoms Network fault signature
[29] [56] [24] 131] [32] [33]
Basic Error Characteristics
Software Dependability Fault Diagnosis
197 Early studies in this field investigated transient errors in DEC computer systems and found that more than 95% of all detected errors are intermittent or transient errors [34], [44]. The studies also showed that the inter-arrival time of transient errors follows a Weibull distribution with a decreasing error rate. This distribution was also shown to fit the software failure data collected from an IBM operating system [22]. A recent study of failure data from three different operating systems showed that time to error (TTE) can be represented by a multi-stage gamma distribution for the measured single-machine operating system and by hyperexponential distributions for the measured distributed operating systems [27]. Studies of dependency between workload and failure in early 1980s, based on measurements from IBM [5] and DEC [6] machines, revealed that the average system failure rate is strongly correlated with the average workload on the system, The effect of workload-imposed stress on software was investigated in [7] and [22]. Recent analyses of DEC [48], [58] and Tandem [25] multicomputer systems showed that correlated failures across processors are not negligible, and their impact on availability and reliability are significant [9], [49], [50]. In [18], analytical modeling and measurements were combined to develop measurement-based reliability/performability models using data collected from an IBM mainframe. The results showed that a semi-Markov process is better than a Markov process for modeling system behavior. Markov reward modeling techniques were further applied to distributed systems [52] and fault tolerant systems [26], to quantify performance loss due to errors/failures for both hardware and software. A census of Tandem system availability indicated that software faults are the major source of system outages in the measured fault tolerant systems [13]. Analyses of field data from different software systems investigated several dependability issues including the effectiveness of error recovery [57], hardware-related software errors [21], correlated software errors in distributed systems [51], software fault tolerance [26], [28], and software defect classification [46], [47]. Measurement-based fault diagnosis and failure prediction issues were investigated in [24], [29], [31], [32], [56]. The following sections discuss issues, analysis methods, and representative studies in measurements, data processing, preliminary analysis of data, dependency analysis, modeling and evaluation, software dependability, and fault diagnosis. Although other studies will be reviewed in the discussion, authors' work will be presented in detail to illustrate methodologies.
198 4.1.2 Basic Methodology Figure 4.1.1 outlines the procedure of a measurement-based analysis. Given field error data collected from a real system, a measurement-based study consists of four steps: 1) data processing, 2) model identification and measure acquisition, 3) model solution if necessary, and 4) analysis of models and measures. Step 1 consists of extracting necessary information from field data (the result can be a form of compressed data, or flat data), classifying errors and failures, and coalescing repeated error reports. In a computer system, a single problem commonly results in many repeated error observations occurring in rapid succession. To ensure that the analysis is not biased by repeated observations of the same problem, all error entries which have the same error type and occur within a short time interval (e.g., 5 minutes) of each other should be coalesced in data processing. Thus, the output of this step is a form of coalesced data in which errors and failures are identified. This step is highly dependent on the measured system. Coalescing algorithms have been proposedin[14], [23], [56]. Step 2 includes identifying appropriate models (such as Markov models) and acquiring measures (such as TBF distributions) from the coalesced data. Several models have been proposed and validated using real data. These include the workload-dependent cyclostationary model in [6], the workload hazard model in [19], and the failure correlation models in [50]. Statistical analysis packages such as SAS [SAS85] or measurement-based dependability analysis tools such as MEASURE-([53] are useful at this stage. Step 3 solves these models to obtain system-level dependability measures (such as availability, reliability and expected reward rates). Dependability and performance modeling and evaluation tools such as SHARPE [43] can be used in this step. Step 4, the most creative part of this study, involves a convincing interpretation of the models and measures obtained from the data; for
Models & Measures Field Data
Data
>
Coalesced Data
Model Idenlifylng
Processing
Measure Acquiring
Step 1
Step 2
Model Mode s
Solution
Me asures
Step 3
Figure 4.1.1. Measurement-Based Analysis
Analysis of Models Measures Step 4
Results
199 example, the identification of reliability bottlenecks and impact on availability of design enhancement. The analysis methods can vary significantly from one study to another study, depending on project goals.
4.1.3 Measurements There are numerous theoretical and practical difficulties associated with making measurements. The question of what and how to measure is a difficult one. A combination of installed and custom instrumentation is typically used in most studies. From a statistical point of view, sound evaluations require a considerable amount of data. In modern computer systems, especially in fault tolerant systems, failures are infrequent and, in order to obtain meaningful data, measurements must be made for a long period of time. Also, the measured system must be exposed to a wide range of usage conditions for the results to be representative. There are normally two ways to make measurements: on-line automatical logging and human manual logging. Many large computer systems, such as IBM and DEC mainframes, provide error-logging software in the operating system. This software records information on errors occurring in the various subsystems, such as the memory, disk, and network subsystems, as well as other system events, such as reboots and shutdowns. The reports usually include information on the location, time, and type of the error, the system state at the time of the error, and sometimes error recovery (e.g., retry) information. The reports are stored chronologically in a permanent system file. The main advantage of the on-line automatic logging is its ability to record a large amount of information about transient errors and to provide details of automatic error recovery processes, which cannot be done manually. Disadvantages are that an on-line log does not usually include information about the cause and propagation of the error or about off-line diagnosis. Also, under some crash scenarios, the system may fail too quickly for any error messages to be recorded. Table 4.1.2 shows a sample of extracted error logs from a VAXcluster multicomputer system. Often, the meaning of a record in a log can differ between versions of the operating system and between machine models. One reason is that error detection and recording routines are written and modified over time by different groups of people. For example, a careful study of VAX error logs and discussion with the field engineers indicate that the operating system on different VAX machine models might report the same type of error into different categories [51]. Thus, it is important to distinguish these errors in the subsequent error classification. Since the information provided by on-line error logs may not be complete, it is valuable to have operator logs compensate the missing information. An operator log
200 Table 4.1.2 A Sample of Extracted Error Logs from a VAXclusterf Entry 5815 7005 12979 13005 13734 3260 10939 14209 13941 20937 27958 37790
Svs. ID Earth E-anh Europa Europa Europa Mercury Jupiter Jupiler Mars Mars Mars Saturn
Interpretation Disk drive error Tape drive error Path itO went from good lo bad Error logging datagram received Virtual circuit timeout Corrected memory error Unexpected read data fault Machine check Bad memory deallocation reque.sl size or address Insufficient nonpaged pool to remaster locks Unexpected system service exception
t The sample is intended to illustrate different types of errors logged. Therefore, entry numbers arc not con.sccutive.
should include information on system crashes, failure diagnosis, component replacement and hardware and software updates.
4.1.4 Data Processing Usually, on-line logs contain a large amount of redundant and irrelevant information in various formats. Thus, data processing must be performed to classify this information and to put it into a flat format to facilitate subsequent analyses. The first step of data processing is error classification. This process classifies errors in the measured system into types based or the subsystems and components in which they occur. There is no uniform or "best" error classification, because different systems have different hardware and software architectures. But some error types, such as Table 4.1.3. Major Error Types in VAXciusters System Hardware
Software
Type
Description
CPU Memory Disk Network
CPU or bus controller errors Memory ECC errors Disk, drive, and controller errors Local network and controller errors
Control Memory I/O
Problems involving program flow control or synchronization Problems referring to memory management or usage Inconsistent conditions detected by I/O management routines
201 CPU, memory, and disk errors, are seen in most systems. Table 4.1.3 lists an error classification (major error types) for VAXcluster systems [51], [52]. After error classification, the following data processing can be broadly divided into two steps: data extraction and data coalescing. Data extraction selects useful entries such as error and reboot reports (throwing away uninteresting entries such as disk volume change reports) from the log file and transforms the data set into a flat format. The design of the flat format depends on the necessity of the subsequent analyses. The following is a possible format: entry number
logging time
error type
device id
error description fields
In on-line error logs, a single fault in the system can result in many repeated error reports in a short period of time. To ensure that the subsequent analyses will not be distorted by these repeated reports, entries that correspond to the same problem should be coalesced into a single event, or tuple [56]. A typical coalescing algorithm [23] is merging all error entries which have the same error type and occur within a AT interval of each other into a tuple. The algorithm is as follows: IF <error type> = AND