Foundations of Dependable Computing: Paradigms for Dependable Applications

FOUNDATIONS OF DEPENDABLE COMPUTING Paradigms for Dependable Applications THE KLUWER INTERNATIONAL SERIES IN ENGINEER...

Author: Gary M. Koob | Clifford G. Lau

26 downloads 892 Views 10MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

FOUNDATIONS OF DEPENDABLE COMPUTING Paradigms for Dependable Applications

THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE OFFICE OF NAVAL RESEARCH Advanced Book Series Consulting Editor Andre M. van Tilborg Other titles in the series: FOUNDATIONS OF DEPENDABLE COMPUTEVG: Models and Frameworks for Dependable Systems, edited by Gary M. Koob and Clifford G. Lau ISBN: 0-7923-9484-4 FOUNDATIONS OF DEPENDABLE COMPUTING: Implementation, edited by Gary M. Koob and Clifford G. Lau ISBN: 0-7923-9486-0

System

PARALLEL ALGORITHM DERIVATION AND PROGRAM TRANSFORMATION, edited by Robert Paige, John Reif and Ralph Wachter ISBN: 0-7923-9362-7 FOUNDATIONS OF KNOWLEDGE ACQUISITION: Cognitive Models of Complex Learning, edited by Susan Chipman and Alan L. Meyrowitz ISBN: 0-7923-9277-9 FOUNDATIONS OF KNOWLEDGE ACQUISITION: Machine Learning, edited by Alan L. Meyrowitz and Susan Chipman ISBN: 0-7923-9278-7 FOUNDATIONS OF REAL-TIME COMPUTING: Formal Specifications and Methods, edited by Andre M. van Tilborg and Gary M. Koob ISBN: 0-7923-9167-5 FOUNDATIONS OF REAL-TIME COMPUTING: Scheduling and Resource Management, edited by Andre M. van Tilborg and Gary M. Koob ISBN: 0-7923-9166-7

FOUNDATIONS OF DEPENDABLE COMPUTING Paradigms for Dependable Applications

edited by Gary M. Koob Clifford G. Lau Ojfice of Naval Research

KLUWER ACADEMIC PUBLISHERS Boston / Dordrecht / London

Distributors for North America: Kluwer Academic Publishers 101 Philip Drive Assinippi Park Norwell, Massachusetts 02061 USA Distributors for all other countries: Kluwer Academic Publishers Group Distribution Centre Post Office Box 322 3300 AH Dordrecht, THE NETHERLANDS

Library of Congress Cataloging-in-Publication Data A C L P . Catalogue record for this book is available from the Library of Congress.

Copyright ® 1994 by Kluwer Academic Publishers All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photo-copying, recording, or otherwise, without the prior written permission of the publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061 Printed on acid-free paper. Printed in the United States of America

CONTENTS

Preface Acknowledgements

vii xiii

1 . PROTOCOL-BASED PARADIGMS FOR DISTRIBUTED APPLICATIONS

1

1.1

Adaptive System-Level Diagnosis in Real-Time R.P. Bianchini, Jr. and M, Stahl

3

1.2

Refinement for Fault-Tolerance: An Aircraft Handoff Protocol K. MarzullOy F.B. Schneider, and J. Dehn

1.3

Language Support for Fault-Tolerant Parallel and Distributed Programming /?.£). Schlichtingy D,E. Bakken, and VT. Thomas

2. ALGORITHM-BASED PARADIGMS FOR PARALLEL APPLICATIONS 2.1

2.2

Design and Analysis of Algorithm-Based Fault-Tolerant Multiprocessor Systems S. Yajnik and N.K. Jha Fault-Tolerance and Efficiency in Massively Parallel Algorithms P.C. Kanellakis and A.A. Shvartsman

39

55

79 81 125

VI

3 . DOMAIN-SPECIFIC PARADIGMS FOR REAL-TIME SYSTEMS 3.1

3.2

Use of Imprecise Computation to Enhance Dependability of Real-Time Systems J,W.S, Liu, K-J Lin, R. Bettati, D, Hull and A, Yu

155 157

Analytic Redundancy for Software Fault-Tolerance in Hard Real-Time Systems 183 M. Bodson, J.P. Lehoczky, R. Rajkumar, L. Sha, and /. Stephan

Index

213

PREFACE

Dependability has long been a central concern in the design of space-based and military systems, where survivability for the prescribed mission duration is an essential requirement, and is becoming an increasingly important attribute of government and conmiercial systems where reduced availability may have severe financial consequences or even lead to loss of life. Historically, research in the field of dependable computing has focused on the theory and techniques for preventing hardware and environmentally induced faults through increasing the intrinsic reliability of components and systems (fault avoidance), or surviving such faults through massive redundancy at the hardware level (fault tolerance). Recent advances in hardware, software, and measurement technology coupled with new insights into the nature, scope, and fundamental principles of dependable computing, however, contributed to the creation of a challenging new research agenda in the late eighties aimed at dramatically inaeasing the power, effectiveness, and efficiency of approaches to ensuring dependability in critical systems At the core of this new agenda was a paradigm shift spurred by the recognition that dependabihty is fundamentally an attribute of apphcations and services—not platforms. Research should therefore focus on (1) developing a scientific understanding of the manifestations of faults at the application level in terms of their ultimate impact on the correctness and survivability of the application; (2) innovative, application-sensitive approaches to detecting and mitigating this impact; and (3) hierarchical system support for these new approaches. Such a paradigm shift necessarily entailed a concomitant shift in emphasis away from inefficient, inflexible, hardware-based approaches toward higher level, more efficient and flexible software-based solutions. Consequently, the role of hardware-based mechanisms was redefined to that of providing and implementing the abstractions required to support the higher level software-based mechanisms in an integrated, hierarchical approach to ultradependable system design. This shift was furthermore compatible with an expanded view of "dependabihty," which had evolved to mean "the abiUty of the system to deUver the specified (or expected) service." Such a definition encompasses not only survival of traditional single hardware faults and environmental disturbances but more complex and less-well understood phenomena, as well: Byzantine faults, correlated errors, timing faults, software design and process interaction errors, and—most significantly—the unique issues encountered in real-

Vlll

time systems in which faults and transient overload conditions must be detected and handled under hard deadline andresourceconstraints. As sources of service disruption multiplied and focus shifted to their ultimate effects, traditional frameworks for reasoning about dependability had to be rethought. The classical fault/error/failure model, in which underlying anomalies (faults) give rise to incorrect values (errors), which may ultimately cause incorrect behavior at the output (failures), required extension to capture timing and performance issues. Graceful degradation, a long standing principle codifying performance/dependability trade-offs must be more carefully applied in real-time systems, where individual task requirements supercede general throughput optimization in any assessment. Indeed, embedded real-time systems—often characterized by interaction with physical sensors and actuators—^may possess an inherent ability to tolerate brief periods of incorrect interaction, either in the values exchanged or the timing of those exchanges. Thus, a technical failure of the embedded computer does not necessarily imply a system failure. The challenge of capturing and modehng dependability for such potentially complex requirements is matched by the challenge of successfully exploiting them to devise more intelligent and efficient—as well as more complete— dependability mechanisms. The evolution to a hierarchical, software-dominated approach would not have been possible without several enabling advances in hardware and software technology over the past decade: (1) Advances in VLSI technology and RISC architectures have produced components with more chip real estate available for incorporation of efficient concurrent error detection mechanisms and more on-chip resources permitting software management of fine-grain redundancy; (2) The emergence of practical parallel and distributed computing platforms possessing inherent coarse-grain redundancy of processing and conmiunications resources—also amenable to efficient software-based management by either the system or the application; (3) Advances in algorithms and languages for parallel and distributed computing leading to new insights in and paradigms for problem decomposition, module encapsulation, and module interaction, potentially exploitable in refining redundancy requirements and isolating faults; (4) Advances in distributed operating systems allowing more efficient interprocess communication and more intelligent resource management;

IX

(5) Advances in compiler technology that permit efficient, automatic instrumentation or restructuring of application code, program decomposition, and coarse and fine-grain resource management; and (6) The emergence of fault-injection technology for conducting controlled experiments to determine the system and apphcation-level manifestations of faults and evaluating the effectiveness or performance of fault-tolerance methods. In response to this challenging, new vision for dependable computing research, the advent of the technological opportunities for realizing it, and its potential for addressing critical dependability needs of Naval, Defense, and commercial systems, the Office of Naval Research launched a five-year basic research initiative in 1990 in Ultradependable Multicomputers and Electronic Systems to accelerate and integrate progress in this important discipline. The objective of the initiative is to establish the fundamental principles as well as practical approaches for efficiently incorporating dependability into critical applications running on modem platforms. More specifically, the initiative sought increased effectiveness and efficiency through (1) Intelligent exploitation of the inherent redundancy available in modem parallel and distributed computers and VLSI components; (2) More precise characterization of the sources and manifestations of errors; (3) Exploitation of application semantics at all levels—code, task, algorithm, and domain—to allow optimization of fault-tolerance mechanisms to both application requirements and resource limitations; (4) Hierarchical, integrated software/hardware approaches; and (5) Development of scientific methods for evaluating and comparing candidate approaches. Implementation of this broad mandate as a coherent research program necessitated focusing on a small cross-section of promising application-sensitive paradigms (including language, algorithm, and coordination-based approaches), their required hardware, compiler, and system support, and a few selected modeling and evaluation projects. In scope, the initiative emphasizes dependability primarily with respect to an expanded class of hardware and environment (both physical and operational) faults. Many of the efforts furthermore explicitly address issues of dependability unique to the domain of embedded real-time systems. The success of the initiative and the significance of the research is demonstrated by the ongoing associations that many of our principal investigators have forged with a variety of military, Govemment, and commercial projects whose critical needs are leading to the rapid assimilation of concepts, approaches, and expertise arising from this initiative. Activities influenced to date include the FAA's Advanced Automation System for air traffic control, the Navy's AX project and Next Generation Computing Resources standards program, the Air Force's Center for Dependable Systems, the OSF/1 project, the space station Freedom, the Strategic

Defense Initiative, and research projects at GE, DEC, Tandem, the Naval Surface Warfare Center, and MITRE Corporation. This book series is a compendium of papers sunmiarizing the major results and accomplishments attained under the auspices of the ONR initiative in its first three years. Rather than providing a comprehensive text on dependable computing, the series is intended to capture the breadUi, depth, and impact of recent advances in the field, as reflected through the specific research efforts represented, in the context of the vision articulated here. Each chapter does, however, incorporate appropriate background material and references. In view of the increasing importance and pervasiveness of real-time concerns in critical systems that impact our daily lives—^ranging from multimedia communications to manufacturing to medical instrumentation—the real-time material is woven throughout the series rather than isolated in a single section or volume. The series is partitioned into three volumes, corresponding to the three principal avenues of research identified at the beginning of this preface. While many of the chapters actually address issues at multiple levels, reflecting the comprehensive nature of the associated research project, they have been organized into these volumes on the basis of the primary conceptual contribution of the work. Agha and Sturman, for example, describe a framework (reflective architectures), a paradigm (replicated actors), and a prototype implementation (the Screed language and Broadway runtime system). But because the salient attribute of this work is the use of reflection to dynamically adapt an application to its environment, it is included in the Frameworks volume. Volume I, Models and Frameworks for Dependable Systems, presents two comprehensive frameworks for reasoning about system dependability, thereby estabUshing a context for understanding the roles played by specific approaches presented throughout the series. This volume then explores the range of models and analysis methods necessary to design, validate, and analyze dependable systems. Volume II, Paradigms for Dependable Applications, presents a variety of specific approaches to achieving dependabiUty at the application level. Driven by the higher level fault models of Volume I and buiilt on the lower level abstractions implemented in Volume III, these approaches demonstrate how dependability may be tuned to the requirements of an appUcation, the fault environment, and the characteristics of the target platform. Three classes of paradigms are considered: protocolbased paradigms for distributed applications, algorithm-based paradigms for parallel applications, and approaches to exploiting application semantics in embedded realtime control systems. Volume III, System Implementation, explores the system infrastructure needed to support the various paradigms of Volume II. Approaches to implementing

XI

suppport mechanisms and to incorporating additional appropriate levels of fault detection and fault tolerance at the processor, network, and operating system level are presented. A primary concern at these levels is balancing cost and performance against coverage and overall dependability. As these chapters demonstrate, low overhead, practical solutions are attainable and not necessarily incompatible with performance considerations. The section on innovative compiler support, in particular, demonstrates how the benefits of application specificity may be obtained while reducing hardware cost and run-time overhead. This second volume of the series builds on the modeling foundation established in Volume I by exploring specific paradigms for managing redundancy and faults at the application level through specialized algorithms or protocols. Consistent with the layered view of dependability that characterizes this series, these softwareoriented approaches rely not only on the underlying models of Volume I for their soundness, but on the abstractions of Volume III for their practicality. In distributed systems, general-purpose dependability is often achieved through process replication managed through protocols. The three approaches described in Section 1 vary in purpose and degree of insulation from the application. Bianchini and Stahl explore the nuances of adapting distributed diagnosis algorithms to a real-time environment. Whereas the diagnosis paradigm is largely independent of the application, the authors demonstrate how consideration of the fault environment and scheduling constraints can lead to unanticipated modes of interaction. Marzullo, et al, present the refinement mapping approach for deriving customized dependable protocols for specific applications. The approach is illustrated through an air traffic control example. Finally, in an instantiation of Agha's concept of reflection (Vol. I), Schlichting, et al, consider two classes of language extensions to support enhanced application-specific control over redundancy and recovery management. Parallel systems are characterized by larger degrees and finer granularity of concurrency than distributed systems. In such large-scale systems with frequent interprocess conmiunication, conventional replication approaches are too costly, inefficient, and potentially detrimental to performance. Fortunately, unlike distributed applications which are typically decomposed by function, parallel scientific algorithms often employ data decomposition to assign each processor (running substantially the same program) a sub-domain corresponding, e.g., to a distinct region of physical space. The regular structure of these computations may be exploited through algorithmic transformations to provide low overhead error detection and recovery. Two such approaches are described in Section 2. Yajnik and Jha focus on the data by presenting a graph-theoretic methodology for generating check operations used to detect and locate faults. Kanellakis and Shvartsman exploit the homogenity of typical parallel tasks by allowing work to be dynamically redistributed in the event of failures.

Xll

Although real-time issues are addressed throughout this series, the tight coupling of embedded real-time systems to applications such as process control and the semantics of those applications—characterized by continuously changing physical variables—suggest an opportunity to explore highly effective and efficient dependability mechanisms that recognize potentially relaxed constraints derived from the additional latitude in error sensitivity typical of these ^plications. In Section 3, Liu, et al, present one such approach for managing redundancy and supporting rapid recovery under hard real-time constraints by trading off result quality for computation time. Bodson, et al, present a paradigm for software fault tolerance based on Uie concept of analytical redundancy in which the behavior of a complex control algorithm of uncertain integrity is monitored by a simpler, robust algorithm of similar but less refined functionality. Gary M. Koob Mathematical, Computer and Information Sciences Division Office of Naval Research Clifford G. Lau Electronics Division Office of Naval Research

ACKNOWLEDGEMENTS

The editors regret that, due to circumstances beyond their control, two planned contributions to this series could not be included in the final publications: "Compiler Generated Self-Monitoring Programs for Concurrent Detection of Run-Time Errors," by J.P. Shen and "The Hybrid Fault Effects Model for Dependable Systems," by C.J. Walter, M.M. Hugue, and N. Suri. Both represent significant, innovative contributions to the theoiy and practice of dependable computing and their omission diminishes the overall quality and completeness of these volimies. The editors would also like to gratefully acknowledge the invaluable contributions of the following individuals to the success of the Office of Naval Research initiative in Ultradependable Multicomputers and Electronic Systems and this book series: Joe Chiara, George Gilley, Walt Heimerdinger, Robert Holland, Michelle Hugue, Miroslaw Malek, Tim Monaghan, Richard Scalzo, Jim Smith, Andr6 van Tilborg, and Chuck Weinstock.

This page intentionally blank

SECTION 1

PROTOCOL-BASED PARADIGMS FOR DISTRIBUTED APPLICATIONS

SECTION 1.1

Adaptive System-Level Diagnosis in Real-Time^ Mark E. Stahl^ Ronald P. Bianchini, Jr.^ Distributed real-time systems are subject to stricter fault-tolerance requirements than non-real time systems. This work presents an application of system-level diagnosis to a real-time distributed system as a first step in providing fault-tolerance. An existing algorithm for distributed system-level diagnosis, Adaptive_DSD, is converted to a real-time framework, establishing a deadline for the end-to-end diagnosis latency. Rate monotonic analysis is chosen as the framework for achieving real-time performance. The ADSD algorithm is converted into a set of independent periodic tasks running at each node, and a systematic procedure is used to assign priorities and deadlines to minimize the hard deadline of the diagnosis function. The resulting algorithm, Real-Time Adaptive Distributed System-Level Diagnosis (RT-ADSD), is fully compatible with a real-time environment, where both the processors and the network supportfixed-priorityscheduling. The RT-ADSD algorithm provides a useful first step in adding fault-tolerance to distributed real-time systems by quickly and reliably diagnosis node failures. The key results presented here include a framework for specifying real-time distributed algorithms and a scheduling model for analyzing them that accounts for many requirements of distributed systems, including network I/O, task jitter, and critical sections caused by shared resources.

This research is supported in party by the Office of Naval Research under Grant N00014-91J-1304 and under a National Science Foundation Graduate Research Fellowship. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the Office of Naval Research or the National Science Foundation. 2. Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213. 3. Associate Professor, Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213.

1.1.1 Introduction As distributed systems proliferate, they increasingly become a platform for the implementation of real-time systems. Distributed real-time systems are subject to stricter fault-tolerance requirements than non-real time systems, since a fault can be either a failed resource or a missed deadline. Our approach to distributed fault tolerance is to have the system perform self-diagnosis at the system level. In system-level diagnosis, a network is modeled as a collection of nodes and edges, such that nodes can be faulty or fault-free. Carnegie Mellon University has recently developed the Adaptive_DSD (ADSD) algorithm for performing on-line distributed system-level diagnosis in fully connected networks. The ADSD algorithm allows every fault-free node to diagnose the fault state of every other node in the system. This is the foundation of providing fault tolerance in a distributed system, by quickly and reliably diagnosing node failures [1]. For the diagnosis algorithm to operate in real-time, a deadline is established by which all fault-free nodes in the system are guaranteed to achieve correct diagnosis. This work describes the specification of the Real-Time Adaptive Distributed System-Level Diagnosis (RT-ADSD) algorithm, an implementation of the ADSD algorithm suitable for execution in a real-time environment. The RT-ADSD algorithm provides a hard deadline for the diagnosis latency - the time from a fault event, either a node failure or recovery, until all nodes are aware of the event. The algorithm is fully distributed, executing at all fault-free nodes, an utiUzes adaptive testing to minimize testing overhead. The diagnosis latency deadline is expressed as an end-to-end deadline of a process that is distributed among multiple nodes in a network. Delay is introduced due to computation time at each node and due to transmission time for each message sent between nodes. The approach used in this work is to establish intermediate deadlines for the portion of work performed on each node and link in the network. The end-toend deadline is the sum of the deadlines achieved at each node and Hnk. Other work has addressed the issue of real-time distributed fault tolerance using faulttolerant group membership protocols. Ezhilchelvan and Lemos [2] give a membership protocol for distributed systems that utilizes "cycles" in a synchronous network During a cycle, each processor is given the opportunity to broadcast a message and processes any incoming messages. Practical considerations, such as the length of a cycle or the amount of work performed during each cycle, are not explored. The algorithm's real-time bound, expressed in terms of cycles, is a measure of the time complexity of the algorithm's communication; i.e. the length of the longest sequence of messages that are broadcast before diagnosis is achieved. This work utilizes a different approach to achieving real-time behavior by utilizing rate monotonic analysis (RMA) [3,4, 5] to schedule the tasks. In rate monotonic analysis, there is no overriding cycle that regulates when work is performed nor is a synchronous network required to bound communication time. Rather, each task has its own period and occurs independently of other tasks in the system. Using RMA, the schedulability of each resource (i.e. network link or processor) is independently veri-

fied, and the delay of each resource is independently quantified. RMA provides the framework for the specification of RT-ADSD. By utiHzing RMA, RT-ADSD can be integrated with other tasks being peifonned by the distributed system provided those tasks are also specified using RMA. The design of RT-ADSD differs from conunon approaches to designing real-time systems in fiat, in many real-time systems, the program being designed has some fixed periodic requirements and/or deadlines that must be met. For example, [6] describes a robotic system where a sensor collecting data must be serviced during every collection interval. In contrast, RT-ADSD is formed by coercing the ADSD algorithm into the real-time framework for the purpose of establishing a hard deadline. No a priori period or deadline is given. The techniques used to create and model RT-ADSD are applicable to other distributed algorithms that operate in real-time environments. The models used here encompass many aspects of distributed programs that must be addressed for their real-time behavior to be analyzed, including idle time for server response, conmiunication delay over a real-time network, and task arrival jitter caused by sequentially dependent tasks. This chapter is organized as follows. Section 2 describes the ADSD algorithm. Section 3 develops the programming and scheduling models used by RT-ADSD. It reviews rate-monotonic scheduling theory, along with extensions necessary for RTADSD. Section 4 describes how to specify a task set for RT-ADSD to arrive at a minimal deadline for diagnosis latency. An example specification is given in Section 5. Conclusions are presented in Section 6.

1.1.2 System Level Diagnosis Consider a system of interconnected components, or units, such that each unit can be eithtr faulty or fault-free. The diagnosis problem is that of determining the set of faulty units by utilizing the results of tests performed among the units themselves. Preparata, Metze, and Chien initiated the study of system-level diagnosis in 1967 by presenting necessary conditions for system-level diagnosability [7]. Testing assignments for diagnosable systems were later characterized by Hakimi and /unin [8]. Since those pioneering works, a large body of literature in system-level diagnosis has been generated [9]. Adaptive testing assignments are used to eliminate redundant tests and reduce overhead [10]. In distributed diagnosis [11], each unit in the system forms its own independent diagnosis of the state of every other unit. In 1991, Bianchini and Buskens presented and implemented the first adaptive distributed diagnosis algorithm. Adaptive DSD [1]. The Adaptive^DSD algorithm is used as the foundation for this work in real-time distributed systems. The Adaptive_DSD algorithm is summarized below. The ADSD algorithm has many features that are desirable for a real-time implementation, including a provably minimum number of tests and a highly periodic structure that can be adapted to rate monotonic scheduling theory.

1.1.2.1 Terminology A graph theoretical model of a distributed computer network is utilized [12]. Nodes represent processors; edges represent conmiunication links. Each node has an associated fault state, either/awZry or fault-free, and is assumed to be capable of testing its neighbors. The result of a test is either/aw/ry or fault-free. A fault model is used to relate test results to the fault state of the nodes involved in the test. In the PMC (or permanent) fault model [7], tests performed by fault-free nodes are accurate and tests performed by faulty nodes produce arbitrary results. The testing assignment is the set of all tests performed between nodes. Diagnosis is the process of mapping the set of test results to the set of faulty nodes. In this work, conmiunication links are assumed to be fault-free. 1.1.2.2 The Adaptive^DSD Algorithm The Adaptive_DSD algorithm performs distributed system-level diagnosis in fully connected networks. Adaptive_DSD utilizes an adaptive testing strategy to minimize the testing overhead and achieve distributed diagnosis. Tests are assumed to conform to the PMC fault-model. /* ADAPTIVE_DSD */ /* ExecutedateachnodeUj^,0<x
y = x; repeat { y = (y+l)modN; test Uy and request Uy to forward Tested_Upy to n^^; } until (n^^ tests Uy as "fault-free");

3.

Tested_Upj^[x] = y;

4. fori = OtoN-l 4.1. if(i#x) 4.1.1. Tested__Upx[i] = Tested_Upy[i]; Figure 1.1.1. The Adaptive_DSD Algorithm. The Adaptive_DSD algorithm is summarized by the procedure given in Figure 1. Adaptive_DSD modifies the testing assignment at each node to maintain a Hamiltonian cycle of fault-free nodes. To achieve this, all N nodes in the system are ordered in a cycle. Periodically, each node tests consecutive nodes in the cycle until a single fault-free node is found. When a node n^ finds another node n: fault-free, it records that result in a local data structure, the Tested_Upi array, settingTested__Upj[i] = j . For example, in Figure 2, nodes n^, n4 and n5 are faulty; all others are fault-free. Test-

ed_Upi[0] = 2 indicates that no tests n2 as fault-free. Node n^ is implied faulty by omission. The resulting testing assignment contains a Hamiltonian cycle of all faultfree nodes.

Tested. Upi[0] = Tested. Upi[l] = Tested. Upi[2] = Tested. .Upi[3] =76} Tested. .Upi[4] = Tested. .Upi[5] = Tested. .Upi[6] = Tested. .Upi[7] =

3

Figure 1.1.2. Testing Graph and Associated Tested Up Array. A local copy of the Tested_Up array is maintained at each node to allow independent diagnosis. To correctly diagnose the system, each node must receive the test results from all other fault-free nodes. Whenever a node is tested, it returns a copy of its Tested_Up array along with the test result. In this way, diagnosis information flows in the reverse direction of tests. The test procedure also provides validation of diagnosis information, ensuring that any information received from a faulty node is discarded. After a finite number of test periods, every fault-free node receives the diagnosis information from all other fault-free nodes. To diagnose the state of the network, a node follows the Tested^Up pointers from itself to all other fault-free nodes. Diagnosis latency is the time from the occurrence of a fault event until all nodes are aware of the event, and is a function of both the number of nodes in the cycle and the message forwarding protocol used to distribute information. Since nodes are not synchronized with respect to periodic tests, information about a new fault event can be held for aftiUtest period at each node before it is requested by previous nodes in the cycle. The resulting end-to-end delay is 0(Tp*N), where Tp is the test period and N is the number of nodes in the system. 1.1.2.3 Target Platform The real-time scheduling model used throughout this work is developed to be consistent with the target platform, consisting of four nodes connected by a shared medium

8 network. The CPU at each ncxle is a Motorola 68020 microprocessor. The programming model is based on the Ada programming language, which provides support for multiple tasks executing at fixed priorities. The programming model also assumes several protocols necessary for algorithm scheduling, including the priority ceiling protocol [13] and sporadic service [14]. These services are explained more fiilly in Section 3, and can be emulated as shown in [15] if they are not provided by the runtime environment. Conmiunication networks are a major source of unpredictability in real-time distributed systems. To alleviate this problem, the target platform consists of an experimental network, the Real-Time Conmiunication Network (RTCN), designed to support preemptive,fixed-priorityscheduling [16]. RTCN provides two-hundred and fifty-six priority levels and several conmiunication services to the applications programmer. Services include connection-oriented and connectionless services, with either synchronous (blocking) or asynchronous (non-blocking) I/O. Because RTCN supports preemptive priority based scheduling, it can be analyzed as a single processor resource using rate monotonic theory. However, since RTCN is a shared medium network, traffic from all nodes must be scheduled concurrently. 1.1.2.4 Approach To create a real-time implementation, the ADSD algorithm must be represented in a form suitable for applying rate monotonic analysis. First the ADSD algorithm is decomposed into a set of processor and network tasks. Processor tasks are identified as threads, and network tasks are called messages. Each task is assigned a period, a priority, and a deadline. The schedulability of the task set at each node is verified using rate monotonic analysis [3,4,5]. Similarly, the schedulability of the network tasks are verified. The diagnosis latency is then expressed as the sum of the deadlines achieved for the sequence of threads and messages required to achieve diagnosis. Techniques are also given for integrating RT-ADSD with other task sets that adhere to rate monotonic assumptions.

1.1.3 Scheduling Model for Distributed System A scheduling model is used to analyze the timing behavior of a set of tasks. If all tasks in the set can meet their deadlines under all conditions, the task set is schedulable. To verify schedulability, the execution behavior of each task must be modeled. Rate monotonic analysis assumes that each task arrives periodically and has a hard deadline. Each task is assumed to execute for a fixed (or worst case) time at one priority, and is independent of all other tasks. Since real world tasks often do not meet all these criteria, the scheduling model is extended to account for irregularities. The periodic tasks that implement the RT-ADSD algorithm are constructed from several "building blocks" that are necessary for a distributed system, and that must be accommodated by the scheduling model. The following building blocks are used:

• •

•

•

• •

Normal sections: Code segments that execute at one priority. These segments are assumed to be perfectly preemptable; no overhead is incurred when switching from tasks at one priority to another. Critical sections: A critical section is a segment of code that locks a resource needed by other threads. Critical sections are used for task synchronization and can create priority inversion; a lower priority task can block a higher priority task. For example, when using a semaphore, a lower priority task can lock the semaphore preventing a higher priority task from executing. Critical sections can also occur with other forms of task synchronization, e.g. the Ada Rendezvous [17]. Interprocess Communication (IPC): Threads executing on the same processor are permitted to send messages to each other. IPC is assumed to be implemented such that no critical section occurs when sending or receiving a message. I/O points: An I/O point occurs whenever a thread transmits a message over the network or receives a message from the network. The point at which a processor task outputs a message becomes the arrival point for a network task. For RTCN, messages are transmitted asynchronously; the task does not wait for a reply or acknowledgment. Idle time: Threads are suspended while waiting for a message to arrive, either from another thread via IPC or from another processor via the network. Idle time can destroy the periodicity assumptions use in rate monotonic analysis. Loops: Some threads are required to repeat certain code segments that are dependent on the fault state of other nodes in the system, and cannot be predicted or bounded flpnon. Repeated segments can include I/O points and critical sections.

Using these building blocks, RT-ADSD can be modelled as a set of threads and messages. Threads are composed of some or all of these building blocks. Message tasks are modeled as consisting of only ^'normal" segments. As an example, the TestNodes thread implements the portion of RT-ADSD that issues tests and processes the result (see Figure 1). A graphical representation of the TestNodes thread is show in Figure 3. This thread includes many of the above elements, including normal execution time, critical sections, I/O points, loops, and idle time. This thread is used as an example in later sections. Output Test Request Message

iy

r

Suspend for Reply

Critical Section / CRiArray Access)

LoopP

Figure 1.1.3. Graphical representation of the TestNodes thread

10 Each of the "building blocks'*, other than normal execution time, introduces variance into the schedulability of the processors. Either the real-time system must be designed to eliminate this variance, or a scheduling model must be developed to account for them. Both approaches are used with RT-ADSD. The next section presents the scheduling model for fully periodic tasks that consist only of normal execution time. Later sections address the exceptions required by the RT-ADSD algorithm. 1.1.3.1 Fixed-Priority Scheduling Rate monotonic analysis, as originally proposed by Liu and Layland, assumes task sets are composed of periodic, independent, fixed-priority tasks [3]. Each task requires a fixed amount of processor time which must be completed before its next period. Priorities are assigned rate monotonically; tasks with shorter periods receive higher priority. Assuming a perfectly preemptive system, Liu and Layland showed that there is an upper bound on processor utiHzation, below which all tasks in a set are guaranteed to meet their deadline. Furthermore, they showed that a rate monotonic priority assignment is the optimalfixed-priorityassignment. Lehoczky, Sha and Ding later provid^ an exact case schedulabihty analysis, showing that the schedulable utilization is frequently higher than the Liu and Layland bound [4]. Later work by Leung and Whitehead showed that a deadline monotonic priority assignment, i.e. shorter deadlines have higher priority, is optimal when the deadhnes are less than or equal to their periods [18]. The following schedulability analysis is adapted from [5] and assumes deadlines are always shorter than or equal to the period of the tasks. Given a task set, S = {Tj, X2,..., XQ.}, each task, Tj, can be defined by the tuple (Q, Ti, Dj, Pj), where Cj = The worst case execution time of Tj. For network messages, Cj represents the transmission time. Ti= The period of Tj. Dj = The deadline of Ti. Dj < T^. Pi= The priority of Xj. For a deadline monotonic priority assignment, the tasks in S are ordered with monotonically increasing deadlines, such that Dj < Dj for i < j . Priorities are chosen such that Pj > Pj for all i < j . Ties are broken arbitrarily. The response time of a task is the interval between its request and completion. A critical instant is the time when a task has its longest response time. For fixed priority tasks where each task arrives independent of ^ others, a critical instant occurs for task Xm ^ S when it arrives simultaneously with all higher priority tasks. If X^j can meet its deadline during the critical instant, it will always meet its deadline [3].

11 To check the schedulability of Tm, define: ^m - 1

W =

°*™J^Cj\n+C^^Bj/t\ OSfSD„ ;=i I ^; I ^ ^

(EQl)

Given that a critical instant occurs at time t = 0, the quantity ^ J ^ V S ^ ^^^J^ represents the cumulative processor demand from all jobs of higher priority than Tnj, and hence the amount of time X^ can be preempted by higher priority tasks prior to time t. The term B^ represents the longest amount of time that Tm can be blocked beyond the quantity above, e.g. due to priority inversion. For independent tasks, Bm = 0 for all m. Thus Win is the lowest processor utilization that occurs between the start of the critical instant and the deadline of 1^. For a task T^ to complete by its deadline, there must exists some time, t, in 0 < t ^ Dm where the cumulative demand from all higher priority processes and thefirstjob of Tjn is less than t. Equivalently, W^ must be less than or equal to 1. When a non-zero blocking term is present, a task can be delayed for an additional B^^ units during its execution. If W^ is still less than or equal to one with blocking present, the task is schedulable. The entire task set is schedulable if V/, l < f < N , 1V.<1

(EQ2)

The tasks of RT-ADSD violate several of the assumptions utilized in basic RMA scheduling. For example, several nodes may fail in sequence, placing an undue testing burden on one particular node. Also, when tasks are executed sequentially on multiple resources, their arrivals are no longer periodic [15]. These issues and others are addressed below. 1.1.3.2 Dynamic Execution Behavior The loop construct in tasks allows the execution time to vary greatly in each period. For example, if there are no faulty nodes in the system, the TestNodes thread will execute as shown in Figure 4a, without taking the branch at the loop. However, if there are faulty nodes present, the TestNodes thread will repeatedly issue a test request to the next node in sequence until a fault-free node is found. This makes the execution time of this thread, as well as the number of I/O points and the amount of idle time, dependent on the fault situation, which is unpredictable. Figure 4 shows some of the ways the TestNodes thread can execute for different fault situations. The number of times Loopp is taken depends on the number of faulty nodes in the system, which is bounded by the number of nodes in the system. One approach to determining the execution behavior of this thread is to assume, in worst case, N-1 iterations of Loopp, and take the total execution time as the worst case behavior. This design is high^ pessimistic, resulting in an overdesigned system with significant excess processor capacity at each node. Instead a bound, F^, is assumed for &e number of fault nodes, bounding the number of iterations of Loopp CPU and network resources are then reserved to acconmiodate the worst case task set for the assumed

12

Figure 1.1.4. Execution behavior of TestNodes when (a) the next node in the cycle is fault free, (b) the next node is faulty, and (c) the next two nodes in sequence are faulty. F5. This design limit is required to achieve real-time behavior: if the number of faults does not exceed F^, the system's diagnosis latency will meet its deadline. 1.1.33 Sporadic Service In a real environment, the number of faults can exceed the chosen bound Eg, increasing the execution time of TestNodes and causing lower priority tasks in the system to miss their deadlines. This situation is referred to as a transient fault overload. Since it is likely that RT-ADSD is not the primary task executing in the system, other tasks must remain schedulable under transient fault overloads. One approach is to "hard code" the limit Eg into the program, which limits Loopp to E5+I iterations. When \F\ > Eg, diagnosis is lost for the system. Instead, we rely on sporadic service to limit the rate of diagnosis [14]. Diagnosis will always be correct, but diagnosis latency may increase beyond its bound under transient fault overload. A sporadic server is defined by an execution capacity, C^ and a refresh period, T^, and executes at the priority of the task it is servicing. As capacity is used by the serviced task, it is refreshed one period after the task became eUgible for execution. The capacity is chosen to equal the permitted worst case execution time of the task being serviced. At run-time, the task executes without any delay beyond that caused by tdgher priority tasks provided the task does not request more that Cg units of time. If the task requests more than C^ units of time, as in the case of a transient fault overload, the sporadic server pauses the task such that the task executes as a periodic task with execution time Cg and period T^ [14]. The sporadic server ensures that other lower priority tasks remain schedulable under transient overload. By inserting delay, the sporadic server can prevent correct diagnosis from being achieved by the deadline. However, the system eventually achieves correct diagnosis for any number of faults. A sporadic service can be provided as a run-time service by the operating system. When run-time support is not available, Klein, et. al, show how a sporadic server can be emulated for an aperiodic task where the execution time is bounded. Their pseudocode is given in Figure 11 [15]. For tasks were the execution time is not bounded, as in the TestNodes task above, a run-time supported server is required. The overhead

13

from the sporadic server is ignored in this analysis. For a more exact analysis, overhead can be treated as blocking time on higher priority tasks. /* Sporadic Server Emulation */ Begin Task() Set_Priority (highest system priority) loop wait for "start task" signal; Init.Tmie = Read.ClockQ; Set_Priority (normal task priority); Do_Task(); Replenish_Time=Init_Time + Refresh_Period; Set_Priority (highest system priority); Sleep_Until (Replenish_Time); end loop; end; Figure 1.1.5. Sporadic Server Emulation Sporadic service is also utilized to solve problems that arise when a task experiences idle time. If a task relinquishes the processor, deferring a portion of its execution, it can create a window of time during which a lower priority task is preempted more than is assumed by the scheduling model in Equation 1. The negative impact on schedulability is known as the deferred execution effect [2,20], and is illustrated by the following example. Consider two tasks, Ti with Ci=6, Ti=15, Di=14, and T2 with C2=ll, T2=24, D2=16. Assume neither task experiences idle time. The system is schedulable according to Equation 1. One possible execution instance is illustrated in Figure 6a. Now assume Tj goes idle for a variable length of time. Figure 6b illustrates how T2 is delayed during its second execution, causing it to miss its deadline. The system is no longer schedulable due to the deferred execution effect. In Figure 6c, the deferred execution effect is eliminated by utilizing a sporadic server for Tj, restoring schedulability.

^1 fe

Jl

iiiilM

J

t

Jl

Jl J Lo

Figure 6a. Periodic tasks without idle time.

liiiiiiiMi

14

^

'^ I r - i

I

n

n

h

L

Figure 6b. Task T^with idle time. X2 missed its deadline in the second period.

1^

glla

-^2

erver

Caapacity

Figure 6c. Using a sporadic server for task X^ restores schedulability. Ibe sporadic server emulated in Hgure 11 is not sufficient for tasks with idle time. Either a more complicated emulation or a run-time server must be utilized. Additional considerations arise when the idle time in a sporadically serviced task occurs at different points within the task. Different idle points are possible given the looping construct provided in the programming model. Such a task may miss its deadline due to interactions with the sporadic server. Consider the example in Figure 7. Task Tj has an Cj = 4, TpS, Dj=7 and can suspend itself for up to 2 units of time. The suspension occurs in one of two ways, as shown. Assume tj is service by a sporadic server with capacity 4, and an execution of type A is followed by an execution of type B. IN this example, Tj misses its deadline during the second period due to interaction with the sporadic server. Two ways Ximay suspend itself Type T^pe^ I

I

I

I

• n

Figure 1.1.7. Task suspends itself in one of two ways. Deadline missed due to sporadic service.

15 To prevent this fonn of blocking, Xj is transformed into a logical task where the execution time before and after any idle point is the maximmn possible for all possible execution behaviors. The execution time of the logical task is used as the capacity of the sporadic server. In the previous example, transform Tj into a logical task with Ci=5, as shown in Figure 8. Using the larger server, an execution of type A followed by type B meets its deadline in both periods. Transformed Task T^pe C Type^

Jt

PL

Seirver Cap apacit]

Type< \ I r

T-]

Figure 1.1.8. Deadline met using sporadic server with capacity of 5. Note that the target platform can only support sporadic service for processor tasks. There is no equivalent for network tasks. A distributed form of sporadic server is developed for Uie network in a later section. 1.13.4 Critical Sections: the Priority CeUing Protocol Tasks in the RT-ADSD task set share a conunon data structure, the Tested-Up array. For proper synchronization, the data structure must be protected by semaphores, creating critical sections of code. Priority inversion can be caused by two tasks that share a semaphore; a lower priority task may block a higher priority task by locking the shared resource. The priority ceiling protocol (PCP) limits the time a higher priority task can be blocked waiting for a semaphore to at most one critical section of any lower priority task [13]. The priority ceiling protocol is a form of a priority inheritance protocol in which each semaphore is assigned a priority ceiling based on the highest priority task that uses it. A task can lock a semaphore only if the task's current priority is higher than the priority ceiling of all semaphores currently locked. If a lower priority task is blocking the execution of a higher priority task, the priority of the lower task is elevated to the level of the blocked task until the semaphore is released. When the priority ceiling protocol is used, the worst-case blocking time experience by a task Tj is: B- = max Sj

(EQ3)

i <j < n

where S; is the length of the longest critical section in task Tj. Equation 3 gives a simple model for priority ceiling protocol blocking, by assuming that a task can be blocked by the critical section of any lower priority task. Klein and Ralya give a more

16 accurate derivation of PCP blocking by considering resource interaction [2]. However, the model given in Equation 3 is sufficient for our analysis, and can be replaced by the Klein and Ralya model without loss of generality. In RT-ADSD, the priority ceiling protocol is emulated by raising the priority of a task to its priority ceiling whenever it enters a critical section [17]. 1.1.3.5 Task Suspension and Idle Time The thread illustrated in Figure 3 suspends itself until a reply is received from a neighboring processor. This introduces two problems into the scheduling model. The first problem, the deferred execution effect, is eliminated by sporadic service. The second problem is how to account for the idle time in the scheduling model. One way to incorporate idle time is to treat it as blocking time, adding it to the value for Bj derived above. This approach is pessimistic since the scheduling model treats blocking time equivalent to execution time. Under this assumption, when a task X^ is suspended, higher priority tasks that execute during the suspension delay T^^'s completion. In this work, task suspension is handled by treating each task segment before and after the suspension as a separate task, or subtasL Each subtask has a deadline and blocking term independent of the other subtasks. When checking a compound task for schedulability, check the schedulability of each subtask instead using Equation 1. This method is also pessimistic in that, for a given task set, it may not be possible for all the subtasks to experience the worst case delay from higher priority tasks that is assumed by Equation 1. However, this approach is beneficial when developing the expression for end-to-end diagnosis latency. The diagnosis latency is expressed as the sum of several tasks executing sequentially on several resources. By dividing the task into subtask, the deadline for each subtask can be included in the expression for endto-end latency independent of other subtasks from the same compound task. Idle time is also eliminated from the expression for latency by replacing it with a sequence of (sub)tasks executing on other resources. Subtasks are incorporated into the scheduling model in Section 3.1 by replacing task Tj containing idle time by m separate logical tasks T^ i, Xj 2» — »'^i_m' ^^^ newly created tasks have the same period and priority as Xj butTiavelndividuaJ execution times, Q j , such that Q j + Q 2 + ... + Q ^ = Q. Each subtask is assigned its own deadline, Djj. Under the mdependence assumption used in the scheduhng model, all m subtasks can arrive simultaneously. Schedulability can be improved by noting that subtasks from the same task are mutually exclusive. Therefore when using Equation 1 to check the schedulability of subtask X, ;, assume the other equal priority subtasks Xj^, kt«y, do not execute. When checking Ine schedulability of a task with priority lower than Xp treat Xj as a single task with execution time Q. An example of subtasks is the TestNodes thread from Figure 4. Consider the execution given in Figure 4a. Two logical threads are identified: TestNodesBegin and TestNodesEnd. For the situation where one node detects a fault. Figure 4b, an additional logical thread occurs, TestNodesMiddle, each time the TestNodes task executes.

17 When multiple faults are present, the TestNodesMiddle task occurs once for each fault, and is replicated in the logical task set. 1.1.3.6 Task Jitter The scheduling model presented so far is based on tasks that arrive on a periodic basis. In the RT-ADSD task set, interprocess communication allows tasks to arrive at the completion of another task. The completion time of the first task is variable and depends on interaction with higher priority processes. The arrival time of the second (dependent) task is no longer periodic. The variability of the arrival time due to sequential dependency is referred to as task jitter. Consider the following example of task jitter. Assume a processor schedules three tasks, Ti, T2 3nd T3 using a deadline monotonic priority assignment. Tasks X^ and T2 arrive in a periodic fashion. Task X3 depends on Uie ouQ)ut of X2, and arrives immediately upon T2's completion, as shown in Figure 9. Even though T2 and X3 both share a period of 20, T3 can arrive early, as quickly as 17 units after its initial starting time. This variance in the arrival time can negatively impact the schedulability of tasks with priority lower than X3 similar to the way idle time causes a deferred execution effect.

^1

^3

I Figure 1.1.9. Task jitter. X3 arrives upon the completion of X2.

The impact of task jitter is analyzed by treating it as a deferred execution effect. Task X3 is effectively deferred up to 6 units of time, which is the difference between the best-case and worst-case completion times of X2. The deferred execution effect adds to the blocking term the minimum of the amount of idle time and the amount of execution time deferred, or min(6,4) = 4 units [20]. In RT-ADSD, jitter is compounded by the fact that many tasks execute as part of a sequence. For example, a periodic test produces a test request message, which cause a test routine to execute at the next processor, which produces a test reply message, etc. Jitter is introduced at each task arrival, creating successively larger differences between the best and worst case completion times. When this interval plus the execution time of a task exceeds the task's period, execution periods overlap causing additional blocking to lower priority tasks. The model for the deferred execution effect

18 given by [20] is no longer valid. When this interval plus execution time exceeds the task's deadline, the task set is no longer schedulable. A more viable solution for distributed systems is to restore periodicity at each step by inserting delay before each dependent task. In RT-ADSD, the sporadic server automatically provides the necessary delay for tasks running on the processors. For example, if X3 is now serviced by a sporadic server with capacity of 4 and period 20, periodicity is restored (Figure 10). Note that T3 is delayed at most until the deadline of T2^ its worst case start time.

^2

vi^

\7m I

Vm A

^3

Figure 1.1.10. Jitter problem solved by using a sporadic service for Tj. Network messages are produced as the output of threads and also suffer task jitter. While some early research focuses networks that enforce periodicity [21], the target platform does not. The user must ensure that messages are released to the network in a periodic fashion. Sha and Goodenough describe a technique whereby network tasks are buffered and released only at periodic intervals [17]. This principle is adapted for RT-ADSD by creating a SporadicThrottle task at each processor for outgoing network messages. A sporadic throttle operates as follows. When a task wishes to send a message to the network, it sends the message to the output throttle. The throttle begins with an integer number of tokens, C^. Each message sent over the network uses one token, which is then scheduled to be replenished one period after the message became valid for transmission. If a packet arrives while the throttle capacity is exhausted, the packet is queued for later transmission. Pseudocode for a sporadic throttle is shown in Figure 11. Assuming network messages have afixed(or maximum) size, the sp>oradic throttle behaves exactly like a sporadic server, allowing the network schedulability to be verified with rate monotonic analysis. Each node in the system operates an independent sporadic throttle task, making the total network traffic N*Ct messages per period. To eliminate jitter on the network, the throttle must transmit a message whenever both a token and message are available, hence the sporadic throttle must always execute as the highest priority task in the system. This priority must be maintained even after RT-

19 /* Sporadic ThrotUe*/ Begin TaskO Replenish_rime[0..(Capacity-l)]=Read_Clock(); Next_Token =0; loop wait for message to arrive in Queue; Init_Time = MAX(Message_Queued_Tiine, Replenish_Tm\e [Next_Token]); Transmit_Message(); Replenish_Tlme[Next_Token]=Init_'Iime + Refiresh_Period; Next_Token =(Next_Token+l) mod Capacity; Sleep__Until (Replenish_Tlme[Next_Token]); end loop; end; Figure 1.1.11. Sporadic throttle for controlling output jitter ADSD is integrated with other task sets. The throttle task is written to be "lightweight" to limit its scheduling impact on lower priority tasks. The sporadic throttle does not eliminate jitter at the receiver, since messages can arrive at their destinations in a variable amount of time. Jitter at the receiver is eliminated by the sporadic server on the input task. In addition to eliminating output jitter, the throttle prevents network schedulability from being compromised during transient fault overload situations. The throttle is given sufficient capacity to support all packets that are generated at one node for the desired performance. As long as these limits are not exceeded, all messages meet their deadlines. When a node generates messages at more than one priority, each priority must be serviced by a separate throttle. The processor's highest priority is reserved for the throttle servicing the highest priority messages. The next highest priority is used for the throttle servicing the next higher priority message, and so on. Again, these priorities must remain unchanged as more tasks are added. As more message types are added, the system scales easily by adding more throttle tasks at each node. Note that lower priority messages can experience delay due to higher priority throttles. However, since lower priority message are blocked by higher priority messages at the network controller, the jitter introduced by multiple throttles is assumed to be negligible.

20 1.13.7 Failure Semantics The behavior of a faulty node is determined by its failure semantics [22]. The traditional PMC fault model assumes that faulty nodes respond to test requests with a faulty test result. This assumption does not consider the myriad ways a node could fail in a real distributed system. For example, a node could fail in such a way that it repeatedly transmits a large number of unnecessary high priority messages.This is referred to 2iS fail-broadcast semantics. Such a node would compromise the schedulability of all messages on the network, further jeopardizing the schedulability of tasks at other fault-free nodes. This form of failure is not tolerated by our scheduling model. A more restricted failure model must be assumed. For RT-ADSD to be implementable, several requirements must be met by faulty nodes. As demonstrated above, a node can not fail-broadcast. Since RT-ADSD relies upon a shared medium network, it is necess^ that all nodes respect the conmiunication limits imposed by the sporadic throttles in order for the network to remain schedulable,. If the network was constructed using point-to-point links, this requirement is eliminated. Furthermore, should a node continue to issue test requests after it has failed, it is important that any extra requests are ignored by other fault-free node. In a point-to-point network, a node could limit the number of test requests that are accepted from any one source. We assume that node failures in RT-ADSD do not violate these constraints. Finally, a faulty node can not respond to a test request as fault-free. The assumption of 100% test coverage is required by the PMC fault model. This does not exclude the possibihty of a node faihng to respond to a test, in which case the fault is detected when the response timeout is exceeded. One technique for meeting these requirements is for the hardware to be constructed with fail-stop semantics; a faulty node stops responding to network requests and sending messages [22]. A test becomes a simple "I'm alive" request-response message pair. Any further test performed by the testing node is redundant. For RT-ADSD, nodes are assumed to have fail-test/stop semantics, defined as follows: A faulty node either responds to test requests with a faulty test result, or does not respond at all. A faulty node does not transmit any messages beyond those allowed by the sporadic throtde. A faulty node does not test any other node more than once per test period. These requirements are less stringent than requiring nodes with fail-stop semantics, resulting in a more robust implementation. 1.1.4 RT-ADSD Implementation The implementation of RT- ADSD is specified in several steps. First, the algorithm is decomposed into a set of parallel threads and messages, each composed of the building blocks presented in Section 3. The execution behavior of each thread is examined to construct the elements of the RT-ADSD task set. Separate task sets are constructed for the network and for each processor. Tasks are assigned priorities and deadlines to

21 minimize diagnosis latency. Finally, a procedure is given to allow the RT-ADSD task set to be integrated with other tasks running on the same system. 1.1.4.1 Terminology The RT-ADSD implementation is characterized by its design parameters and operational variables. Each network and node task is specified by a set of variables. This section details the parameters that must be characterized by the design process. Design parameters are those values that must be chosen by the system designer prior to implementation and define the final system domain. There are two design parameters in RT-ADSD: the number of nodes in the system, N, and the maximum number of simultaneously faulty nodes, F^, also referred to as the fault limit. The value of F^ reflects the desired level of fault-tolerance. The real-time deadline achieved for RTADSD is guaranteed to be met provided the number of faulty nodes does not exceed Fg. If Fg is exceeded, correct diagnosis is still provided but may not meet the expected deadline. Operational variables describe an aspect of RT-ADSD's operation, are constrained during the specification process, and are traditionally minimized or maximized. The three operational variables for RT-ADSD are: • Tp : The test period, or time between periodic tests issued by one node. • Tout • Maximum time from a test request to a reply, beyond which a node is considered faulty. • Diagnosis Latency : The maximum time allowed from a fault event until all nodes are aware of the event. The network task set represents a set of messages that are periodically presented to the network for transmission. A task Tj is characterized by (Q, Dj Tj, Pj), as defined in Section 3.1. The transmission time, Q, is fixed by the message length and the network bandwidth, both of which are known for a particular implementation. Each node generates two message tasks, test requests and test replies. The collection of network tasks generated at one node i is referred to as S^coMM- The task set SCOMM is the collection of all tasks that must be supported by the communication network. The target platform, RTCN, assumes all nodes share a common channel. Hence, SCOMM = U S CQMM- Note that S^OMM contains multiple tasks with the same priority, when coYfeidenng the schedulability of an individual task in SCOMM» ^^ other N1 tasks of equal priority are assumed to block the given task. This pessimistic treatment of equal priority messages produces the worst case delay for any one message. The parallel threads running at each node comprise the CPU task set. Each task Xj, is characterized by (Cj, Dj, Tj, Pj, Sj, Bj). The first four variables are defined in Section 3.1. The remaining variables are: • Sj : The length of T/s longest critical section. • Bj : The worst case blocking time of Tj, determined by Equation 3.

22 The execution time, Cj, and critical section, Sj, are fixed by the processor speed of the target platform. The blocking term Bj is dependent on the priority ordering of the tasks [Equation 3]. The remaining vanables are derived during the specification process. A sporadic server is created using Cj, Tj, and Pj to service each task. The set of tasks that execute at any node Uj is referred to as S^cpuTasks that experience idle time are treated as multiple logical tasks, as explained in Section 3.5. For a task Tj consisting of m subtasks, each subtask t j j has its own execution time Cj i such that Q j + C^i + ••• + Cim = Q. Each subtask also has its own deadline D j i , Dj_2» •••»E)i_ii^ A deadline is not~assigned to the task as a whole. 1.1.4.2 Task Definition The following set of tasks fully implement the ADSD algorithm, in conformance with the above scheduling model. Each thread is modeled an Ada task and network tasks are represented as RTCN messages. The processor task set executing at each node n,, S^CPU» consists of four concurrent threads: TestNodes: Periodically test subsequent nodes a until a fault-free node is found. Process and store the Tested_Up array returned with the test result. ReceiveRequest:

Process incoming test requests received from the Comm thread. Send reply, along with a copy of the local Tested_Up array.

Diagnose:

Diagnose the state of all other nodes.

OutputThrottle:

Implement the sporadic throttle. Two copies of this thread run simultaneously, one for test requests and another for test replies.

The network task set, S^oMM* consists of two types of messages: TestRequests and TestResults. TestRequest messages are generated by the TestNodes thread and convey the identity of the requesting node. TestResult messages contain the result of the test along with a copy of the tested node's Tested_Up Array. Acknowledgment messages are not considered in this specification. Node failures are detected by either a faulty test result or a reply timeout. The Tested__Up array data structure is accessed by the TestNodes, ReceiveRequest and Diagnose threads, and requires a mutual exclusion protocol. The priority ceiling protocol provides the necessary mutual exclusion and is emulated in Ada by the use of a monitor operating at the priority ceiling of the critical section. The resource requirements of the four processor tasks are smnmarized graphically in Figure 12. Execution times are not to scale. The figure above gives a description of the physical implementation of RT-ADSD. However, in the logical implementation analyzed by the scheduling model, tasks such as TestNodes are further decomposed into subtasks TestNodesBegin, TestNodesMiddle and TestNodesEnd, as described in Section 3.5. The interaction between tasks at

23 TestNodes

ReceiveRequeJ

f

S

Diagnose pj OutputThrottg]

LoopP

[ I

Tested Up Array Access M (Critical Section) • Interprocess Message I/OQ Networlc Message I/O I

Figure 1.1.12. Thread execution behavior and resource requirements for basic RT-ADSD algorithm. one node is diagramed by a petri-net in Figure 13. Row of control between tasks on the same processor occurs via interprocess communication. Messages pass theflowof control between separate processors. Since the three TestNodes subtasks are part of the same physical thread, the subtask that executes is determined by the result of the test. 1.1.4.3 Task Execution Behavior To generate the elements of the CPU and Comm task sets, the arrival pattern and periods of each task must be determined. Using Figure 13, timelines are generated to visualize the execution behavior of RT-ADSD for a sample execution. Figure 15 illustrates the tasks executing at two fault-free nodes, A and B, where A tests B. At node A, four tasks execute each period to test node B: TestNodesBegin, RequestOutputThrottle, TestNodesEnd, and Diagnose. At node B, two threads execute to process the incoming test: ReceiveReply and ResultOutputThrottle. In a fault-free environment, every node issues one test and is tested by one other node. Therefore, the total task set at each node is the combination of the tasks executing at nodes A and B. The CPU task set for this fault situation consists of six tasks, each having period Figure 15 illustrates a timeUne for the case of a single faulty node. Node A tests node B as faulty and node C as fault-free. Note that the CPU task set contains an additional task, TestNodesMiddle, and the RequestOutputThrottle executes twice. By the failure semantics developed in Section 3.7, the faulty node can also issue a test request to node C once per period, forcing node C to execute an additional occurrence of the ReceiveRequest and Reply OutputThrottle threads. In general, the TestNodesMiddle and RequestOutputThrottle threads execute once for every faulty node in sequence following the testing node. The ReceiveRequest and

24 START

Test Result Msj Test Result Ms^ (Passed) (Failed)

Figure 1.1.13. Petri-net showing thread and packet interaction at one node. ReplyOutputThrottle threads may execute once per period for each faulty node in the system. The worst case CPU task set for an individual node is: S CPU = {

'^TestNcxlesBegin» Ps'^^TestNodesMiddle* "CTestNodesEnd* (Fs"*"^)*'^RequestOutputThrottle» (^s"'"^)'"^ReceiveRequest' ( F s + 1 ) ''^ReplyOutputThrottle 1

(EQ 4)

Similar analysis yields the combined set of tasks presented to the network each period. If no faults are present, each node issues one TestRequest and one TestResult message. The combined Conmi task set for N nodes contains N TestRequest tasks and N TestResult tasks, each with period Tp. When faulty nodes are present, an addition TestRequest is issued per period for each faulty node. Furthermore, by the failure semantics in Section 3.7, faulty nodes can issue test requests and replies up to the capacity of their sporadic throttles. The capacities of the two throttles are evident

25 Timing Notes

Processes at node A Network Messages Processes at node B

Begin Test Period at Node A W TestNodesBegin

"S

9

8E

W RequestOutputThrottle w TestRequest

H

Y ReceiveRequest

CA

.S

Y ReplyOutputThrj]

.S "jfTestResult

^r

^ TestNodesEnd w Diagnose Figure 1.1.14. Timeline for RT-ADSD, node A tests node B as fault-free.

Timing Notes

Processes at node A Network Messages Processes at node (t

Begin Test Period at Node A f TestNodesBegin

i

r RequestOu^utThrottle ^TestRequest to B --'--• (Unanswered or Faulty Result)

H

TestNodesMiddle RequestOutputThrottle Jf TestRequest to C

0

iS

s B H

I

ReceiveRequest

ReplyOutputThr|)ttle

s

I

:*

TestResult

TestNodesEnd Diagnose

Figure 1.1.15. Timeline for RT-ADSD: node A tests node B as faulty, node C as fault-free.

26 from Equation 4. For the RequestOutputThrottle to execute Fg+l times it must begin with Fg+l tokens. Network edacity is reserved for the worst possible set of RTADSD messages, thereby ensuring non-ADSD messages that their timing will not be compromised. The pessimistic set of conununication tasks is: SCOMM =

{ (N+2-Fs +1 )-XTestRequest» (N +2'F^ +l)-XTestResult 1 (EQ 5)

1.1.4.4 Bounding Parameters The contents of the S^cPU ^^^ SCOMM ^ ^ sets depend on the values chosen for design parameters N and F^. For the following sections we assume that the system must achieve a hard deadUne for diagnosis latency when the number of faults is less than or equal to one. Thus and Fs=l Necessary restrictions on the test period and test timeout are evident from the timelines. The test timeout must allow sufficient time for the test result to arrive. All tasks share the same period, T^. Since the three test nodes tasks are implemented as one physical thread, the TestNodesEnd thread must complete before the TestNodesBegin thread can execute again. Therefore, Tp must be greater than or equal to the sum of the deadlines of all tasks that occur between the start of TestNodesBegin and the completion of TestNodesEnd when one (in general, F^) node is faulty. These two constraints can be expressed as follows: Tout

~

^^RequestOutputThrottle + ^TestRequest + ^ReceiveRequest + ^ReplyOutputThrottle + I^estResult (EQ 6)

Tp

^

^TestNodesBegin + ^Qut + E>TestNodesMiddle +Tout + I^TestNodesEnd

(EQ 7)

The diagnosis latency is expressed as the sum of three components: 1) The time for the fault or recovery to be detected. 2) The time for the Tested_Up information to circulate to the remaining N-1 nodes. 3) The time for the last node to run Diagnose and correctly determine the state of the network. Each of these terms is detailed for the case of a single fault event: 1) Upon a node failure, up to one full test period, Tp, can elapse before a test request arrives at the failed node. The testing node detects the failure by a faulty test result or a message timeout, whichever occursfirst.The testing node then tests the next node in sequence (successfully) and stores the new information to its local Tested_Up array

27 by the completion of the TestNodesEnd task. Assuming that each task completes by its deadline, the worst case delay is [Figure 15]: Tp + 2-Tout + I>restNodesEnd

OEQ 8)

When a node recovers, this time is shorter since a node response within the timeout. However, Equation 8 is used as an upper bound. 2) Since nodes are not synchronized, a node with information about a fault event may not be tested for a fuU test period, incurring propagation delay. The propagation delay can reoccur for the remaining N-1 nodes in the system. The worst case delay from thefirstnode to the last node is: (N-1) (Tp + DReceiveRequest + ^^ReplyOutputThrotUe + IhestResult + ^TestNodesEnd) (^Q 9)

3) The last node achieves diagnosis after running Diagnose on the updated Tested_Up information, requiring time: ^Diagnose

(EQ10)

The sum of these three equations is the worst case diagnosis latency for RT-ADSD: Diagnosis Latency = N (Tp + DjestNodesEnd) + (N-l)-(DReceiveRequest + I^ReplyOuq)utThrottle + ^^TestResult) + 2-Tout + DDiagnose

(EQ 11)

To minimize this expression, the shortest values of Tp and Toy^ are used, as determined by Equations 6 and 7. The final expression for diagnosis latency is: Diagnosis Latency = (3N+l)-DReceiveRequest + (3N+l)-DReplyOutputThrottle + (3N+l)-I>restRepiy + (2N+2)-DRequestOutputThrottie + (2N+2) ^TestRequest "** 2N-I>restNodesEnd

1.1.4.5

+ N-DrestNodesBegin + N-DjestNodesMiddle + ^Diagnose Assigning Priorities and Deadlines

(EQ 12)

The diagnosis latency given in Equation 12 is a weighted function of task deadlines.The elements contained in the CPU and Comm task set must be assigned priori-

28 ties and deadlines, such that the system is schedulable and the minimum diagnosis latency is achieved. The tasks have no a priori deadlines or priorities. Rate monotonic analysis assigned priorities as the inverse of the arrival rate of each individual task. However, all the tasks in RT-ADSD share the same period, Tp, so a method must be determined to assign priorities. The priorities should be assigned so as to minimize diagnosis latency. Once priorities are assigned, a deadline can be assigned to each task based on its worst case completion time. The method used in RT-ADSD to assign priorities is derived from the following example. Assume a set of jobs, Ji, J2, ..., Jn, arrive simultaneously at a processor. A job is a single instance of a task, and is defined by an execution time Cj and priority Pj. If the jobs are ordered such that Pj > P2 > " > PQ, then the shortest deadline that can be met for each job (assuming no other jobs are added later) is: D, = C,

(EQ 13)

D2 = Ci + C2

(EQ 14)

Dn = J,l,C,

(EQ15)

Assume a cost function that is a weighted sum of individual task deadlines. Diagnosis latency is one such function. Theorem 1: Given a cost function of the form, ^^^^ ^ ^Pi^ ^ ^ minimum cost is obtained by arranging priorities such that Cj/wj is monotonically increasing [23].

4.5.1

Network Tasks: Priorities and Deadlines

The following analysis assumes RT_ADSD tasks are the only tasks executing on the network. Both TestRequest and TestResult packets influence Uie end-to-end diagnosis latency in Equation 12. The contribution to the diagnosis latency can be expressed as: ^Latency = (N+2)-DTe,^equest + (N+2)-DTestResult

( E Q 16)

where Dj is the deadline met by a message from task Tj. To assign optimal priorities, ScoMM ^s simpli^ed by combining identical messages into one message. The transformed task set, S coMM> consists of two tasks: a TestRequest message, TQ, of length (N+3)*CTestReauest' ^ ^ ^ TestRcsult message, TR, of length (N+3)*CTestResiiit» each having period Tp. The transformed task set is equivalent to the original task set in that it presents the same amount of work to the network at a given priority in any one period, and the deadline achieved for XQ (TR) is the same as that achieved by any indi-

29 vidual TestRequest (TestResult) packet. From Theorem 1, the optimal priority assignment to minimize Equation 16 is found by sorting on Cj/Wj. For the two task set, Cj/Wj are:

^^/^e =

CR/^^R =

2]VT2 ( ^ + 2 F , + l)CTestResult 3^^—[

^ Q ^^^ ^^ ,^^ (EQ 18)

The transmission times CjestRequest ^ ^ CxestReesuit ^^ fi^^^ fo^ ^ particular implementation. If CrVwQ < CR/WR, the optimal priority assignment places TestRequests higher than TestResmts. The deadline for a single message is then: I>restRequest = (N+2Fs+l)-CTestRequest

(EQ 19)

^TestResult = (N+2-Fs+l)-CxestRequest + (N+2-F,+l)-CTestResult

(EQ 20)

When CR/WR < CQ/WQ, the priority ordering is reversed; new deadlines are computed in a similar fashion. 4.5.2

Processor Tasks: Priorities and Deadlines

The following analysis assumes RT_ADSD tasks are the only tasks executing on the processor. All tasks in the CPU set share the same period and are assumed to have deadlines shorter than their periods. The execution time of each task, Q, is fixed by the source code and the target platform. Once a priority assignment is given, the deadline for each task can be determined as follows. Assuming the tasks are ordered such that Pi > P2 > •" > Pfl, the shortest deadline that can be met for each task is: Z)i = C i + ^ i

(EQ21)

Z>2 = C14-C2 + 52

(EQ22)

^n = J,l,C,^B„

(EQ23)

30 For a task composed of logical subtasks, the deadline for each subtask T: ^ is found by: ^J-^ = X i : i < ^ i + C j _ , + B j ,

(EQ24)

The deadline for task with priorities lower than j remain as given by Equation 23. Using the expression for diagnosis latency and Theorem 1, a priority assignment can be found for each task in the system. For a task, Tj, that is composed of several subtasks, each subtask has a separate weighting factor Include T^ in priority order by sorting on the term

S i"^= VC^. E;

(EQ25)

l^iJ

rather than Cj/Wj. The priority assignment given by Theorem 1 produces the optimum diagnosis latency when all blocking terms are zero [23]. This result can be extended to show optimality when Bi is constant. However, the blocking quantity varies with the relative ordering of tasks (Equation 3) making the assignment given by Theorem 1 no longer optimum. Instead, Theorem 1 is used to assign an initial priority ordering, then heuristic improvements ar made. No claims are made as to its optimality. The heuristic relies on swapping tasks with adjacent priorities and computing the effective change in diagnosis latency. The change in diagnosis latency when swapping tasks is illustrated with the following example. Assume an initial set of tasks, Tj, T2,..., Tu, where each task is composed of a fixed execution time Q and an order dependent blocking term, B,. Assume the tasks are arranged such that Pj > P2 > '• > PQ. Create a second ordering, X'^, T'2,..., X'^, formed by exchanging the priorities of tj with Tj+i. When 3ie two tasks are exchanged, their blocMng terms are recomputed as B'j. The change in diagnosis latency in moving Tj to T'j+j, as contributed by Tj is: ^-^i'

[C,,,^(B\,,^B,)]

(EQ26)

and the change contributed by Tj+i is: Ai.i = W i , i [ - C i + ( B V - 5 i , i ) ]

(EQ27)

31 When the task Xj is composed of m subtasks, the effect of swapping priorities is more complicated, since there are m blocking terms that can potentially be modified by the swap. If Tj is moved to Xj^.i, the contributed effect on diagnosis latency from Tj is: m

^i = X ^ U [Ci.i + (^•(i.i)j -B\j)]

(EQ28)

j = i

If the Xj+i task is made up of m subtasks, a similar equation for the contributed change in diagnosis latency is derived:

^i.i = X ^(-i)J f- ^i ^ ^^'u - ^(-i)J) ]

(EQ 29)

If Ai + Aj+i < 0, then exchangmg the tasks reduces the overall diagnosis.latency. Using this result, the procedure for assigning priorities to the CPU task set, S*cPU» is: 1) For the tasks in S^cpu that execute more than once per period, (TestNodesMiddle, RequestOutputTlirottle, ReceiveRequest and ReplyOutputThrottle,) combine identical executions of the same task into one task. i.e. Assume one ReceiveRequest task of length (Fg+l) CReceiveRequest2)

Assign the RequestOutputThrottle task the highest priority in the system. Assign the ReplyOutputThrottle task the next highest priority in the system.

3)

Of the remaining tasks in S'cpu without an assigned priority, use the initial priority order provided by Theorem lusing the values for Wj defined by Equation 12. i.e Q/Wi (or [^j^i ^ijl'^lZj^i "^U^ ^^^ ^ ^ ^ ^^^ subtasks) is monotonically increasing with decreasing priority.

6)

Use a variant on bubble-sort to improve diagnosis latency for the non-throttle tasks. For i = n-1 to 1 begin Forj = 1 to i begin Assume priorities of Xj and Xj+i are exchanged. Compute new blocking terms, B'j and B'j+j. Compute Aj and Aj+j If Ai + Ai+i<0 Exchange priorities of Xj and Xj+j. end. end.

32

7) Use Equations 21 through 21 to assign deadlines for each task in S^cpuAt this point, the CPU task set is fully specified. The operational parameters TQ^^ and Tp are chosen as constrained by Equations 6 and 7, respectively. The final diagnosis latency is found from Equation 12. All fault-free nodes are guaranteed to achieve diagnosis within this deadline provided the chosen design parameters are not exceeded. If the design parameters are exceeded, the system still achieves diagnosis but the deadline is no longer guaranteed. 1.1.4.6 Integrating RT-ADSD The procedures given above for assigning priorities and deadlines assume that RTADSD is the only process running on the system. For RT-ADSD to provide fault-tolerance in a distributed system that is already supporting other real-time tasks, a technique is needed to integrate the various task sets. Non-RT-ADSD tasks are referred to as system tasks. This section describes a technique to integrate RT-ADSD task sets with system task sets that use rate monotonic analysis, such that the real-time deadlines of the system tasks are maintained. The following assumptions are made about the system tasks sets prior to integration. Each processor supports a (possibly unique) task set S^p.sysTEM- The network supports a single task set, SN-SYSTPM- Each processor and the network task set is schedulable by rate monotonic analysis. The system task sets utilizes less than 100% of any individual resource, processor or network (otherwise no other tasks can be added to the system.) The S P-SYSTEM task set may contain output throttle tasks, identical to those used by RT-AUSU, mat occupy the highest task priorities in the system. The remaining S'p.sYSTEM ^sks are assigned deadline monotonic priority. If the combination of system and RT-ADSD task sets is still schedulable by Equation 2, no further analysis is required. If not, then one method of achieving a schedulable system is to increase the diagnosis latency of RT-ADSD. The following procedure scales the diagnosis latency of RT-ADSD until a schedulable system is achieved. The schedulability of the network is ensured in Step 2. In Step 5, the output throttle tasks of RT-ADSD are demonstrated to not interfere with normal system tasks. The schedulability of each node is established in Step 7. 1)

Combine the Comm task set SCOMM ^^^ ^^ ^N-SYSTEMAssign priorities using a deadline monotonic algorithm.

2)

Check the schedulability of the combined network task set. If the network is not schedulable. Multiply all ScoMM ^^k deadlines by a scale factor CXCOMM >1Reassign priorities in a deadUne monotonic fashion, if necessary. Let T p = m a x (CXcoMM'^^TestRequest* ^COMM*^TestResult» Tp)

Check the schedulability of the network using scaled deadline and period

33

T Increase CX^-QMM ^ ^ repeat until the network is schedulable. Let S N-SYSTEM ^ ^^ fi^^ communication task set and T p be the period at which it is schedulable. 3)

If T p > Tp, increase all S^cpu ^ ^ periods to Tp. Label the resulting task sets S^ P-SYSTEM-

Steps 4 through 7 are repeated for each node task set S* p.^ysTEM4)

Combine the system task set at each node, S^p.sysTEM* with the two output throttle tasks from S'cpuAssign the highest system priorities to the output throttle tasks such that higher priority throtdes handle higher priority messages. Assign priorities to the remaining tasks in deadline monotonic order. Call the new task set S p.svsTEM-

5)

Assume that each RT-ADSD output throttle task occurs only once, (i.e., Tp = oo)

Check if the S p.sYSTEM ^^ schedulable. If it is not schedulable, STOP. RT-ADSD cannot be integrated with the system task set. Find the smallest period, T"p, for the RT-ADSD throttle tasks such that the system is schedulable. 6)

If T"P < T'p, set T"P = T'p. Combine all remaining RT-ADSD tasks into S'p. SYSTEM-

Set the period of all RT-ADSD tasks (including the throttle tasks) as T"p. Leaving the throttle task priorities unchanged, assign priorities to the remaining tasks in deadhne monotonic order. For tasks with multiple subtasks, order on the subtask with the smallest deadline. Call the resulting task set S p.svsTEM7)

Check the schedulabiUty of S P-SYSTEMIf the system is not schedulable. Multiply the RT-ADSD throttle task deadlines by a scale factor, OC^pu > L Reassign priorities in a deadline monotonic fashion, if necessary.

34

Set T p = max (smallest Tp allowed by Equation 7, T p). Check the schedulability of the network using the scaled deadlines and period T p Increase OLQ^U and repeat until the network is schedulable. Call thefinalsystem task set S p.sysTEM. After each individua^„node is scheduled, the global test period, T p, is chosen as the largest test period,^T p, Required by any individual node in the sysfem and is used by all nodes. Since T*p > T^, the network remains schedulable. The final value for TQ^^ is given by Equation 6. Tnefinaldiagnosis latency is given by Equation 12. This procedure does not produce an optimal diagnosis latency. It only gives a systematic procedure for integrating RT-ADSD with other task sets. Abetter approach would include RT-ADSD early in the design of a particular application. However, this is beyond the scope of this chapter.

1.1.5 Example An example of the specification procedure is performed with design parameters of N = 4 nodes and assuming a maximum of F^ = 1 faulty nodes. Table 1 lists the execution time and length of the longest critical section, Cj and S, respectively, for each task. The execution time for each task is estimated in microseconds, calculated from the processor speed and the number of lines of source code. Both Cj and Sj are functions of Fg and N. In the table, Uj represents the number of times task Xj occurs in the processor or network task sets [Equations 4,5]. Packet transmission times, Cj, are also given in microseconds and reflect the relative size of the packets, including network overhead. The weighting factors, Wj, are taken from Equations 12 and 16 for node and network tasks, respectively, when N=4.

Task Set S^CPU

Task

"i

Q

TestNodes

-

175

-

16 1

TestNodesBegin

1

125

0

4

TestNodesMiddle

1

15

0

4

TestNodesEnd

1

35

25

8

ReceiveRequest

2

60

14

13

RequestOutputThrottle

2

10

0

10

ReplyOutputThrottle

2

10

0

13

Si

Table 1.1.1: RT-ADSD task sets for N=4, F^ = 1.

Wi

35 Task Set

SCOMM

Task

"i

Q

Si

Wi

Diagnose

1

32

18

1

TestRequest (Q)

7

1.5

n/a

TestResult (R)

7

3.5

n/a

1^" 13

Table 1.1.1: RT-ADSD task sets for N=4, F^ = 1. Network tasks are scheduled first. From Equations 17 and 18, :Q/WQ = 1.05 < CR/WR = 1.88. Hence, to minimize latency, PjestRequest > PxestResuit^ Cr and the deadlines for the network tasks are DjestRequest = 10.5 and DxestResuit = 35 [Equations 19,20]. Next, the node task set is scheduled. The two highest processor priorities are reserved for the RequestOutputThrottle and ReplyOutputThrottle tasks, with the RequestOutputThrottle receiving the highest priority. The remaining priorities are assigned by sorting on (ni-Q)/Wi, producing the following priority assignment: (EQ30)

"ReceiveRequest ^ *^TestNodes ^ "Diagnose

The blocking terms are computed for each task given the above priority ordering [Equation 3] and deadlines are assigned [Equations 23,21]. The node task set at this point is summarized in Table 1. The period of all tasks is equal to the test period, 1389.4 [Equation 7]. The timeout for test responses is 340.5 |Iseconds [Equation 6].

Task Set

Task and Subtasks

Pi

"i

Ci

S^CPU

RequestOutputThrottle

1

2

10

ReplyOutputThrottle

2

2

ReceiveRequest

3

TestNodes

4

Si

Bi

45

1389.4

0

25

10

65

1389.4

0

25

2

60

185

1389.4

14

25

-

-

-

-

-

-

1

125

303

1389.4

0

18

TestNodesMiddle

-

1

15

193

1389.4

0

18

TestNodesEnd

-

1

35

213

1389.4

25

18

Diagnose

5

1

32

367

1389.4

18

0

TestRequest (Q)

1

7

1.5

10.5

1389.4

n/a

n/a

TestResult (R)

2

7

3.5

35.0

1389.4

n/a

n/a

TestNodesBegin

ScoMM

Ti

I>i

Table 1.1.2: Fully specified RT-ADSD task sets for N=4, F^ = 1.

36 Finally, the reordering heuristic developed in Section 4.5.2 is employed to improve diagnosis latency. The priority of adjacent (non-throttle) tasks are exchanged and the effect on the end-to-end delay is computed. Consider the exchange of PjestNodes ^ ^ ^Diagnose- ^^^^ ^ ^ exchange, the blocking term of all three TestNodes subtasks reduces to zero, while the blocking term for TDiacnose increases to 25. From Equations 26 and 27, ATestNodes = 272 and Aoiagnose = -150. ^Latency = ^TestNodes + ^Diagnose = +122

(EQ 31)

Hence, the exchange of X-re^tNodes ^ ^ "^Diagnose ^ ^ s not result in a lower diagnosis latency. Repeating the same calculation with aJl other adjacent task pairs shows that no exchange produces a lower latency. The values given in Table 1 represent the final CPU and Conun task sets. The final diagnosis latency achieved for this system is 8,320 [Equation 12]. 1.1.6 Conclusion This chapter demonstrates one approach to the difficult problem of real-time distributed system design. Rate monotonic analysis provides the basis for the scheduling model. A programming framework is developed within that model that allows the construction of complex distributed algorithms. The framework is expected to be general enough to implement a variety of distributed applications. The application program presented here, RT-ADSD, solves the fault-diagnosis problem for nodes of fully connected networks. The techniques used for specifying RTADSD are expected to prove useful in specifying other non-real-time algorithms within a real-time environment. An important step in the conversion process is the application of bounds on the algorithm's execution and the use of sporadic service to enforce those bounds, rather than utihzing the worst-case execution time. In this way, the scheduled execution behavior more closely resembles the expected execution behavior. By an intelligent choice of bounds, in our case the fault limit, the algorithm's execution behavior was made tractable and still provided useful results. Further research is needed in the area of distributed real-time systems. Improvements to the programming model could be obtained by the use of a hardware-based network sporadic server, thereby eliminating the overhead required for sporadic throttie tasks. Distributed algorithms would also benefit from the development of better scheduling models. The model presented here utilizes a pessimistic assumption that all tasks at a node or link (except those belonging to the same physical thread) arrive independentiy of each other. The timelines in Section 4.3 show that several tasks arrive upon the completion of other tasks. The independence assumption produces longer deadlines and reduced utilization by assimiing tasks can compete for a resource when they may not be able to do so in the implementation. Harbour, Klein and Lehoczky (HKL) presented a model for analyzing the schedulability of fixed priority tasks [6] that may prove useful in such cases. Each task is allowed to vary its priority in a deterministic way throughout its execution, allowing

37

sequences of RT-ADSD tasks that utilize the same resource (i.e. TestNodesBegin followed by RequestOutputThrottle) to be modeled as a single task. This more closely resembles the actual execution tfehavior of RT-ADSD, producing shorter deadlines for each compound task. However, the HKL model does not account for the effects of idle time nor does it support sporadic service, both of which are required to schedule RT-ADSD. HKL captures sequential dependencies between tasks on the same node. A better model would incorporate tasks composed of subtasks that occur sequentially across several different resources and is an open research issue.

References [1] Bianchini, R. R, and Buskens, R. "An Adaptive Distributed System-Level Diagnosis Algorithm and its Implementation." Proceedings of the IEEE 23rd International Symposium on Fault-Tolerant Computing, June 1991, pp. 222-229. [2] Ezhilchelvan, R D. and de Lemos, R. "A Robust Group Membership Algorithm for Distributed Real-Time Systems." Proceedings of IEEE Real-Time Systems Symposium, December 1990, pp. 173-179. [3] Liu, C. L., and Layland, J. W. "Scheduling Algorithms for Multi-Programming in a Hard Real-Tmie Environment." Journal of the Association for Computing Machinery, 20 (1), January 1973, pp. 46-61. [4] Lehoczky, J. R, Sha, L. and Ding, Y. "The Rate-Monotonic Scheduling Algorithm: Exact Characterization and Average Case Behavior." Proceedings of IEEE Real-Time Systems Symposium, 1989, pp. 166-171. [5] Lehoczky, J. R "Fixed Priority Scheduling of Periodic Task Sets with Arbitrary Deadlines." Proceedings of IEEE Real-Time System Symposium, 1990, pp. 201209. [6] Harbour, M. G., Klein, M. H. and Lehoczky, J. P. "Fixed Priority Scheduling of Periodic Tasks with Varying Execution Priority." Proceedings of IEEE RealTime Systems Symposium, 1991. [7] Preparata, F. P., Metze, G. and Chien, R. T. "On the connection Assignment Problem of Diagnosable Systems." IEEE Transactions on Electronic Computing, EC-16 (12), December 1967, pp. 848-854. [8] Hakimi, S. L., and Amin, A. T. "Characterization of Connection Assignment of Diagnosable Systems." IEEE Transactions on Computers, C-23 (1), January 1974, pp. 86-88. [9] Dahbura, A.T. "System-Level Diagnosis: A Perspective for the Third Decade." Concurrent Computation: Algorithms, Architectures, Technologies, Plenum Publishmg Corp., 1988, pp. 411-434.

38 [10] Hakimi, S. L. and Schmeichel, E. F. "An Adaptive Algoritiun for System Level Diagnosis." Journal of Algorithms, 5, June 1984, pp. 526-530. [11] Hosseini, S. H., Kuhl, J. G., and Reddy, S. M. "A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair." IEEE Transactions on Computers, C-33 (3), March 1984, pp. 223-233. [12] Bondy, A. and Murty, U. S. R. Graph Theory and Applications, Elsevier North Holland, Inc., New York, N.Y., 1976. [13] Sha, L., Rajkumar, R., and Lehoczky, J. R "Priority Inheritance Protocols: An Approach to Real-Time Synchronization." IEEE Transactions on Computers, September 1990. [14] Sprunt, B., Sha, L., and Lehoczky, J. P. "Aperiodic Task Scheduling for Hard Real-Time Systems." The Journal of Real-Time Systems, 1,1989, pp. 27-60. [15] Klein, M. H. et al. A Practitioner's Handbook for Real-Time Analysis: Guide to Rate Monotonic Analysis for Real-Time Systems. Kluwer Academic Publishers, NorweU MA, 1993. [16] "Real-Time Communications Network Operating System. RTCN-OS Users's Guide." XXXX-PX2-RTCN edition, IBM Systems Integration Division, Manassas, VA, 1989. [17] Sha, L., and Goodenough, J. B. "Real-Time Scheduling Theory and Ada." IEEE Computer, 23 (4), April 1990, pp. 53-62. [18] Leung, J. and Whitehead, J. "On Complexity of Fixed-Priority Scheduling of Periodic Real-Tune Tasks." Performance Evaluation, 2,1982, pp. 237-250. [19] Klein, M. H., and Ralya, T. "An Analysis of Input/Output Paradigms for RealTime Systems." Tech. Report CMU/SEI-90-TR-19, Software Engineering Institute, July 1990. [20] Rajkumar, R., Sha, L., and Lehoczky, J. P. "Real-Time Synchronization Protocols for Multiprocessors." Proceedings of IEEE Real-Time Systems Symposium, December 1988, pp. 259-269. [21] Golestani, S. J. "Congestion-Free Transmission of Real-Time Traffic in Packet Networks." Proceedings IEEE Infocom '90, June 1990, pp. 527-536. [22] F. Cristian. "Understanding Fault-Tolerant Distributed Systems." Communications of the ACM, 34 (2), February 1991. [23] Smith, W. E. "Various Optimizers for Single Stage Production." Naval Research Logistics Quarterly, 3,1956, pp. 59-66.

SECTION 1.2

Refinement for Fault-Tolerance: An Aircraft Hand-off Protocol Keith MarzuUo^ Fred B. Schneider^, and Jon Dehn^ Abstract Part of the Advanced Automation System (AAS) for air-traffic control is a protocol to permit flight hand-off from one air-traffic controller to another. The protocol must be fault-tolerant and, therefore, is subtle—an ideal candidate for the apphcation of formal methods. This paper describes a formal method for deriving fault-tolerant protocols that is based on refinement and proof outlines. The AAS hand-off protocol was actually derived using this method; that derivation is given.

1.2.1 Introduction The next-generation air traffic control system for the United States is currently being built under contract to the U.S. government by the IBM Federal Systems Company (recently acquired by Loral Corp.). Advanced Automation System (AAS) [1] is a large distributed system that must function correctly, even if hardware components fail. Design errors in AAS software are avoided and eliminated by a host of methods. This paper discusses one of them—the formal derivation of a protocol from its specification—and how it was applied in the AAS protocol for transferring authority to control a flight from one air-traffic controller to another. The flight hand-off protocol we describe is the one actually used in the production AAS system (although the protocol there is programmed in Ada). And, the derivation we give is a description of how the protocol actually was first obtained. The formal methods we use are not particularly esoteric nor sophisticated. The specification of the problem is simple, as is the characterization of hardware failures that it must tolerate. Because the hand-off protocol is short, computer-aided support was not necessary for the derivation. Deriving more complex protocols would certainly benefit from access to a theorem proven ^Department of Computer Science, University of California San Diego, La Jolla, CA 92093. This author is supported in part by the Defense Advanced Research Projects Agency under NASA Ames grant number NAG 2-593, Contract N00140-87-C-8904 and by AFOSR grant number F496209310242. The views, opinions, and findings contained in this report are those of the author and should not be construed as an official Department of Defense position, policy, or decision. ^Department of Computer Science, Cornell University, Ithaca, NY 14853. This author is supported in part by the Office of Naval Research under contract NOOO14-91 -J-1219, AFOSR under proposal 93NM312, the National Science Foundation under Grant CCR-8701103, and DARPA/NSF Grant CCR-9014363. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author and do not reflect the views of these agencies. ^Loral Federal Systems, 9231 Corporate Blvd., Rockville, MD 20850.

40 We proceed as follows. The next section gives a specification of the problem and the assumptions being made about the system. Section 1.2.3 describes the formal method we used. Finally, Section 1.2.4 contains our derivation of the hand-off protocol.

1,2.2 Specification and System Model The air-traffic controller in charge of aflightat any time is determined by the location of the flight at that time. However, the hand-off of the flight from one controller to another is not automatic: some controller must issue a command requesting that the ownership of a flight be transferred from its current owner to a new controller. This message is sent to a process that is executing on behalf of the new controller. It is this process that starts the execution of the hand-off itself. The hand-off protocol has the following requirements: PI: No two controllers own the same flight at the same time. P2: The interval during which no controller owns a flight is brief (approximately one second). P3: A controller that does not own a flight knows which controller does own that flight. The hand-off protocol is implemented on top of AAS system software that implements several strong properties about message delivery and execution time [1]. For our purposes, we simplify the system model somewhat and mention only those properties needed by our hand-off protocol. The system is structured as a set of processes running on a collection of processors interconnected with redundant networks. The services provided by AAS system software include a point-to-point FIFO interprocess communication facility and a name service that allows for location-independent interprocess communication. AAS also supports the notion of a resilient process s comprising a primary process s.p and a backup process s.b. The primary sends messages to the backup so that the backup's state stays consistent with the primary. This allows the backup to take over if the primary fails. A resilient process is used to implement the services needed by an air-traffic controller, including screen management, display of radar information, and processing of flight information. We denote the primary process for a controller C as C.p and its backup process as C.b. If C is the owner of a flight / , then C.p can execute commands and send messages that affect the status of flight / ; C.b, like all backup processes in AAS, only receives and records information from C.p in order to take over if C.p fails. AAS implements a simple failure model for processes [3]: SI: Processes can fail by crashing. A crashed process simply stops executing without otherwise taking any erroneous action.

41 S2: If a primary process crashes, then its backup process detects this and begins executing a user-specified routine. Property S2 is implemented by having SL failure detector service. This service monitors each process and, upon detecting a failure, notifies any interested process. If the hand-off protocol runs only for a brief interval of time, then it is safe to assume that no more than a single failure will occur during execution. So, we assume: 53: In any execution of the hand-off protocol, at most one of the participating processes can crash. 54: Messages in transit can be lost if the sender or receiver of the message crashes. Otherwise, messages are reliably delivered, without corruption, and in a timely fashion. No spurious messages are generated. We also can assume that messages are not lost due to failure of network components such as controllers and repeaters. This is a reasonable assumption because the processors of AAS are interconnected with redundant networks and it is assumed that no more than one of the networks will fail. In any long-running system in which processes can fail, there must be a mechanism for restarting processes and reintegrating them into the system. We ignore such issues here because that functionality is provided by AAS system software. Instead, we assume thatat the beginning of a hand-off from A to 5, all four processes y4.p,A.6,jB.p, 5.6 are operational.

1.2.3 Fault-tolerance and Refinement A protocol is a program that runs on a collection of one or more processors. We indicate that S is executed on processor p by writing: (5)atp

(1.2.1)

Execution of (1.2.1) is the same as skip if p has failed and otherwise is the same as executing 5 as a single, indivisible action. This is exactly the behavior one would expect when trying to execute an atomic action 5 on a fail-stop processor. Sequential composition is indicated by juxtaposition. (5i)atpi

(52>atp2

(1.2.2)

This statement is executed by first executing {S\) atpi and then executing {S2) atp2Notice that execution of (^2) at pi cannot assume that 5i has actually been performed. If Pi fails before execution of {5i) at p\ completes, then the execution of (5i) at p\ is equivalent to skip. Second, observe that an actual implementation of (1.2.2) when p\

42 and p2 are different will require some form of message-exchange in order to enforce the sequencing. Finally, parallel composition is specified by: cobegin (5i) atpi || (^2) atp2 || ••. || (Sn) atpn coend

(1.2.3)

This statement completes when each component {Si) Sitpi has completed. Since some of these components may have been assigned to processors that fail, all that can be said when (1.2.3) completes is that a subset of the Si have been performed. If, however, we also know the maximum number t of failures that can occur while (1.2.3) executes, then at least n — ^ of the Si will be performed.

Proof Outlines We use proof outlines to reason about execution of a protocol. A proof outline is a program that has been annotated with assertions, each of which is enclosed in braces. A precondition appears before each atomic action, and a postcondition appears after each atomic action. Assertions are Boolean formulas involving the program variables. Here is an example of a proof outline. {x = OA?/ = 0} XI : X :x^\ {x =r 1 A y = 0} X2: y:=y+\ In this example, x = 0 /\y = 0,x — l A y ^ O , and x = 1 A 2/ = 1 are assertions. Assertion x = 0 A y = 0 is the precondition of XI, denoted /7r^(Xl), and assertion X = 1 A y = 0 is the postcondition of XI, denoted/?05"r(X2). The postcondition of XI is also the precondition of X2. A proof outline is valid if its assertions are an accurate characterization of the program state as execution proceeds. More precisely, a proof outline is valid if the proof outline invariant /\ {{at{S) ^ pre{S)) A {after{S) ^ s

post{S)))

is not invalidated by execution of the program, where at{S) is a predicate that is true when the program counter is at statement S, and after{S) is a predicate this is true when the program counter is just after statement S. The proof outline above is valid. For example, execution starting in a state where X = l A y = OA after(Xl) is true satisfies the proof outline invariant and, as execution proceeds, the invariant remains true. Notice, our definition of validity allows execution

43 to begin anywhere—even in the middle of the program. Changing post{Xl) (and pre{X2)) to ;r = 1 destroys the validity of the above proof outline. (Start execution in state X = l A y = 23A after{Xl). The proof outline invariant will hold initially but is invalidated by execution of X2.) A simple set of (syntactic) rules can be used to derive valid proof outlines. The first such programming logic was proposed in [2]. The logic that we use is a variant of that one, extended for concurrent programs [4]. Additional extensions are needed to derive a proof outline involving statements like (1.2.1). Here is a rule for (1.2.1); it uses the predicate up{p) to assert that processor p has not failed. Action at Processor:

up{p) not free in A, up{p) not free in B {A] S {B]

{A}{S)2iip{{AyB)A{up{p)^B)} Since execution of (5) at p when p has crashed is equivalent to a skip, one might think that {A} (S) atp {{up{p) ^B)A

i^upip) ^ A)}

(1.2.4)

should be valid if {A} S {B} is. Proof outHne (1.2.4), however, is not valid. Consider an execution that starts in a state satisfying A and suppose p has not crashed. According to the rule's hypothesis, execution of 5 would produce a state satisfying B. If process p then crashed, the state would satisfy ->up{p) A B. Unless B implies A, the postcondition of (1.2.4) no longer holds. The problem with (1.2.4) is that the proof outline invariant is invalidated by a processor failure. The predicate up{p) changing value from true to false causes the proof outline invariant to be falsified. We define a proof outline to be fault-invariant with respect to a class of failures if the proof outline invariant is not falsified by the occurrence of any allowable subset of those failures. For the hand-off protocol, we are concerned with tolerating a single processor failure. We, therefore, are concerned with proof outlines whose proof outline invariants are not falsified when up{p) becomes false for a single processor (provided up{p) is initially true for all processors). Checking that a proof outline is fault-invariant for this class of failures is simple: Fault-Invariance: For each assertion A: (A A f\up{p)) => f\A[up{p')-false] p

p'

where L[x\= e] stands for L with every free occurrence of x replaced by e.

44

1.2A Derivation of the Hand-off Protocol Let CTR{f) be the set of controllers that own flight / . Property PI can then be restated as PV: \CTR{f)\ < 1. Desired is a protocol Xfer(A, B) satisfying {AeCTR{f)APV} Xfev{A,B) {BeCTR{f) APV} such that PV holds throughout the execution of Xfer(A, B). A simple implementation of this protocol would be to use a single variable ctr{f) that contains the identity of the controller of flight / and to change ctr{f) with an assignment statement: {Aectr{f) APV} ctrif):= (ctrif) - {A})U {B} {Bectr{f) API'} This implementation is problematic because the variable ctr{f) must reside at some site. Not only does this lead to a possible performance problem, but it makes determining the owner of / dependent on the availability of this site. Therefore, we represent CTR{f) with a Boolean variable C.ctr{f) at each site C, where CTRif) :

{C\C.ctrif)}.

By doing so, we now require at least two separate actions in order to implement Xfer(i4,B)—one action that changes A.ctr{f) and one action that changes B.ctr(f). Using the Action at Processor Rule, we get: XI :

{A G CTR{f) A PV} {A.ctr{f):= false)atA {{up{A) ^ {{A 0 CTR{f)) A (CTRif) = 0))) A PV}

X2:

{CTRif) = 0} {B.ctrif):= true) atB {iupiB)^iBeCTRif)))APV)}

Note ihaipreiXl) must assert that CTRif) = 0 holds, since otherwise execution of X2 invalidates PI'. The preconditions of XI and X2 are mutually inconsistent, so these statements cannot be executed in parallel. Moreover, X2 cannot be run first because pre{X2),

45 CTR(f) = 0, does not hold in the initial state. Thus, X2 must execute after XL Unfortunately, post{Xl) does not imply pre{X2); if up{A) does not hold, then we cannot assert that CTR{f) = 0 is true. This should not be surprising: if A fails, then it cannot relinquish ownership. One solution for this availability problem is to employ a resilient process. That is, each controller C will have a primary process C.p and a backup process C.b executed on processors that fail independently. Each process has its own copy of C.ctr{f), and these copies will be used to represent C.ctr{f) in a manner that tolerates a single processor failure: r rtr(f\ . / ^-P-^^^(f) o . c f r u ; . | c.b.ctrif)

ifwpCC'-P) if-^up{C.p)

Since we assume that there is at most one failure during execution of the protocol, the above definition never references the variable of a failed process. Replacing references to processor "A" in Statement XI with "A.p" produces the following: Xla :

{AeCTRif)APl'} {A.p.ctr{f):= false) sttA.p {(upiA.p) => {{A 0 CTRif))

A (CTR(f) = 0))) A PI'}

This proof outline is not fault-invariant, however. If A.p were to fail when the precondition holds, then the precondition might not continue to hold. In particular, if A G CTR{f) holds because/4.p.crr(/) is true and A.b.ctr{f) happens to be false, then when A.p fails, A G CTR{f) would not hold. We need to assert that A.p.ctr{f) = A.b.ctr{f) also holds whenever pr^(Xla) does. We express this condition using the following definition: Pr. (up{A.p) A up{A.b)) ^ {A.b.ctr{f) = A.p.ctr{f)) Note that if one of A.p or A.b has failed then A.p.ctr(f) and A.b.ctr{f) need not be equal. Adding Pr to/7r^(X la) gives the following proof outline, which is fault-invariant for a single failure: Xla :

{A G CTR{f) A PV A Pr} {A.p.ctr{f)\— false) at A.p {{up{A.p) => i(A 0 CTRif))

A (CTRif) = 0))) A PI'}

We need more than just Xla to implement XI, however. Xla does not re-establish Pr, which must hold for subsequent ownership transfers. This suggests that A.6.cfr(/) also be updated. Another problem with Xla is ihaX post{Xla) still does not imply pre{X2): if up{A.p) does not hold, then CTR{f) - 0 need not hold.

46 An action whose postcondition implies up{A.b) ^ {^A.b.ctr(f)

A {-^up{A.p) => {CTR{f) = 0))

suffices. By our assumption, up{A.p) V up{A.b) holds, so this postcondition and post{Xl2i) will together allow us to conclude CTR{f) = 0 holds, thereby establishing/7r^(X2). Here is an action that, when executed in a state satisfying/7r^(Xla), terminates with the above assertion holding: Xlb :

{{A e CTR{f)) A PV A Pr} {A.b.ctr{f):= false) atA.6 {up{A,b) => {-nA.bxtr(f) A i^up{A.p) => {CTR{f) = 0)))}

One might think that since XIa and Xlb have the same preconditions they could be run in parallel, and the design of the first half of the protocol would be complete. Unfortunately, we are not done yet. The original protocol specification implicitly restricted permissible ownership transitions. Written as a regular expression, the allowed sequence of states is: {CTR{f) = {A})+ {CTR{f) = 0)* {CTR{f) = {8})+

(1.2.5)

That is, first A owns the flight, then no controller owns the flight for zero or more states, and finally B owns the flight. The proof outline above does not tell us anything about transitions; it only tells that PI' holds throughout (because PV is implied by all assertions). We must strengthen the proof outline to deduce that only correct transitions occur. A regular expression (like the one above) can be represented by a finite state machine that accepts all sentences described by the regular expression. Furthermore, a finite state machine is characterized by a next-state transition function. The following next-state transition function 8AB characterizes the finite state machine for (1.2.5): f {{A},iD,{B}} SAB:{

{
{

{{B]]

HAeCTRif) \iCTR{f)

if

= %

BeCTR(f)

The value of SAB is the set of values of CTR{f) that are next allowed for the protocol. For example, when CTR{f) = 0 holds, SAB says that a transition to a state in which CTR{f) = 0 holds or to a state in which CTR{f) = {B} holds are the only permissible transitions. Note that since PV holds, we know that the three cases A E CTR{f),CTR{f) = 0, J5 G CTR{f) are mutually exclusive, so SAB always has a unique value. We further define SSAB to be the value of SAB in the previous state during the execution of the hand-off protocol, or {{A}, 0, {B}} if there is no previous state. Our

47 hand-off protocol only will make permissible state transitions provided each assertion implies that CTR(f) G ^SAB\ that is, provided the current owner of / is one of the owners that was acceptable as the "next owner" in the previous state of the system. We therefore add conjunct CTR{f) G OSAB to the assertions in the proof outline and check to see if the stronger proof outline is valid. If it is valid, then we can move on to implementing X2, the second part of Xfer(A, B); otherwise, we will have reason to make further modifications. Here is the (strengthened) proof outline with Xla and Xlb running in parallel: {CTR{f) e SSAB A Ae CTR{f) A PV A Pr} cobegin {CTR{f) G SSAB A A G CTR{f) A PI' A Pr} Xla : {A.p.ctr{f):= false) atA.p {CTR{f) e SSAB A

{up{A.p) ^{A^ II Xlb :

CTR{f) A {CTR{f) = 0)) A PV}

{CTR{f) G eSAB A (A G CTR{f)) A PV A Pr} {A.b.ctr{f):= false) atA.6 {CTR{f) e SSAB A

{up{A.b) ^ (^A.b.ctrif)

A {-^up{A.p) ^ {CTR{f) = 0)))}

coend {CTR{f) e SSAB A

{up{A.p) => (A ^ CTR{f) A (CTRif) = 0))) A {up{A.b) ^ i^A.b.ctrif) A {^up{A.p) => (CTRif) = 0))) A PI' A Pr} Unfortunately, this proof outline is not fault-invariant. If A.p fails in a state satisfying after{X\a) A at{Xlb) then the following holds before the failure: after{X\2i) A at(Xlb) A up{A.p) A up{A.b) A (CTRif) = 0) A ^A.p.ctrif)

A A.b.ctrif) A GSAB = {0, B}

After the failure, we have: afteriXld) A atiXlb) A -^upiA.p) A upiA.b) A (A G CTRif)) A ^A.pxtrif) A A.b.ctrif) A GSAB = {0, 5 } So, CTRif) G 0^A5 does not hold after the failure, and the first conjunct of ^^^^(Xla) is invalidated. One simple solution is to preclude states where atiXlb) A afteriXla)

48 holds. This can be done by running the two actions in sequence—first XIb and then XI a. The result is described by the following proof outline: Xlb :

Xla :

{CTR{f) G SSAB API' APr AAe CTR{f)} {A.b.ctr{f):= false) at.4.6 {CTR{f) G SSAB A PV A A.p.ctr{f) A {up{A.b) => --A.b.ctrif))} {A.pxtr{f):= false) at A.p {CTR{f) e SSAB A PV A {up{A.b) => --A.b.ctr{f)) A {up{A.p) ^ '^A.pxtr{f)) A Pr} therefore, according to the definitions of CTR{f) Siud A.ctr{f), {CTR{f) e eSAB AA^ CTR{f) A PV A Pr}

What we really want to conclude in/7<95'r(Xla), however, is CTR{f) = 0—not just A 0 CTR{f). This is easily done by strengthening the above proof outline with the following: POnly{A): For all controllers C: C ^A:C

^ CTR{f)

POnly{A) is initially true because A e CTR{f) A PV holds. It is not invalidated by any assignment, because the only variables assigned to are those ofA.p andA.b. So, POnly{A) remains true throughout the execution of XI. The derivation of a protocol for X2 is basically the same, except that A is replaced by B SLnd false is interchanged by true. Doing so results in the proof outline shown in Figure 1.2.1.

1.2.4.1 Implementing P3 Property P3 of Section 1.2.2 is satisfied by the protocol in Figure 1.2.1 as long as there are exactly two controllers. When there are more than two controllers, a controller must query other controllers in order to determine which owns a flight. Doing so is inefficient, so we instead consider having each controller C maintain a variable C.ctrID{f) that names the owner of flight / . As with C.ctr{f), we represent the value of C.ctrID{f) in a manner that tolerates a single site failure:

This variable can be used to implement the Boolean C.ctr{f) by defining: C.ctr{f): {C.ctrID{f) = C) Thus, the assignment ''Cxtr{f) := true'' would be replaced by ''C.ctrID{f) := C'\ and ''C.ctr{f) :=false'' would be replaced by ''C.ctrID{f) := X" for any value X ^ C. We can rewrite P3 as the following:

49

{CTR{f) e SSAB A PV A PrAPOnly{A) A A G CTR{f)} {A.b.ctr{f):= false) at A.6 {CTR{f) G 0^A5 A P i ' A POnly{A) A A.pxtr{f) AB^ CTR{f) A (w/?(A.6) ^ -A.6.crr(/))} Xla : {A.p.ctr{f):=: false) atA.p {CTR{f) e SSAB A PV A PrAPOnly{A) AB^ CTR{f) A {up{A.b) => ^A.b.ctr{f)) A (upiA.p) ^ -^A.p.ctr{f))} therefore, according to the definitions of CTR(f), Axtr{f), and POnly{B) {CTR{f) e SSAB A PV A PrAPOn\y{B) A CTR{f) =: 0} X2b: {B.b.ctr{f):= true)atB.b {CTR{f) e GSAB A PV A POnly{B) A --B.p.ctr{f) A {up{B.b)^B.b.ctr{f))] X2a : {B.p.ctr{f):= true) atB.p {CTR{f) e eSAB A PV A PrAPOnly{B) A {up{B.b) ^ B.b.ctr{f)) A {up{B,p) => B.p.ctr{f)) therefore, according to the definitions of CTR{f) and B.ctr{f), {B e SSAB A PI' APrA POnly{B) ABe CTR{f)} Xlb :

Figure 1.2.1: Hand-off Protocol for A and B

50

Z(A,5): Zp:

Zb:

{true} cobegin \\c:C^{A,B}' {true} {C,p.ctrID{f)::=B)atC.p {up{C.p) => (C.pxtrlDif)

= B)}

\\c:C^{A,B] • {true} {C.b.ctrID{f):^ B) at C.h {up{C.h) :=> (C.b.ctrlDif) = B)} coend {up{C.p) ^ (C.p.ctrlDif) = B) A up{C.b) ^ (C.b.ctrlDif) = B)} therefore, according to the definitions of C.ctrID{f) {(VC0{A,5}:C-5)}

Figure 1.2.2: Hand-off Protocol for Controllers other than A and B py: {3C: (C.ctrlDif) = C)) => (3(7: (C.ctrlDif) ^ C) A {\iC': C.ctrID{f)

= C)).

For the protocol of Figure 1.2.1, P3' holds when at{X2b) is true because the the antecedent \s false. Furthermore, if we explicitly assign A.ctrID{f) := B as the assignments X1 b and X1 a, then P3' holds throughout the execution, provided C ranges over the set {A,B}. For the other controllers, additional statements are needed, shown in Figure 1.2.2. Since Z(A, B) in Figure 1.2.2 changes the values of C.ctrID{f), it should be executed when CTR{f) = 0 holds, because otherwise its execution may violate P3\ Thus, Z(A, B) would have to be started no earlier than after{Xl2i) and terminate by at{X2a). Unfortunately, Z(A, B) may take a significant amount of time—even though its component statements can be executed in parallel, the time to execute Z(A, B) will include some communication and synchronization overhead. This extra time could make satisfying P2 hard or impossible. Property P3' is perhaps a bit too strong. In fact, all that is really required is that a controller be able to communicate with the process that owns a flight. For example, C.ctrlD(f) could be the start of a path of controllers, terminating with the current owner. The scheme where C.ctrID{f) indicates the current owner is equivalent to requiring that this path have a length of 1. But, longer paths are also acceptable. Let C - ^ C denote that C.ctrlD{f) - C, and let C ^ * C denote the transitive closure of-^. Using this notation, P3' can be expressed as:

51 PS': {3C:C-^C)^{3C

:C-^C

A (VC : C' ^

C)).

We weaken P3' as follows:

py is left invariant by the protocol in Figure 1.2.1. P3" is also an invariant of the protocol of Figure 1.2.2 provided B-^ B\J B-^ A initially holds. From post(Z{A, B)) and/705r(X2a), we conclude that as long as the execution of Z(A, B) completes before another hand-off starts, Py will hold once Z(A, B) and the protocol in Figure 1.2.1 have both terminated. Since P3' implies 5 ^^ 5 V B "^ A, the system is once again in a state from which a hand-off can be performed. Hence, Z(A, B) can begin executing at any point during the hand-off from A to B—because its precondition, B-^ B\J B-^ A, holds throughout the protocol of Figure 1.2.1. And, Z(i4, B) must complete before a subsequent hand-off has started.

1.2.4.2 Implementation using Messages So far, the protocol we have derived consists of assignment statements to various variables that reside on separate processes. The protocol consists of the three processes, as follows: cobegin {A.h.ctrU) Xlb {A.p.ctr{f) Xla {B.b.ctrif) X2b (B.pxtrif) X2a

— false) at A.6 = false) at A.p — true) SitB.b = true) Sit B.p

Zp :

||c:$?{A,B}: {C.p.ctrID{fy.= B) at C.p

Zb :

\\c:^{A,B}: {C.b.ctrID{f):= B) at C.b

coend An actual implementation would require that each assignment statement be executed by the processor whose variable is being set. Furthermore, the assignment statements of the first process must be sequenced. This sequencing will be accomplished in our implementation by processor B.p, since this processor starts the protocol. If B.p crashes, then B.b will take over the sequencing. Because all assignments are constants to variables, when taking over, B.b can simply start at the beginning of the sequence—it not need to know how far B.p got before failing. B.b does need to know when B.p has finished executing the hand-off protocol. Otherwise, a crash of B.p might cause B.b to re-execute the hand-off from A to 5 after / has been later handed off to another controller, in which case B.b would undo that later hand-off. Hence, B.b must be notified of the completion of the hand-off before

52 any subsequent hand-offs are started. We represent the fact that a hand-off from AioB is in progress with a variable B.b.xfr, whose value is initially _L. In order to continue the implementation using messages, some further details of the AAS system services must be given. • Communication between resilient processes uses send and receive. If some process sends a message m to a resilient process C, then m is enqueued at C.p if C.p has not crashed and enqueued at C.b if C.p has crashed. Furthermore, send does not return control until the message has been enqueued at the remote process. The remote process may crash after enqueueing m but before delivering m, in which case m is lost. • The primary of a resilient process communicates with its backup using log. Like send, log does not return control until the message is enqueued by the remote process. A log that is executed when there is no backup (for example, when the backup has crashed or when log is executed by the backup itself) does nothing and immediately returns control. • Until the primary of a resilient process crashes, the backup delivers only messages sent by log. • When primary C.p crashes, C.b takes over by first processing any enqueued messages sent by C.p using log. It then executes the user-defined recovery protocol. And, finally, it receives messages sent to C. We also use a variable in each process to represent the value of variables C.p.ctrID{f) and C.b.ctrID{f). A simple approach would be to introduce C.p.owner{f) and C.b.owner{f), such that: C.p.ctrID{f): C.b.ctrID{f):

C.p.owner{f) C.b.owner{f).

Doing so, however, is inefficient (as well as difficult given the AAS communication primitives). Consider Xlb in the hand-off protocol. To implement Xlb, B.p would send a message to A.6 instructing it to execute A.b.owner{f):= B. Since Xlb must complete before Xla starts, B.p cannot start Xla before A.6 completes its assignment. The result is two end-to-end message delays. A more efficient hand-off protocol can be implemented using the following definitions of C.p.ctrID{f) and C.b.ctrID{f). Let the predicate Ec{f,X) mean that C.b has enqueued but not yet processed a log from C.p that requests the execution of C.b.owner{f):= X, and let Vc{f) be the value of X in the most recent such log

53 message. Then, we define: C.p.ctrID{f):

C.p.owner(f)

C h ctrlDl f) • / C.b.ownerU)

if (VX : -^EcU, X))

B.p can cause the execution of Xlb followed by XIa simply by sending a single message to A requesting execution of owner{f):= B. Upon delivery of this message, A.p first executes a log so A.6 learns of the message. Since log does not return until EAif,B) holds, post{Xlb) holds when log returns. A.p can then establish/?<95f(Xla) by executing C.p.owner{f):= B. The complete hand-off protocol is shown in Figure 1.2.3. The assertions in the code refer to Figures 1.2.1 and 1.2.2. Acknowledgements Scott Stoller provided helpful comments on a draft of this paper. We also would like to thank Mary Jodrie and Alan Moshel for bringing the problem to our attention and for helping us develop the requirements.

References [1] F. Cristian, B. Dancey and J. Dehn. Fault-tolerance in the Advanced Automation System. In Proceedings of 20th International Symposium on Fault-Tolerant Computing, Newcastle Upon Tyne, UK, 26-28 June 1990), pp. 6-17. [2] C. A. R. Hoare. An Axiomatic Basis for Computer Programming. Communications of the ACM 12(10):576-580 (October 1969). [3] Richard D. Schlichting and Fred B. Schneider. Fail-Stop Processors: An Approach to Designing Fault-Tolerant Computing Systems. ACM Transactions on Computer Systems 3(l):222-238 (August 1983). [4] Fred B. Schneider. On Concurrent Programming. To appear.

54

cobegin {pre{X\b)] {log-xfn={f,Ar)atB.p {send ''owner{f):= 5 " to A) atB.p (wait for ''ack'' from A) at B.p {post{Xla) A pre(X2b)} (log ''owner{f):= F') at^.p {post(X2b) Apre(X2a)} {B.p.owner{f):= B) at B.p {post{X2a)} {\IC :C ^ {A, B] : send ''owner{f):= F ' to C) at B.p (log "A/rirr _L"} at^.p ||c:C?i5'- (when deliver ''owner{f):= X'') at C.p (log ''owner{f):— X") at C p {(C = A) =>/?(9^r(Xlb) A (C 7^ A) ^post{Zb)} {C.p.owner{f):=X)atC.p {{C = A) ^ post{X\2i) A (C / A) => (/705r(Zp) A/?o5r(Zb))} (send "ack" to X) at C.p \\c' 11 c :

(when deliver "x:= t;" from C.p do C6.a::= i;) atC.b (when C.p fails do if xfr= if, X) then start hand-off of / from X to C) at C.b

coend

Figure 1.2.3: Complete Hand-off Protocol

SECTION 1.3

Language Support for Fault-Tolerant Parallel and Distributed Programming Richard D. Schlichting,t David E. Bakken,t and Vicraj T. Thomastt Abstract Most high-level programming languages contain little support for programming multicomputer programs that must continue to execute despite failures in the underlying computing platform. This paper describes two projects that address this problem by providing features specifically designed for fault-tolerance. The first is FT-Linda, a version of the Linda coordination language for writing fault-tolerant parallel programs. Major enhancements include stable tuple spaces whose contents survive failure and atomic execution of collections of tuple space operations. The second is FT-SR, a language based on the existing SR distributed programming language. Major features include support for transparent module replication, ordered group communication, automatic recovery and failure notification. Prototype versions of both languages have been implemented.

1.3.1. Introduction Many multicomputer applications—that is, applications intended for execution on either a parallel machine or a distributed system—could benefit from increased support for fault-tolerance. For example, parallel scientific applications can be longrunning, meaning that failures in the underlying computing platform (e.g., a node crash) can lead to significant wasted computation. Similarly, fault-tolerance provisions are mandatory in certain distributed applications such as air-traffic control since the software must continue to provide service even when failures occur. The t Dept. of Computer Science, Univ. of Arizona, Tucson, AZ 85721. t t Honeywell Technology Center, 3660 Technology Drive, Minneapolis, MN 55418 This work supported in part by the National Science Foundation under grant CCR-9003161 and the Office of Naval Research under grant NOOO14-91 -J-1015

56 difficulty in writing such fault-tolerant software is widely recognized and has led to the development of many types of fault-tolerance techniques and software structuring paradigms [1]. This chapter describes two projects that attempt to simplify the construction of such fault-tolerant software by providing enhanced programming language support. The first is FT-Linda, a version of the Linda coordination language [2] that includes provisions for fault-tolerance. Specifically, FT-Linda includes two major additions to standard Linda. One is the ability to define stable tuple spaces, which allow data to be stored in a distributed virtual memory that survives processor failures. The other is an atomic guarded statement, which allows collections of tuple space operations to be executed atomically despite concurrency and failures. The second project is FT-SR, which has taken the high-level SR distributed programming language [3] and added fault-tolerance features. The novelty of FTSR is that it presents an integrated package of mechanisms that can be used to construct programs that use a variety of approaches. For example, it is equally wellsuited for building distributed fault-tolerant programs structured according to the object/action model [4], the restartable action paradigm [5], and the replicated state machine approach [6]. The specific mechanisms include provisions for transparent module replication, ordered group communication, automatic recovery, and failure notification. These mechanisms are also designed to mesh well with the features already present in SR, so that a coherent language package is presented to the programmer. These two projects have a number of common elements. For example, although each language is intended for a different problem domain and is based on a different computing model, they are united in their orientation towards languagebased mechanisms. Another common element is that both projects are pragmatic and implementation oriented. FT-Linda is currently being implemented using the Consul system [7] and the A-kernel [8], and will execute on top of the Mach microkernel. An initial implementation of FT-SR that runs standalone on a network of Sun workstations using the A-kernel has been completed. In both cases, prototype implementations of realistic applications are either in progress or planned. In the following sections, we provide an overview of FT-Linda and FT-SR. In Sections 2 and 3, respectively, we describe each language's salient features and give an illustrative example of its use. Section 4 then describes how each language is implemented, with a focus on the run-time systems where most of the fault-tolerance features are realized. Finally, Section 5 compares our approaches to other

57 language-based approaches for fault tolerance, while Section 6 offers conclusions.

1.3.2. An Overview of FT-Linda 1.3.2.1. Linda Linda is a system for constructing parallel programs based on a communication abstraction known as tuple space (TS) [9]. Tuple space is an associative (i.e., content-addressable) unordered bag of data elements called tuples. Processes are created in the context of a given TS, which they use as a means for communicating and synchronizing. A tuple consists of a logical name and zero or more values. An example of a tuple is (N,100, true). Here, the tuple consists of the logical name N and the data values 100 and true. Tuples are immutable, i.e., they cannot be changed once placed in TS by a process. The basic operations defined on a TS are the deposit and withdrawal of tuples. The out operation deposits a tuple into TS and is nonblocking. The in operation withdraws a tuple matching the specified parameters from TS; we call these parameters its pattern. To match such a tuple, the in must have the same number of parameters, and each parameter must match the corresponding value in the tuple. These parameters can either be actuals or formals. Actuals are literal values or variables. To match a value in a tuple, the value of that literal or variable actual must be of the same type and value as the corresponding value in the tuple. Formals are a question mark followed by the name of a variable or type. Formals automatically match any value of the same type. If the formal has a variable name, then in assigns to the variable the corresponding value from the tuple. The Linda rd operator is like in except it does not withdraw the tuple. Similarly, operators rdp and inp are exactly like rd and in, respectively, except that they are nonblocking. Rather, they return a boolean result that indicates if an appropriate tuple was found.

1.3.2.2. FT-Linda Language Features The effects of processor failures on the execution of Linda programs are not considered in standard definitions of the language or most implementations. In examining how Linda is currently used to program parallel applications and would likely be used for fault-tolerant applications, two fundamental deficiencies are apparent. The first is lack of tuple stability. That is, the language contains no provisions for

58 guaranteeing that tuples will remain available following a processor failure. The second can be characterized as lack of sufficient atomicity. Informally, a computation that modifies shared state is atomic if, from the perspective of other computations, all its modifications appear to take place instantaneously despite concurrent access and failures. In Linda, of course, the shared state is the TS and the computations in question are TS operations. The key here is that Linda provides only single-op atomicity, i.e., atomic execution for only a single TS operation, while multi-op atomicity is required to make many applications fault-tolerant. FT-Linda addresses these problems by providing stable tuple spaces and atomic guarded statements, respectively. We now describe each in turn. Stable Tuple Spaces. Stable tuple spaces are part of a more general feature that allows users to define multiple tuple spaces, each of which can vary in their resilience and scope. The resilience attribute, either stable or volatile, specifies the TS behavior when failures occur. In particular, tuples in a stable TS will survive (some number of) processor failures, while those within a volatile TS have no such guarantee. The scope attribute can be either shared or private, A shared TS can be used by any process and is analogous to the single TS in current versions of Linda. A private TS, on the other hand, may be used only by a single specified process. A single stable shared TS is created when the program is started and can be accessed using the handle TSmain. Other tuple spaces are created using the FTLinda primitive ts_create. This function takes the resilience and scope attributes as required arguments, and returns a TS handle that is subsequently passed as the first argument to other TS primitives such as in and out. An optional third argument used in the case of private TSs is the identifier of the process that can access the TS. Atomic Guarded Statements. An atomic guarded statement (AGS) provides allor-nothing execution of multiple tuple operations despite the possibility of failures or concurrency. The simplest case of the statement is: < guard -^ body> where the angle brackets are used to denote atomic execution. The guard can be in, inp, rd, rdp, or true, while the body is a series of in or rd operations or a null body denoted by skip. The process executing an AGS is blocked until the guard either succeeds or fails, as defined below. If it succeeds, the body is then executed in such a way that the guard and body are an atomic unit; if it fails, the body is not executed. In either case, execution proceeds with the next statement.

59 Informally, a guard succeeds if either a matching tuple is found or the value true is returned. Specifically, true succeeds immediately. A guard of in or rd succeeds once there is a matching tuple in the named TS, which may be immediately, at some time in the future, or never. A guard of inp or rdp succeeds if there is a matching tuple in TS when execution begins. Conversely, a guard fails if the guard is an inp or rdp and there is no matching tuple in TS when the AGS is executed. A boolean operation used as a guard may be preceded by not, which inverts the success semantics for the guard in the expected way. Only one operation—the guard—is allowed to block in an AGS. Thus, if an in or rd in the body does not find a matching tuple in TS, an exceptional condition is declared and the program is aborted. The AGS also has a disjunctive case that allows more than one guard/tuple pair, as shown below: < or or

guard J -^ body J guardj^ -^ body^^ >

Intuitively, a process executing this statement blocks until at least one guard succeeds or all guards fail. A more thorough description of the semantics can be found in [10]. Other Features. FT-Linda has a number of other useful features. For example, two operations move and copy are provided to allow a collection of tuples to be moved atomically from one TS to another. The primitives support tuple patterns similar to those allowed in the in operation. In this case, only the matched tuples are moved or copied, respectively. In addition, inp and rdp operations provide absolute guarantees as to whether there is a matching tuple, a property that we call strong inplrdp semantics. Of all other distributed Linda implementations of which we are aware, only [11] offers similar semantics. Other implementations either do not provide inp and rdp or provide only a ''best effort" attempt to find a matching tuple. Semantics of the type provided by FT-Linda can be very useful since they make a strong statement about the global state of the TS and hence of the parallel application or system built using FT-Linda. Finally, FT-Linda also provides oldest matching semantics, meaning that in, inp, rd, and rdp always return the oldest matching tuple if one exists. These

60 semantics are exploited in the disjunctive version of an AGS as well to select the guard and body to be executed if more than one guard succeeds.

1.3.2.3. A Fault-Tolerant Bag-of-Tasks Linda lends itself nicely to a method of parallel programming called the bag-of-tasks or replicated worker programming paradigm [2]. In this paradigm, the task to be solved is partitioned into independent subtasks. These subtasks are placed in a shared data structure called a hag, and each process in a pool of identical workers then repeatedly retrieves a subtask description from the bag, solves it, and outputs the solution. In solving it, the process may use only the subtask arguments and possibly non-varying global data, which means that the same answer will be computed regardless of which processor computes it and at what time. Among the advantages of this programming approach are transparent scalability, and automatic load balancing. Realizing this approach in Linda is done by having the TS function as the bag. The TS is seeded with subtask tuples, where each such tuple contains arguments that describe the given subtask to be solved. The collection of subtask tuples can thus be viewed as describing the entire problem. The actions taken by a generic (non-fault-tolerant) worker are shown in Figure 1.3.1. The initial step is to withdraw a tuple describing the subtask to be performed; the label '*work" is used as a distinguishing mark to identify tuples containing

process worker while true do inC'work", ?subtask_args) calc(subtask_args, var result_args) outC'result", result_args) end while end proc Figure 1.3.1 —Non-fault-tolerant worker process

61 subtask arguments. The worker then computes the results, which are subsequently output to TS with an identifying label. Given this worker, a processor failure such as a "crash" at an inopportune time could could result in a lost calculation. Specifically, if the processor fails after a worker has withdrawn the subtask tuple but before depositing the result tuple, that result will never be computed. To solve this problem, an "in_progress" tuple is deposited into TS atomically when the subtask tuple is withdrawn, and then removed atomically when the result tuple is deposited. This in_progress tuple completely describes the subtask tuple and can be used to regenerate the lost subtask tuple should a failure occur. The code for the worker demonstrating this technique is shown in Figure 1.3.2. The final part of the example is regenerating a lost subtask tuple from the in_progress tuple after a failure. This job is performed by a monitor process that executes on each machine. The code for this process is shown in Figure 1.3.3. This process uses failure tuples that are deposited automatically by the FT-Linda runtime system upon detection of a failed processor to trigger recovery activity. The invocation failure_ts is a function in the runtime system that specifies into which TS the

# TSmain is global, stable process worker() while true do < in(TSmain, *'work", ?subtask_args) -^ out(TSmain, "in_progress", my_hostid, subtask_args) > calc(subtask_args, var result_args) < in(TSmain, "in_progress", my_hostid, subtask_args) -^ out(TSmain, "result", result_args) > end while end worker Figure 1.3.2 — Fault-tolerant worker

62 failure tuples are to be deposited.

1.3.3. An Overview of FT-SR 1.3.3.1. SR SR is a high-level distributed programming language in which programs consist of one or more modules called resources [3]. Resource instances are created dynamically and may be distributed over multiple virtual machines, each of which is mapped to some physical machine in a network. An operation is an entry into a resource that has a name, and can have parameters and return a result. There are two different ways to implement an operation: as a proc or as an alternative in an input statement. A proc is a section of code whose format resembles that of a conventional procedure. Unlike a procedure, though, a new process is created, at least conceptually, each time the operation is invoked. An input statement is a type of multiway receive that has the following form:

process monitorO TSmonitor = ts_create(volatile, private) failure_ts(TSmonitor) while true do in(TSmonitor,"failure",?host) # regenerate all in_progress tuples we find from the failed host while < inp(TSmain, ''in_progress", host, ?subtask_args) -^ out (TSmain, '*work", subtask_args) > do skip end while end while end monitor Figure 1.3.3 — Monitor process

63 in opname J (parameters)—> op_body J [] opname^(parameters) -^ op_body^ ni A process executing an input statement is delayed until there is at least one alternative opname. for which there is a pending invocation. When this occurs, one such alternative is selected non-deterministically, the oldest pending invocation for the chosen alternative is selected, and the corresponding statement list is executed. An operation is invoked explicitly using a call or send statement, or is implicitly called by its appearance in an expression. Execution of a call terminates once the operation has been executed and a result, if any, returned. Its execution is thus synchronous with respect to the operation execution. Execution of a send is, on the other hand, asynchronous: a send terminates when the target process has been created (if a proc), or when the arguments have been queued for the process implementing the operation (if an input statement).

1.3.3.2. FT-SR Language Features FT-SR is based on a programming model oriented around fail-stop atomic objects (FS atomic objects) [12]. Such objects behave as atomic objects, with operations that are atomic and serializable [5], except when their implementation assumptions are violated, in which case failure notification is provided. Such a notification would be generated, for example, when multiple failures exhaust the redundancy of a highlevel FS atomic object comprised of replicated instances of a simple FS atomic object. This feature allows an object using another object to react to its catastrophic failure to initiate appropriate recovery actions. The goal of FT-SR is to support the building of systems based on the FS atomic object model. However, given the need for flexibility, we do not provide these objects directly in the language, but rather include features that allow them to be easily implemented. To this end, the language has provisions for encapsulation based on SR resources, resource replication, recovery protocols, synchronous failure notification when performing interprocess communication, and a mechanism for asynchronous failure notification. Features used for simple FS atomic objects are outlined first, followed by those for composing such objects into higher-level FS atomic objects. Simple FS Atomic Objects. Realizing much of the functionality of a simple FS atomic object—i.e., one not composed of other objects or using any other fault-

64 tolerance techniques—in SR is straightforward since a resource instance is essentially an object in its own right. The one aspect of simple FS atomic objects that SR does not support directly is generation of a failure notification. Two different kinds of failure notification and consequently, two different ways of fielding a notification are provided. The first is synchronous with respect to a call; it is fielded by an optional backup operation specified in the calling statement. The second kind of notification is asynchronous; the programmer specifies a resource to be monitored and an operation to be invoked should the monitored resource fail. Backup operations are typically implemented locally by the calling resource, but in general, it could be in any resource. The operation is called with the same arguments as the original operation and, hence, must be type compatible with the original operation. Backup operations can only be specified with call invocations; send invocations are non-blocking and no guarantees can be made about the success or failure of such an invocation if the resource implementing the operation fails. Execution is blocked if a call fails and there is no associated backup operation. Asynchronous failure notification is realized by new monitor and monitorend statements. A process uses monitor to enable monitoring of a resource instance. This statement takes the identifier (i.e., the resource capability) of the instance to be monitored, and the name and arguments of an operation to be invoked by the runtime system should the instance fail. Monitoring is terminated by monitorend, which also takes a resource capability as its argument or by another monitor statement that specifies the same resource. The ability to request asynchronous notification has proven to be convenient in a variety of contexts [13, 14, 15] and is in keeping with the inherently asynchronous nature of failures themselves. Higher-Level FS Atomic Objects. FT-SR provides mechanisms for supporting the construction of more fault-tolerant, higher-level FS atomic objects using replication, and for increasing the resilience of objects to failures using recovery techniques. The replication facilities allow multiple copies of a resource to be created, with the language and runtime providing the illusion that the collection is a single resource instance exporting the same set of operations. The SR create statement has been generalized to allow for the creation of such replicated resources, which we call a resource group. When creating a resource group, the value returned from executing the create statement is a resource group capability that allows multicast invocation of any of the group's exported operations. In other words, using this capability in a call or a send statement causes the invocation to be multicast to each of the individual

65 resource instances that make up the group. All such invocations are guaranteed to be delivered to the runtime of each instance in a consistent total order, although the program may vary this if desired. A resource group can also be configured to work according to a primary-backup scheme; in this scenario, invocations to the group are delivered only to a replica designated as the primary by the language runtime, with the other replicas being passive. FT-SR also provides the programmer with the ability to restart a failed resource instance on a functioning virtual machine. The recovery code to be executed upon restart is denoted by placing it between the keywords recovery and end in the resource text. A resource instance may be restarted either explicitly or implicitly. Explicitly, it is done using a restart statement that is syntactically similar to the create statement. Implicit restart is indicated by the presence of the keyword persistent in the resource declaration. In this case, if a virtual machine executing one of the instances of the resource group fail, the system will select another virtual machine and recreate the failed instance automatically. The arguments supplied during the recreation are the same as those used for the original creation. This facility is designed to allow a resource group to automatically regain its original level of redundancy following a failure. Finally, we note that the failure notification facilities described above work with resource groups as one would expect. If a resource is not persistent, a notification is generated once all replicas have been destroyed, while for a persistent resource, a notification is generated once all replicas have been destroyed and the list of all backup machines has been exhausted. In either case, the way in which the notification is fielded is specified using backup operations or the monitor statement as described above.

1.3.3.3. A Fault-Tolerant Data Manager In this section, we present an example program that illustrates how FT-SR can be used to construct a simple system consisting of a data manager and and associated stable storage objects. The data manager implements a collection of operations that provide transactional access to data items located on a stable storage. The organization of the manager itself is based on the restartable action paradigm, with key items in the internal state being saved on stable storage for later recovery in the event of failure. Replication is used to build stable storage. The data manager controls concurrency and provides atomic access to data items on stable storage. For simplicity, we assume that all data items are of the same

66 type and are referred to by a logical address. Stable storage is read by invoking its read operation, which takes as arguments the address of the block to be read, the number of bytes to be read, and a buffer in which the values read are to be returned. Data is written to stable storage by invoking an analogous write operation, which takes as arguments the starting address of the block being written, the number of bytes in the block, and a buffer containing the values to be written. Figures 1.3.4 shows the specification and an outline of the body of such a data manager. As can be seen in its specification, the data manager imports stable storage and lock manager resources, and exports six operations: startTransaction, read, write, prepareToCommit, commit, and abort. The operation startTransaction is invoked by the transaction manager to access data held by the data manager; its arguments are a transaction identifier tid and a list of addresses of the data items used during the transaction, read and write are used to access and modify objects. The two operations prepareToCommit and commit are invoked in succession upon completion to, first, commit any modifications made to the data items by the transaction, and, second, terminate the transaction, abort is used to abandon any modifications and terminate the transaction; it can be invoked at any time up to the time commit is first invoked. All of these operations exported by the data manager are implemented as procs; thus, invocations result in the creation of a thread that executes concurrently with other threads. Finally, the data manager contains initial and recovery code, as well as a failure handler proc that deals with the failure of the lockManager and stableStore resources. To implement the atomic update of the data items, the data manager uses the standard technique of maintaining two versions of each data item on stable storage together with an indicator of which is current [16]. To simplify our implementation, we maintain this indicator and the two versions in contiguous stable storage locations, with the indicator being an offset and the address of the indicator used as the logical address of the item. Thus, the actual address of the current copy of the item is calculated by taking the address of the item and adding to it the indicator offset. The data manager keeps track of all in-progress transactions in a status table. This table contains for each active transaction the transaction identifier (tid), the status (transStatus), the stable storage addresses of the data items being accessed by the transaction (dataAddrs), the value of the indicator offset of each item (currentPointers), a pointer to an array in volatile memory containing a copy of the data items (memCopy), and the number of data items being used in the transaction (numltems). This table can be accessed concurrently by threads executing the procs

67

resource dataManager imports globalDefs, lockManager, stableStore op startTransaction(tid: int; dataAddrs: addrList; numDataltems: int) op read(tid: int; dataAddrs: addrList; data: dataList; numDataltems: int) op write(tid: int; dataAddrs: addressList; data: dataList; numDataltems: int) op prepareToCommit(tid: int), commit(tid: int), abort(tid: int) body dataManager(dmId: int; Imcap: cap lockManager; ss: cap stableStore) type transInfoRec = rec(tid: int; transStatus: int; dataAddrs: addressList; currentPointers: intArray; memCopy: ptr dataArray; numltems: int) var statusTable[l:MAX_TRANS]: transInfoRec; statusTableMutex: semaphore initial # initialize statusTable monitor(ss) send failHandler() monitor(lmcap) send failHandler() end initial ...code for startTransaction, prepareToCommit, commit, abort, read/write... proc failHandlerO destroy myresource() end failHandler recovery ss.read(statusTable, sizeof(statusTable), statusTable); transManager.dmUp(dmld); end recovery end dataManager Figure 1.3.4 — Outline of dataManager resource

68 in the body of the data manager, so the semaphore statusTableMutex is used to achieve mutual exclusion. New entries in this table also get saved on stable storage for recovery purposes. Reads and writes during execution of the transaction are actually performed by the data manager on versions of the items that it has cached in its local (volatile) storage. The data manager depends on the stable storage and lock manager resources to implement its operations correctly. As a result, it needs to be informed when they fail catastrophically. The data manager does this by establishing an asynchronous failure handler failHandler for both of these events in the initial code using the monitor statement. When invoked, failHandler terminates the data manager resource, thereby causing the failure to be propagated to the transaction manager. The failure of the data manager itself is handled by recovery code that retrieves the current contents of the status table from stable storage. It is the responsibility of the transaction manager to deal with transactions that were in progress at the time of the failure; those for which commit had not yet been invoked are aborted, while commit is reissued for the others. To handle this, the recovery code sends a message to the transaction manager notifying it of the recovery. The procs implementing the other data manager operations do not use any of the FT-SR primitives specifically designed for fault-tolerant programming and are therefore not shown here. They can be found in reference [12]. We now turn to implementing stable storage. One way of realizing this abstraction is to replicate the storage resource to increase failure resilience. Figure 1.3.5 shows such a resource; for simplicity, we assume that storage is managed as an array of bytes. Replica failures are dealt with by restarting the resource on another machine; this is done automatically since stableStore is declared to be a persistent resource. The recovery code that gets executed in this scenario starts by requesting the current state of the store from the other group members. All replicas respond to this request by sending a copy of their storage state; the first response is received, while the other responses remain queued at the recvState operation until the replica is either destroyed or fails. The newly restarted replica begins processing queued messages when it is finished with recovery. Since messages are queued from the point that its sendState message was sent to the group, the replica can apply these subsequent messages to the state it receives to reestablish consistency with the state of the other replicas.

69

persistent resource stableStore import globalDefs op read(address: int; numBytes: int; buffer: charArray) op write(address: int; numBytes: int; buffer: charArray) op sendState(sscap: cap stableStore) op recv State (objects tore: objList) body StableStore var store[MEMSIZE]: char process ss do true —> in read(address, numBytes, buffer) -^ buffer[l:numBytes] := store[address:address+numBytes-l] [] write(address, numBytes, buffer) -^ store [address, address+numBytes-1] := buffer[l:numBytes] [] sendState(rescap) —> send rescap.recvState(store) ni od end ss recovery send mygroup().sendState(myresource()) receive recvState(store); send ss end recovery end StableStore Figure 1.3.5 — stableStore resource

Stable storage could also be implemented as a primary-backup group by adding a primary restriction to the read and write operations. The process ss would then send the updated state to the rest of the group at the end of each operation by invoking a recvState operation on the group. This operation would be implemented by extending the input statement in ss to include this operation as an additional alternative.

70

resource main imports transManager, dataManager, stableStore, lockManager body main var virtMachines[3] : cap vm # array of virtual machine capabilities dataSS[2], tmSS: cap stableStore # capabilities to stable stores Im: cap lockManager; dm[2]: cap dataManager # caps to lock and data managers virtMachines[l] = create vm() on "host 1" virtMachines[2] = create vm() on "host2" virtMachines[3] = create vm() on "host3"

# backup machine

# create stable storage for use by the data managers and the transaction manager dataSS[l] := create(i := 1 to 2) stableStore() on virtMachines dataSS[2] := create(i := 1 to 2) stableStore() on virtMachines tmSS := create(i := 1 to 2) stableStore() on virtMachines # create lock manager, data managers, and transaction manager Im := create lockManager() on virtMachines[2] fa i := 1 to 2 -^ + dm[i] = create dataManager(i, Im, dataSS[i]) on virtMachines[i] af tm = create transManager(dm[l ], dm[2], tmSS) on virtMachines[l] end main Figure 1.3.6 — System startup in resource main

The main resource that starts up the entire system is shown in Figure 1.3.6. Resource main creates a virtual machine on each of the three physical machines available in the system. Three stable storage objects are then created, where each such object has two replicas and uses the virtual machine on "host3" as a backup machine. The two data managers are then created followed by the transaction manager. Notice how the system is created ''bottom up," with the objects at the

71 bottom of the dependency graph being created before the objects on which they depend. This way, each object can be given capabilities to the objects on which it depends upon creation.

1.3.4. Implementing Language Fault-Tolerance Features The implementation of the fault-tolerance features in both FT-Linda and FT-SR consist of two major components: a precompiler and a runtime system. In the case of FT-Linda, the precompiler catalogs the tuple patterns used in the program, transforms patterns into the appropriate tuple indices, and then generates C code. This precompiler is a derivative of the Ice precompiler for standard Linda [17]. For FT-SR, the precompiler is a modified version of the standard SR precompiler, which uses lex and yacc. It also generates C code. The second component of each language is a runtime system that is linked in with the object code of the user's program to form an executable. The runtime system contains the bulk of the functionality needed to implement the fault-tolerance features, so we focus here on these parts of the respective languages. As mentioned, both are based on the x-kernel, an operating system kernel that facilitates the implementation of the network protocols needed for these distributed implementations. FT-Linda Runtime System. The general scheme used for implementing stable tuple spaces and atomic guarded statements within the runtime system is based on the replicated state machine approach [6], a formalization of replicated processing to achieve fault-tolerance. In this technique, an application is represented as a state machine that maintains state variables and makes modifications in response to commands from other state machines or the environment. To provide resilience to failures, the state machine is replicated on multiple independent processors, and an ordered atomic multicast is used to deliver commands to all replicas reliably and in the same total order. Given this scheme, stable tuple spaces are implemented by treating them as the state machine application and replicating the contents across multiple machines. Updates are then propagated using atomic multicast and applied at each copy in the same order. The atomicity required for the multiple TS operations in an atomic guarded statements is realized by treating the entire sequence as, in essence, a single state machine command. This technique has the virtue of being simple to implement and requires less message passing than general transactions, while still supporting the level of atomicity needed to realize fault tolerance in many Linda applications.

72

Host

HoStr Generated Code

Generated Code

FT-Linda Library

FT-Linda Library

lf

1^

TS State Machine 1f

t

T

Consul

1

Figure 1.3.7 — FT-Linda Runtime Structure

The actual software components of the runtime systems are the FT-Linda library, manages the flow of control associated with processing FT-Linda requests and implements local TSs, the TS state machine, which manages the copies of stable TSs on this host, and Consul, which implements the basic functionality of atomic multicast, consistent total ordering, and membership. It also notifies the FT-Linda runtime of processor failures so that failure tuples can be deposited into the TS specified by the user application. The runtime structure of a system consisting of N host processors is shown in Figure 1.3.7; here, the edges represent the path taken by messages to and from the network. FT-SR Runtime System. The FT-SR runtime system implements the functionality required to realize resource groups, recovery code, and both types of failure notification. The overall structure of the runtime on a given machine is shown in Figure 1.3.8. Here, some of the important modules in both the kernel and user resident parts of the runtime system are shown together with the communication paths between them. The kernel resident modules shown are the Communication

73

User Program

User Program

Figure 1.3.8 — FT-SR Runtime Structure

Manager, the Virtual Machine (VM) Manager, and the Processor Failure Detector (PFD). The communication manager consists of multiple communication protocols that provide point-to-point and broadcast communication services between

74 processors. The VM Manager is responsible for creating and destroying virtual machines, and for providing communication services between virtual machines. The PFD is a failure detector protocol; it monitors processors and notifies the VM manager when a failure occurs. The user space resident modules shown in Figure 1.3.8 are the Resource Manager, the Group Manager, the Invocation Manager, and the Resource Failure Detector (RFD). The Resource Manager is responsible for the creation, destruction and restart of failed resources. The Group Manager is responsible for the creation, destruction and restart of groups, restart of failed group members, and all communication to and from groups. The RFD detects the failure of resources and group members. Perhaps the most interesting part of the implementation is the way in which communication involving resource groups is managed. Recall that all messages sent to a group as a result of an invocation must be multicast and delivered to all replicas in a consistent total order. The technique we use is similar to [18, 19, 20], where one of the replicas is a primary through which all messages are funneled. Another max_sf ("maximum simultaneous failures") replicas are designated as primarygroup members, with the remaining being considered ordinary members. Upon receiving a message, the primary adds a sequence number and multicasts it to all replicas of the group. Only the replicas that belong to the primary-group acknowledge receipt of the message. As soon as the primary gets these max_sf acknowledgements, it sends an acknowledgement to the original sender of the message; this action is appropriate since the receipt of max_sf acknowledgements guarantees that at least one replica will have the message even with max_sf failures. The membership of these groups may be modified when a failure occurs. For example, if the primary fails, the first member of the primary-group is designated as the new primary. The designation of a new primary or the failure of a primarygroup member will cause the size of the primary-group to fall below max_sf. When this happens, ordinary members are added to the primary-group to bring it up to max_sf members. No special action is needed when an ordinary member fails. If the resource from which the failed member was created is declared as being persistent and backup virtual machines were specified in the create statement, failed replicas are restarted on these backups. Restarted replicas join the group as ordinary members.

75

1.3.5. Related Work A number of other efforts have investigated language-based methods for supporting fault-tolerance. For Linda specifically, these include [11, 21, 22, 23, 24]. Some do not add to or modify the existing Linda primitives, but rather focus on improving the failure semantics of existing Linda constructs. For example, [24] gives a design for implementing stable TSs based on replicating tuples, and then using locks and a general commit protocol to perform updates. Other efforts are more similar to our approach in that they also extend the language itself. For example, Plinda [11] allows the programmer to combine Linda tuple space operations in a transaction, and also provides combination commands (e.g., in-out) that allow multiple operations to be done atomically. This design is very general, but the implementation overhead appears to be significant. Other high-level languages for fault-tolerance similar to FT-SR have been proposed as well. In general, however, they provide support for only a single way of structuring applications, restricting the flexibility of the programmer. For example, languages like Argus [25], Avalon [26], Plits [27], TABS [28] and Hops [29] support the object/action model, while languages like Fault-Tolerant Concurrent C [14] support the replicated state machine approach. In contrast, FT-SR with its underlying model of fail-stop atomic objects has been designed to support all these paradigms equally well.

1.3.6. Conclusions Developing enhanced language support for fault tolerance is, in some sense, a neglected area compared with the numerous efforts to develop new system libraries or network protocols. However, our view is that research in this area has the potential to render significant benefits. By offering a high-level realization of important fault-tolerance abstractions, programmers are freed from the need to learn implementation details or how a particular library can be used in a given context. The advantages of a single, coherent package for expressing the program should not be underestimated either. In fact, our efforts to incorporate fault-tolerance constructs in languages can be viewed as continuing the evolution of programming languages, which have generally progressed over the years from low-level to high-level languages with more functionality. The overriding challenge here, as in any language design effort, is finding the right balance between the expressiveness of the language constructs and the efficiency of the underlying implementation.

76 1.3.7. References [I]

S. Mishra and R. Schlichting, "Abstractions for constructing dependable distributed systems," Technical report 92-19, Dept. of Computer Science, University of Arizona, 1992.

[2]

S. Ahuja, N. Carriero, and D. Gelemter, "Linda and friends," IEEE Computer, vol. 19, pp. 26-34, August 1986.

[3]

G. Andrews and R. Olsson, The SR Benjamin/Cummings, Redwood City, CA, 1993.

[4]

J. Gray, "An approach to decentralized computer systems," IEEE Trans, on Software Engineering, vol. SE-12, pp. 684-692, June 1986.

[5]

B. Lampson, "Atomic transactions," in Distributed Systems-Architecture and Implementation (B. Lampson, M. Paul, and H. Seigert, eds.), ch. 11, pp. 246265, Springer-Verlag, Berlin, 1981.

[6]

F. Schneider, "Implementing fault-tolerant services using the state machine approach: A tutorial," ACM Computing Surveys, vol. 22, pp. 299-319, Dec. 1990.

[7]

S. Mishra, L. Peterson, and R. Schlichting, "Consul: A communication substrate for fault-tolerant distributed programs," Distributed Systems Engineering, vol 1, pp. 87-103, 1993.

[8]

N. Hutchinson and L. Peterson, "The A-kernel: An architecture for implementing network protocols," IEEE Trans, on Software Engineering, vol. SE-17, pp. 64-76, Jan. 1991.

[9]

D. Gelemter, "Generative communication in Linda," ACM Trans, on Programming Languages and Systems, vol. 7, pp. 80-112, Jan. 1985.

Programming

Language,

[10] D. Bakken and R. Schlichting, "Supporting fault-tolerant parallel programming in Linda," IEEE Trans, on Parallel and Distributed Systems, to appear, 1994. [II] B. Anderson and D. Shasha, "Persistent Linda: Linda + transactions -H query processing," in Research Directions in High-Level Parallel Programming Languages, LNCS, Vol. 574, pp. 93-109, Springer-Verlag, Berlin, 1991. [12] V. Thomas, ET-SR: A Programming Language for Constructing FaultTolerant Distributed Systems, Ph.D. Dissertation, Dept. of Computer Science, University of Arizona, 1993. [13] P. Buhr, H. MacDonald, and C. Zarnke, "Synchronous and asynchronous handling of abnormal events in the jiSystem," Software—Practice and

77 Experience, vol. 22, pp. 735-776, Sept. 1992. [14] R. Cmelik, N. Gehani, and W. Roome, "Fault Tolerant Concurrent C: A tool for writing fault tolerant distributed programs," in Proc. 18th Symp. on Fault-Tolerant Computing, Tokyo, pp. 55-61, June 1988. [15] R. Schlichting, F. Cristian, and T. Purdin, "A linguistic approach to failurehandling in distributed systems," in Dependable Computing for Critical Applications, pp. 387-409, Springer-Verlag, Wien, 1991. [16] P. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, Reading, MA, 1987. [17] J. Leichter, Shared Tuple Memories, Shared Memories, Buses and LANs— Linda Implementation Across the Spectrum of Connectivity, Ph.D. Dissertation, Dept. of Computer Science, Yale University, 1989. [18] J. Chang and N. Maxemchuk, '*Reliable broadcast protocols," ACM Trans, on Computer Systems, vol. 2, pp. 251-273, Aug. 1984. [19] M. Kaashoek, A. Tanenbaum, S. Hummel, and H. Bal, "An efficient reliable broadcast protocol," Operating Systems Review, vol. 23, pp. 5-19, Oct. 1989. [20] H. Garcia-Molina and A. Spauster, "Ordered and reliable multicast communication," ACM Trans, on Computer Systems, vol. 9, pp. 242-271, Aug. 1991. [21] S. Cannon and D. Dunn, "A high-level model for the development of faulttolerant parallel and distributed systems," Technical report A0192, Dept. of Computer Science, Utah State Univ., 1992. [22] S. Kambhatla, Replication Issues for a Distributed and Highly Available Linda Tuple Space. M.S. Thesis, Dept. of Computer Science, Oregon Graduate Institute, 1991. [23] L. Patterson, R. Turner, R. Hyatt, and K. Reilly, "Construction of a faulttolerant distributed tuple-space," in Proc. 1993 ACM Symp. on Applied Computing, pp. 279-285, Feb. 1993. [24] A. Xu and B. Liskov, "A design for a fault-tolerant distributed implementation of Linda," in Proc. 19th Fault-Tolerant Computing Symposium, Chicago, IL,pp. 199-206, June 1989, [25] B. Liskov, "Distributed programming in Argus," Commun. ACM, vol. 31, pp. 300-312, March 1988. [26] M. Herlihy and J. Wing, "Avalon: Language support for reliable distributed systems," in Proc 17th Symp. on Fault-Tolerant Computing, Pittsburgh, PA, pp. 89-94, July 1987.

78 [27] C. Ellis, J. Feldman, and J. Heliotis, "Language constructs and support systems for distributed computing," in Proc. 1st ACM Symp. on Principles of Distributed Computing, Ottawa, Canada, pp. 1-9, Aug. 1982. [28] A. Spector, D. Daniels, D. Duchamp, J. Eppinger, and R. Pausch, "Distributed transactions for reliable systems," in Proc. J0th ACM Symp. on Operating Systems Principles, Orcas Island, WA, pp. 127-146, Dec. 1985. [29] H. Madduri, "Fault-tolerant distributed computing," Scientific Honeyweller, pp. I-IO, Winter 1986-87.

This page intentionally blank

SECTION 2

ALGORITHM-BASED PARADIGMS FOR PARALLEL APPLICATIONS

SECTION 2.1

Design and Analysis of Algorithm-Based Fault Tolerant Multiprocessor Systems Shalini Yajnik and Niraj K. Jha*

Abstract Algorithm-based fault tolerance (ABFT) is a cost-effective technique for improving the reliability of a multiprocessor system. It uses systemlevel codes to provide concurrent error detection and fault diagnosis capability to the system.. This section gives an overview of the design and analysis techniques used in ABFT. ^

2.1.1

Algorithm-Based Fault Tolerance

Parallel processing architectures are often employed for applications requiring high speed of data operations for long processing times. These characteristics make such architectures vulnerable to failures during normal operation, leading to a decrease in reliability. Therefore, ensuring the correctness of the results of the computations performed by the parallel processing system is an important concern. This can be achieved by introducing an on-line method of checking for failures. The technique for checking should be such that it does not introduce a large overhead in hardware or time. Algorithm-hased fault tolerance (ABFT) [l]-[35] is one such technique. It provides concurrent error detection and diagnosis capability to the parallel processing system. It encodes the input data at the system level, and modifies the algorithm to operate on this encoded input data to produce encoded output data. The redundancy in the encoded output data is used by the checks in the system to detect and locate the presence of faults. A B F T does not duplicate computations to provide fault * Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 ^Acknowledgments: This work was supported by Office of Naval Research under Contract no. N00014-91-J-1199.

82 tolerance, unlike many other fault tolerance schemes. Therefore, it is highly cost-effective for massively parallel systems. This section will give an overview of the design and analysis techniques used in ABFT. There are two basic models which are used widely for representing and analyzing ABFT systems — the graph-theoretic model [13] and the matrix-based model [17]. Variations of these models have also been proposed by researchers in the course of evolution of this area [9, 14]. The design of ABFT systems can be done by both deterministic [6, 13, 20, 26, 27], as well as randomized [8, 10, 12] construction techniques. The deterministic techniques, however, are usually not as efficient as the randomized techniques. The latter are based on a simple procedure called RANDGEN which is fast and easily parallelizable. This procedure produces ABFT systems with a wide spectrum of properties, with asymptotically the least number of checks in many cases. Another approach, called synthesis for fault tolerance^ introduces ABFT features at the architecture synthesis stage itself [5]. Once one or more faults have been detected in a multiprocessor system, a diagnosis algorithm uses the results of the checks to locate the fault(s). The complexity of the only previously known general diagnosis algorithm [4] increases drastically with the fault diagnosis capability of the system. However, a much more efficient new algorithm [12] has been proposed which exploits the fact that aliasing is an extremely rare phenomenon. Aliasing is said to occur when two or more erroneous data elements assume values that can fool the check that checks them. This can happen if the errors compensate each other such that the system-level code that is used is not able to catch it. If indeed aliasing has not occurred, this new algorithm guarantees correct diagnosis. If aliasing, rare though it may be, does occur, this algorithm can either still perform correct diagnosis or it indicates its inability to perform diagnosis. Hence, it cannot be fooled by aliasing. When it cannot perform diagnosis, we can switch to the general diagnosis algorithm [4]. Therefore, this two-tier approach can drastically reduce the diagnosis time since the general diagnosis algorithm is much more complex than this one. Before going into the different concepts, let us take a look at an example ABFT system. The following definitions [1] are presented for the purpose of the example.

83 Definition 1 A column checksum m,atrix A' of an n x m m,atrix A is an (n -h 1) X 77^ w,atrix, which consists of m,atrix A in the first n rows, and the i^^, 1 < i < m, element of the (n -h 1)*'^ row consists of the summation of the elements in the i^^ column in matrix A.

Definition 2 A row checksum m,atrix A' of an nxm matrix A is an nx {m-\-1) m,atrix, which consists of matrix A in the first m columns, and the i^^, I < i < n, element of the {m-\- l)*'^ column consists of the surnmation of the elements in the i^^ row in matrix A.

b 31

b 21

b II

«2JI

<^22 C21

b

b

32

33

b 22

b 12

Zb 3j

hs H b 13

U) Ij

^D^HJir^X-O

'" "'' ''^'^c>-Cb-O^^J ^3 ^,2

Z«,;^V^;—Orf]

Figure 2.1.1: A mesh array for matrix multiplication

Example 1 Figure 2.1.1 shows a 4 x 4 mesh-connected array of processors. We will perform a fault tolerant matrix multiplication of two 3 x 3 matrices A : [aij) and B : (bij) to produce product matrix C Matrix multiplication has the property that it preserves checksums across the computation. Therefore, if the two input matrices are encoded using row or column checksums, the output matrix will also be in checksum encoded form. Let us encode matrix A to A' using column checksums and matrix B to B^ using row checksums. Let the product matrix be C". Then matrix C will be a column checksum as well as a row checksum matrix. The row and column checksum encoding in matrix C is checked to ensure the correctness of the computation, thus providing fault tolerance to the system.

84

A!

an

ai2

^21 ^31

^22 ^32

\ Ex^il

Et«i2

^13 ^23 O33

^11 ^31

Et^*3 / dio dii

/

d\ 6.2 d\ ds C' = dr dg \ di3 di4

dz dio \ d^ dii dg di2 dis die J

di2 die

^13 di4 dis die

2.1.2

^12 &13 E j ^Ij

5 ' = I 621 ^22 &23 E j ^2j

= = = = = = = =

^32 ^33 E ; ^3j

di -{- d2 -\- ds ^4 4- CZ5 + de

dj -\- ds -\- dg C^13 + ^14 + dis ^1 + ^4 + dj ^^2 + ^5 + ds ds -\- de -\- dg <^io + dii -h di2

D

Models

Consider a multiprocessor system of m processors, pi, P2j---) Pm- Suppose an algorithm producing k data elements, di, ci2j---} dk^ is mapped to it. Each processor in the multiprocessor system computes one or more data elements. A processor can share the computation of a data element with other processors in the system. The set of data elements produced by a processor is called its data set The traditional processor-level fault model proposed by Huang and Abraham in [1] is generally used. According t o this model a fault in a processor manifests itself as an error in one or more data elements produced by the processor. The set of faulty processors in the system is said t o be a fault pattern. The set of erroneous data elements produced by a fault pattern is said to be an error pattern. In an A B F T system, the input data elements are encoded using a property of the algorithm which remains invariant through the computation. Therefore, the output data elements are also in encoded form and have the original data part, e.g. c^i, d2,---) ^9'^^Example 1, and a redundant data part, e.g. dio, ciii,..., die» The output data elements are checked by means of checks, ci, C2,..., c^, which in Example 1 are checks on the checksum encodings. The set of data elements checked by a check is called its data set. Therefore, an A B F T system consists of processors, data elements and checks. The checks in the system can be bounded or unbounded. The maximumi number of data elements a bounded check can check is fixed. There is no such restriction on unbounded checks. A {g^h) check is defined on g data elements

85 and the result of the checking operation is as follows [14]. • The check outputs a 0 if the check-computing processor is non-faulty and all the data elements in its data set are error-free. • The check outputs a 1 if the check_computing processor is non-faulty and at least one data element in its data set is in error, but the number of data elements which are erroneous in its data set does not exceed h. • The check is unpredictable if 1. The check_computing processor is faulty or 2. The number of erroneous elements in its data set is greater than h. Here, h is called the error-detectability of the check. The checksum encodings used in Example 1 have h = 1. When the algorithm is run on the system and the checking operations are performed, each check in the system evaluates to a *1* or a *0'. The state of the checks can be represented by a ^-bit binary vector where q is the number of checks in the system. This vector is called the syndrome [7]. Bit i in the vector is set to *1' if check i is at state *1' in the system, else it is set to *0'. In a fault-free system all the bits in the syndrome should be *0'. A check which has more than h erroneous data element in its data set is unpredictable. Therefore, a single error pattern may give rise to more than one syndrome.

D e f i n i t i o n 3 A system is said to be s-fault [s-error) detecting if for every fault (error) pattern of size s or less, there is at least one check which evaluates to a

D e f i n i t i o n 4 A system is said to he t-fault (t-error) diagnosing if for every fault [error) pattern of size t or less, the syndrome produced by the system, is distinct from the syndromics produced by other fault (error) patterns of size smaller than or equal to t.

Systems which have a combination of fault detection and diagnosis properties have also been considered, e.g. a system which is both 5-fault detecting and t-fault diagnosing at the same time.

86 Definition 5 A system is said to he s-fault detecting/t-fault diagnosing (s > t) if for every set of s or fewer faults the system can detect the faults but may not he ahle to diagnose the faulty processors, whereas in the presence oft or fewer faulty processors, the system has the capability to diagnose the set of faulty processors.

In case of fault diagnosing systems if some processors share in the computation of a data element and if only the shared data element is in error, it becomes impossible to diagnose the fault to a particular processor. The set of processors which have non-disjoint data sets are said to form a processor class. The faults in the system can be identified to a processor class. Henceforth in this chapter, fault diagnosability refers to processor class fault diagnosability.

2.1.2.1

The Graph-Theoretic Model

Banerjee and Abraham [13] proposed a graph-theoretic model for representing and analyzing ABFT systems. We call this model the original model. Later the model was refined by Yajnik and Jha [9] to include the faults in the processors performing the checking operations. This model is called the extended model. This section will give a brief overview of the two models. The Original Model: An ABFT system can be represented by a tripartite graph called the processor-data-check (PDC) graph consisting of sets of processor nodes (P), data nodes (D) and check nodes (C). Processor nodes are linked by edges to data nodes and the data nodes to check nodes. An edge between processor node pi and data node dj implies that pi is involved in the computation of dj. Similarly, an edge between data node dj and check node Ck implies that dj is checked by c^. Therefore, the processor-data part, called the PD graph, of the PDC graph represents the mapping of data elements of the algorithm to the processors. Given such a mapping, the design of an ABFT system under this model, involves constructing the data-check part, called the DC graph, of the PDC graph such that the system has the desired fault tolerance capabilities. The Extended Model: The original model assumes that checks are evaluated on processors which are either outside the system or are fault-free or self-checking. In actual systems this is not the case. The checks are usually

87 evaluated on processors in the multiprocessor system and are as prone to failures as any other processor in the system. Under the extended model used for representing A B F T systems, the processors c o m p u t i n g the checks are considered to be a p a r t of the system and any failures in the check-computing processors are also taken into consideration. T h e model uses two graphs to represent an A B F T system — an extended t r i p a r t i t e PDC graph and a b i p a r t i t e check evaluation (CE) graph. T h e two graphs consist of six types of nodes. I n f o . p r o c e s s o r nodes compute the original unencoded d a t a elements which are represented by i n f o _ d a t a nodes. T h e r e d u n d a n t p a r t of the o u t p u t d a t a is represented by c h e c k _ d a t a nodes. These are c o m p u t e d by c o d e _ p r o c e s s o r nodes. T h e checking operations are represented by c h e c k nodes which are evalu a t e d by c h e c k - C o m p u t i n g processor nodes. T h e check-computing processor nodes can be of the info-processor node type or the code-processor node t y p e . T h e extended PDC graph represents the processor-data-check relationship as in the original model, except t h a t the info_processor nodes are connected to only the info-data nodes and the code.processor nodes are involved in the c o m p u t a t i o n of only the check-data nodes. Although there is sharing of d a t a elements a m o n g the info_processors nodes, the code-processor nodes are not allowed to share the c o m p u t a t i o n of a d a t a element with other processors in the system. Each check has at least one check-data element in its d a t a set. A check computes some function of the d a t a elements in its d a t a set and compares the result with a particular check-data element in its d a t a set. This check-data element is called the check-compare element of the set. T h e CE graph represents the m a p p i n g of check evaluations to the system processors. There is an edge between check node Ci and processor node pj in the graph if processor pj evaluates check €{. We give an example based on the extended model next.

E x a m p l e 2 Consider the fault tolerant multiplication of two 3 x 3 matrices A and B given in E x a m p l e 1. T h e m a t r i x multiplication result is a 4 x 4 m a t r i x C whose info-data elements are d i , fi2)---> ^9- T h e last column a n d the last row of C are the r e d u n d a n t o u t p u t d a t a elements. These seven elements, dio, ciii,..., di6, are the check-data elements. Info_processors p i , P2)---j P9 produce the info-data elements, whereas code_processors pio, P i i , . . . , Pie perform the r e d u n d a n t c o m p u t a t i o n s and produce the check-data elements. Checks c i , C2,..., cy, given next m a k e the system 3-fault detecting [1].

88

". O Info_processor

Real_data p

(~\

X i, ^,0^ p, o — m , , , o ,o—m

,0(a)

Code_processor

AT'

Figure 2.1.2: (a) The original PD graph, (b) The 3-fault detecting PDC graph Checks ci : dio = di + d2 + d^l C3 * di2 — dj -{- ds -\- dg? C5 • ^13 = di -{- d^ -{- dr? C7 • dis = ds -{- de -\- dg?

C2 : d i i = ci4 + ^5 -f- de? C4 : die = di3 -\- cfi4 -h ci] C6 : cfi4 = ^2 + C^5 + ^8?

In this example, dio is the check_compare element of check ci, dn is the check_compare element of check C2, and so on. The original non-fault-tolerant PD graph and the corresponding extended PDC graph are shown in Figure 2.1.2. For the extended model, the mapping of checks to processors needs to be performed before the system can truly be said to be 3-fault detecting. A general method for finding such a mapping is given in Section 2.1.3.3. D

89

2.1.2.2

The Matrix-Based Model

The tripartite PDC graph can be represented by two matrices — PD and DC [17]. The PD matrix has processors as rows and data elements as columns and represents the PD graph. There is a *1' entry in position (i, j) in the matrix if there is an edge from processor pi to data element dj in the PD graph. The DC matrix has data elements as rows and checks as columns and represents the DC graph. There is a ' 1 ' entry in position (i, j ) in the DC matrix if there is an edge from data element di to check Cj in the DC graph. A third matrix called the PC matrix, and the corresponding PC graph, are also used for representing an ABFT system. The PC matrix is obtained by taking the product of the PD and DC matrices. An entry in position (i, j) in the matrix indicates the number of paths from processor pi to check Cj in the PDC graph. In the extended model the information in the CE graph is represented in the matrices in the form of true and invalid entries. An entry (i, j) in the PC matrix is true if check Cj corresponding to the j ^ ^ column is not evaluated on processor pi corresponding to the i*'^ row, else the entry is defined as invalid. The PD matrix for the PDC graph given in Figure 2.1.2 is an identity matrix of order 16. Therefore, the PC matrix is the same as the DC matrix, and is given below.

Pi P2 P3

P4

Ps Pe P7

PC=PDx

DC =

Ps P9 PlO Pll P12

Pis Pl4 Pis PI6

Ci

C2

C3

C4

C5

Ce

Cj

1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0

0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0

0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1

1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0

0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0

0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0

The matrices defined next are taken from [4, 17], and are useful in the

90 analysis of ABFT systems. Definition 6 ^ PD is defined as the matrix whose rows are formed by adding r different rows of matrix PD, for all possible combinations of r rows, and setting all non-zero entries in the resulting matrix to 1. Definition 7 "^PC is the m,atrix obtained by the product of the matrices and DC.

^PD

A row in matrix'' PD represents a unique set of r processors and the set of *1' entries in the row represent the union of the data sets of these processors. The (i, i ) entry in matrix '"PC represents the number of data elements of the corresponding r processors that are checked by the j ^ ^ column check. For example, row (pi,P2) in the ^PC matrix derived from the PC matrix given earlier will be [2 0 0 0 1 1 0].

2.1.3

Design of ABFT Systems

Each processor in the system may compute one or more data elements. When a processor is faulty any one or more of its data elements can be in error. Therefore, a fault pattern can give rise to several error patterns. Nair and Abraham [20] proposed a hierarchical method of construction to take care of different error patterns. They defined a unit system in which each processor computes only one data element. Given a unit system of desired fault tolerance capability, they proposed the construction of a product system by connecting every data element affected by a processor in the actual system to every check with which the processor is connected in the unit system. The product system will have the desired fault tolerance capabilities provided there is no sharing of data elements among the processors. An alternative procedure proposed by Vinnakota and Jha [6] takes care of sharing of data elements. Their procedure of forming the final system, called the composite system^ is discussed in Section 2.1.3.2.

2.1.3.1

Unit S y s t e m Construction

In a unit system each processor computes only one data element. The system given in Example 1 is a unit system. The presence of a fault in a processor of the

91 unit system implies the presence of an error in the corresponding data element. Therefore, the cardinality of a fault pattern is the same as the cardinality of the resulting error pattern. In such a case, designing 5-fault detecting/diagnosing systems is the same as designing 5-error detecting/diagnosing systems. Therefore, in this section, we will concentrate on error detecting/diagnosing systems. Randomized [8], as well as deterministic [13, 26, 27], techniques have been used in the construction procedures for such systems, though no general deterministic techniques exist for designing ^-error diagnosing systems, except for the simple case of t = 1 [11]. In this section, we discuss both types of techniques.

Randomized Construction The randomized construction approach is based on a simple procedure called RANDGEN which was proposed by Sitaraman and Jha [8]. RANDGEN is very fast and easily parallelizable, and uses unbounded checks. It can produce efficient DC graphs with a wide spectrum of properties by just changing its input parameters. For example, it can produce 5-error detecting DC graphs with asymptotically optimal number of checks and t-error diagnosing DC graphs with asymptotically nearly optimal number of checks. Let D , \D\ — n, be the set of data elements and C, \C\ — 5, be the set of checks. The DC graph is constructed by adding edges between the set of data nodes and the set of check nodes. RANDGEN makes random decisions during the construction of edges, using probability p which is an input parameter to the procedure, as follows. Algorithm

RANDGEN{q,p):

For every pair (u, v), where u ^ D and v ^ C do the Add edge (u^v) to the DC graph with probability p.

following: •

Algorithm RANDGEN considers an A B F T system under the original model. An extension of the algorithm, as given below, was proposed in [10] which gives design procedures for A B F T systems under the extended model. Let Di be the set of m info_data elements di, ^2,..., dm and D2 be the set of q check_data elements d'^ dj,..., d'^. Let C be the set of q checks, ci, C2,..., Cg.

92 Algorithm

Mod.RANDGEN{q,p):

For the construction of the DC graph: • For every pair (^J, c^), 1 < i < 5, add an edge to the DC graph. • For every pair (1^, c^), 1 < i < g, where u E Di U {D2 — {^i}}, add edge (u, Ci) to the DC graph with probability p. For the construction of the CE graph: • For every check c^, choose uniformly but randomly a processor pk from the set of all processors. Add edge {ci^pk) to the CE graph. D The two algorithms consider each pair of nodes independently. Therefore, it is easy to parallelize their implementation. The following theorems show how RANDGEN and Mod^RANDGEN are used in designing error detecting and diagnosing systems. The proofs of the theorems are given in [8] for RANDGEN and in [10] for Mod-RANDGEN. All logarithms, unless specified otherwise, are to the base 2. Error Detecting S y s t e m s We first discuss methods for design of error detecting systems. Theorem 1 Algorithm RANDGEN, using parameter q — 3.8s log n and p — - , produces an s-error detecting DC graph, under the original model, with probability at least 1 — ;;3y* Example 3 Consider a multiprocessor system on which a data set consisting of a matrix of dimension 512 x 512 ( = 262144 data elements) is mapped. Suppose we want to make the system 3-error detecting. We need to have 3.8 x 3 x 1 8 = 205.2 checks in the system. Since the number of checks is an integer, we add 206 checks to the system. With each check we do the following. We consider each data element and add it to the check's data set with probability | . When we are done with all the checks in the system, we get a DC graph which is 3-error detecting, under the original model, with a probability at least 1 — 262^43 ~ 0.999996. As a comparison, the traditional deterministic method requires 1023 checks. However, each check in that case checks much fewer data elements. D

93 T h e o r e m 2 Algorithm

ModJiANDGEN{q,p),

checks, where n = m-\-q,

andp — 7^37^7; produces an s-error

under the extended

model, with probability

using q -

(^^J^^'l[^L,\\ogn detecting

system,

at least 1 — ;^33;.

Error Diagnosing Systems We discuss design of error diagnosing systems next. T h e o r e m 3 For every S C D, | 5 | = 2t — 1, and for every di E D, di ^ S, let there exist a check which is connected to di, but not to any data element of set S. Then the DC graph of the system is t-error diagnosing under the original model. For the extended model, the check has to satisfy an additional constraint that it is not evaluated on a processor computing any element of set S\^di. Using T h e o r e m 3 we can prove the next two theorems. T h e o r e m 4 Algorithm RANDGEN, p — ^, produces a t-error diagnosing probability at least 1 — ^.

using q = {7M^ + 3 . 8 t ) l o g n checks and DC graph, under the original m,odel, with

E x a m p l e 4 Consider the 512 x 512 m a t r i x , with 262144 d a t a elements, once again. Suppose we want to make the system 3-error diagnosing, under the original model. According to T h e o r e m 4, we need (7.8 x 9 + 3.8 x 3) x 18 = 1468.8 checks in t h e system. Therefore, for each of the 1469 checks we do the following. We consider each d a t a element in the system a n d a d d it to the check's d a t a set with probability | . T h e resulting system is 3-error diagnosing with probability at least 1 - 2 6 ^ = 0.999996. D

T h e o r e m 5 Algorithm

ModJlAN

DOE N{q,p),

using q ^

checks, where n = m-\-q, and p = (^^Vtt2\ P''^oduces a t-error under the extended

model, with probability

'^^^^^+1^^)^'^^'^ diagnosing

system,

at least 1 — - .

In each of the above cases if one is not satisfied with the probability with which one can obtain an 5-error detecting or t-error diagnosing system, one can make this probability even closer to 1 by adding a few more checks.

94 A number of researchers have considered systems which have a tion of error detection and diagnosis properties, e.g. a system which detecting/t-error diagnosing. One simple way of doing this would be two systems, one 5-error detecting and the second t-error diagnosing superimpose one system on the other to get the final system. The theorem states that we only need to worry about the case where s >

T h e o r e m 6 Any t-error

diagnosing

system

is also 21-error

combinais 5-error to design and then following 2t [8].

detecting.

Deterministic Construction Several deterministic techniques exist for the construction of error detecting systems. Some of these techniques use bounded checks whereas others use unbounded checks. There exist no general deterministic techniques for the construction of error diagnosing systems, except for the case of 1-error diagnosing systems. The construction techniques given in this section use bounded checks and are implemented under the extended model. The design techniques under the original model are based on similar concepts [27]. The checks used in the unit system construction methods are (gf, 1) checks. Banerjee and Abraham [13] have shown that even if h of the (y, 1) checks in an error detecting system are combined into a single [gh^ h) check, the system has the same error detection capabilities. Also, most traditional coding schemes use (^, 1) checks even for error diagnosing systems. Therefore, all system designs can be confined to systems which have (^, 1) checks. The info-data elements in the original non-fault-tolerant PD graph are distributed among the template unit system and its copies. The method of distributing the info_data among the unit systems is given later in Section 2.1.3.2. In this section, we assume that the distribution has already been done. Consider a template unit PD graph consisting of m processors and m info_data elements. Each info_processor in the unit system is connected to exactly one info_data element. In the extended model the system has to satisfy the constraint that each check should have at least one check_data element in its data set. In such unit systems, the number of info_data elements is equal to the number of info_processor nodes m and the number of check_data elements is equal to the number of checks q. Only one check_data element can be mapped to one code.processor. So the number of code_processors introduced in a unit

95 system is equal to the number of checks. T h e deterministic m e t h o d s for design of error detecting a n d 1-error diagnosing systems, under the extended model are discussed next [9, 11]. Note t h a t the systems constructed by these m e t h o d s are all unit systems, therefore, error detection and diagnosis is s a m e as fault detection a n d diagnosis for these systems. Error D e t e c t i n g S y s t e m s We discuss m e t h o d s for construction of 5-error detecting systems next. T h e o r e m 7 The number ing system

of checks sufficient

under the extended m,odel is

- ^

for constructing

a 1-error

detect-

.

M e t h o d : Construct d a t a sets of each check by taking g — I elements from the info_data set and an unused element from the check_data set. Each d a t a element is present in exactly one check's d a t a set a n d therefore, any single error will be caught by the checks. D

T h e o r e m 8 For g > 2, m > [g — ly, constructing

a 2-error

detecting

system

the nuvnher of checks sufficient

under the extended

m,odel is

^^

for .

Method: 1. Let A = set of info.data elements in the unit system, where |A| = TTI >

(.9-1?. 2. Let B = set of check_data elements in the unit system, where \B\ — q —

3. Arrange set A of info_data elements in a grid of (^ — 1) columns. 4. Construct r o w c h e c k s by taking (^ — 1) info_data elements of a row in A a n d a single unused element from check_data set B, If the last row has less t h a n (^ — 1) elements in it, take elements starting from the t o p of the first column to complete the row check. 5. Construct c o l u m n c h e c k s by taking an unused element from B a n d {g — 1) elements of a column from A starting from where the last r o w / c o l u m n

96 check ended, going to the t o p of the next column if there are no elements left in t h e present column. •

E x a m p l e 5 Let m = 17, 5 r= 2, ^ = 5. T h e n \B\ = Ijzj]

= 9 a n d \A\ = 17.

Let A = [ai, 0 2 , . . . , 017] and B — [61, 62,..., 69]. A^ arranged in a grid with 4 columns, is shown below. cii

a2

CI3

04

dh

^6

0.7

^8

ag ai3 ai7

aio ai4

ciii ai5

ai2 aie

Checks: Row checks r i = [01,^2,03,04,61] ^2 — 7*3 = 7*4 = ^5 =

[05,06,07,08,62] [09, Oio, Oil, O12, 63] [0l3, Oi4, a i 5 , O16, 64] [017,01,05,09, 65]

C o l u m n checks ci = [013,017,02,06,65] C2 = [oiO, Oi4, O3, O7, 67] C3 = [011,015,04,08,63] C4 = [oi2,Oi6, 69]

Total n u m b e r of checks == 5 -|- 4 = 9. D

T h e o r e m 9 For g > 2, m> construction m+

[g — lY,

the number

of a Z-error detecting system

of checks sufficient

for the

under the extended model is -3^-

+

bzil

^-1

Method: 1. Let B be the set of

^-1

check_data elements.

2. Let A b e the set of all info_data elements in the unit system a n d the rest of

-3j-

check-data elements.

97 3. Construct a grid of d a t a elements of A with (gf — 1) columns. 4. Number the elements columnwise, i.e. go down the first column, then the second column and so on. Place the check_data elements present in A at positions ^, 2^, 3^,... in the grid and if the n u m b e r of elements in A is not a multiple of g then place a check_data element at the b o t t o m of the last column. 5. Construct r o w c h e c k s with a row of A a n d an unused element of B, 6. Construct c o l u m n c h e c k s with g elements by going down a column a n d going to the next column if the present column is used u p . •

E x a m p l e 6 Let m = 17, 5 = 3, ^ = 5. Hence, the n u m b e r of checks = 11.

-3^- -f

Let info_data elements — [^i, 02,..., 017] a n d check_data

elements = [61, 62, -.., &11]. T h e n \A\ - m-f- f ^ j

- 22 a n d \B\ -

.

^_^

6. Let A — [ai, 02,..., aiy, 67, h%^.., 611] and B — [61, 62, •••, ^e]- A^ a r r a n g e d in a grid with 4 columns, is shown below.

ax

ae

^2

a?

ail ai2

03

08

^>9

^10

a^

«>8

^13

ai7

h

ag

ai4

fell

as

^10

^15 ^16

Checks

Row Checks 7*1 = [ai, cte, a i l , ^15, fei] 7*2 = [^2, a?, ^12, ^16, ^2] 7*3 = [^3, ^8, ^9, feio, ^3] r4 — [a4,fegj^13, a i 7 , 64] rs == [67, ag, ai4, ^ n , 65]

Column Checks ci — [ai, 02, a s , a4, 67] C2 == [as, tte, a7, ag, 63] C3 = [ag, aio, a n , ai2, 6g] C4 = [ais, ai4, a i s , a i e , 610] C5 = [ai7, ^ n ]

7*6 = [asjCtio,

D

fee]

For the case of 771 < (^ — 1)^, the m e t h o d s for 5 = 2 and 5 = 3 are similar

98 and can be found in [9]. For 5 > 4 a hierarchical construction method is used which is also given in detail in [9]. Error D i a g n o s i n g S y s t e m s A procedure for designing 1-fault diagnosing systems has to consider the fact that a single fault can give rise to several error patterns and hence to several syndromes. For diagnosing single faults present in the system, it is necessary that each syndrome produced by a fault be distinct from the syndromes produced by other single faults in the system. If there is sharing of data elements among processors, the system cannot always locate faults in individual processors [17]. If only a shared data element is found to be in error, all the processors involved in the computation of that data element are declared to be faulty. However, if at least one of the erroneous data elements in the data set of a processor class is not a shared one, then the faulty processor can be pinpointed exactly. The system design uses (^,1) checks. Such a check may be invalidated if there is more than a single erroneous data element in its data set. In order to prevent invalidation of checks in the presence of a single fault in the system, no check's data set should have more than a single data element from any processor's data set. This means that there cannot be any entries greater than *1' in the PC matrix. Let a be the maximum cardinality of a processor class. No two processors in a processor class should be checked by the same check. If this is allowed, then some copy of the template unit system may have two data elements of the same processor being checked by the same check, which might result in check invalidation. Consider a square grid G of size n x n or a rectangular grid of size (n -f 1) x n. Number the rows in the grid as 0, 1,..., n — 1, n, and similarly the columns from 0 to n— 1. The j^^ diagonal is defined as containing the positions (i, (i -f j) mod n), 0 < i < n — 1 , of the grid. T h e o r e m 10 For a < of checks sufficient

T ^ Z W \ {g - I), g > 2, m > {g - 1)^, the

for constructing

tended model is Ifzi]

a 1-error

diagnosing

system

number

under the ex-

+ [ r ^ W ] id " ! ) •

Method: 1. Let A == set of info_data elements in the unit system, where |A| = m >

99

2. Let B = set of check-data elements in the unit system, where \B\ =: q = jzi

+

(^^1)3 {g - !)• Let m = x{g - 1)^ + y, where x and y are

integers, such that x =

/ ^^va

and y < {g — 1)^.

3. Arrange set A of info_data elements in grids Ai, ^2)---) ^x^ of size [g — 1) X (^ — 1) each. Arrange the rest of the y elements in a grid Ax^i of {g~l) columns. Place the info.data elements corresponding to a processor class either along the diagonals in one grid or in different grids. Once a processor class occupies some positions in a diagonal, the unused positions can be used by another processor class. 4. Construct row checks by taking all info_data elements of a row in Ai^ 1 < i < X -h 1, and a single unused element from check-data set B. 5. Construct column checks by taking an unused element from B and all elements of a column from Ai, 1 < i < x -\- 1. D In the above construction procedure, the data elements corresponding to a processor class are placed along the diagonals in a grid or in separate grids because no two diagonal entries, or entries from separate grids can be in the same check's data set.

Example 7 Let m == 14, g = A. Then \B\ = - ^

+ ^ - ^

(gf - 1) = 11 and

\A\ = 14. Let A = [ai,a2) •••)Cti4] and B = [^i, ^2) •••> ^ii]- Suppose processors Pi, p4 and ps computing ai, a^ and ag, respectively, are in one processor class Pi, and processors p2j Pe and p7 computing 02, ae and ar, respectively, are in another processor class P2. Arrange set A in two grids, Ai of size 3 x 3 and A2 of size 2 x 3 . The data elements corresponding to Pi are placed along diagonal 0, and the data elements corresponding to P2 are placed along diagonal 1 in grid Ai, as shown:

O'l ^

as as ar

0.2 ^

a^ a^ ag

^3

^ ae ae ag

A

,

^10

^11

ai3

ai4

A2 :

^12

100 T h e row a n d column checks can then be derived as follows: R o w checks 7*1 = [ai, CL2, a s , bi] ^2 — [^5, ^4, ae, 62] 7*3 = [a?, 09, ag, 63] r4 = [aio, a n , ai2, 64] 7*5 = [cti3, cti4, ^5]

C o l u m n checks ci = [ai, a s , ay, he] C2 = [^2, 04, ag, 67] C3 = [as, ae, ag, 6g] C4 = [aio, aia, 69] C5 = [ a n , ai4, 610] ce - [ai2,^ii] °

T h e m e t h o d of construction for the case when m < (^ — 1)^ is similar to the m e t h o d given in the previous theorem. 2.1.3.2

Composite System

Construction

It is assumed here, as in [6], t h a t the processors in the original non-faulttolerant system can share d a t a elements, b u t each processor produces at least one d a t a element which no other processor in the system affects. These d a t a elements are defined as the distinguishing data elements of the processor. For designing the complete A B F T system, the unit t e m p l a t e system has to be constructed first. From the unit t e m p l a t e system, copies are created and then these systems are combined to form the composite A B F T system using some rules. T h e m e t h o d of construction is an extension of the design process given in [6]. A d a t a element is defined as filled if it has been used in or added to the tripartite graph of the unit system, else it is called unfilled. Given the original non-fault-tolerant P D graph:

1. Construct the PD graph of the t e m p l a t e unit system, where each processor is connected to exactly one of its distinguishing d a t a elements. If more t h a n one distinguishing d a t a element exist, then one can be chosen at r a n d o m . 2. Construct copies of the t e m p l a t e unit system such t h a t all the unfilled d a t a elements in the system are filled. If there are no unfilled d a t a elements left for a processor, then reuse an already-filled element. For details of this p a r t the readers are referred to [6]. 3. Using the m e t h o d s given in Section 2.1.3.1, design the DC g r a p h of the t e m p l a t e unit system such t h a t it has the desired fault tolerance

101 capabilities. For the extended model, first add check_data nodes and code_processor nodes to the template unit PD graph and then design its DC graph. 4. In the copies of the template, add the check nodes with their edges to the data nodes connected in the same way as in the unit template system. However, the check nodes in the copies are numbered differently. For the extended model, add the code_processor nodes, check_data nodes and the check nodes as in the template unit system but rename the check-data nodes and the check nodes. •

Given the template system and its copies, the composite system is formed by superimposing the unit systems on one another, as follows.

1. The data sets of the processor nodes are the same as in the original non-fault-tolerant PD graph. In the extended model, the data set of a code_processor in the composite system is the union of its data sets in the unit systems. 2. While merging the DC graphs of the different unit systems, the set of checks to which a data element is connected is formed by taking the union of the set of checks to which it is connected in each of the unit systems. After this, the checks whose data sets are found to be identical are merged into a single check. In the extended model, checks which have the same data sets (except for the check_compare elements which may be different) are merged into a single check. The corresponding check_compare element is removed from the system if it is not connected to any other check. • The example given next is under the extended model. E x a m p l e 8 Consider the implementation of some algorithm on a parallel system with the PD graph as shown in Figure 2.1.3(a). Suppose we wish to design a 2-fault detecting version of this system. A template system for the PD graph, which is 2-fault detecting (with g — Z)/\s shown in Figure 2.1.3(b). The info_data set of the template system is {^1,^3,^4,^5}. Each member of this set is a distinguishing element. By Theorem 8 we require four checks to detect two faults in the template system. Therefore, we need to add four

102

Figure 2.1.3: (a) The original PD graph, (b) a 2-fault detecting template unit system, (c) a copy of the template system, (d) the composite system. check-data nodes {de, cZr, dsj dg^^ and consequently four code-processor nodes P5) P6) P7 s-nd ps to the graph. A single copy needs to be formed to fill the remaining info-data elements, as shown in Figure 2.1.3(c). The info.data set of the copy is {^2, ds^ ^4, ^5} and the check-data set is {dio, tin, ^^12, ^£13}. The code_processors remain the same. The template system and the copy are now composed to form the final system which is shown in Figure 2.1.3(d). In the final system, check ce is equivalent to check C2 and check cg is equivalent to check C4. So checks CQ and cg are removed. • The following theorem shows that the fault tolerance properties of the unit systems are maintained in the composite system [6]. T h e o r e m 11 If the template system is s-fault detecting/t-fault diagnosing, the composition of the template system and the copies is also s-fault detecting/tfault diagnosing.

103 The unit/composite system construction techniques only ensure that the fault tolerance properties are maintained if the checks are computed on faultfree processors. For the extended model, a third stage of design called check mapping is required. The randomized construction procedure Mod^RANDGEN takes care of the check mapping part during the unit system construction stage, whereas in deterministic construction this has to be done separately. The next section deals with the mapping of checks onto system processors. Since a general deterministic construction procedure is not available for fault diagnosing systems, we will concentrate on check mapping for fault detecting and 1-fault diagnosing systems only. 2.1.3.3

Check Mapping

Given a composite system which is designed to be 5-fault detecting or 1-fault diagnosing assuming that the checks are computed correctly, check evaluations have to be mapped to the processors in such a way as to maintain the fault tolerance capability of the system even when the checks may themselves fail. This section gives methods for performing the check mapping operation for 5-fault detecting systems and 1-fault diagnosing systems. Fault Detecting S y s t e m s If the check_computing processors in an ABFT system are allowed to be faulty, then a fault pattern is detectable if and only if, for every error pattern that the fault generates, there is at least one non-faulty processor evaluating a check which has a single erroneous data element in its data set. All the processors in the system have the capability to perform the checking operations. The following are some definitions and theorems from [17].

Definition 8 A row of^ PC is said to he completely detectable if and only if the fault pattern represented by the row is detectable for all possible error patterns produced by that fault.

T h e o r e m 12 A system is s-fault detecting if and only if the matrices ^PC, for i — 1, 2,...., s, are com,pletely detectable, i.e. all the rows of each of the m,atrices are completely detectable.

104 For a row of'"PC to be detectable, there should be at least one entry in the row which is less than or equal to the error-detectability (h) of the check used, and the corresponding check should be evaluated on a processor which is not in the set of faulty processors defined by the row. This should be true for all the rows of each of the matrices *PC, for i = 1,2,...., s. Since we want to take care of all the possible error patterns that a fault pattern can generate, we consider the PC matrices of each of the unit systems, instead of operating on the PC matrix of the final composite system. The processor set of a check is defined as the set of processors, any one of which can be used to evaluate the check. We need to find the processor sets of each of the checks such that the mapping of checks to processors maintains the fault detectability of the system. Before going into the details of the algorithm used for check mapping, consider the following example. *

E x a m p l e 9 Consider the PD graph of the system given in Figure 2.1.3(a). Suppose we want to make the system 1-fault detecting. The unit systems consist of 4 info_data elements. In order to detect a single fault in the system we add two checks with ^ = 3 (from Theorem 7). The unit systems and the final system are shown in Figure 2.1.4. Construct the PC matrices of the two unit systems, PCi and PC2 and concatenate them as shown in Figure 2.1.4(d) to form matrix PF. In these PC matrices, all the checks in the composite system, not just those present in the unit system, are included, because of the need to concatenate these matrices. Number the rows in PF from 0 to 11. We have to make sure that each row in PF has at least one true *1' entry. In row 0, there is a *1' entry for check ci. In order to make this entry true, check ci should not be evaluated on pi. In row 2, there is a '1' entry corresponding to check C2. If this entry is to be true, check C2 cannot be evaluated on processor ps. This also makes the *1' entry in row 8 true. Similarly, the other rows are dealt with. The following processor sets for the checks satisfy all the above constraints.

set.

Check

Processor Set

C3

{P3, P4, Pe} {Pi, P2, Ps] {P3, P4, Pe}

We can then map a check to any one of the processors in its processor One possibility is to map ci to p4, C2 to ps, and C3 to pe. In general.

105

PC matrix of , unit system 1 \

PC matrix of unit system 2

^^6

V 0 r 0 0 r 0 1* r 0 0 1* 0 0 0 0 0 r 0 r 0 0 0 r

0 0 0 0 0 0

r r 0 0

r 0

0

• Cj cannot be evaluted on p

I

• Cj cannot be evaluted on p

2 3

^ C2 cannot be evaluted on p ^ C cannot be evaluted on p.

4

^ C cannot be evaluted on p

5

^ C cannot be evaluted on p

6

• C cannot be evaluted on p

7 8

^ Cj cannot be evaluted on p already satisfied by 2

9 already satisfied by 3 10 » Cj cannot be evaluted on p II

already satisfied by 5

(d)

Figure 2.1.4: (a) Unit system 1, (b) unit system 2, (c) the composite system, (d) the PF matrix. because of this flexibility in mapping, we can consider the communication delay overhead and/or the processor load while assigning checks to check_computing processors. For each of the processor choices for a check, one can compute the communication delay to transfer the data elements, that the check checks, to the processor, and then choose the processor which will entail the least delay. At the same time, we can try to avoid assigning more than one check to a processor, whenever possible, in order to balance the processor load. • Algorithm for mapping of checks: 1. For each of the unit systems, construct a PC matrix. Tag each row of the

106 matrix with the corresponding processor. Let d be the number of unit systems. Term the PC matrix of the i^^ unit system as P Q , I < i < d. A check in the final system has a column in each of the PC matrices. Throw out the columns corresponding to the checks which are redundant and have been removed from the composite system. For each of the PCi matrices, construct ^PCi, 1 < r < s. Each row in ^PCi is a combination of r rows of PCi. Tag each row in ^PCi with a set of r processors corresponding to the r rows whose combination it is. Initialize the processor sets of each check to all the processors in the system. Columnwise concatenate all the matrices ^PCi^ 1 ^ '^ ^ ^j 1 < i < c2, to form one matrix PF. Arrange the rows in the final matrix PF in an increasing order of the number of *l's and number the rows from 0 onwards. Number the columns from 0 to 5 — 1, where q is the total number of checks in the system. In each row of PF, mark the *!' which appears first in or after the column i modulo g, where i is the row number. For all rows of PF starting from the first do i = 0; throw — 0; While ( throw = 0 & i < g - l ) { if ( the i^^ column entry is *!') and (the set of processors in the row tag p| the i^^ column check's processor set = (f)) then { throw the row out of the matrix; throw = 1;

} i = i + 1;

} if ( throw = 0) then do the following: (a) Let present-check = Marked *1' column. (b) If (processor set of present_check) —(set of processors in the row tag) = <j)^ label the marked *!' as had and mark a *!' in that row which is not labeled bad. Go back to Step 7(a). Else, remove those processors from the present_check's processor set which are also in the row tag of the present row. (c) Throw the present row out of the matrix.

107 8. Map each check to any of the processors in its processor set. Here, we can consider the communication overhead and/or the processor load while assigning the checks to the processors. D

Fault Diagnosing S y s t e m s For a system to be 1-fault diagnosing, no two faulty processors should produce the same syndrome. Nair and Abraham [17] introduced the concept of disagreements to check whether two faulty processors are distinguishable or not. A few definitions and theorems adapted from [17] are presented here.

Definition 9 Rows Ri and R2 of matrix PC are said to have a 0 — 1 disagreement if there is at least one true ^0' in either Ri or R2, such that the other row has a true *1' in the corresponding position, for all possible error combinations caused by the faults corresponding to the two rows. Definition 10 If all pairs of rows of matrix PC have a 0 — 1 disagreement, then PC is said to have a 0 — 1 disagreement with itself. If two rows in a matrix ^PC have a disagreement, it means that the two fault patterns are distinguishable if all the data elements corresponding to those fault patterns are in error. The distinguishability may not hold if only a subset of the data elements corresponding to the fault patterns are in error. Nair and Abraham [17] defined a stronger condition called complete disagreement to ensure full distinguishability of two fault patterns, as follows.

Definition 1 1 - 4 disagreement between two rows is called a complete disagreem,ent, if the disagreement exists for all possible error patterns corresponding to the fault patterns associated with the two rows.

Theorem 13 A fault tolerant system, is 1-fault diagnosing if and only if PC has a complete 0—1 disagreement with itself. If the PC matrices of the unit systems are considered, the information regarding the sharing of data elements is lost. Instead of working with the PC

108 C4 C^ C^

^2

1

0 1 0 0 0

C2 C^ C4 C^ C.

Pi

0 0 0 0

PI

0 1 0 1 0 0

Pi

0 10

1 1 1

Ps

1 0 0 1 0 0

P3

0 0 0

1 1 0

PI

P4

? P6

0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0

P4 P5 P6

0 10

11

0 0 1

0 0 0 0 0 1 0

10

0 0 0

9

0 0 1 0 0 0

B

0 0 0 0 0 1

P.

0 0 0 1 0 0

p.

0 0 0 10

(a)

0

(b)

Figure 2.1.5: (a) PC matrix of the template unit system, (b) PC matrix of the copy matrices of the unit system directly, the PC matrices of the units systems are modified to preserve this information. For the info.processor part of the system, take the PD matrix of the original non-fault-tolerant system, but consider only those columns whose corresponding data elements are present in the unit system. The DC matrix has rows corresponding to the data elements in the unit system and columns corresponding to all the checks in the composite system. The columns whose checks are not in the unit system have all *0' entries. In the DC matrix, a check in a unit system copy which is equivalent to a check in the template unit system or another copy, is represented by its corresponding check in the composite system. The augmented PC matrix of the unit system is then obtained by multiplying the PD and the DC matrices. Hereafter, in this section, the PC matrix of a unit system means the augmented PC matrix. The PC matrices of the unit systems in Figure 2.1.3 are given in Figure 2.1.5. T h e o r e m 14 If PC matrices of each of the unit systems have 0—1 disagreem,ent with them,selves and no check in the com,posite system has m,ore than a single data element from a processor's data set, then the PC matrix of the com,posite system, has a complete 0 — 1 disagreement with itself. The PC matrix of the composite system should have a complete 0 — 1 disagreement with itself to ensure 1-fault diagnosability in the system. By Theorem 14, the PC matrices of the unit systems can be handled directly, instead of the PC matrix of the composite system. A PC matrix having a 0 — 1 disagreement with itself means that the ^PC matrix should have at least one *1' in each row. For the 0 — 1 disagreement between two rows Ri and R2 to

109 hold even in the presence of faults in the check_coniputing processors in the system, at least one *1' in the row in ^PC^ corresponding to the (i^i, R2) pair, should be true. The algorithm for mapping of checks is similar in flavor to the algorithm used in the case of fault detecting systems, except that the algorithm considers only the ^PC matrices of the unit systems while performing the check mapping. The details of the algorithm can be found in [11].

2.1.4 An of s or section as well

Analysis of ABFT systems analysis of an ABFT system involves determination of the largest value ^, such that the system is 5-fault detecting or ^-fault diagnosing. This will present methods for analyzing an ABFT system under the original as extended models.

As defined earlier, a row in ^PD corresponds to a unique set of i processors and the entries in the row correspond to the data elements computed by this set of processors. Similarly, a row in *PC corresponds to a set of i processors. A fault pattern of size i can be represented by a row R in the ^PC matrix. If all the data elements in the data set of the fault pattern are erroneous, then an entry ^PC{R^j) represents the number of erroneous data elements in the data set of the check corresponding to the j ^ ^ column of the matrix. In the presence of fault pattern R:

1. U'PC{RJ)

- 0, then the 3^^ bit in the syndrome is a *0'.

2. If 0 < 'PC{RJ) < /i, then the j ^ ^ bit is a 'V and the entry 'PC{RJ) defined as valid. 3. If^PC(Rjj)

is

> h the value of bit j is unpredictable.

4. In the extended model • If the check corresponding to the j ^ ^ column is not evaluated on any processor corresponding to R then ^PC{R^j) is a true entry. • Else, it is an invalid entry and the value of bit j is unpredictable.

no 2.1.4.1

A n a l y s i s o f Fault D e t e c t a b i l i t y

The algorithm for analyzing the fault detectability of an ABFT system under the original model was given by Nair and Abraham [17]. For each fault pattern there may exist more than one error pattern. If the algorithm enumerates all possible error patterns it will be very complex. Nair and Abraham proposed an error collapsing technique so that their algorithm converges much faster and requires less storage. Their algorithm for analyzing detectability is as follows. Algorithm for Fault Detectability: 1. 2 = 1, 5 = 0.

2. Construct ^PC. 3. For every row R in ^PC do the following: (a) If row R has no valid entry, set 5 = i— 1, STOP. Else, go to the next step. (b) Find all values of j such that 0 < 'PC{RJ)

< h.

(c) For all j do the following: • For all rows in the DC matrix, \iDC{k,j) - 1, set ^PD{R, k) = 0 and set the corresponding elements in the k^^ column of the PD matrix to '0'. (d) If at least one row of the new PD matrix has all zeros, row R is completely detectable. Go to Step 3 for the next row. Else, continue. (e) Find the new ^PC matrix by multiplying the new ^PD matrix and the DC matrix, and go back to Step 3(a). D When we analyze the fault detectability of a system under the extended model, the basic algorithm is the same except that instead of only valid entries, valid and true entries have to be considered. A small example of analysis under the original model is given next. Example 10 Consider the PD, DC and *PC, 1 ^ ^ ^ 4, matrices given next. Matrices PC, ^PC and ^PC have *1' entries in all rows for all error patterns. Matrix ^PC does not have any valid entries for error pattern (cfi, ^2,0^3,^^4, ^5). Therefore, the system is only 3-fault detecting.

Ill

^

d2 0 1 0 0

ds 0 1 0 0

Cl

C2

C3

C4

1

0 0 0 1 1

1 0 0 1 0

1 1 0 0 0

Cl

C2

C3

C4

1 1 0 0

0 0 1 1

1 0 1 0

di

PD =

Pi P2

r 1

P3 P4 L

di

DC

PC =

d2

ds d^ ds

Pi P2 P3

P4 1

^

0 0

0 1 0 0

(Pl,P2) (Pl)P3)

^PC =

(Pl,P4) (P2,P3) (P2,P4) (P3,P4)

^PC

^5

0 0 1 0

0 0 0 1

1 0

0J

C2

C3

C4

2 1 1 1 1 0

0 1 1 1 1 2

1 2 1 1 0 1

2 1 1 1 1 0

(Pl)P2,Pa ) (Pl>P2,P4 ) (Pl)P3,P^ )

-^

Cl

C2

i'3

C4

2 2 1 1

1 1 2 2

2 1 2 1

2 2 1 1

Cl

(Pi,P2,P3,P4)

-1

1 1

Cl

(P2, P3,P4 )

*PC =

04

[

C2

2

2

C3

C4

2

2

D

112 2.1.4.2

A n a l y s i s of Fault D i a g n o s a b i l i t y

For a system to be t-fault diagnosing, no two fault p a t t e r n s of size t or less should produce the same syndrome. T w o fault p a t t e r n s Fi and F2 are said to be fully distinguishable if there exist no two error p a t t e r n s Ei and E2 corresponding, respectively, to Fi and F2 such t h a t the syndrome produced by the two error p a t t e r n s are the same. Next we give a few definitions from [17].

D e f i n i t i o n 12 Rows Ri and R2 of a matrix ^PC are said to have a 1 — 0 disagreement if there exists at least one valid entry in row Ri such that the corresponding entry in row R2 is ^0\

D e f i n i t i o n 1 3 Rows Ri and R2 of a matrix ^PC are said to have a 0 — 1 disagreem,ent if there exists at least one valid entry in one row such that the corresponding entry in the other row is a '0\

D e f i n i t i o n 1 4 Matrix ^PC has a 0 — 1 disagreement with itself if every row in the m,atrix has a 0 — 1 disagreement with every other row in the matrix.

D e f i n i t i o n 15 Matrix PC has a 1 — 0 disagreement with matrix ^PC if every row R of PC has a 1 — 0 disagreement with every row in ^PC which does not contain R.

All the above definitions apply to the extended model if the valid entries a n d the *0' entries considered are all true entries. Henceforth, when we refer to disagreement, we mean a complete disagreem,ent. T h e following theorems from [4] form the basis for the fault diagnosability algorithm.

T h e o r e m 15 A system is t-fault diagnosing iff ^PC has a com,plete 0 — 1 disagreement with itself and PC has a complete 1 — 0 disagreement with ^~^PC.

T h e o r e m 16 Let ki be the largest integer such that PC has a complete 1 — 0 disagreem^ent with ^^PC. Let k2 he the largest integer such that ^^PC has a complete 0 — 1 disagreement with itself. The system is then t-fault diagnosahle for t = min{ki + 1, AJ2).

113 Theorem 17 If every check is connected to at least two processors in the PC graph, then t < ta in a t-fault diagnosing system, where ta is the minimum number of checks checking a data element in the system. Let COV be the minimal set of processors which is connected to all the checks in the system. Let tj = |COV|, if COV is the only such set of minimal cardinality. If there is more than one such minimal set, th — \COV\ — 1. Theorem 18 In a t-fault diagnosing system,, t
- !)•

3. Conduct a binary search for k[ between the limits 0 and {tmax — I)4. AJJ = min(Aj2, k[ + 1). 5. Conduct a binary search for AJJ between the limits 0 and k'l -\- 1, 6. t = k'^, D Example 11 Consider the system given in Example 10. ta = 1, |COV| = 2 and there are 3 covering sets of minimal cardinality. Therefore, ti, — 1 and imax = 1- ^i = min{kijO) — 0. Therefore, we need to search for k'2 between 0 and 1. Matrix PC has a complete 0 — 1 disagreement with itself. Hence, AJJ = 1 and the system is 1-fault diagnosing. •

2.1.4.3

Fault D i a g n o s i s

Any fault diagnosis algorithm has to take into account the fact that each fault pattern can give rise to more than one syndrome. Formally, the diagnosis problem can be defined as follows. Given a t-fault diagnosing system

114 and a syndrome produced by it, determine which fault pattern of size t or less was responsible for giving rise to the syndrome. The following definitions and theorems from [4] form the basis for the diagnosis algorithm.

D e f i n i t i o n 16 In a bipartite graph G{Vij ¥2^ E), two vertices vi G Vi and V2 G V2, are said to cover each other if they are linked by an edge.

D e f i n i t i o n 1 7 A set of vertices V/ C Vi, is said to cover V2 C V2, where Fj is the union of the set of vertices covered by the members ofV(.

D e f i n i t i o n 18 A set of vertices V{ C Vi is called a covering set if it covers V2.

D e f i n i t i o n 19 A covering set of vertices ¥[ C Vi, is said to be irredundant if there does not exist a vi G V/, such that {¥[ — vi) also forms a covering set.

T h e o r e m 19 In at-fault diagnosing system,, given a fault pattern F, such that \F\ < t, and a syndromic S produced by the fault pattern, any processor p, p £ F, covers at least one check c E C at state ^1 * in the PC graph, which no other processor in F covers.

T h e o r e m 2 0 Given syndromic S, fault pattern F formes an irredundant ing set of processors for the set of checks at state 1 in the PC graph.

cover-

The actual fault pattern may not be the only irredundant covering set of the set of checks at state '1'. However, determining the irredundant covering sets narrows down the list of fault patterns that we need to consider. Therefore, the diagnosis algorithm involves first finding these irredundant covering sets and then determining which one of these agrees with the given syndrome. Given matrix PC and syndrome 5, define matrix PCs as follows:

1. Remove the columns corresponding to every check at state *0' in the syndrome.

115 2. Remove any processor row with a *0' in every remaining column. 3. Set every nonzero entry in the new m a t r i x to * 1 ' .

Since each faulty processor is connected to at least one check at state *1', no faulty processor will be eliminated by the above procedure. T h e fault p a t t e r n still remains an irredundant covering set of processors in PCs^ M a t r i x PCs can be further reduced by using the following m e t h o d . If a check k covers only a single processor pk in the PCs m a t r i x , then pk belongs to the i r r e d u n d a n t covering set a n d forms an essential row in the m a t r i x . Processors corresponding to all essential rows in the m a t r i x are a p a r t of the fault p a t t e r n . These can be removed from the m a t r i x . T h e n all checks covered by the essential rows can be removed, a n d the rows which become all *0's after removing these checks can be discarded t o o . In the new m a t r i x consider two checks Ci and Cj such t h a t all processors covered by check Ci are also covered by check Cj. In this case the column corresponding to check Cj will have *l's in all the rows in which the column corresponding to check Ci has *l's. Here Cj is said to d o m i n a t e c^. Any processor which covers check Ci will also cover check Cj. Therefore, check Cj can be removed from the m a t r i x . T h e m a t r i x obtained after all such reductions is called PCsr [4]. T h e faulty processors remaining in PCsr form an i r r e d u n d a n t covering set for the checks in PCsrA l g o r i t h m for F a u l t D i a g n o s i s : Given the PC m a t r i x of the system a n d syndrome S produced by it: 1. C o m p u t e PCs2. Find t h e set of essential processors. If all the checks in m a t r i x PCs have been covered, the set of essential processors forms the fault p a t t e r n , STOP. 3. C o m p u t e

PCsr-

4. C o m p u t e all the irredundant covering sets of m a t r i x PCsn cardinality less t h a n or equal to t.

which are of

5. If only one such covering set exists, S T O P . T h e covering set along with the set of essential processors forms the fault p a t t e r n .

116 6. Find the irredundant covering set which agrees with the syndrome. There can be only one such set. This set along with the set of essential processors forms the fault pattern. STOP. D

E x a m p l e 1 2 Consider the 1-fault diagnosing system given in Example 10. Suppose a processor in the system becomes faulty and the system generates syndrome 5 = [0 1 1 0]. We compute PCg by throwing out columns corresponding to checks ci and C4 and the row corresponding to processor p2 from matrix PC,

PC =

Pi P2 P3 P4

Cl

C2

C3

C4

1 1 0 0

0 0 1 1

1 0 1 0

1 1 0 0

PC.

Pi P3

P4

C2

C3

0 1 1

1 1 0

There are no essential processors and dominating columns in PCs. Therefore, PCsr = PCs. The irredundant covering set of cardinality one is [pa]. Hence, ps is the faulty processor in the system. •

The complexity of the diagnosis algorithm is 0{n^). When the value o f t is not very large, the diagnosis algorithm is practical. If the value of t is relatively large, the algorithm becomes computationally very expensive. Up till now in this chapter we assumed that if there are more than h erroneous data elements in a check's data set, the erroneous data elements may invalidate the check due to aliasing. In such a case, the state of the check will be unpredictable. This assumption makes the already complicated task of diagnosis even more difficult. It has been shown by Srinivasan and Jha [12] that check invalidation due to aliasing of data elements is a very rare phenomenon. If each output data element is represented by an n-bit word and each erroneous bit pattern is equally likely, they show that the probability of aliasing can be bounded by ^^^_^, where c is a small constant. For a 64-bit system, this probability can be as small as 10"-^^. They exploit this fact to propose a simple and fast 0(t^n^ log n) diagnosis algorithm. They introduce the concept of NO-error detectability/diagnosahilUy to measure the fault tolerance capability of a system under the assumption that no check invalidation occurs due to aliasing. Their

117 algorithm D I A G takes as parameters DC graph G, syndrome S and another parameter t. Algorithm DIAG(G,5,t):

1. Remove all the checks which are at state *0' from the DC graph to form the reduced check set Cr. 2. Remove the data elements connected to the checks removed in Step 1 to form the reduced data set Z)^. Declare the removed data set to be error-free. 3. Find D^ C ^ r , \Dr\ < t, which is compatible with the syndrome 5 . 4. If one such D^ exists, return it as the set of erroneous data elements, else declare failure. •

D e f i n i t i o n 2 0 A DC graph is called s-error NO-detecting if, given that number of erroneous data elements is less than or equal to s, at least one check in the system, goes to a state ' i ' under the assum^ption that check invalidation does not occur.

D e f i n i t i o n 21 A DC graph is called t-error NO-diagnosing if, given that the number of erroneous data elements in the system is less than or equal to t, the set of erroneous data elements can be uniquely determined from the syndrome, under the assumption that check invalidation does not occur.

T h e o r e m 2 1 Provided the number of erroneous data elements is no more

than

t: 1. if t is the error diagnosability and if DIAG{G^S^t) returns a solution, then it is guaranteed to be the correct set of erroneous data elements regardless of whether check invalidation has occurred or not. 2. if check invalidation does not occur and t is the NO-error diagnos ability of the DC graph, then DIAG is guaranteed to return the correct set of erroneous data elements.

118 An interesting point made by the above theorem is that aliasing cannot fool DIAG. Only in the extremely rare cases where aliasing has occurred and DIAG declares failure does one need to switch to the conventional diagnosis algorithm. If aliasing is assumed not to occur, then it can be shown that using the same number of checks in RANDGEN, roughly double the number of errors can be located and any number of errors can be detected.

2.1.5

Synthesis for Algorithm-Based Fault Tolerance

The design approach used in Section 2.1.3 takes an existing non-fault-tolerant system and modifies it to introduce fault tolerance into it. Another approach would be to introduce fault tolerance at the system synthesis stage. In this approach the designer can get an optimal mix of all the desired features, such as hardware, performance and fault tolerance, in the final design. This approach is called synthesis for fault tolerance, Vinnakota and Jha [5] used this approach to develop a general method for synthesis of ABFT systems. One needs to expose the inherent parallelism of an algorithm before implementing it on a parallel architecture. Dependence graph and single assignment form are two techniques which are helpful in bringing out the parallelism in an algorithm [36]. Consider the following loop in an algorithm: c = 0; For i = 1 to 4 c = c-\- b{i) end for

The value of variable c changes as the loop is executed. In a single assignment form, a variable can be assigned only one value throughout the execution of the program. Therefore, the value of c in different loops needs to be represented by different variables. The single assignment form of the above example will be the following:

119 c(0) = 0; For i = 1 to 4 c{i) - c{i - 1) -h b{i) end for

The dependence graph (DG) of an algorithm represents the data dependencies in the algorithm. Each node in the DG represents an operation and the directed arcs between the nodes represent the data dependencies between the operations. In the above example each operation c(i) = c{i — 1) -\- b{i) is represented by node i in the DG and there is an incoming arc from node i — 1 and an outgoing arc to node i-h 1 from node i. Each node in the DG operates only once during an algorithm execution. The DG is mapped to an architecture by means of projection mapping. A projection vector d is chosen and all the nodes along the direction of the projection vector are mapped to one processor. Since a number of dependent tasks have to be performed on each processor, a schedule vector s has to be defined such that the data dependencies are satisfied and no two tasks mapped to a processor need to be performed in the same computation cycle. For a feasible schedule, ^d > 0 and i^e* > 0, where e represents the vectors corresponding to the edges in the DG [36]. The synthesis method from [5] is as follows. Synthesis Method:

1. Encode the algorithm using a property which remains invariant during the computation. 2. Write the encoded algorithm in single assignment form. 3. Construct a DG for the encoded algorithm from the single assignment form. 4. Project the DG in different directions to get various architectures. 5. Choose the architecture which is optimal in terms of some cost function based on fault detectability, diagnosability, hardware and performance. D

120

r

^31

^32

^33

5^^j

^2,

«i3 «i2 ^ i i - K j — K J — K J — K J T'^ll 1^12 1^13 Y ^23 ^22 « 2 1 - K J K) KJ KJ 1^21 1^22 1^23 1^: ^33 «32 ^ 3 1 - K J

KJ

p31

Zai3 Iai2 I^ii-^^O

< )

1^32

KJ

Ml

(c)

''42

L I^3j ^ 33 ^32 ^ 3 1 - * 0

r'n J^-); o ^ ^-K. 2j ^^23 ^22

^ ^21 - H D

K j

Hr>—

T«23 T«33 K >

< )

1^33

KJ ^-43

]C

KJ

lb^•b^^ fc,2 ^11-^0 «11

HO «21

HO—HO ^31

I«i,

"-44

(d)

Figure 2.1.6: (a) The DG for matrix multiplication (b) projection in the i direction, (c) projection in the k direction, (d) projection in the j direction. Example 13 Consider the multiplication of two 3 x 3 matrices A and B given in Example 1. We encode the matrices using row and column checksums which remain invariant during the multiplication. Matrix A is encoded as A' using column checksums and matrix B is encoded using row checksums as matrix B\ The resultant matrix C : {cij) is a 4 x 4 matrix which has both row and column checksum property. The DG for the matrix multiplication is given in Figure 2.1.6(a). The DG is projected along the i, k and j dimensions and the array of processors obtained are given in Figure 2.1.6(b), (c) and (d), respectively. In the array obtained by projecting the DG in the k direction, each

121 processor computes only one output data element. Therefore, the effect of a fault in a processor is limited to only one data element and it is easy to catch the error. The arrays obtained by projecting the DG in the i, j directions, though better in terms of hardware, have much worse fault tolerance capabilities than the array in Figure 2.1.6(c). Each processor computes more than one data element and, therefore, the chances of check invalidation increase even for the simple case of a single fault. •

2.1.6

Conclusions

This chapter gave a brief review of several aspects of A B F T systems. It first presented an overview of the models used for representing A B F T systems. Various design techniques which use these models were discussed and compared. Methods were given to assess the fault tolerance capabilities of an A B F T system. Two simple diagnosis algorithms were presented. Finally, another approach to designing A B F T systems, called synthesis for fault tolerance, was briefly explained.

References [1] K.H. Huang and J.A. Abraham, ^^Algorithm-based fault tolerance for matrix operationSj^^ IEEE Trans. Com,put.^ vol. C-33, no. 6, pp. 518-528, June 1984. [2] J.Y. Jou and J.A. Abraham, ^^Fault tolerant m^atrix arithmetic and signalprocessing on highly concurrent computing structures^^'' Proc. lEEE^ vol. 74, pp. 732-741, May 1986. [3] J.A. Abraham et al.^ ^^Fault tolerance techniques for systolic Computer, pp. 65-74, July 1987.

arrays,^^ IEEE

[4] B. Vinnakota and N.K. Jha, ^^Diagnosability and diagnosis of algorithmsbased fault tolerant systems,^^ IEEE Trans. Com,put., vol. 42, no. 8, pp. 924-937, Aug. 1993. [5] B. Vinnakota and N.K. Jha, "^ dependence graph-based approach to the design of algorithm,-based fault tolerant system,s,^^ in Proc. Int. Symp. FaultTolerant Comput., Newcastle-upon-Tyne, pp. 122-129, June 1990.

122 [6] B. Vinnakota and N.K. Jha, ^^Design of multiprocessor systems for concurrent error detection and fault diagnosis^^^ in Proc. Int. Symp. Fault-Tolerant Comput.^ Montreal, pp. 504-511, June 1991. [7] B. Vinnakota, ^^Analysis, design and synthesis of algorithm,-based fault tolerant systemsj^^ Ph.D. Thesis, Dept. of Electrical Engg., Princeton University, Oct. 1991. [8] R. Sitaraman and N.K. Jha, ^^Optimal design of checks for error detection and location in fault tolerant multiprocessor systems,^^ IEEE Trans. Comput., vol. 42, no. 7, pp. 780-793, July 1993. [9] S. Yajnik and N.K. Jha, ^^Design of algorithm-based fault tolerant system's with in-system, checks,^^ in Proc. Int. Conf. Parallel Proc, vol. 1, St. Charles, IL, Aug. 1993. 10] S. Yajnik and N.K. Jha, ^^Analysis and randomized design of algorithmbased fault tolerant m.ultiprocessor system,s under the extended graphtheoretic model^''^ in Proc. ISCA Int. Conf. Parallel Dist. System,s, Louisville, KY, Oct. 1993. 11] S. Yajnik and N.K. Jha. ^^Graceful degradation in algorithm-based fault tolerant systems^^^ in Proc. Int. Symp. Circuits & Systems^ London, UK, May 1994. 12] S. Srinivasan and N.K. Jha, ^^Efficient diagnosis in algorithm-based fault tolerant multiprocessor systems^^^ in Proc. Int. Conf. Computer Design, Boston, MA, pp. 592-595, Oct. 1993. 13] P. Banerjee and J.A. Abraham, ^^Bounds on algorithm,-based fault tolerance in multiple processor systems,^^ IEEE Trans. Comput., vol. C-35, no. 4, pp. 296-306, Apr. 1986. 14] P. Banerjee and J.A. Abraham, ^^Concurrent fault diagnosis in m,ultiple processor systems,^^ in Proc. Int. Sym,p. Fault-Tolerant Comput., Vienna, pp. 298-303, June 1986. 15] P. Banerjee, "-A theory for algorithm-based fault tolerance in array processor systems,^^ Ph.D. Thesis, Coordinated Science Laboratory, Univ. of Illinois, Urbana, Dec. 1984. 16] P. Banerjee et a/., ^^Algorithm,-based fault tolerance on a hypercube m,ultiprocessor,'' IEEE Trans. Comput., vol. 39, pp. 1132-1145, Sept. 1990. 17] V.S.S. Nair and J. A. Abraham, "^ m,odel for the analysis of fault tolerant signal processing architectures,^^ in Proc. Int. Tech. Symp. SPIE, San Diego, pp. 246-257, Aug. 1988. 18] V.S.S. Nair and J.A. Abraham, ^^General linear codes for fault-tolerant m,atrix operations on processor arrays,^^ in Proc. Int. Sym,p. Fault-Tolerant Comput., Tokyo, pp. 180-185, June 1988.

123 [19] V.S.S. Nair and J.A. Abraham, " ^ model for the analysis, design and comparison of fauH-iolerant WSI architectures,^^ in Proc. Workshop Wafer Scale Integration, Como, Italy, June 1989. [20] V.S.S. Nair and J. A. Abraham, ^^Hierarchical design and analysis of faulttolerant multiprocessor systems using concurrent error detection,^^ in Proc. Int. Symp. Fault-Tolerant Comput., Newcastle-upon-Tyne, pp. 130-137, June 1990. [21] V.S.S. Nair, ^^Analysis and design of algorithm-based fault tolerant system^s,^^ Ph.D. Thesis, Coordinated Science Laboratory, Univ. of Illinois, Urbana, Aug. 1990. [22] A.L.N. Reddy and P. Banerjee, ^^Algorithm,-based fault tolerance for signal processing applications,^^ IEEE Trans. Comput., vol. 39, pp. 1304-1308, Oct. 1990. [23] V. Balasubramaniam and P. Banerjee, ^^Algorithm-based fault tolerance for signal processing applications on a hypercube multiprocesor,^^ in Proc. 10th Real-time Systems Symp., Santa Monica, CA, pp. 134-143, 1989. [24] V. Balasubramaniam and P. Banerjee, ^^Trade-offs in design of efficient algorithm,-based error detection schem,es for hypercube m,ultiprocessors,^^ IEEE Trans. Software Engg., vol. 16, pp. 183-196, Feb. 1990. [25] V. Balasubramaniam and P. Banerjee, ^^Com,piler assisted synthesis of algorithm-based checking in multiprocessors^^ IEEE Trans. Com,put., vol. 39, no. 4, pp. 436-446, Apr. 1990. [26] D. Gu, D.J. Rosenkrantz and S.S. Ravi, ^^Design and analysis of test schem>es for algorithm-based fault tolerance,^^ in Proc. Int. Symp. FaultTolerant Comput., Newcastle-upon-Tyne, pp. 106-113, June 1990. [27] D.J. Rosenkrantz and S.S. Ravi, ^^Im,proved bounds on algorithm-based fault tolerance,^^ in Proc. Annual Allerton Conf. Comm., Cont. and Comput., Allerton, IL, pp. 388-397, Sept. 1988. [28] D.M. Blough and A. Pelc, ^^Alm,ost certain fault diagnosis through algorithm-based fault tolerance,^^ Tech. Rep. ECE-92-09, Dept. of Electrical and Computer Engg., Univ. of California, Irvine. [29] K.H. Huang, ^^Fault tolerant algorithms for multiple processor systems,^^ Ph.D. Thesis, Coordinated Science Laboratory, Univ. of Illinois, Urbana, Nov. 1983. [30] Y.H. Choi and M. Malek, "A fault tolerant FFT processor,'' IEEE Trans. Comput., vol. 37, no. 5, pp. 617-621, May 1988. [31] J.Y. Jou and J.A. Abraham, ''Fault-tolerant FFT networks,'' IEEE Trans. Com,put., vol. 37, no. 5, pp. 548-561, May 1988.

124 [32] F.T. Luk and H. Park, ^^Fault-tolerant matrix triangularization on systolic arrays'' IEEE Trans. Comput, vol. 37, no. 11, pp. 1434-1438, Nov. 1988. [33] F.T. Luk and H. Park, " ^ n analysis of algorithm-based fault tolerance techniques^'' in Proc. SPIE Adv. Alg. Arch. Signal Proc.^ vol. 696, pp. 222228, Aug. 1986. [34] Y.H. Choi and M. Malek, " ^ fault-tolerant systolic sorter," IEEE Trans. Comput., vol. 37, no. 5, pp. 621-624, May 1988. [35] C.J. Anfinson and F.T. Luk, "-4 linear algebraic model of algorithm,-based fault tolerance," IEEE Trans. Comput., vol. 37, no. 12, pp. 1599-1604, Dec. 1988. [36] S.Y. Kung, VLSI Array Processors, Prentice-Hall, Engelwood Cliffs, NJ, 1988.

SECTION 2.2

Fault-Tolerance and Efficiency in Massively Parallel Algorithms Paris C. Kanellakis^ and Alex A. Shvartsman^

Abstract We present an overview of massively parallel deterministic algorithms which combine high fault-tolerance and efficiency. This desirable combination (called robustness here) is nontrivial, since increasing efficiency implies removing red u n d a n c y whereas increasing fault-tolerance requires adding r e d u n d a n c y to c o m p u t a t i o n s . We study a s p e c t r u m of algorithmic models for which significant robustness is achievable, from static fault, synchronous c o m p u t a t i o n to d y n a m i c fault, asynchronous c o m p u t a t i o n . In addition to fail-stop processor models, we examine a n d deal with arbitrarily initialized memory a n d restricted memory access concurrency. We survey the deterministic upper b o u n d s for the basic Write-All primitive, the lower b o u n d s on its efficiency, and we identify some of the key open questions. We also generalize the robust c o m p u t i n g of functions to relations; this new approach can model a p p r o x i m a t e c o m p u t a t i o n s . We show how to compute a p p r o x i m a t e Write-All optimally. Finally, we synthesize the state-of-the-art in a complexity classification, which extends with fault-tolerance the traditional classification of efficient parallel algorithms.

2.2.1

Introduction

A basic problem of massively parallel c o m p u t i n g is t h a t the unreliability of inexpensive processors a n d their interconnection m a y eliminate any potential efficiency advantage of parallelism. Our research is an investigation of fault models and parallel c o m p u t a t i o n models under which it is possible to achieve algorithmic efficiency (i.e., speed-ups close to linear in the n u m b e r of processors) despite the presence of faults. We would like to note t h a t these models can also ^ Computer Science Department, Brown University, PO Box 1910, Providence, RI 02912, USA. Electronic mail: [email protected]. This research was supported by ONR grant N00014-91-J-1613. ^Digital Equipm.ent Corporation, Digital Consulting Technology Office, 30 Porter Road, Littleton, MA OI46O, USA. Electronic mail: [email protected].

126 be used to explore common properties of a broad spectrum of fault-free models, from synchronous parallel to asynchronous distributed computing. Here, our presentation focuses on deterministic algorithms and complexity, as opposed to algorithms that use randomization. There is an intuitive trade-off between reliability and efficiency because reliability usually requires introducing redundancy in the computation in order to detect errors and reassign resources, whereas gaining efficiency by massively parallel computing requires removing redundancy from the computation to fully utilize each processor. Thus, even allowing for some abstraction in the model of parallel computation, it is not obvious that there are any non-trivial fault models that allow near-linear speed-ups. So it was somewhat surprising when in [17] we demonstrated that it is possible to combine efficiency and fault-tolerance for many basic algorithms expressed as concurrent-read concurrent-write parallel (CRCW) random access machines (PRAMS [14]). The [17] fault model allows any pattern of dynamic fail-stop no restart processor errors, as long as one processor remains alive. The fault model was applied to all CRCW PRAMs in [23, 40]. It was extended in [18] to include processor restarts, and in [42] to include arbitrary static m,em,ory faults, i.e., arbitrary memory initialization, and in [16] to include restricted memory access patterns through controlled memory access. Concurrency of reads and writes is an essential feature that accounts for the necessary redundancy so it can be restricted but not eliminated - see [16, 17] for an in-depth discussion of this issue. Also, as shown in [17], it suffices to consider COMMON CRCW PRAMS (all concurrent writes are identical) in which the atomically written words need only contain a constant number of bits. The work we survey makes three key assumptions. Namely that: 1. Failure-inducing adversaries are worst-case for each model and algorithms for coping with them are deterministic. 2. Processors can read and write memory concurrently - except that initial faults can be handled without memory access concurrency. 3. Processor faults do not affect memory - except that initial memory can be contaminated. A central algorithmic primitive in our work is the Write-All operation [17]. Iterated Write-All forms the basis for the algorithm simulation techniques of [23, 40] and for the memory initialization of [42]. Therefore, improved WriteAll solutions lead to improved simulations and memory clearing techniques. The Write-All problem is: using P processors write Is into all locations of an array of size N, where P < N. When P = N this operation captures the

127 c o m p u t a t i o n a l progress t h a t can be naturally accomplished in one time unit by a PRAM. We say that Write-All completes at the global clock tick at which all the processors that have not fail-stopped share the knowledge that I's have been written into all N array locations. Requiring completion of a Write-All algorithm is critical if one wishes to iterate it, as pointed out in [23] which uses a certification bit to separate the various iterations of (Certified) Write-AIL Note t h a t t h e Write-All completes when all processors halt in all algorithms presented here. Under d y n a m i c failures, efficient deterministic solutions to Write-All, i.e., increasing the fault-free 0{N) work by small polylog(iV) factors, are nonobvious. T h e first such solution was algorithm W of [17] which has (to date) the best worst-case work b o u n d 0(N -f- P log^ iV/log log iV) for 1 < P < iV. This b o u n d was first shown in [22] for a different version of the algorithm a n d in [29] the basic argument was a d a p t e d to algorithm W. Let us now describe the contents of this survey, with some pointers to the literature, as well as our new contributions. In Section 2.2.2 we present a synthesis of parallel c o m p u t a t i o n a n d fault models. This synthesis is new and includes most of the models proposed to d a t e . It links the work on fail-stop norestart errors, to fail-stop errors with restarts (both detectable a n d undetectable restarts). T h e detectable restart case has been examined, using a slightly formalism in [8, 18]. T h e undetectable restart case is equivalent to general general model of asynchrony t h a t has received a fair a m o u n t of in the literature. An elegant deterministic solution for Write-All in appeared in [3]. T h e proof in [3] is existential, because it uses a argument. It has recently been m a d e constructive in [33].

different the most attention this case counting

For some i m p o r t a n t early work on asynchronous PRAMs we refer to [9, 10, 15, 22, 23, 30, 32, 34]. In the last three years, randomized asynchronous c o m p u t a tion has been examined in depth in [4, 5, 21]. These analyses involve randomness in a central way. T h e y are mostly a b o u t off-line or oblivious adversaries, which cause faults during the c o m p u t a t i o n b u t pick the times of these faults before the c o m p u t a t i o n . Although, we will not survey this interesting subject here we would like to point-out t h a t one very promising direction involves combining techniques of randomized asynchronous c o m p u t a t i o n with randomized inform a t i o n dispersal [36]. T h e work on fault-tolerant a n d efficient parallel shared memory models has also been applied to distributed message passing models; for example see [1, 11, 12]. In Section 2.2.3 we examine a n array of algorithms for the Write-All problem. These employ a variety of deterministic techniques a n d are extensible to

128 the c o m p u t a t i o n of other functions (see Section 2.2.4). In particular, in Section 2.2.4, we provide new b o u n d s for fault-tolerant a n d efficient c o m p u t a t i o n of parallel prefixes. In Section 2.2.5 we introduce the problem of a p p r o x i m a t e Write-All by computing relations instead of functions. One new contribution t h a t we make is to solve a p p r o x i m a t e Write-All optimally. In Section 2.2.6 we survey the state-of-the-art in lower b o u n d s . In Section 2.2.7 we present a new complexity classification for fault-tolerant algorithms. We close with a discussion of randomized vs deterministic techniques for fault-tolerant a n d efficient parallel c o m p u t a t i o n (see Section 2.2.8).

2.2.2

Fault-tolerant parallel computation models

In the first subsection we detail a hierarchy of fail-stop models of parallel comp u t a t i o n . We then explain the cost measures of available processor steps a n d overhead ratio, which we use to characterize robust algorithms. T h e final three subsections contain comments on variations of the processor, memory, a n d network interconnect parts of our models. 2.2.2.1

Fail-Stop P R A M s

T h e parallel r a n d o m access machine ( P R A M ) of Fortune and Wyllie [14] combines the simplicity of a RAM with the power of parallelism, a n d a wealth of efficient algorithms exist for it; see surveys [13, 20] for the rationale behind this model and the fundamental algorithms. We build our models of fail-stop P R A M S as extensions of the PRAM model. 1. There are Q shared memoTy cells, a n d the input of size iV < Q is stored in the first N cells. Except for the cells holding the input, all other m e m o r y is cleared, i.e., contains zeroes. Each memory cell can store 0 ( l o g i V ) bits. All processors can access shared memory. For convenience we assume they "know" the input size N, i.e., the \ogN bits describing it can be p a r t of their finite state control. For convenience we assume t h a t each processor also has a constant size private memory, t h a t only it can access. 2. There are P < N initial processors with unique identifiers (piDs) in the range 1 , . . . , P . Each processor "knows" its PID and the value of P , i.e., these can be p a r t of its finite s t a t e control. 3. T h e processors t h a t are d a r d PRAM model [14]. an observer outside the event, the processors do

active all execute synchronously as in the stanAlthough processors proceed in synchrony a n d PRAM can associate a "global t i m e " with every not have access to "global t i m e " , i.e., processors

129 can try to keep local clocks by counting their steps and communicating through shared memory but the PRAM does not provide a "global clock". 4. Processors stop without affecting memory. They may also restart, depending on the power of a fault-inducing adversary. In the study of fail-stop PRAMs, we consider four main types of failureinducing adversaries. These form a hierarchy, based on their power. Note that, each adversary is more powerful than the preceding ones and that the last case can be used to simulate fully asynchronous processors [3]. Initial faults: adversary causes processor failures only prior to the start of the computation. Fail-stop failures: adversary causes stop failures of the processors during the computation; there are no restarts. Fail-stop failures, detectable restarts: adversary causes stop failures; subsequently to a failure, the adversary might restart a processor and a restarted processor "knows" of the restart. Fail-stop failures, undetectable restarts: adversary causes stop failures and restarts; a restarted processor does not necessarily "know" of the restart. Except for the initial failures case, the adversaries are dynamic. A major characteristic of these adversary models is that they are worst-case. These have full information about the structure and the dynamic behavior of the algorithms whose execution they interfere with, while being completely unknown to the algorithms. Remark on (un)detectable restarts: One way of realizing detectable restarts is by modifying the finite state control of the PRAM. Each instruction can have two parts, a green and a red part. The green part gets executed under normal conditions. If a processor fails then all memory remains intact, but in the subsequent restart the next instruction red part is executed instead of the green part. For example, the model used in [8, 18] can be realized this way, instead of using "update cycles". The undetectable restarts adversary can also be realized in a similar way by making the algorithm weaker. For undetectable restarts algorithms have to have identical red and green parts. For example, the fully asynchronous model of [3] can be realized this way. • We formalize failures as follows. A failure pattern F is syntactically defined as a set of triples where tag is either failure indicating processor failure, or restart indicating a processor restart, PID is the processor identifier, and t is the time indicating when the processor stops or restarts. This time

130

\piDi

\piDp

^ N M B I N

E T W O

N G

R K

MEM I MEM2

MEMQ\

Figure 2.2.1: An architecture for a fail-stop multiprocessor. is a global time, that could be assigned by an observer (or adversary) outside the machine. The size of the failure pattern F is defined as the cardinality | F | , where \F\< M for some parameter M. The abstract model that we are studying can be realized in the architecture in Fig. 2.2.1. This architecture is more abstract than, e.g., an implementation in terms of hypercubes, but it is simpler to program in. Moreover, various faulttolerant technologies can contribute towards concrete realizations of its components. There are P fail-stop processors [38]. There are Q shared memory cells. These semiconductor memories can be manufactured with built-in fault tolerance using replication and coding techniques [37]. Processors and memory are interconnected via a synchronous network [39]). A combining interconnection network well suited for implementing synchronous concurrent reads and writes is in [24] and can be made more reliable by employing redundancy [2]. In this architecture, when the underlying hardware components are subject to failures within their design parameters, the algorithms we develop work correctly, and within the specified complexity bounds. 2.2.2.2

Measures of Efficiency

We use a generalization of the standard Parallel-time x Processors product to measure work of an algorithm when the number of processors performing work fluctuates due to failures or delays [17, 18]. In the measure we account for the available processor steps and we do not charge for time steps during which a processor was unavailable due to a failure. Definition 2.2.1 Consider a parallel computation with P initial processors that terminates in time r after completing its task on some input data / of size N and in the presence of the fail-stop error pattern F. If Pi{I^ F) < P is the number of processors completing an instruction at step z, then we define 5(7, F, P) as: 5(7, F, P) = E L , Pi{I, F). •

131 Definition 2.2.2 A P-processor PRAM algorithm on any input data 7 of size |/| = N and in the presence of any pattern F of failures of size \F\ < M uses available processor steps S = SN^M.P = niax/^ir{5(7, F, P ) } . • The available steps measure 5 is used in turn to define the notion of algorithm robustness that combines fault tolerance and efficiency: Definition 2.2.3 Let T(N) be the best sequential (RAM) time bound known for iV-size instances of a problem. We say that a parallel algorithm for this problem is a robust parallel algorithm if: for any input / of size N and for any number of initial processors P {1 < P < N) and for any failure pattern F of size at most M with at least one surviving processor (M < N for fail-stop model), the algorithm completes its task with S = SN^M.P < c T(N)\og^ iV, for fixed c, c'. • For arbitrary failures and restarts, the completed work measure 5 depends on the size N of the input / , the number of processors P , and the size of the failure pattern F. The ultimate performance goal is to perform the required computation at a work cost as close as possible to the work performed by the best sequential algorithm known. Unfortunately, this goal is not attainable when an adversary succeeds in causing too many processor failures during a computation. E x a m p l e : Consider a Write-All solution, where it takes a processor one instruction to recover from a failure. If an adversary has a failure pattern F with | P | - n(iV^+^) for e> 0, then work will be ll(iV^+^) regardless of how eflScient the algorithm is otherwise. This illustrates the need for a measure of efficiency that is sensitive to both the size of the input iV, and the size of the failure pattern | P | < M. We thus also introduce the overhead ratio a that amortizes work of the essential work and failures: Definition 2.2.4 A P-processor PRAM algorithm on any input data / of size |/| = iV and in the presence of any pattern F of failures and restarts of size | P | < M has overhead ratio a = (TN,M,P — maxj,F I \IMF\ f *

^

When M = 0{P) as in the case of the stop failures without restarts, S properly describes the algorithm efficiency, and a = 0( ^)^'^). When F can be large relative to N and P with restarts enabled, a better reflects the efficiency of fault-tolerant algorithms. We can generalize the definition of a in Def. 2.2.4 in terms of the ratio ^Ll •'• J , where T{I) is the time complexity of the best known sequential solution for a particular problem.

132 2.2.2.3

Processor issues: survivability

We have chosen to consider only the failure models where the processors do not write any erroneous or maliciously incorrect values to shared memory. While malicious processor behavior is often considered in conjunction with message passing systems, it makes less sense to consider malicious behavior in tightly coupled shared memory systems. This is because even a single faulty processor has the potential of invalidating the results of a computation in unit time, and because in a parallel system all processors are normally "trusted" agents, and so the issues of security are not applicable. The fail-stop model with undetectable restarts and dynamic adversaries is the most general fault model we deal with. It can be viewed as a model of parallel computation with arbitrary asynchrony. Remark on stronger survivability assumption: The default assumption we make is that throughout the computation one processor is fault-free. This assumption can be made stronger, i.e., a constant fraction of the processors are fault-free. We always list the stronger assumption explicitly when used (e.g., in the complexity classification). • Remark on weaker survivability assumption and restarts: For the models with restarts one can use the weaker survivability assumption that at each global clock tick one processor step executes. In [18] this was stated using "update cycles", but it can be stated using our green-red instruction implementation - remark on (un)detectable restarts. • 2.2.2.4

Memory issues: words vs bits and initialization

In our models we assume that logiV-bit word parallel writes are performed atomically in unit time. The algorithms in such models can be modified so that this restriction is relaxed. The sufficient definition of atomicity is: (1) logiV-size words are written using log N bit write cycles, and (2) the adversary can cause arbitrary fail-stop errors either before or after the single hit write cycle of the PRAM, but not during the bit write cycle. The algorithms that assume word atomicity can be mechanically compiled into algorithms that assume only the bit atomicity as stated above. A much more important assumption in many Write-All solutions was the initial state of additional auxiliary memory used (typically of ^{P) size). The basic assumption has been that: The n ( P ) auxiliary shared memory is cleared or initialized to som,e known value.

133 While this is consistent with definitions of PRAM such as [14], it is nevertheless a requirement t h a t fault-tolerant systems ought to be able to do without. Interestingly there is an efficient deterministic procedure t h a t solves the Write-All problem even when the shared memory is contaminated, i.e., contains a r b i t r a r y values. 2.2.2.5

Interconnect issues: concurrency vs redundancy

T h e choice of CRCW (concurrent read, concurrent write) model used here is justified because of a lower bound [17] t h a t shows t h a t the C R E W (concurrent read, exclusive write) model does not a d m i t fault-tolerant efficient algorithms. However we still would like control memory access concurrency. We define measures t h a t gauge the concurrent m e m o r y accesses of a c o m p u t a t i o n . D e f i n i t i o n 2 . 2 . 5 Consider a parallel c o m p u t a t i o n with P initial processors t h a t terminates in time r after completing its task on some input d a t a / of size N in the presence of fail-stop error p a t t e r n F. If at time ^ (1 < i < r ) , P/^ processors perform reads from N^ shared memory locations a n d P^^ processors perform writes to N^^ locations, then we define: (i) the read concurrency (ii) the wmte concurrency

p as: p — PI^F.P

— 5^1=1 (-^i^ "~ -^/^)i ^^^

u a,s: CJ = UJI^F.P = 531=1 (-^t^ ~ -^i^)-

'-'

For a single read from (write to) a particular m e m o r y location, the read (write) concurrency p (CJ) for t h a t location is simply the n u m b e r of readers (writers) minus one. For example, if only one processor reads from (writes to) a location, t h e n p (u) is 0, i.e., no concurrency is involved. Also note t h a t the concurrency measures p and u are cumulative over a c o m p u t a t i o n . For the algorithms in the EREW model, p = u = 0, while for the CREW model, cj = 0. T h u s our measures capture one of the key distinctions a m o n g the EREW, CREW and CRCW memory access disciplines.

2.2.3

Robust parallel assignment and Write-All

2.2.3.1

W r i t e - A l l a n d initial faults

We first consider the weak model of initial (static) faults in which failures can only occur prior to the start of an algorithm. We assume t h a t the size of the Write-All instances is N a n d t h a t we have P processors, P' < P oi which are alive at the beginning of the algorithm. Our EREW a l g o r i t h m E (Fig. 2.2.2) consists of phases E l a n d E2. In phase E l , processors enumerate themselves a n d

134 01 forall processors P I D = 1 . . P parbegin 02 Phase E l : Use non-oblivious parallel prefix to compute rankpjD and P' 03 Phase E2: Set x[{rankpiD — 1) * p - . . . {rankpin * p - ) — 1] to 1 04 parend Figure 2.2.2: A high level view of a l g o r i t h m E. compute the t o t a l number of live processors. T h e details of this non-oblivious counting are in [16]. In phase E2, the processors p a r t i t i o n the i n p u t array so t h a t each processor is responsible for setting to 1 all the entries in its p a r t i t i o n . T h e o r e m 2 . 2 . 1 T h e Write-All problem with initial processor a n d memory faults can be solved in place with S — 0{N + P' log P) on an E R E W P R A M , where 1 < P < iV a n d P — P ' is the number of initial faults. W i t h the result of [7] it can be shown t h a t this algorithm is optimal, without memory access concurrency. 2.2.3.2

D y n a m i c faults and algorithm W

A more sophisticated approach is necessary to obtain an efficient parallel algor i t h m when the failures are dynamically determined by an on-line adversary. Algorithm W of [17] is an efficient fail-stop Write-All solution (Fig. 2.2.3). It uses full binary trees for processor counting, processor allocation, a n d progress measurement. Active processors synchronously iterate t h r o u g h the following four phases: Wl:

Processor enumeration. All the processors traverse b o t t o m - u p the processor enumeration tree. A version of parallel prefix a l g o r i t h m is used resulting in an overestimate of the number of live processors.

W2:

Processor allocation. All the processors traverse the progress measurement tree top-down using a divide-and-conquer approach based on processor enumeration a n d are allocated to un-written input cells.

W3:

Work phase.

W4:

Progress measurement. All the processors traverse b o t t o m - u p the progress tree using a version of parallel prefix a n d c o m p u t e an u n d e r e s t i m a t e of the progress of the algorithm.

Processors work at the leaves reached in phase W 2 .

Algorithm W achieves optimality when parameterized using a progress tree with N/ log N leaves and log N input d a t a associated with each of its leaves. By optimality we m e a n t h a t for a range of processors the work is 0{N). A

135 01 forall processors PID=l..iV parbegin 02 Phase W3: Visit leaves based on FID to work on the input data 03 Phase W4: Traverse the progress tree bottom up to measure progress 04 while the root of the progress tree is not N do 05 Phase W l : Traverse counting tree bottom up to enumerate processors 06 Phase W2: Traverse the progress tree top down to reschedule work 07 Phase W3: Perform rescheduled work on the input data 08 Phase W4: Traverse the progress tree bottom up to measure progress 09 od 10 parend Figure 2.2.3: A high level view of algorithm W,

complete description of the algorithm can be found in [17]. Martel [29] gave a tight analysis of algorithm W, T h e o r e m 2 . 2 . 2 [17, 29] Algorithm W is a robust parallel Write-All algorithm with 5 = 0{N + ^ l o g ^ i V / l o g l o g i V ) , where N is the input array size a n d the initial number of processors P is between 1 and N.

Note t h a t the above b o u n d is tight for algorithm W . This upper b o u n d was first shown in [22] for a different algorithm. T h e d a t a structuring technique [22] might lead to even better b o u n d s for Write-All.

2.2.3.3

D y n a m i c faults, detected restarts, and algorithm V

Algorithm W has efficient work when subjected to a r b i t r a r y failure p a t t e r n s without restarts a n d it can be extended to handle restarts. However, since accurate processor enumeration is impossible if processors can be restarted at any time, the work of the algorithm becomes inefficient even for some simple adversaries. On the other hand, the second phase of a l g o r i t h m W does implement efficient top-down divide-and-conquer processor assignment in O(logiV) time when p e r m a n e n t processor F I D s are used. Therefore we produce a modified version of a l g o r i t h m W, t h a t we call V. To avoid a restatement of the details, the reader is referred to [18]. V uses the optimized algorithm W d a t a structures for progress estimation a n d processor allocation. T h e processors iterate t h r o u g h the following three phases based on the phases W 2 , W 3 a n d W 4 of a l g o r i t h m W:

136 VI: Processors are allocated as in the phase W2, but using the permanent PiDs. This assures load balancing in O(logiV) time. V2: Processors perform work, as in the phase W3, at the leaves they reached in phase VI (there are logJV array elements per leaf). V3: Processors continue from the phase V2 progress tree leaves and update the progress tree bottom up as in phase W4 in O(logiV) time. The model assumes re-synchronization on the instruction level, and a wraparound counter based on the PRAM clock implements synchronization with respect to the phases after detected failures [18]. The work and the overhead ratio of the algorithm are as follows: Theorem 2.2.3 [18] Algorithm V using P < N processors subject to an arbitrary failure and restart pattern F of size M has the work S — 0(iV-|-P log N-\MlogiV), and its overhead ratio is: a — 0(log N). Algorithm V achieves optimality for a non-trivial set of parameters: Corollary 2.2 A Algorithm V with P < N/\o^ N processors subject to an arbitrary failure and restart pattern of size M < N/\ogN has 5 = 0{N). One problem with the above of restarts and a large amount algorithm X of the next section [3] to provide better bounds on 2.2.3.4

approach is that there could be a large number of work. Algorithm V can be combined with or with the asymptotically better algorithm of work.

Dynamic faults, undetected restarts, and algorithm X

When the failures cannot be detected, it is still possible to achieve sub-quadratic upper bound for any dynamic failure/restart pattern. We present Write-All algorithm X with 5 = 0{N • P^""^^) = N • P^'^^. This simple algorithm can be improved to 5 = 0{N • P^) using the method in [3]. We present X for its simplicity and in the next section a (possible) deterministic version of [3]. Algorithm X utilizes a progress tree of size N that is traversed by the processors independently, not in synchronized phases. This reflects the local nature of the processor assignment as opposed to the global assignments used in algorithms V and W. Each processor searches for work in the smallest subtree that has work that needs to be done. It performs the work, and moves to the next subtree.

137 01 02 03 04 05 06 07 08 09 10 11 12 13

forall processors PID=O..P — 1 parbegin Perform initial processor assignment to the leaves of the progress tree while there is still work left in the tree do if subtree rooted at current node u is done t h e n move one level up elseif ^ is a leaf t h e n perform the work at the leaf elseif u is an interior tree node t h e n Let UL and UR be the left and right children of u respectively if the subtrees rooted at UL and UR are done t h e n update u elseif only one is done t h e n go to the one that is not done else move to UL or UR according to FID bit values fifi od parend Figure 2.2.4: A high level view of the algorithm X.

T h e a l g o r i t h m is given in Fig. 2.2A. Initially the P processors are assigned to the leaves of the progress tree (line 02). T h e loop (lines 03-12) consists of a multi-way decision (lines 04-11). If the current node u is marked done, the processor moves u p the tree (line 04). If the processor is at a leaf, it performs work (line 05). If the current node is an u n m a r k e d interior node a n d b o t h of its subtrees are done, the interior node is marked by changing its value from 0 to 1 (line 08). If a single subtree is not done, the processor moves down appropriately (line 09). For the final case (line 10), the processors move down when neither child is done. Here the processor PID is used at d e p t h h of the tree node: based on the value of the h^^ most significant bit of the binary representation of PID, bit 0 will send the processor to the left, a n d bit 1 to the right. T h e performance of algorithm X is characterized as follows: T h e o r e m 2 . 2 . 5 Algorithm X with P processors solves the Write-All problem of size N {P < N) in the fail-stop restartable model with work 5 = 0{N'P^^^^). In addition, there is an adversary t h a t forces algorithm X to perform S = ^ ( i \ r . p i o g f ) work. T h e a l g o r i t h m views undetected restarts as delays, a n d it can be used in the asynchronous model where it has the same work [8]. Algorithm X could also be useful for the case without restarts, even t h o u g h its worst-case performance without restarts is no better t h a n a l g o r i t h m W. O p e n P r o b l e m : A major open problem for the model with undetectable restarts is whether there is robust Write-All solution, i.e., where the work is Npolylog(N). Also, whether there is a solution with a = polylog(N).

138 01 02 03 04 05 06 07 09 10

forall processors PID = 1..VN parbegin Divide the N array elements into VN work groups of VN elements Each processor obtains a private permutation T^pj^ of {1, 2 , . . . , y/N} foTi=l,.VNdo if 7rp^^[t]th group is not finished t h e n perform sequential work on the vN elements of the group and mark the group as finished fi od parend Figure 2.2.5: A high level view of the a l g o r i t h m Y.

2.2.3.5

D y n a m i c faults, undetected restarts, and algorithm Y

A family of randomized Write-All algorithms was presented by Anderson a n d Woll [3]. T h e m a i n technique in these algorithms is abstracted in Fig. 2.2.5. T h e basic a l g o r i t h m in [3] is obtained by r a n d o m l y choosing the p e r m u t a t i o n in line 03. In this case the expected work of the a l g o r i t h m is O(iVlogiV), for P — y/W (assume iV is a square). We propose the following way of determinizing the algorithm (see [19]): Given P = \N^ we choose the smallest prime m such t h a t P < m. Primes are sufficiently dense, so t h a t there is at least one prime between P a n d 2 P , so t h a t the complexity of the algorithms is not distorted when P is not a prime. We then construct the multiplication table for the n u m b e r s 1 , 2 , . . ,m— 1 m o d u l o m. Each row of this table is a p e r m u t a t i o n a n d this structure is a group. Processor with P I D i uses the i t h p e r m u t a t i o n as its schedule. This table need not be pre-computed, as any item can be c o m p u t e d directly by any processor with the knowledge of its P I D , a n d the number of work elements w it has processed thus far as {PID • w) m o d m. C o n j e c t u r e : We conjecture t h a t the worst case work of this deterministic algorithm is no worse t h a n the expected work of the randomized algorithm. Experimental analysis supports the conjecture. Formal analysis can be reduced to the open problem below t h a t contains an interesting group-theoretic aspect of the multi-processor scheduling problem [41]. In order to show t h a t the worst case work of y is 0 ( i V log iV), it is sufficient to show t h a t : Given a prime TTI, consider the group G = ( { l , 2 , . . . , m — !},# (mod m)). The multiplication table for G, when the rows of the table are interpreted as permutations o f { l , . . . , ? n — 1}, i s a group K of order m — 1 (a subgroup of all permutations). Show that, for each left coset of K (with respect to all permutations) the sum of the number of left-to-right maxima of all elements of the coset is 0 ( m log TTI).

139 1 01 forall processors PID=1..P parbegin P processors t dear N locations 02 Clear the initial block of No = Go elements sequentially using P processors 03 i := 0 Iteration counter 04 while Ni < N do 05 Use Write-All solution with data structures of size Ni and Gi^i at the leaves to clear memory of size Ni^i =Ni'Gi^i ; i := i + 06 07 od 08 parend

elements

1

Figure 2.2.6: A high level view of algorithm Z. 2.2.3.6

Bootstrapping and algorithm Z

The Write-All algorithms and simulations (e.g., [17, 22, 23, 40]) or the algorithms that can serve as Write-All solutions (e.g., the algorithms in [9, 32]) invariably assume that a linear portion of shared memory is either cleared or is initialized to known values. Starting with a non-contaminated portion of memory, these algorithms perform their computation by "consuming" the clear memory, and concurrently or subsequently clearing segments of memory needed for future iterations. We define an efficient Write-All solution that requires no clear shared memory [42]. The solution uses a bootstrap approach: In stage 1 all P processors clear an initial segment of No locations in the auxiliary memory. In stage i the P processors clear Ni^i = Ni • Gi^i memory locations using Ni memory locations that were cleared in stage i — 1. Using algorithm W and tuning the parameters Ni and Gi we obtain a solution (algorithm Z, see Fig. 2.2.6) that for any failure pattern F {\F\ < P) has work 0{N -\- ^(lo^io N\^ ) '^^^^^^^ ^"^y initialization assumption. A similar algorithm that inverts the bootstrap procedure can be used to clear the contaminated shared memory if the output must contain only the results of the intended computation. The complexity of algorithm Z~^ is identical to the complexity of algorithm Z. For algorithm simulation and for transformed algorithms, the complexity cost is additive in both cases. 2.2.3.7

Minimizing concurrency: processor priority trees

Among the key lower bound results is the fact that no efficient fault-tolerant CREW PRAM Write-All algorithms exist [17] - if the adversary is dynamic then any P-processor solution for the Write-All problem of size N will have (deterministic) work Q{N ' P), Thus memory access concurrency is necessary to combine efficiency and fault-tolerance. However, while most known solutions

140 for the Write-All problem indeed make heavy use of concurrency, the goal of minimizing concurrent access to shared memory is attainable. We gave a Write-All algorithm in [16] in which we bound the total amount of concurrency used in terms of the number of dynam,ic processor faults of the actual run of the algorithm. When there are no faults our algorithm executes as an EREW PRAM and when there are faults the algorithm differs from EREW in the amount of concurrency proportional to the number of faults. The algorithm is based on a conservative policy: concurrent reads or writes occur only when the presence of failures can be inferred and then concurrency is allowed in proportion to the failures detected. The robust CRCW algorithm WCR/W '^^ [16] is based on algorithm W and it uses processor identifiers to construct m,ergeahle processor priority trees (PPT), which control concurrent access to memory. During the execution, the PPTs are com,pacted and w,erged to remove faulty processors and to determine when concurrent access to memory is warranted. By taking advantage of parallel slackness and by clustering the input data into groups of size logiVlogP, we obtain an algorithm that has a range of optimality and that controls its memory access concurrency: Theorem 2.2.6 Algorithm W C R / W of [16] with input clustering is a robust Write-All algorithm with 5 == 0{N-^P^^^^^^), write concurrency u < | F | , and read concurrency p < 7 | F | logiV, where 1 < P < iV. The basic algorithm can be extended to handle arbitrary initial memory contents [16]. It is also possible to reduce the maximum per step memory access concurrency by polylogarithmic factors by deploying a general pipelining technique. Finally, [16] shows that there is no robust algorithm whose total write concurrency is bounded by |JP|^ for 0 < e < 1.

2.2.4

Computing functions robustly

In this section we will work our way from the simplest to the most complicated functions with robust solutions. 2.2.4.1

Constants, booleans and Write-All

Solving a Write-All problem of size N can be viewed as computing a constant vector function. Constant scalar functions are the simplest possible functions (e.g., simpler than boolean OR and AND). At the same time, it appears

141 that Write-All problem is a more difficult (vector) task than computing scalar boolean functions such as multiple input OR and AND. In the lower bounds discussion we consider a model with memory snapshots^ i.e., processors can read and process the entire shared memory in unit time. For the snapshot model there is a sharp separation between Write-All and boolean functions. Clearly any boolean can be computed in constant time in the snapshot model, while we have a lower bound result for any Write-All solution in the snapshot model requiring work n{Nj^^). Solving a Write-All problem is no more difficult than computing any other vector function, e.g., parallel prefix. In the next subsection we also show that the best (as of this writing) Write-All solution can be used to derive a robust parallel prefix algorithm that has the same work complexity. 2.2.4.2

Parallel prefix and Write-All

Solutions for the Write-All problem can be used as building blocks for custom transformations of efficient parallel algorithms into robust algorithms [17]. Transformations are of interest because in some cases it is possible to improve on the work of oblivious simulation such as [23, 32, 40]. These improvements are most significant for fast algorithms when a full range of processors is used, i.e., when N processors are used to simulate N processors, because in this case parallel slack cannot be taken advantage of. One immediate result that improves on the available general simulations follows from the fact that algorithms V, W and X, by their definition, implement an associative operation on N values. Theorem 2.2.7 Given any associative operation © on integers, and an integer array x[l..iV], it is possible to robustly compute ©£_i x[i] using P fail-stop processors at a cost of a single application of any of the algorithms F , W or X, This saves a full log N factor for all simulations. The savings are also possible for the important prefix sums and pointer doubling algorithms. Efficient parallel algorithms and circuits for computing prefix sums were given by Ladner and Fischer in [26], where the prefix problem, is defined as follows: Given an associative operation ® on a domain X>, and x i , . . . , x^ ^ ^ j compute, for each k^ (1 < k < n) the sum ©^_i Xi. In order to compute the prefix sums of N values using N processors, at least log iV/log log iV parallel steps are required [6, 27], and the known algorithms require at least log N steps. Therefore an oblivious simulation of a known prefix algorithm will require simulating at least logiV steps. When using P = N

142 processors with algorithm W (the most efficient as of this writing Write-All solution) whose work is S^ — Q(iV^^^^^ ^ ) , the work of the simulation will be We can extend Theorem 2.2.7 to show a robust prefix algorithm whose work is the same as that of algorithm W. In the fail-stop model we have the following result that uses as the basis an iterative version of the recursive algorithm of [26]: Theorem 2.2.8 Parallel prefix for N values can be computed using N fail-stop processors using 0{N) clear memory with 5 = ^{^i^^Tw)' A similar approach was also taken by Martel et al. [30] to produce an efficient randomized transformation of the prefix algorithm. 2.2.4.3

List ranking

Another important improvement for the fail-stop case is for the pointer doubling operation that is used in many parallel algorithms. The robust algorithm is implemented using a variation of algorithm W and the standard pointer doubling algorithm. We associate each list element with a progress tree leaf. In the work phase of algorithm W we double pointers and update distances. The log N pointer doubling operations in the work phase make log N/ log log N overall iterations sufficient with each iteration performing the same work S^ as algorithm W. Theorem 2.2.9 There is a robust list ranking algorithm for the fail-stop model with S = Q(J"f^^^5^(iV, P)), where JV is the input list size and S^{N, P) is the complexity of algorithm W for the initial number of processors P : 1 < P
General Parallel Assignment

Consider computing and storing in an array x[l..iV] the values of a vector function / that depend on PIDs and the initial values of the array x. Assume each of the N scalar components of/ can be computed in 0(1) sequential tim.e. This is the general parallel assignment problem.

143 forall processors PID = \.,N parbegin shared integer array a5[l..iV]; x[PID]:=f{PID,x[^-N]) parend In [17] a general technique was shown for making this operation robust using the same work as required by Write-All. We modify the assignment so that it remains correct when processors fail and when multiple attempts are made to execute the assignment (assuming the surviving processors can be reassigned to the tasks of faulty processors). This is done using binary version numbers and two generations of the array: 1

forall processors PID = 1. ,N parbegin shared integer array x [0..1][1..N]; bit integer v\

x[v^ \\[PID] := f{PID Mv][i-N]y, V := V -\- \

parend

Here, bit v is the current version number or tag (mod 2), so that x['u][l,. .N] is the array of current values. Function / will use only these values of x as its input. The values of/ are stored in x['uH-l][l.. .iV] creating the next generation of array x. After all the assignments are performed, the binary version number is incremented (mod 2). At this point, a simple transformation of any Write-All algorithm, with the modified general parallel assignment replacing the trivial "x[i] = 1" assignment, will yield a robust iV-processor algorithm: Theorem 2.2.10 The asymptotic work complexities of solving the general parallel assignment problem and the Write-All problem are equal.

2.2.4.5

Any P R A M steps

The original motivation for studying the Write-All problem was that it captured the essence of a single PRAM step computation. It was shown in [23, 40] how to use the Write-All paradigm in implementing general PRAM simulations. The generality of this result is somewhat surprising. Fail-stop faults: An approach to such simulations is given in Fig. 2.2.7. The simulations are implemented by robustly executing each of the cycles of the PRAM step: instruction fetch, read, compute, and write cycles, and next instruction address computation. This is done using two generations of shared

144 01 forall p r o c e s s o r s PID=:1..P p a r b e g i n Simulate N fault-prone processors 02 The PRAM program for N processors is in shared memory (read-only) 03 Shared memory has two generations: current and future; 04 Initialize N simulated instruction counters to start at the first instruction 05 while there is a simulated processor that has not halted d o 06 Tentative computation: Fetch instruction; Copy registers to scratchpad 07 Perform read cycle using current memory 08 Perform the compute cycle using scratchpad 09 Perform write cycle into future memory 10 Compute next instruction address 11 Reconcile memory and registers: Copy /itture locations to current 12 o d 13 p a r e n d Figure 2.2.7: Simulations using Write-All

primitive.

memory, "current" a n d "future", a n d by executing each of these cycles in t h e general parallel assignment style, e.g., using algorithm W. Using such techniques it was shown in [23, 40] t h a t if Sy, (iV, P) is t h e efficiency of solving a Write-All instance of size N using P processors, a n d if a linear a m o u n t of clear memory is available, then any iV-processor PRAM step can be deterministically simulated using P fail-stop processors a n d work Syj{N^ P). If t h e Parallel-time x Processors of a n original iV-processor algorithm is r • iV, then t h e work of t h e fault-tolerant simulation will be 0{T • Syj[N^ P)). T h e simulation in t h e fail-stop model is optimal for a wide range of processors [40]. T h e following theorem might have some practical significance, given the constant overhead. T h e o r e m 2 . 2 . 1 1 A n y iV-processor PRAM algorithm can b e optimally simulated (with constant overhead) on a fail-stop P-processor CRCW PRAM, when P < N\oglogN/\og^

N,

E R E W , C R E W , and WEAK and COMMON C R C W PRAM

a l g o r i t h m s are s i m u l a t e d on fail-stop C O M M O N C R C W P R A M S ; A R B I T R A R Y , P R I -

O R I T Y a n d STRONG CRCW PRAMS are simulated on fail-stop PRAMs of t h e same type. W h e n t h e full range of simulating processors is used {N — P) optimality is not achievable. In this case customized transformations of parallel algorithms (such as our prefix a n d list ranking algorithms) m a y improve on t h e oblivious simulations. Note t h a t T h e o r e m 2.2.11 also holds when t h e failed processor are restarted during t h e simulation between t h e individual Write-All steps.

145 Initial faults: Algorithm E can be used for simulations of EREW P R A M algorithms on fail-stop EREW P R A M S [16]. Simulations are much simpler for this case as compared to the dynamic failures case. T h e c o m p u t a t i o n a l overhead of such simulations is additive. This simulation is optimal when P - T — U[P'\ogP). T h e o r e m 2 . 2 . 1 2 Any P-processor, r parallel time EREW PRAM algorithm can be robustly simulated on a fail-stop EREW PRAM t h a t is subject to static initial processor and m e m o r y faults. T h e work of the simulation is P'T-\-0(P' logP), where P' is the n u m b e r of live processors. Fail-stop faults with detectable restarts: There is broad range of parameters for the work performed in executing a parallel algorithm on a faulty PRAM is asymptotically equal to the ParalleUtimex Processors product for t h a t algorithm. T h e o r e m 2 . 2 . 1 3 Any iV-processor P R A M algorithm can be executed on a failstop P-processor CRCW PRAM with detectable restarts, with P < N. Each JVprocessor P R A M step is executed in the presence of any p a t t e r n F of failures and restarts o{ size M with: S = 0{mm{N-{-Plog^ N-\-MlogN, iV-P^°«t}), and overhead ratio: a = 0(log^ N). E R E W , C R E W , a n d WEAK a n d COMMON CRCW PRAM algorithms are simulated on fail-stop COMMON CRCW P R A M S ; A R B I T R A R Y and STRONG CRCW PRAMS are simulated on fail-stop P R A M S of the same type. Fail-stop faults with undetectable restarts: When the failures are undetectable, deterministic simulations become difficult due to the possibility of processors delayed due to failures writing stale values to shared memory. Fortunately, for fast polylogarithmic time parallel algorithms we can solve this problem by using polylogarithmically more memory. We simply provide as m a n y "future" generations of m e m o r y as there are P R A M steps to simulate. Processor registers are stored in shared memory along with each generation of shared memory. Prior to starting a parallel step simulation, a processor uses binary search to find the newest simulated step. W h e n reading, a processor linearly searches past generations of memory to find the latest written value. In the result below we use the existential algorithm [3]. T h e o r e m i 2 . 2 . 1 4 Any iV-processor, log^^^^ iV-time, M - m e m o r y P R A M algor i t h m can be deterministically executed on a fail-stop P-processor CRCW PRAM ( P < N) with undetectable restarts, and using shared memory M • log^^^^ N. Each iV-processor P R A M step is executed in the presence of any p a t t e r n F of failures and undetected restarts with 5 = 0{N^).

146 2.2.5

Computing relations and approximate Write-All

Here we show that computing some relations robustly is easier than computing functions robustly. Consider the majority relation M: Given an array x[l..iV], x G M when \{x[i] : x[i] — 1}| > \N, C . Dwork observed that the Q[N\ogN) lower bound [22] on solving Write-All using N processors also applies to producing a member of A4 in the presence of failures. It turns out that 0(iV log iV) work is also sufficient to compute a member of the majority relation. Let's parameterize the majority problem in terms of the approximate WriteAll problem by using a quantity e such that 0 < 5 < | , thus we would like to initialize at least (l—e)N array locations to 1. We call this problem the AWA{6), Surprisingly, algorithm W has the desired property: Theorem 2.2.15 Given any constant e such that 0 < e < ^, algorithm W solves the AWA{e) problem with 5 = 0{NlogN) using N processors. If we choose e — 1/2* {k — const) and then iterate this Write-All algorithm log log iV times, the number of unvisited leaves will be Ne^^^^^^^^"^ — iV(logiV)^^8^ = N{logN)-^ r= N/log^N, Thus we can get even closer to solving the Write-All problem: Theorem 2.2.16 For each k = const^ there is a robust AWOl(T—^) algorithm that has work 5 = 0{NlogNloglogN).

2.2.6

Lower bounds

The strongest known lower bound for Write-All was derived by Kedem, Palem, Ragunathan and Spirakis in [22]. Theorem 2.2.17 [22] Given any P-processor CRCW PRAM algorithm for the Write-All piohlem of size iV, an adversary can force fail-stop (no restart) errors that result in iV + Q{PlogN) (where P < N) steps being performed. Recently, Martel and Subramonian [31] have extended the Kedem et al. deterministic lower bound [22] to randomized algorithms against oblivious adversaries. It is open whether this lower bound applies to the static fault case. It was shown in [17] that no optimal solutions for the Write-All problem exist that use the range of processor 1 < P < N even when the processors can take instant memory snapshots^ i.e., processors can read and locally process the

147 entire shared memory at unit cost. T h e lower b o u n d below applies to fail-stop, deterministic or randomized, P R A M S a n d it is the strongest possible b o u n d under the memory snapshots assumption, i.e., there is a m a t c h i n g upper b o u n d . T h e o r e m 2 . 2 . 1 8 [17] Given any iV-processor CRCW P R A M a l g o r i t h m for the T^rzte--4//problem of size iV, an adversary can force fail-stop errors t h a t result in ^{^lo^iJ^i^) steps being performed, even if the processors can read and locally process all shared memory at unit cost. W h e n restarts are introduced, we show the following result t h a t also is the strongest possible result under the snapshot assumption [8]: T h e o r e m 2 . 2 . 1 9 Given any P-processor CRCW PRAM a l g o r i t h m t h a t solves the Write-All problem of size N (P < iV), a n adversary ( t h a t can cause a r b i t r a r y processor failures and restarts) can force the a l g o r i t h m to perform N -h fi(PlogP) work steps. T h e next result shows t h a t CRCW is necessary to achieve efficient solutions to the Write-All problem. In the absence of failures, any P-processor CREW (concurrent read exclusive write) or EREW (exclusive read exclusive write) P R A M can simulate a P-processor CRCW PRAM with only a factor of O ( l o g P ) more parallel work [20]. However a more severe difference exists between CRCW and CREW PRAMS (and thus also EREW P R A M S ) when the processors are subject to failures. T h e o r e m 2 . 2 . 2 0 Given any deterministic or randomized iV-processor CREW PRAM algorithm for the Write-All problem, the adversary can force fail-stop errors t h a t result in fl{N^) steps being performed, even if the processors can read and locally process all shared memory at unit cost. For the CREW PRAMs, Martel a n d S u b r a m o n i a n [31] show a randomized algorithm with expected work of only O(iVlogiV) for P = N,

2.2.7

A Complexity classification

2.2.1 A

Efficient p a r a l l e l c o m p u t a t i o n

Many efficient parallel algorithms can be used to show problem membership in the class A/'C (of polylog time and polynomial n u m b e r of processors [35]). T h e inverse is not necessarily true. This is because the algorithms in MC allow for polynomial inefficiency in work [25] - the algorithms are fast (polylogarithmic

148 time), but the computational agent can be large (polynomial) relative to the size of a problem [35]. A characterization of parallel algorithm efficiency that takes into account both the parallel time and the size of the computational resource is defined by Vitter and Simmons [44] and expanded on by Kruskal et al. [25]. The complexity classes in [25] are defined with respect to the time complexity T{N) of the best sequential algorithm for a problem of size N - this is analogous to the definition of robustness. Each class is characterized in terms of parallel time T{N) and, parallel work T{N) • P{N), We give these class definitions below, but instead of failure-free work, we use the overhead ratio a that for the failure-free case is simply T{N) • P{N)/T{N): Let A be a problem with sequential (RAM) time complexity T{N), A parallel algorithm that solves an iV-size instance of A using P{N) processors in r{N) time belongs to the class: ENC:

if T{N) = \og^^^\T{N))

EP:

if T{N) < T{NY

ANC:

if T{N) = log^(^)(T(iV))

AP:

if T{N) < T{Ny

SNC:

if T{N) = log^^^\T{N))

SP:

if T{N) < T{Ny

2.2.7.2

and a = 0(1).

(const e < 1) and cr = 0(1). and a =

log^^^\T{N)).

(const s < 1) and a =

\og^^^\T{N)).

and a =

T{N)^W,

(const e < 1) and a =

T{N)^(^\

Closures under failures

We now define criteria for evaluating whether algorithm transformation preserves the efficiency of the algorithms for each of the classes above. To use time complexity in comparisons, we need to introduce a measure of time for the fault-tolerant algorithms. In a fault-prone environment, a time metric is meaningful provided that a significant number of processors still are active. Here we use the worst case time provided a linear number of processors are active during the computation. This is our weak survivability assumption. Without this assumption, all one can conclude about the running time is that it is no better than the time of the best sequential algorithm, since the number of active processors might become quite small. We assuming P is a polynomial in N (note that until now we generally assumed P < N). Then l o g P = O(logiV). We now state the definition:

149 II Complexity

1

Class ENC

1 ^^ ANC AP SNC

1

^P

Time with > cP processors

0{r{N)\o^Nl\og\ogN) = o(iogO(0(T(i\r))) = o(T{Ny) = log^(i)(T(N))

=

OiTiNY)

= log^(0(T(iV'))

=

o{T{Ny)

Overhead c 0(log'^(i)iV) >0(1)

Closed T under f ? | No

>0(1)

No

= log^(0(T(iV)) = log^(0(T(i\r)) = T(i\r)^(i)

Yes Yes

= T(iV)^(i)

Yes

1

Yes

1

Table 2.2.1: Closure under the fail-stop transformation ^.

D e f i n i t i o n 2 . 2 . 6 Let Cr^w be a class with parallel time in the complexity class T and parallel work in the complexity class w. We say t h a t Cr^w is closed with respect to a fault-tolerant transformation (p if for any a l g o r i t h m A in Cr^w(1) overhead a of (t>{A) is such t h a t cr • r • P is in ly, a n d (2) when the n u m b e r of active processors at any point of the c o m p u t a t i o n is at least cP for constant c > 0, then the running time t is in r . D In the fail-stop model without restarts, given any a l g o r i t h m A, let C{A) be the fault-tolerant algorithm t h a t can be constructed as either a simulation or a transformation. Using, for example, algorithm W as the basis for transforming non-faulttolerant algorithms, we have the following: (1) T h e multiplicative overhead in work is 0 ( l o g i V ^ / l o g l o g i V ) , a n d so the worst case overhead a is 0 ( l o g i V ^ / l o g l o g i V ) = log ^^^ N a n d the worst case work of the fault-tolerant version C{A) is a • r{N) • P. (2) Algorithm W terminates in Sy,/cP = 0(log^ iV/loglogiV) time when at least cP processors are active, therefore if the parallel time of a l g o r i t h m A is T(iV), then the parallel time of execution for C{A) using at least cP active processors is 0 ( r ( i V ) log'^ JV/log log iV). T h e resulting closure properties of the classes in [25] under our fail-stop t r a n s f o r m a t i o n ' s summarized in Table 2.2.1. In the fail-stop model with detectable restarts, for any a l g o r i t h m A, let p{A) be the fault-tolerant algorithm constructed using any of our techniques. In this model we provide existential closure properties by taking a d v a n t a g e of the existential result by Anderson a n d Woll [3], who showed t h a t for every e: > 0, there exists a deterministic algorithm for P processors t h a t simulates P instructions with 0(P^"*"^) work. Given the a l g o r i t h m [3], we interleave it with algorithm V, for example, so t h a t the overhead a of the combined a l g o r i t h m

150 1 Complexity

1

Class

0{-T(N)-P'^)

ENC

> iog^(i)(T(i\r))

EP

= 0{T(Ny) > log^(0(T(i^))

ANC AP SNC \

Time with > cP processors

SP

=

0{T(Ny)

> log^(i)(T(i\^))

0{T(Ny)

Overhead a 0(log2 N)

>o{i) >o{i) = log^(i)(T(i\r)) = log^(0(T(i\r)) = T(iV)^(0

= r(iv)^(i)

Closed 1 under p? |

No

1

No Unknown | Yes Unknown |

Yes

1

Table 2.2.2: Closure under the restartable fail-stop transformation p. is 0(log^ N). Table 2.2.2 gives the closure properties under the restartable fail-stop transformation. Note t h a t due to the lower b o u n d s for the WriteAll problem, the entries t h a t are marked "No" m e a n non-closure, while the "Unknown" result means t h a t closure is not achieved with the known results.

2.2.8

Discussion: on randomization and approximation

We have presented an overview of the theory of efficient a n d fault-tolerant parallel algorithms. Our focus has been deterministic algorithms, p a r t l y because our work has concentrated on this topic, b u t also because m a n y deterministic techniques exist for the problems of interest. We close our exposition with an observation (by D. Michailidis) t h a t illustrates the power of randomization (vs determinism). As we described above deterministic Write-All solutions require logarithmic time. This is true even for a p p r o x i m a t e Write-All. However: T h e o r e m 2 . 2 . 2 1 T h e a p p r o x i m a t e Write-All problem (AWA) of size N where the number of locations to be written is iV' = aN and the number of surviving processors is a t least (3N^ for some constants 0 < a , / 5 < 1 can be solved probabilistically (error is Monte Carlo) on a CRCW PRAM with 0{N) expected work in 0 ( 1 ) parallel steps. R a n d o m i z a t i o n is an i m p o r t a n t algorithmic tool which has h a d extensive a n d fruitful application to fault-tolerance, e.g., [36]. Probabilistic techniques have played a key role in the analysis of asynchronous parallel c o m p u t i n g - see for example, [4, 5, 9, 10, 15, 22, 23, 2 1 , 30, 32, 34]. Note however, t h a t it is often hard to compare the analytical b o u n d s of deterministic vs randomized algorithms, since much of the randomized analysis is done using a n oblivious adversary a s s u m p t i o n .

151 Randomized algorithms often achieve better practical performance than deterministic ones, even when their analytical bounds are similar. Future developments in asynchronous parallel computation will employ randomization as well as the array of deterministic techniques surveyed here.

Bibliography [1] M. Ajtai, J. Aspnes, C. Dwork, O. Waarts, "The Competitive Analysis of Wait-Free Algorithms and its Application to the Cooperative Collect Problem", manuscript 1993. [2] G. B. Adams III, D. P. Agrawal, H. J. Seigel, "A Survey and Comparison of Fault-tolerant Multistage Interconnection Networks", IEEE Computer, 20, 6, pp. 14-29, 1987. [3] R. Anderson, H. WoU, "Wait-Free Parallel Algorithms for the Union-Find Problem", Proc. of the 23rd ACM Symp. on Theory of Computing, pp. 370-380, 1991. [4] Y. Aumann and M.O. Rabin, "Clock Construction in Fully Asynchronous Parallel Systems and PRAM Simulation", in Proc. of the 33rd IEEE Symposium on Foundations of Computer Science, pp. 147-156, 1992. [5] Y. Aumann, Z.M. Kedem, K.V. Palem, M.O. Rabin, "Highly Efficient Asynchronous Execution of Large-Grained Parallel Programs", in Proc. of the 34th IEEE Symposium on Foundations of Computer Science, pp. 271-280, 1993. [6] P. Beame and J. Hastad, "Optimal bounds for decision problems on the CRCW PRAM," Journal of the ACM, vol. 36, no. 3, pp. 643-670, 1989. [7] P. Beame, M. Kik and M. Kutylowski, "Information broadcasting by Exclusive Read PRAMs", manuscript 1992. [8] J. Buss, P.C. Kanellakis, P. Ragde, A.A. Shvartsman, "Parallel algorithms with processor failures and delays", Brown Univ. TR CS-91-54, August 1991. [9] R. Cole and O. Zajicek, "The APRAM: Incorporating Asynchrony into the PRAM Model," in Proc. of the 1989 ACM Symp. on Parallel Algorithms and Architectures, pp. 170-178, 1989. [10] R. Cole and O. Zajicek, "The Expected Advantage of Asynchrony," in Proc. 2nd ACM Symp. on Parallel Algorithms and Architectures, pp. 85-94, 1990. [11] R. DePrisco, A. Mayer, M. Young, "Time-Optimal Message-Optimal Work performance in the Presence of Faults" manuscript, 1994. [12] C. Dwork, J. Halpern, O. Waarts, "Accomplishing Work in the Presence of Failures" in Proc. 11th ACM Symposium on Principles of Distributed Computing, pp. 91-102, 1992. [13] D. Eppstein and Z. Galil, "Parallel Techniques for Combinatorial Computation", Annual Computer Science Review, 3 (1988), pp. 233-83. [14] S. Fortune and J. Wyllie, "Parallelism in Random Access Machines", Proc. the 10th ACM Symposium on Theory of Computing, pp. 114-118, 1978.

152 [15] P. Gibbons, "A More Practical P R A M Model," in Proc. of the 1989 Symposium on Parallel Algorithm>3 and Architectures, pp. 158-168, 1989.

ACM

[16] P. C. Kanellakis, D . Michailidis, A. A. Shvartsman, "Controlling Memory Access Concurrency in Efficient Fault-Tolerant Parallel Algorithms", 7th Int-l Workshop on Distributed Algorithms, pp. 99-114, 1993. [17] P. C. Kanellakis and A. A. Shvartsman, "Efficient Parallel Algorithms Can Be Made Robust", Distributed Computing, vol. 5, no. 4, pp. 201-217, 1992; prelim, vers, in Proc. of the 8th ACM PODC, pp. 211-222, 1989. [18] P. C. Kanellakis and A. A. Shvartsman, "Efficient Parallel Algorithms On Restartable Fail-Stop Processors", in Proc. of the 10th ACM Symposium on Principles of Distributed Computing, 1991. [19] P. C. Kanellakis and A. A. Shvartsman, "Robust Computing with Fail-Stop Processors", in Proc. of the Second Annual Review and Workshop on Ultradependable Multicomputers, Office of Naval Research, pp. 55-60, 1991. [20] R. M. Karp and V. Ramachandran, "A Survey of Parallel Algorithms for Shared-Memory Machines", in Handbook of Theoretical Computer Science (ed. J. van Leeuwen), vol. 1, North-Holland, 1990. [21] Z. M. Kedem, K. V. Palem, M. O. Rabin, A. Raghunathan, "Efficient Program Transformations for Resilient Parallel Computation via Randomization," in Proc. 24th ACM Symp. on Theory of Comp., pp. 306-318, 1992. [22] Z. M. Kedem, K. V. Palem, A. Raghunathan, and P. Spirakis, "Combining Tentative and Definite Executions for Dependable Parallel Computing," in Proc 23d ACM. Symposium on Theory of Computing, pp. 381-390, 1991. [23] Z. M. Kedem, K. V. Palem, and P. Spirakis, "Efficient Robust Parallel Computations," Proc. 22nd ACM Symp. on Theory of Computing, pp. 138-148, 1990. [24] C. P. Kruskal, L. Rudolph, M. Snir, "Efficient Synchronization on Multiprocessors with Shared Memory," in ACM Trans, on Programming Languages and Systems, vol. 10, no. 4, pp. 579-601 1988. [25] C. P. Kruskal, L. Rudolph, M. Snir, "A Complexity Theory of Efficient Parallel Algorithms," Theoretical Computer Science 7 1 , pp. 95-132, 1990. [26] L. E. Ladner, M. J. Fischer, "Parallel Prefix Computation", Journal ACM, vol. 27, no. 4, pp. 831-838, 1980.

of the

[27] M. Li and Y . Yesha, "New Lower Bounds for Parallel Computation," of the ACM, vol. 36, no. 3, pp. 671-680, 1989.

Journal

[28] A. Lopez-Ortiz, "Algorithm X takes work fi(7ilog^ 7 i / l o g l o g 7 i ) in a synchronous fail-stop (no restart) P R A M " , unpublished manuscript, 1992. [29] C. Martel, personal communication, March, 1991. [30] C. Martel, A. Park, and R. Subramonian, "Work-optimal Asynchronous Algorithms for Shared Memory Parallel Computers," SI AM Journal on Computing, vol. 21, pp. 1070-1099, 1992

This page intentionally blank

153 [31] C. Martel and R. Subramonian, "On the Complexity of Certified Write-All Algorithms", to appear in Journal of Algorithms (a prel. version in the Proc. of the 12th Conference on Foundations of Software Technology and Theoretical Computer Science^ New Delhi, India, December 1992). [32] C. Martel, R. Subramonian, and A. Park, "Asynchronous PRAMs are (Almost) as Good as Synchronous PRAMs," in Proc. 32d IEEE Symposium on Foundations of Computer Science^ pp. 590-599, 1990. [33] J. Naor, R.M. Roth, "Constructions of Permutation Arrays for Ceratin Scheduling Cost Measures", manuscript, 1993. [34] N. Nishimura, "Asynchronous Shared Memory Parallel Computation," in Proc. 3rd ACM Symp. on Parallel Algor. and Architect., pp. 76-84, 1990. [35] N. Pippinger, "On Simultaneous Resource Bounds", in Proc. of 20th IEEE Symposium, on Foundations of Computer Science, pp. 307-311, 1979. [36] M.O. Rabin, "Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance", J. of ACM, vol. 36, no. 2, pp. 335-348, 1989. [37] D. B. Sarrazin and M. Malek, "Fault-Tolerant Semiconductor Memories", IEEE Computer, vol. 17, no. 8, pp. 49-56, 1984. [38] R. D. Schlichting and F. B. Schneider, "Fail-Stop Processors: an Approach to Designing Fault-tolerant Computing Systems", ACM Transactions on Computer Systems, vol. 1, no. 3, pp. 222-238, 1983. [39] J. T. Schwartz, "Ultracomputers", ACM Transactions on Programming Languages and Systems, vol. 2, no. 4, pp. 484-521, 1980. [40] A. A. Shvartsman, "Achieving Optimal CRCW PRAM Fault-Tolerance", Information Processing Letters, vol. 39, no. 2, pp. 59-66, 1991. [41] A. A. Shvartsman, Fault-Tolerant and Efficient Parallel Computation, Ph.D. dissertation. Brown University, Tech. Rep. CS-92-23, 1992. [42] A. A. Shvartsman, "Efficient Write-All Algorithm for Fail-Stop PRAM Without Initialized Memory", Information Processing Letters, vol. 44, no. 6, pp. 223-231, 1992. [43] R.E. Tarjan, U. Vishkin, "Finding biconnected components and computing tree functions in logarithmic parallel time", in Proc. of the 25th IEEE FOCS, pp. 12-22, 1984. [44] J. S. Vitter, R. A. Simmons, "New Classes for Parallel Complexity: A Study of Unification and Other Complete Problems for P," IEEE Trans. Comput., vol. 35, no. 5, 1986.

This page intentionally blank

SECTION 3

DOMAIN-SPECIFIC PARADIGMS FOR REAL-TIME SYSTEMS

SECTION 3.1

Use of Imprecise Computation to Enhance Dependability of Real-Time Systems Jane W. S. Liu, Kwei-Jay Lin, Riccardo Bettati, David Hull and Albert Yu

Abstract In a system based on the imprecise-computation technique, each time-critical task is designed in such a way that it can produce a usable, approximate result in time whenever a failure or overload prevents it from producing the desired, precise result. This section describes ways to use this technique together with traditional fault-tolerance methods to reduce the costs of providing fault tolerance and enhanced availability. Specifically, an imprecise mechanism for the generation and use of approximate results can be integrated in a natural way with traditional checkpointing and replication mechanisms. Algorithms and process structures for this integration and rules for determining when approximate results can be used in place of the desired results are discussed.

3.1.1 Introduction The imprecise computation technique was proposed as a way to handle transient overloads in real-time systems [1-3]. Here, by real-time system, we mean a computing and conmiunication system in which a significant portion of the tasks have deadlines. The term task refers to a unit of work to be scheduled and executed. A task may be the computation of a control law, the transmission of an operator command, the Jane W. S. Liu, Riccardo Bettati, and David Hull are at the Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801. Kwei-Jay Lin is at the Department of Electrical and Computer Engineering, University of California, Irvine, Irvine, California. Albert Yu is at Hughes Aircraft, Radar Systems Group, Los Angeles, California 90009. This work was partially supported by the U. S. Navy ONR contracts No. NVY N00014 89-J-l 181 and No. NVY N0001489-J-l 146.

158 retrieval of a file, etc. As the result of its execution, each task delivers some data or service. The failure of a time-critical task to deliver its result by its deadline is a timing fault. A real-time system functions correctly only in the absence of timing faults. An example is the TCAS (Traffic Alert and Collision Avoidance System) used in commercial aircraft to alert pilots of potential collisions. The command telling the pilot of the conflict traffic and the necessary evasive action not only must be correct but also must be issued in time. Other examples of real-time systems include flight control and management, intelligent manufacturing, and various monitoring systems. For many real-time applications, we may prefer to have approximate results of a poorer but acceptable quality on a timely basis to late results of the desired quality. For example, it is better for a collision avoidance system to issue a timely warning together with an estimated location of the conflict traffic than a late command specifying the exact evasive action. Other examples are video and voice transmissions. While poor quality images and voices may be tolerable, late frames and long silences are often not. The imprecise computation technique was motivated by this observation and the fact that good approximate results can often be produced with much less processor time than results of the desired quality. By trading the quality of the results for the amount of time and resource required to produce them, a system based on this technique tries to make approximate results of acceptable quality available whenever it cannot produce results of the desired quality in time. Hereafter, we will refer to such a system as an imprecise system and a result of the desired quality a precise result. An imprecise computation is a generalization of an anytime computation, a term used in the AI literature. The focus of the AI community has been on anytime computational algorithms and reasoning about the results produced by them (e.g., [4-6]). On the other hand, the work by the real-time systems conmiunity has been concerned with how to structure the operating and run-time systems to take advantage of the schedulingflexibilityprovided by imprecise computations. In recent years, several models have been developed to characterize the behavior of imprecise computations and to quantify the costs and benefits of the tradeoff between result quality and required processing time for different classes of applications and implementation methods. Many efficient scheduling algorithms that achieve the optimal or suboptimal tradeoff are now available; examples are those described in [7-18]. These algorithms have made the imprecise computation approach to overload handling feasible. The imprecise computation technique is also a natural means for enhancing fault tolerance and graceful degradation of real-time systems. We consider here only hardware faults and transient software faults. Permanent software faults are not considered. To see how availability and fault tolerance can be increased in an imprecise system, we consider tracking and control systems for example. A transient fault may cause a tracking computation to terminate prematurely and produce an approximate result. No recovery action is needed if the result allows the system to maintain track of the targets.

159 Similarly, as long as the approximate result produced by a control law computation is sufficiently accurate for the controlled system to remain stable, the fault that causes the computation to terminate prematurely can be tolerated. In embedded systems, this technique can be used together with the traditional replication and checkpointing techniques (e.g., [19,20]). The result is a reduction of the costs of providing fault tolerance and enhanced availability. This section describes an architecture that integrates the imprecision mechanism for the storage and return of intermediate, approximate results of computations with the fault-tolerant mechanisms that support traditional checkpointing and replication. The Imprecise Computation Server (ICS) system is being built on the Mach operating system to implement this architecture. ICS will make it easy for us to implement imprecise computations and experiment with them. When it is completed, we plan to implement several representative applications using the ICS system in order to evaluate experimentally the effectiveness of the imprecise computation technique. The applications we plan to implement and experiment with include multimedia data transmission, direct digital control and optimal control, and database queries and updates. A reason for choosing these types of applications for in depth examination is that they require different imprecision management rules. These rules govern whether approximate results produced by prematurely terminated tasks are acceptable and what error recovery operations are to be carried out when the results are not acceptable. Following this introduction. Section 3.1.2 gives a brief overview of the different ways to implement imprecise computations, workload models used to characterize them, and algorithms to schedule them. Section 3.1.3 describes an architectural framework for the integration of an imprecision mechanism and a traditional checkpointing mechanism, as well as the ICS system built on this framework. Section 3.1.4 describes algorithms for scheduling replicated imprecise computations and Section 3.1.5 a process structure that supports imprecision and replication-masking. Section 3.1.6 discusses the imprecision management rules for different types of applications. Future extensions are discussed at the end of each section. Section 3.1.7 is a summary.

3.1.2 Imprecise Computation An imprecise system can be represented abstractly by a variation of the precedence graph in Figure 3.1.1. Specifically, this graph represents an imprecise system T as a set of related tasks. Each task is represented by a node. Nodes are depicted by boxes of all shapes in this figure. Tasks may be dependent; data and control dependencies between tasks impose precedence constraints on the order of their execution. There is an edge from a node Ti to a node Tj if the task 7f must be completed before Tj can begin execution.

160

jj^ multiple

optional

Figure 3.1.1: General Imprecise Computation Model Implementation Methods In an imprecise system, some tasks are identified by the progranmier as mandatory. Like tasks in traditional real-time systems, mandatory tasks must be executed and completed by their deadlines in order for the system to function correctly. These tasks are shown as solid boxes in Figure 3.1.1. The progranmier identifies some less important tasks as optional, meaning that these tasks can be skipped (i.e., not executed) without causing intolerable degradation in the output of the system. Optional tasks are represented by dotted, shaded boxes in Figure 3.1.1. The system may skip optional tasks during overload so that mandatory tasks can complete in time. Sieve Method and the 0/1 Constraint For example, in radar signal processing the task that estimates the background noise level in the received signal can be skipped.

161 An old estimate of noise level can be used when a new estimate cannot be produced in time. Therefore this task is optional. Similarly, in aflightmanagement system the step that updates the estimated time of arrival can be skipped when the system is busy dealing with local turbulence and nearby traffic and is, therefore, optional. In general, we call a task whose sole purpose is to produce a result that is at least as precise as the corresponding input or an old output a sieve function. A sieve can be skipped in order to save time and resource. When it is skipped, the result is less precise. Conversely, when a sieve completes in time, it improves the quality of the result. This method for producing approximate results is called the sieve method. There is no benefit gained by completing a sieve in part. Therefore, we want to either execute such an optional task to completion before its deadline or skip it entirely. In this case, we say that the execution of the optional task satisfies the Oil constraint. When tasks have the 0/1 constraint, the scheduler must decide before the execution of each task begins whether to schedule the task entirely or to skip it entirely. Some schedulingflexibilityis thus lost. Monotone Tasks and Milestone Method A task, as well as the underlying computational algorithm, is said to be monotone if the quality of the intermediate result produced by it is non-decreasing as it executes longer. We can logically decompose each monotone task into a mandatory task and an optional task. A monotone task produces a precise result when the entire task completes. An approximate result can be made available by recording the intermediate results produced by the task at appropriate instants of its execution. If the task is terminated before it is completed, the approximate result produced by it at the time of its termination is the best among all the intermediate results produced before its termination. This result is usable as long as its mandatory task is completed. This method for returning approximate results is called the milestone method. Clearly, the milestone method relies on the use of monotone computational algorithms. Such algorithms exist in many application domains, including numerical computation, statistical estimation and prediction, sorting, facsimile transmission [21], video and voice transmission [22,23], and database query processing [24,25]. Monotone tasks can be easily implemented in most existing programming languages; an example illustrating how to implement monotone tasks in Ada 9x can be found in [26]. When tasks are monotone, the decision on which optional task and how much of the optional task to schedule at any time can be made dynamically. Because the scheduler can terminate a task any time after it has produced an acceptable result, scheduling monotone tasks can be done on-line or nearly on-line. Consequently, we have the maximum flexibility in scheduling when imprecise computations are implemented using this method.

162 Multiple-Version Method For some applications, neither the milestone method nor the sieve method can be used. Trading off result quality for processing time is nevertheless possible by using multiple versions of tasks. Using this method, we need to provide at least two versions of each task: the primary version and the alternate version(s). The primary version of each task produces a desired, precise result but has a longer processing time. An alternate version has a shorter processing time but produces an approximate result. During a transient overload, when it is not possible to complete the primary version of every task by its deadline, the system may choose to schedule alternate versions of some tasks. The feasibility and effectiveness of the multiple version method has been demonstrated for both real-time computing and data communication; tools and environments have been developed to support this method [27-33]. A higher storage overhead incurs because of the need for maintaining multiple versions. Performance data shows that there is little advantage to be gained by having more than two versions [30,32]. For scheduling purpose, we model the alternate version of each two-version task as a mandatory task and the primary version as a mandatory task followed by an optional task. The processing time of the mandatory task is the same as that of the alternate version, and the processing time of the optional task is equal to the difference between the processing times of the two versions. The optional task must be either scheduled and completed by its deadline, corresponding to the primary version being scheduled, or skipped entirely, corresponding to the alternate version being scheduled. Consequently, scheduling tasks that have two versions is the same as scheduling tasks with the 0/1 constraint. Error Characteristics and Scheduling Algorithms For the purpose of scheduling, we logically decompose each task 7f into two tasks, the mandatory task Mi followed by the optional task Oi, independent of the method used to implement it. Let n, rui and Oi denote the processing times of 7^, Mi and Oi, respectively. Clearly, rrii -h Oi = Ti. The classical deterministic model is a special case of this imprecise computation model where all tasks are mandatory, that is, Oi = 0 for all i. Similarly, a sieve or an anytime computation is a task that is entirely optional. In general, the error of a task Ti is a function of the processing time of the portion that is executed, as well as the errors in the inputs of the task. Let ai denote the amount of processor time that is assigned to the optional task Oi. All existing scheduling algorithms assume that the inputs to every task have zero error. In other words, the error Ci in the result produced by the task Ti (or simply the error of Ti) depends solely on ai and is given by the error function Ei{(Ti) of the task. Moreover, when (Ti is equal to Oi, that is, when the scheduler allows Ti to execute until it terminates normally, its error Ci is zero.

163

linear processor time Mi

Xi

processor time

(a). Types of error functions

(5) Effects of input error

Figure 3.1.2: Error Characteristics The error in the result of a task typically decreases in discrete steps and is, therefore, discontinuous. Because it is difficult to work with discontinuous functions, we typically approximate them by the continuous error functions. The continuous error functions were found to be good approximations except when the error decreases to zero in very few (e.g., two or three) steps [8]. Figure 3.1.2(a) shows three types of error functions that characterize the general behavior of different monotone computations. When the scheduler works correctly, it allows every mandatory task to complete. Therefore, the value of the error in the range where the amount of assigned processor time is less than the mandatory processing time is unimportant; for the sake of convenience, we let this value be 1. When the exact behavior of error functions is not known, a reasonable choice is a = (oi — (Ti)/oi for all i\ that is, the error of a task is equal to the processing time of its skipped portion. For this reason, most studies on scheduling imprecise computations assume this linear error function. For a given schedule, the total error of the task set T is e = Yll=i '^i^i^ where wi are the weights of the tasks. By choosing the weights, we can account for the different degrees to which the errors in the results produced by the individual tasks impact the overall quality of the result produced by the entire task set. Examples of optimal and suboptimal algorithms that minimize the total error can be found in [8,11,15]. An iterative algorithm, such as Newton's method forfindingroots of polynomials, typically gives us a monotone task whose error decreases faster earlier during its execution. The rate of convergence to zero error decreases as such a task continues to execute. Its error function behaves like the convex function in Figure 3.1.2(a). Many

164 other computations, such as statistical estimation, also have this type of error function. The average error of tasks with convex error functions can be kept small by making the maximum normalized error of all tasks as small as possible. Given a schedule of a set {Ti} of tasks with identical weights, the maximum normalized error of the task set is 6 = max, {{oi — (Ti)/oi}. Polynomial-time algorithms for finding optimal schedules with the smallest maximum normalized error can be found in [13,17]. The error of a monotone task whose error function is concave decreases at a faster rate as it executes longer. The optional part of such a task should be scheduled as much as possible or not scheduled at all. In the limit, the error of an optional task with the 0/1 constraint stays at the maximum value of 1 until the optional task completes. In a schedule that satisfies the 0/1 constraint, the amount of processor time assigned to every optional task is equal to either oi or 0. The general problem of scheduling to meet the 0/1 constraint and timing constraints, while minimizing the total error is NP-complete when the optional tasks have arbitrary processing times. Approximate algorithms with reasonably good worst-case performance can be found in [14]. When the optional tasks have identical processing times and weights, tasks with the 0/1 constraint can be optimally scheduled in 0{n log n) time or 0{'n?) time depending on whether the tasks have identical or different release times. These algorithms can be found in [11]. Future Work in Scheduling Again, all existing algorithms for scheduling imprecise computations assume that tasks have zero input errors. Moreover, the release times and deadlines of individual tasks are given. Both these assumptions are often not valid. The timing constraints of a set of dependent tasks are often end-to-end in nature. As long as the last task(s) in the set completes before its end-to-end deadline, when the individual tasks in the set complete is unimportant. The possibility of postponing the deadlines of some tasks gives an additional dimension in the tradeoff between the quality and the timeliness of the overall result and makes the problem of scheduling imprecise computations more difficult. The result produced by a task may be an input to its inmiediate successors. When this result is imprecise, the input is erroneous. We can account for the effect of errors in the inputs of a task T, by using an error function that is also a function of input errors. Specifically, let the errors in the inputs of a task Ti be denoted by the vector e,. The error in the result produced by Ti whose optional task is assigned cr, units of processor time is given by Ci — Ei{ai,ei), where Ei{ai,ei) is a non-increasing function of (Ti and a non-decreasing function of e,. Figure 3.1.2(b) illustrates the effect of input errors. A task may need to do some additional work to correct its input error, and a poorer input may slow down the rate at which its result converges to the precise one. Consequently, the processing times of both Mi and Oi may become larger as the input

165 error ef increases. When the magnitude of e, becomes larger than some threshold, the processing time o,(ei) of the optional task Oi may become infinite, and the task Ti can never produce a precise result no matter how long it executes. The need to keep the error in an input below this threshold determines when a result produced by each predecessor task is acceptable and how large the processing time of the mandatory part of the predecessor task should be. We are developing scheduling algorithms that take input errors into account and make use of the end-to-end nature of the deadlines when trying to minimize the error in the overall result.

3.1.3 Integration of Imprecision with Checkpointing As stated earlier, the mechanism for the storage and return of the intermediate, approximate results of computations can be easily integrated with the traditional checkpointing mechanism for fault tolerance. This section describes an architecture that allows this integration and the Imprecise Computation Server (ICS) system which implements this architecture. Throughout this section, we confine our attention to transient faults and failures. To make approximate results available in an imprecise system, the intermediate results produced by each task must be recorded as the task progresses toward completion. The application programmer defines the intermediate result variables and the accuracy measure variables to record, as well as the time instants at which their values are recorded. Each set of recorded values can be viewed as a checkpoint. Therefore, we call the mechanism for returning intermediate approximate results user-directed checkpointing. In contrast, the traditional checkpointing mechanism is system-directed. Using user-directed checkpointing to supplement system-directed checkpointing has several obvious benefits. Because the application progranmier can make use of the semantics of the individual computations, it is possible to keep the amount of state information recorded at each user-directed checkpoint small. For example, it suffices to record two consecutive intermediate roots produced by an iterative rootfinding computation or the current sample mean and variance of a simulation run. In the absence of user provided checkpointing routines, the operating system must save the entire address space of each task. Consequently the cost of user-directed checkpointing is usually lower than for system-directed checkpointing. Furthermore, since fault recovery can make use of user-directed checkpoints as well as systemdirected checkpoints, system-directed checkpoints need not be done as often, thus reducing the cost of providing fault tolerance. When a task terminates before it is completed, recovery action is not necessary if the last result returned was sufficiently good. Thus the processing time that would be spent repeating the work between the last checkpoint and the point of failure may be saved.

166 Client

Server

Caller

Callee

Handler

Supervisor Scheduler

Figure 3.1.3: ICS Process Structure An Imprecise System Architecture The process structure that we use for the result-saving purpose is the simple variation of the client-server model shown in Figure 3.1.3. There is a server type for each service provided by an imprecise system. Each server consists of a callee and a supervisor, while each client consists of a caller and a handler. The caller and callee are applicationspecific and are written by the application programmer. The handler and supervisor are part of the underlying system. The supervisor is created at configuration time or at the instantiation time of the server. It may be dedicated to one server or shared by several servers. These details are unimportant for the purpose of this discussion. As in traditional systems, the callee executes whenever it is called by the client. The client calls a server by sending an invocation request to the server's supervisor. Typically, only the client knows how accurate a result must be for it to be acceptable and how many faults each server invoked by it must tolerate. On the other hand, only the scheduler has the information on the demand and the availability of all resources. Both kinds of information are needed to support the decision on what quality of service to provide to each client and which tasks to terminate when an imprecise system recovers from a transient overload. To support this decision, the information regarding the result accuracy required by the client, the minimal level(s) of accuracy achievable by the server, and the required number of tolerated faults is exchanged between the client, the server, and the scheduler at invocation time. We will return shortly to describe the interactions between these system components for the purpose of this information exchange. When the supervisor grants the request, it activates the callee. When the time allocated for the callee's execution expires, the supervisor terminates the callee. The client has the final, precise result when the callee terminates normally. If the callee terminates prematurely, the client has the best approximate result produced by the

167 callee before it terminates. Based on the latest recorded value of the accuracy measure variable, the handler can decide whether this approximate result is acceptable. In addition to aiding the result-saving process, the supervisor also checkpoints the callee and carries out sanity checks on a periodic basis in order to enhance fault tolerance. Because the supervisor and the callee do not communicate during the callee's execution, a fault occurring during the callee's execution does not affect the supervisor. We further assume that the effect of any fault occurring during the callee's execution is confined to it. This simplifying assumption allows us to focus on the interaction between the imprecision and fault-tolerance mechanisms. When a failure is detected before the callee *s mandatory portion completes, recovery is necessary. The supervisor restores the callee to its state at the time of the last checkpoint and resumes its execution from that state. We call such an action a system-directed recovery action. An imprecise system must also support user-directed recovery. For real-time applications, forward recovery is typically more appropriate. Moreover, it can often be done effectively, especially for embedded applications. To carry out forward recovery, the client may want the sequence of approximate results produced by the callee during the course of its execution, not just the last result before its termination. This sequence of approximate results is often the consecutive discrete-time sample values of a continuous-time function. The future values of such a function can be estimated by extrapolating from the available past sample values. Depending on the semantics of the problem, there may be several ways to do this extrapolation to arrive at different acceptable estimates of thefinalresult. For example, the application programmer may provide the client with some routines that implement filters with prediction or pure prediction filters. If the callee terminates prematurely, the client would take the approximate results and apply one of the given routines on them to generate a better approximation of thefinalresult. In this way, the application system can better control of the quality of approximate results. Scheduling Checkpointed Imprecise Tasks From the scheduler's point of view, an imprecise application system T is a set of n (callee/server) tasks. Suppose that the supervisor of the callee Ti takes a checkpoint every k units of time, and that it takes Ci units of processor time to generate a checkpoint. While a checkpoint is generated, a sanity check is made. We assume that sanity checks never fail, and a failure occurring at any time instant is detected by the next sanity check after the instant. The callee task is allowed to fail ki times, {ki is specified by the client in its invocation request, in a manner described later in this section.) Some other recovery measure will be taken if it fails more than ki times. Let pi denote the amount of time the supervisor executes in order to roll back the callee task. A task Ti

168 can tolerate ki faults only when a sufficient amount of processor time is allocated so that its mandatory portion can complete before its deadline even when it must be rolled back ki times. In general, we say that a schedule isfc^-tolerantfor task Ti if the task can recover from ki failures without causing any task in the system T to miss its deadline. A schedule for the system T is k-tolerant, where k denotes the vector (fci, ^ 2 , . . . , kn), if it is fci-tolerant for Ti, ^2-tolerant for T2, and so on. The goal of scheduling checkpointed imprecise tasks in T is to find a k-tolerant schedule that minimizes the total error when all /?i + ^2 + \- kn failures occur. (Here, the error of each callee task Ti is equal the processing time of its skipped optional portion.) To accomplish this goal, the scheduler must allocate at least m / « mj + [mi/li\ Ci -h ki{li -h Cj + pi) units of time to the task when the processing time of its mandatory portion is m,. When tasks are statically bound to processors and tasks on each processor are scheduled as in a uniprocessor system, the algorithm called Algorithm COL in [33], can be used for this purpose. When the tasks are off-line, this algorithm is optimal, in the sense that it can always find afc-tolerantschedule if such a schedule exists and the total error of all the tasks in T is minimized. (A task system is off-line when the release times, deadlines and processing times of all the tasks in it are known before the processor begins to execute any task.) The version described below assumes that the tasks are on-line; the algorithm is optimal only when the system satisfies the feasible mandatory constraint [15]. Similar to the NORA algorithm for scheduling on-line imprecise tasks which are ready for execution at their arrival times [15], Algorithm COL maintains a reservation list of time intervals for all tasks that have been released but are not yet completed. This list is updated each time a new task Ti is released; as a result of this update, an interval (or several intervals) of length equal to rrii' is reserved for it. The reservation list is derived from a feasible schedule of all the unfinished portions of the mandatory tasks. This schedule is generated by backward scheduling according to the latest-ready-time-first rule; a time interval is reserved if some task is scheduled in the interval. The scheduler uses the reservation list as a guide. It never schedules any optional task in a reserved interval. The reservation list is also updated as the execution of each mandatory task Mi progresses to completion. At an update when y units of the mandatory portion have already been completed and z of the ki tolerable failures have occurred, the amount of time reserved for Mi is reduced to rrii — y-\- [(m^ - y)/li\ Ci + {ki — z){li -{- Ci -\- Pi). The reservation for it is eventually deleted when the mandatory portion completes. Algorithm COL differs from the NORA algorithm in the way it schedules mandatory and optional tasks. Algorithms COL maintains an EDF-ordered main task queue and an optional task queue. When a task is released it is put into the main task queue after the reservation list is updated. Thefirsttask in this queue is scheduled for execution. The task executes for as long as there is time before the beginning of the earliest reserved

169 interval. When the task completes or is terminated at its deadline, it is removed from the main task queue and its reservation is deleted from the reservation list. If before its completion or its deadline, the beginning of the earliest reservation interval is reached, the reservation of the task is deleted if it has any reserved time according to the current reservation list. Otherwise, the task is terminated and is removed from the main task queue. If a task is not completed when it is removed from the main task queue, it is placed in the optional task queue, which is also EDF-ordered. When the main task queue is empty, the processor executes the optional task at the head of the the optional task queue, until the completion or the deadline of the task; it is then removed from the optional task queue. The elegance of this algorithm in combination with a checkpointing scheme is that any unused recovery time is automatically made available for the other tasks, allowing their optional portion to execute longer. The performance of this algorithm for scheduling on-line transactions that have bounded-response-time and high-availability requirements has been evaluated and performance data can be found in [33]. The ICS System We are implementing the ICS (Imprecise Computation Server) system based on the architecture described above. ICS runs on top of the Mach operating system and is integrated with the Mach Interface Generator (MIG) [34]. MIG takes an interface description for a service and generates interface code for the client and the server. This interface hides the call to the remote server from the client, making it appear to be a local procedure call. Interfaces To develop an application system using ICS, the application programmer first defines the interface for each service, and writes a MIG specification file that describes the interface. A modified version of MIG, called IMIG, is run to generate the interface code. IMIG adds an argument to each of the interfaces, which is used to exchange ICS-specific information between the caller and the callee. Specifically, it is a pointer to a data structure containing scheduling information; a portion of this structure is shown in Figure 3.1.4. The requiredAccuracy and outputAccuracy parameters measure the accuracy of the result. The deadline and mandatoryTime parameters allow the programmer to specify the timing requirements of the application system. The resources parameter is used for server selection. The toleratedFaults parameter specifies the number of faults that the server must be able to tolerate. This information, together with the value mandatoryTime, is used by the scheduler to determine the amount of processor time to reserve for the mandatory portion. Next, the programmer writes the code that implements the callers and callees in the application system and then compiles and links the code with the ICS library, which

170 struct icslnfo { long requiredAccuracy; struct timeval deadline; long resources;

/* User Inputs.

int serviceNumber; /* System Inputs, struct timeval mandatoryTime; int toleratedFaults; long outputAccuracy;

/* Outputs.

*/

Figure 3.1.4: The icslnfo Structure

ICS-supplied

User-supplied

Client Side icsFindService client interface routine

Server Side icsRegisterService icsService result-saving routine server interface routine checkpoint routine restart routine

Table 3.1.1: ICS Functions contains the supervisor and handler code. Each client or server produces a separate application program. ICS's interface routines fall into two groups: client-side routines and server-side routines. They may be further divided into those that are provided by ICS to be called by the application and those that are written by the programmer to be called by ICS. The routines and their classifications are listed in Table 3.1.1. Details on these routines and their usage can be found in [35]. The client interface routine is the client's interface to a server and is called by the client when it wishes to make use of a service. The server interface routine is the corresponding interface on the server side. When a client interface routine is called, the server interface routine is eventually invoked by a supervisor on the selected server. The checkpoint routine provided by the application programmer is used for both

171 result-saving and checkpointing purposes. If no checkpoint routine is specified, ICS will checkpoint the callee in the traditional manner by saving the callee's entire address space. In this case, ICS will be unable to return an approximate result to the caller if the callee fails; the computation will have to be restarted from the last checkpoint. Service Establishment The programmer starts the application system by starting the application programs for each of the servers and clients in the system. Each server, after performing any necessary initialization, registers itself with its supervisor by calling icsRegisterService, This call informs the supervisor of the name of the service, the resources required by the callee, and the routines to call to perform checkpointing and to restart the callee from a checkpoint. Once the service has been registered, the server calls the icsService routine, which passes control to the supervisor. The supervisor advertises the services that it provides and waits for requests from clients. To make use of a service, a client first calls the icsFindService routine with the name of the service it wishes to call. icsFindService returns a service number. The service number is used by the handler to keep track of all of the servers for that service. When a client calls a server, the caller fills in the icslnfo structure and calls the ICS-supplied client interface routine. This interface routine invokes the handler. The handler uses the scheduling information provided in this structure to select a server and sends a message to the chosen server's supervisor. The supervisor creates a new thread of control for the callee and informs the scheduler of the callee's deadline and processing time requirements. It then starts the callee when the scheduler grants its request to start the callee. After starting the callee, the supervisor sets a timer. When the timer expires, the supervisor calls the callee's checkpoint routine, which instructs the callee to call its result-saving routine. When the callee wants to save an approximate result, it calls its result-saving routine itself. The result-saving routine, which is generated by IMIG, sends the result and the callee's current state to the handler that originated the request. The result-saving routine also resets the timer, thus delaying the next supervisor-invoked checkpoint. If the handler finds that the server has died or has missed its deadline, it can use the state information received at the last checkpoint to decide whether to return the result that has been computed so far to the caller, or to carry out some user-directed recovery action, or to use the saved state to restart the request on another server. Extensions to ICS We have implemented a prototype version of ICS and the design seems reasonable. Once we complete the version described here, there are several extensions we would like to make. One extension is to replicate servers on redundant processors. The following section discusses an approach to use the imprecise computation technique to reduce the overhead of replication. Another approach is

172 to replicate only the mandatory portions. The approximate results produced by the mandatory portions are used to ensure the correctness of the refined results produced by the unreplicated optional portion. We will take this approach because it is more compatible with the ICS architecture. In the near future, we will extend the ICS system to support imprecise service negotiation. When it is not possible for the scheduler to allocate sufficient amounts of processor time or other resources to grant a request, it informs the supervisor of the available amounts. Rather than simply denying the request, the supervisor and client may enter an optional negotiation phase. The willingness of the client to accept a poorer quality result that can be produced with the available resources is determined. The callee is scheduled if the client can use such a result. In this way, the quality of each imprecise computation can be negotiated at invocation time.

3.1.4 Integration of Imprecision and Replication For real-time applications, replication and masking is a more appropriate approach to provide fault tolerance whenever there is more than one processor. This section describesfirstalgorithms for scheduling replicated periodic imprecise tasks and then a process structure that allows replication be incorporated in an imprecise system. Scheduling Replicated Imprecise Taslis We consider here a system T of n independent, replicated periodic tasks. They are to be assigned to identical non-faulty processors and allocated sufficient processor time for them to produce acceptable results in time. The Allocation Problem We say that the system is in mode j when the system contains j non-faulty processors. Each task Ti is replicated. There are c,(j) clones, or copies, of Ti when the system is in mode j . Ci{j) is chosen by the application programmer. The set {ci(j)} is an input of the task assignment module. When invoked, this module allocates a fraction of processor time to (each clone of) each task andfindsan assignment of the clones of all the tasks to the j processors. An assignment in mode j is a valid one if the Ci{j) clones of every task Ti is assigned on a one-clone-per-processor basis to c, (j) non-faulty processors. Moreover, let Xi{j) be the amount of processor time per period that is allocated to each clone of task Ti. Because the mandatory portion of every task must always be completed, Xi{j) must be larger than the processing time rrii of the mandatory task Mi of Ti. Since the optional portion need not be completed, Xi{j) may be less than the processing time Vi of the task as a whole. Let pi be the period of Ti, and Ui{j) = Xi{j)/pi be the fraction of processor time allocated to each clone of T,. We refer to Ui{j) as the allocated

173 utilization, or simply the utilization, of Ti. For an assignment to be valid, the allocated utilization Ui{j) must be such that rui/pi < Ui{j) < ri/pi for all I
174 valid. In the second phase, each algorithm generates a valid task assignment using the upper bounds derived in phase one as input. The algorithms used in this phase are based on well-known heuristics for bin packing. Examples are least-utilizationbest-fit, least-number-of-clones-first-fit, and multifit. Because the upper bounds of task utilizations are used, it is likely that the total utilizations of some processors exceed the schedulability threshold. The third phase adjusts the allocated utilizations of the tasks so that the total utilization of each processor is less than the schedulability threshold and the total weighted utilization U{j) is maximized. The problem solved in the third phase is formulated as a linear-programming problem, and is solved using known efficient linear-programming algorithms. The difference between the Allocate algorithm and the Modify algorithm lies in the second phase in which different routines are invoked to produce desirable task assignments. The desirable task assignments sought by the Allocate algorithm are those close to being optimal. On the other hand, the Modify algorithm seeks feasible assignments that maintain as much location continuity as possible with respect to the existing task assignment that was given as input. To achieve this objective, the routine called by the Modify algorithm in phase two will not move those clones already assigned to the non-faulty processors and tries to maximize U{j) under this constraint. Detailed descriptions of these algorithms and their performance can be found in [36,37].

3.1.5 Process Structure for Replication With replication, the system can tolerate both transient and permanent physical faults in the processors. We assume that the mean time between failures is much greater than the time needed for a failure recovery. During a failure recovery, the system will remain functional as long as there are enough existing clones to mask the failures. We can also incorporate replication with imprecision using a server-client process structure. Specifically, in a system where both the client Tc and the server T, are replicated, the result produced by each clone of a server is broadcast to all the clones of the client. The additional time needed for broadcasting is included in the processing time of server task. Because an accuracy measure is returned with each result, a client clone may choose to use any one of the results that arrive in time based on the quality of the results. Neither agreement nor synchronization among the client clones is necessary. It is easy to show that when both the client and the server are periodic, the maximum latency for detecting a failure of the server is equal to pc -\- Ps, where pc and ps are the periods of the client and server, respectively. The system provides two types of servers, diagnosis servers and repair servers, to detect failures and recover from them. Each diagnosis server is responsible for monitoring some number of processors. It polls the processors it services, either upon request by some task or periodically, to

175 determine their status. Each processor is monitored by more than one diagnosis server. In this way, the diagnosis servers are independently replicated to provide concurrent, independent services. When a faulty processor is detected by a diagnosis server, it calls a repair server. Upon request by a diagnosis server, a repair server carries out a clone reconfiguration. This is done in two phases. Using the Modify algorithm, it first finds a new task assignment to relocate the clones assigned to the faulty processor. It then installs the new assignment on the non-faulty processors. During this installation, the amount of processor time allocated to each clone may be increased or decreased. Some new clones may need to be added to some non-faulty processors. This is essentially a mode change operation. This operation can be done in an iterative manner. The repair server sends the information on the new clone allocated utilization iteratively to each non-faulty processor. The number of clones and the allocated utilizations of the clones are changed in such a way that the number of clones of every task TJ on non-faulty processors is never less than Ci{j) when the system is in mode j [38]. In particular, according to the clone reconfiguration scheme described in [38], the repair server is monotone. Therefore, if another processor fails while a repair server is carrying out clone reconfiguration, we can terminate the ongoing repair service. Because the Modify algorithm is used, the server monotonically reduces the number of the excessive clones and the number of the missing clones to zero. If a repair server is terminated during the first phase before a new assignment is found, the incomplete assignment it has generated before its termination is treated as an approximation of the final assignment. Another clone reconfiguration can be started based on this approximate assignment. When a clone reconfiguration is interrupted during the second phase of the execution, some clones may have been partially reconfigured. Since the method used in the second phase is iterative, it can also be aborted at any time. The system is in a partially reliable state if some of the new clones are not installed at the time, but the new configuration is an improvement of the existing one, and each iteration by the server improves or preserves the reliability of the system.

3.1.6 User-Directed Recovery and Imprecision Management In practice, it is often impossible to bound the time required to produce an acceptable result. As an example, we consider a task that iterativelyfindsa root of a polynomial. How long the task must execute to reduce the error to an acceptable level depends not only on the coefficients of the polynomial but also on the region where is the root is expected to be found. The supervisor may underestimate the processing time of the mandatory part. Furthermore, the required accuracy of the result may depend on the value of the result. For example, how accurate the estimated positions of several cooperative robot arms may be depends on the positions of the arms. The closer they

176 Application Domain Type

Decision Support Information or Recovery Actions

Non-cumulative

Acceptance of Xi is based on Xj. Recovery action depends on x,. Alternative recovery actions: refine x,. Acceptance of Xi is based on Xi and Xj_i. Recovery action depends on Xi and Xi_i. Alternative recovery actions: refine Xi or precompute

Limitedly-cumulative

Cumulative

Acceptance of Xi is based onxi,Xi-i,Xi-2, — Recovery action depends onxi,Xi^i,Xi^2, Alternative recovery actions: refine Xi, or precompute Xi^i, or recompute Xi_i, Xi_2, Table 3.1.2: Imprecision Management Policies

are and the greater the likelihood of collision, the more accurate the estimates must be. Consequently, the client may find that it requires better accuracy than it requested during invocation after examining the result. The above examples illustrate that sometimes a client mayfinda result unacceptable and recovery is necessary, even though the mandatory part of the callee has been completed. In this case, the appropriate recovery action to take is typically application dependent. Whether some recovery action is necessary and what recovery action should be taken are likely to be dependent on the semantics of data, operations and applications. We call a policy that defines the rules governing when and what application-directed recovery is done an imprecision management policy and the mechanism that enforces a set of such rules an imprecise management mechanism. To support a wide range of embedded, real-time applications, an imprecise system must provide three types of imprecision management mechanisms for three types of applications: (error) non-cumulative, limitedly-cumulative, and (error) cumulative. The policies governing these mechanisms are summarized in Table 3.1.2. To explain the different application types, we consider a sequence of (server) tasks .. .,Ti_2,Ti-i,Ti,Tf+i,... whose results are ... ,Xi-2,Xi-i,Xi,Xi^i,..., respectively. The client invokes the task Ti and requests the result Xi be produced by the time instant ti in a sequence of time instants ..., ti_2 <
177 Non-cumulative Type In the simplest case, the tasks are independent and their inputs are precise. No information on the results can be carried from one task to the other tasks. The decision about whether to accept an approximate result Xi produced by a prematurely terminated task TJ of this type can be made independently of previous results and future results. When the system recovers from a fault causing a premature termination and there is time, it is clear that recovery action should be carried out to improve the result as much as possible. The problem is simply that of deciding how much of the available processor time allocated to the client should be spent on improving this approximate result when there are other approximate results competing for time. This problem is essentially one of scheduling. In addition to some numerical computations and monotone database query processing [24,25], image transmission is an example of this type. Suppose that a progressive built-up method is used for the transmission of still images [21]. The data encoding each image frame is divided into four blocks; each additional block gives a clearer image. The input data encoding the image presented to each transmission task contain all four blocks. If the transmission of any image terminates prematurely due to network partition or congestion, the approximate result is a fuzzy picture as long as the transmission of one of the four blocks is completed. The receiver can decide whether the fuzzy picture is acceptable without considering the quality of other pictures. In this case, the recovery operation is relatively straightforward. Transmission of compressed video (or voice) is a more complex application of non-cumulative type. Typically, frames of images are transmitted periodically. When the transmissions of some frames are terminated prematurely, the average error in incompletely transmitted frames over several consecutive periods has an observable effect. It is not necessary to complete the transmission of any frame. However, a sufficient number of frames of good quality must be transmitted every second in order to maintain an overall acceptable average quality. The decision on whether to accept an approximate result, that is an incomplete frame, must be made based on the average error over several periods. The fact that the timely delivery of each frame is more critical than the quality of each individual frame is also a critical factor in this decision. Limitedly-Cumulative Type In an application of the limitedly-cumulative type, the current task T, to be completed by ti is similar to the previous task Ti_i with small changes due to small changes in the input data. The previous result Xi-i can be carried forward as an input to the current task. In particular, the result Xi^i can be used as an initial approximation of the result Xi to be produced by the current task, possibly even one accurate enough for use if there is insufficient time to execute Ti. The decision on whether to accept an approximate result Xi can no longer be made independently of the previous result. When the client receives Xi, it is no longer clear what to do: to carry out the recovery operation and improve the approximate result even though the

178 result is sufficiently accurate, or to accept the approximate result and use the available processor time to start the next task. To make the optimal choice in this case requires the use of a cost function that includes the estimated errors of results at several time instants. Traditional linear or approximately piece-wise linear feedback control is an example of limitedly-cumulative applications. A control law is typically computed periodically. The answer produced in a period can be used as an initial approximation of the answer in the next period. Because the effects of inaccuracy accumulate from one period to the next, when deciding whether to accept an approximate result in the current period we must also consider the accuracy of the result produced in the previous period. Cumulative Type In an application of the (error) cumulative type, the current task Ti has as inputs the results of all previous tasks. An example of this type of application is the continuing solution of a differential equation in real-time simulation where the problem to be solved at any time instant is defined by the solutions of the problems solved at earlier time instants. In other words, the results produced by previous tasks determine the parameters of the current task. They may also be used to provide an initial approximation of the current result xi. Again, it is not obvious how best to use the available processor time after an acceptable approximate result has been returned. In some situations it might be preferable to go back to improve some previous results, since the error remaining there will affect the accuracy of the current result, as well as future results. Multiple-target tracking, non-linear control systems (such as bang-bang control) and hybrid systems (such as knowledge-based control and discrete-event/continuous control) are other examples of cumulative type. Imprecision management is considerably more complex for this type of application. The cost function used by the imprecision management mechanism must take into account the accuracy of results produced by many, or all, previous tasks. The dependency of the accuracy of future results on the accuracy of the current result must also be considered.

3.1.7 Summary We have described the architecture of an imprecise, fault-tolerant system. In this system, the imprecision mechanism that is needed to record and return intermediate results of computations is integrated with the traditional checkpointing and recovery mechanism. They complement each other. The imprecision mechanism is application dependent and hence is only applicable in some application domains. Whenever it is applicable, it provides fault tolerance at a low cost. In contrast, the traditional mechanism is application independent and can be broadly applied and supported system-wide, but has a relative high overhead.

179 We are implementing the ICS system based on this architecture. ICS will make it easy for us to implement imprecise computations and experiment with them. Specifically, to develop an application system in ICS, we will only need to write the code to implement the application, leaving the ICS system to provide the code for the underlying support modules. Clearly, the result saving process must be efficient so that it will not degrade the normal system performance substantially. Similarly, the overhead incurred in the imprecise service establishment must be kept low. We are evaluating several alternative ways to implement these protocols to determine how best to meet these performance objectives. Once this version of the ICS system is completed, we plan to extend the ICS so that the application programmer can easily replicate the mandatory portions of some tasks if the replication approach is chosen. The effectiveness and limitations of the imprecise computation technique can be demonstrated clearly only by examining application systems in a diverse spectrum of application domains. For this reason, we will focus on the experimental evaluation of this technique for several several representative applications. Again, the applications we plan to implement using the ICS system and evaluate in depth include multimedia data transmission, direct digital control and optimal control, and database queries and updates. We note that it is relatively easy to judge the quality of voices and images. Therefore, rules governing when a partially transmitted message carrying voice and images is acceptable are relatively simple. In the other extreme, the question of whether a partially completed data query or update is sufficiently good is much more difficult to answer.

References [1] Liu, J. W. S., K. J. Lin and C. L. Liu, "A position paper for the IEEE 1987 Workshop on Real-Time Operating Systems," Cambridge, Mass, May, 1987. [2] Lin, K. J., S. Natarajan, J. W. S. Liu, "Imprecise results: utilizing partial computations in real-time systems," Proceedings of the IEEE 8th Real-Time Systems Symposium, San Jose, California, December 1987. [3] Liu, J. W. S., S. Natarajan, and K. J. Lin, "Scheduling real-time, periodic jobs using imprecise results," Proceedings of Eighth Real-Time Systems Symposium, pp. 252-260, San Jose, CA, December 1987. [4] Dean, T. and M. Boddy, "An analysis of time dependent planning," Proceedings of Conference of the AAAI, 1988. [5] del Val, A., "Approximate belief update," Proceedings of Workshop on Imprecise and Approximate Computations, Phoenix, Arizona, December 1993.

180 [6] Decker, K., V. Lesser, and R. Whitehair, "Extending a blackboard architecture for approximate processing," Real-Time Systems Journal, 2, 1990. [7] Leung, J. Y-T., T. W. Tarn, C. S. Wong, and G. H. Wong, "Minimizing mean flow time with error constraints," Proceedings of the 10th IEEE Real-Time Systems Symposium, December 1989. [8] Chung, J. Y., J. W. S. Liu, and K. J. Lin, "Scheduling periodic jobs that allow imprecise results," IEEE Transactions on Computer, Vol. 39, No. 9, pp. 11561174, September 1990. [9] Leung, J. Y-T. and C. S. Wong, "Minimizing the number of late tasks with error constraints," Proceedings of the 11th Real-Time Systems Symposium, Orlando, Florida, December 1990. [10] Liu, J. W. S., K. J. Lin, W. K. Shih, A. C. Yu, J. Y. Chung, and W. Zhao, "Algorithms for scheduling imprecise computations," IEEE Computer, pp. 5868, May 1991. [11] Shih, W. K., J. W. S. Liu and J. Y. Chung, "Algorithms for scheduling tasks to minimize total error," SIAM Journal of Computing, Vol. 20, No. 3, pp. 537-552, June 1991. [12] Zhao, W., S. Vrbsky and J. W. S. Liu, "An analytical model for multi-server imprecise systems". Proceedings of the 5 th International Conference on Parallel and Distributed Computing and Systems, Pittsburgh, PA, September 1992. [13] Ho, K. I. J., J. Y. T. Leung and W. D. Wei, "Minimizing maximum weighted error of imprecise computation tasks," Technical Report, Department of Computer Science and Engineering, University of Nebraska, 1992. [14] Ho, K. I, J., J. Y. T. Leung, and W. D. Wei, "Scheduling imprecise computation tasks with 0/1 constraints," Technical Report, Department of Computer Science and Engineering, University of Nebraska, 1992. [15] Shih, W. K. and J. W. S. Liu, "On-line scheduling of imprecise computations to minimize total error," Proceedings of the 13th IEEE Real-Time Systems Symposium, Phoenix, Arizona, pp. 280-289, December 1992. [16] Cheong, I., "Heuristic algorithms for scheduling error-cumulative, periodic jobs," Ph.D. thesis. Department of Computer Science, University of Illinois, January 1993.

181 [17] Shih, W. K. and J. W. S. Liu, "Minimization of the maximum error of imprecise computations." submitted. [18] Ho, K. I. J, V. K. M. Yu, and W. D. Wei, "Minimizing the weighted number of tardy tasks units," to appear in Discrete Applied Math, [ 19] Obradovic, M. and P. Berman, "Voting as the optimal static pessimistic scheme for managing replicated data," Proceedings of the 9th IEEE Symposium on Reliable Distributed Systems, October 1990. [20] Koo, B. and S. Toueg, "Checkpointing and rollback-recovery for distributed systems," IEEE Transactions on Software Engineering, January 1987. [21] Wallace, G. K., "Overview of the JPEG (ISO/CCITT) still image compression standard," Visual Communication and Image Processing, '89, SPIE, Philadelphia, November 1989. [22] Wood, J. and S. O'Neil, "Sub-band coding of images," IEEE Transactions on Acoustic Speech Signal Communications, 34, October 1986. [23] Suzuki, J. and M. Taka, "Missing packet recovery techniques for low-bit-rate coded speech," IEEE Journal of Selected Areas in Communications, 9(7), September 1991. [24] Buneman, P., S. Davidson, and A. Watters, "A semantics for complex objects and approximate queries," Proceedings of the Seventh Symposium on the Principles of Database Systems, pp. 305-314, March 1988. [25] Vrbsky, S. and J. W. S. Liu, "Approximate: a monotone query processing," em IEEE Transactions on Knowledge and Data Engineering, October 1993. [26] Ada 9X Mapping, Version 3.1, Ada 9X Mapping/Revision Team, Intermetric, Inc., Cambridge, Mass., August 1991. [27] Liestman, A. L. and R. H. Campbell, "A fault-tolerant scheduling problem," IEEE Transactions on Software Engineering, vol. SE-12, No. 10., pp. 10891095, October 1986. [28] Gopinath, P. and R. Gupta, "Applying compiler techniques to scheduling in realtime systems," Proceedings of the llth IEEE Real-Time Systems Symposium, Orlando, Florida, December 1990. [29] Kim, B. and D. Towsley, "Dynamic flow control protocols for packet-switching multiplexers serving real-time multipacket messages," IEEE Transactions on Communications, Vol. COM-34, No. 4, April 1986.

182 [30] Yemini, Y., "A bang-bang principle for real-time transport protocols," Proc. SIGCOMM '83 Symp. Commun. Architect. Protocols, pp. 262-268, May 1983. [31] Zhao, W. and E. K. P. Chong, "Performance evaluation of scheduling algorithms for dynamic imprecise soft real-time computer systems", Australian Computer Science Communications, Vol. 11, No. 1, pp 329-340,1989. [32] Lopez-Millan, V., W. Feng, and J. W. S. Liu, "A congestion control scheme for real-time traffic switching element using the imprecise computation technique," submitted to the 1994 International Conference on Distributed Computing Systems. [33] Bettati, R., N. S. Bowen, and J. Y. Chung, "On-Line Scheduling for Checkpointing Imprecise Computation," Proceedings ofEuromicro 93 Workshop on Real-Time Systems, Oulu, Finland, June 1993. [34] Mach 3 Server Writer's Guide, edited by K. Loepere, Open Software Foundation and Carnegie Mellon University, 1990. [35] Hull, D. and J. W. S Liu, "ICS: A System for Imprecise Computations," Proceedings ofAIAA Conference, October 1993, also technical report in preparation. [36] Yu, A., "Scheduling parallel real-time tasks that allow imprecise computations," Ph.D. thesis. Technical Report UIUCDCS-R-92-1738, University of Illinois, 1992. [37] Yu, A. and K. J. Lin, "A scheduling algorithm for replicated real-time tasks," Proc. Phoenix Conference on Computers and Communications, pp. 395-402, April 1992. [38] Yu, A. and K. J. Lin, "Recovery manager for replicated real-time imprecise computations," Proceedings of IEEE Workshop on Parallel and Distributed RealTime Systems, Newport Beach, CA, April 1993.

SECTION 3.2 Analytic Redundancy for Software Fault-Tolerance In Hard Real-Time Systems^ Marc Bodson^ John Lehoczky^ Ragunathan Rajkumar^ Lui Sha^ Jennifer Stephan^

ABSTRACT This chapter develops a new methodology for the design of reliable control systems. The impressive capabilities of modern computers have enabled the implementation of highly sophisticated intelligent control methods even in relatively modest applications. However, the risk of software errors and the potential of faildres due to unanticipated algorithmic behavior and modes of operation may increasingly exclude the use of such technologies in apphcations where timing or safety is critical. We present a new approach to software fault-tolerance that will ensure that high-performance intelligent control will be achievable together with high-reliability. The idea is based on redundancy of the controller software with a complementary reliable/high-performance structure that exploits a significant disparity between the two systems. We discuss various software error types and review current methods of software fault-tolerance. We present the new methodology, discuss issues that arise in its use and present experimental results for a particular control system. ^ Research supported in part by Office of Naval Research under Contract N00014-92-J-1524 •^Department of Electrical Engineering, University of Utah, Salt Lake City, UT 84112 "^Department of Statistics, Carnegie Mellon University, Pittsburgh, PA 15213 ^Software Engineering Listitute, Carnegie Mellon University, Pittsburgh, PA 15213 ^Software Engineering Institute, Cstmegie MeUon University, Pittsbm-gh, PA 15213 ® Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsbiu-gh, PA 15213

184

3^2-l

T h e Software Fault-Tolerance P r o b l e m

3.2.1.1

The Evidence

There has been great progress in the area of hardware fault-tolerance. Unfortunately, the same cannot be said about software fault-tolerance. Computing system failures are increasingly caused by software faults. A commercial study by Tandem Computing Inc. [10] evaluated customer outage reports over a five year period. Their data indicate that the mean time between faults (MTBF) for software remained about constant while the MTBF for hardware increased from less than 50 years to approximately 300 years, an increase of approximately 500%! More specifically, • In 1985, MTBF(hardware)/MTBF(software) = 1.2 • In 1987, MTBF(hardware)/MTBF(software) ^ 1.7 • In 1989, MTBF(hardware)/MTBF(software) = 9.4. What are the reasons behind such a dramatic increase in the percentage of failures caused by software failures? Hardware has become more reliable, through better quality control and quality improvement. High hardware reliability can also be achieved through redundancy. On the other hand, methods to improve software reliability have not been so successful. In addition, user expectations for software's functionality are continually increasing which leads to increased software complexity and size. Woodfield [24] concluded from an experiment that a 25% increase in problem complexity results in a 100% increase in program complexity. A 1991 study by Andrews [1] measures program size by lines of code for several applications. These measures ranged from 19,000 lines of code for an electronic 4-speed automobile transmission to 3.5 million lines of code for the B-2 bomber. Even such seemingly simple (and ubiquitous) systems such as hand-held bar code scanners require 10,000 to 50,000 lines of code. While the consequences of software errors are often benign, there is ample evidence that in some cases they can be extremely costly, and even fatal. The recent bestseller by L. Lee [15] surveys a series of tragic examples, including Xray machines administering lethal doses to patients, the downing of an Iranian jetliner, and the shutdown of the AT&T network for several hours in January 1990 due to a minor software bug. A second software failure that severely affected the telephone industry is detailed by Watson [23]. The incident occurred

185 in the summer of 1991 when one incorrect line of code caused 20 million phone customers to lose service. Software failures can be traced to a number of distinct causes, including specification errors, design errors, coding errors, operating system errors and user errors. There is a significant body of literature aimed at classifying software errors (for example, Avizienis and Kelly [4], Glass [9] and Kelly and Murphy [12]) and obtaining measures of reliability for software (for example Musa [17], Pfleeger [18]). Ironically, fixing software bugs is itself a significant source of new errors (Levendel [16]). In fact, as Watson [23] details, the second extreme failure of the telephone industry mentioned above was due to a software patch that had been recently added. With current methods it appears to be nearly impossible to guarantee that any software of significant size will be free of errors.

3.2,1.2

Software Fault-Tolerance in Control Systems

In the case of control applications, software errors could also include failures of the control algorithms themselves, due to unanticipated operating conditions for which the software was never tested. For example, the closed-loop dynamics of neural network controllers are not fully understood. In the case of complex systems controlled by expert systems or fuzzy logic, it is usually impossible to guarantee that all the possible logical paths have been tested. For all practical purposes, the potential existence of latent errors, whether implementationbased or algorithmic, poses a serious problem for the implementation of intelligent control methods in applications where correct operation is critical. Currently, there exists a tradeoff between performance and reliability of control systems. On the one hand are recently developed advanced control algorithms that have uncertain reliability while providing high performance. On the other hand, classical control systems, while being well understood and highly reliable, often cannot achieve the performance levels of more advanced control techniques. The objective of our research is to escape from the tradeoff between functionality and reliability. To achieve this goal, software errors and failures due to unanticipated algorithmic behavior and modes of operation of advanced controllers must be tolerated. To make matters more difficult, an increasing number of control applications involve unstable systems. The X-29 aircraft, for example, is highly unstable. Since the instability is a result of choosing a configuration which increases per-

186 formance and/or reduces fuel consumption, future aircraft (even commercial aircraft) are expected to be dynamically unstable. Other applications involving unstable systems include magnetic levitation systems and certain chemical processes in which the highly efficient operating points correspond to an unstable equilibrium. There is no indication that the technology for building provable correct real-time software will become available for complex systems in the foreseeable future. It is also unclear when modern methods such as neural networks, fuzzy logic and others, will be fully understood. The inability to guarantee the total reliability of high-performance intelligent control methods will prevent their application if no method is developed to guarantee their reliability as well. System developers would then be forced to choose between a simple and reliable controller with less functionality and a high-performance but complex controller with less reliability.

3.2.1.3

Existing Solutions to Software Fault-Tolerance

The two most well-known approaches that have been proposed to tolerate software errors are the recovery block approach, which was introduced by Ran dell [19] and N-version programming^ first proposed by Avizienis [3] and [4]. Both employ multiple versions of software to achieve (informational) redundancy. Under the recovery block approach, the selection of the correct answer is performed by an acceptance test upon which all alternatives are judged. All alternatives are required to satisfy the specifications. However, the acceptance test is not an implementation of the specification. Rather, it is a set of assertions that all the implementations must observe. Acceptance tests are statements regarding the invariants of the computation or the allowable range of the output values. They are not usually tight bounds on the solution. As a result, a faulty answer produced by the primary implementation may go undetected by the acceptance test. This approach is effective when the primary software fails or gives a wildly wrong answer; however, it is hard to write a sufficiently stringent acceptance test to detect more subtle software errors which may still crash the system. N-version programming attempts to achieve software fault-tolerance by combining design diversity with redundancy. The idea is certainly valuable, but its utility seems to be limited to small applications. Unfortunately, creating multiple versions of a large software system is unrealistically expensive in many applications. Indeed, the total testing required for multiple versions of the software could often be more effectively applied to a single version. In addition.

187 the approach offers few guarantees in terms of reliabihty. It is difficult to assure that design bugs in multiple versions will be independent, since common design errors could result from subtle flaws in the modeling and specification process and from cases that are intrinsically hard to predict [13]. The current approaches to software fault-tolerance certainly help to overcome certain categories of faults. These techniques are able to tolerate some errors of design and coding, and the extent of their coverage will depend upon the independence of the designs and the code that is generated. Still, existing techniques are unable to tolerate errors related to the specifications, because all existing techniques use one specification as a basis for the design. Fortunately, the method of analytic redundancy described in this paper is designed to address this category of errors as well as other such categories.

3.2.1.4

Fundamental Observations

The following observations about the fundamental problems of software faulttolerance will be central to the approach adopted in this chapter. 1. Complex means unreliable. The likelihood that a piece of software has residual software errors increases as the software becomes more complex and the program length increases. It is unreasonable to expect bug-free performance from large or complex software. 2. Common design and specifications means common errors. Clearly, the less correlated two pieces of software are, the better the chances that the residual software errors will occur under different sets of conditions. This is the basic premise of design diversity. However, independence should go beyond the current methods of implementing design diversity. Design diversity uses designs and methodologies based on one common specification. The fact that all designs are based on one specification seriously limits the range of solutions that are available to the designer. Therefore, independence is not assured solely by employing different teams of programmers, different specification and programming languages etc. but also requires completely different designs and methodologies. This greater diversity in design and methodology may be afforded by different specifications, which will increase the likelihood of truly independent implementations.

3. Dissociated software and control designs limits reliability. Typically, system reliability is approached at each level by a different specialist. Computer engineers address computer hardware reliability issues, software engineers address software reliability issues and control engineers address actuator and sensor reliability issues. However, there is great potential in an integrated domain-specific approach. The area of control systems, for example, is an attractive candidate for such an approach, especially for applications for which accurate mathematical models of the system's behavior are available. In these cases, the correctness of the system's operation can be evaluated not only on the basis of the values it produces, but also through its performance in controlling the system. In addition to the fault detection logic, the two control systems must be designed. The design of the high-performance control system is the subject of much of the current research in control engineering. On the other hand, the design of reliable controllers to perform the functions that are considered here is not conventional. Issues such as software complexity, reliability and recoverability are not among the objectives that are optimized by standard control methodologies.

3*2.2 3,2.2,1

A N e w Approach to Software Fault-Tolerance The Simplex Architecture

The previous section discussed the apparent existence of a fundamental tradeoff between the performance (or functionality) of software and its reliability. Only a new approach to this problem will permit one to escape from this tradeoff and achieve both high performance and high reliability. To this end, Sha, Lehoczky and Bodson [20] introduced the concept of the Simplex architecture for reliable software design. Subsequent publications, [5], [6] and [21], detailed initial work and experimentation using the Simplex architecture. This new approach is based on the concept of analytic redundancy. The method combines two different software systems, one with high performance and another with high reliability in such a way that the composite software system exhibits the strengths of each subsystem. The two are combined using the Simplex architecture, a software engineering architecture which permits the high-performance software to run while it is performing properly but will switch control to the

189 reliable software when a failure occurs or the system is becoming unsafe. Moreover, the Simplex architecture will ensure the timing correctness of the system even if the high-performance software fails. The name Simplex architecture is chosen because it depends, for reliability, on a single highly reliable component. The concept of redundant hardware and software is already present in existing systems, but the Simplex architecture offers innovative features. The complementary reliable/high-performance structure that is proposed exploits a significant disparity between the two systems. Current systems are usually based on duplication, or the combination of systems of comparable complexity and designed from a common specification by independent teams. In contrast, the Simplex architecture exploits the relative advantages of simple and reliable versus complex high-performance systems. The two systems are designed with different needs in mind. Another innovative feature of the approach is the mechanism for deciding which piece of software to use. Clearly, an automatic procedure is desirable and classical methods such as voting are inadequate. The proposed decision is based on measures of the performance and of the safety of the system. This approach requires an incorporation of a significant amount of knowledge about the behavior of the system. It also requires an interdisciplinary approach. For instance, when the applications are control systems, implementation of the failure detection for the Simplex architecture involves an understanding of control, real-time, and software issues. These innovative features provide the potential for an increase in reliability and performance. As proposed, the Simplex architecture provides a solution to the existing problem of tradeoff between reliability and performance. Moreover, it has potential as a method for ensuring the safe insertion of new technology. In the case of critical applications, the insertion of new technologies carries significant risks, and these risks often prevent its adoption. For example, many advanced control algorithms have not been used in flight control due to the uncertainty that they represent to traditional flight control engineers. The risk of failures is significant, given the complexity of the algorithms and the lack of a full understanding of the possible responses of such systems. The Simplex architecture can potentially reduce the risk from the use of these advanced control algorithms to acceptable levels. In this case, the old technology plays the role of the simple controller, while the new technology becomes the complex controller. Essentially, the approach offers the promise of safe technology upgrades.

190 3.2.2.2 Application Of T h e Simplex A r c h i t e c t u r e To Feedback Control The theory of the Simplex architecture has been explored primarily through the application of the ideas to control systems. For this class of applications, the Simplex architecture uses a full-function, high-performance but complex control system complemented by an err or-free implementation of a highly reliable control system of lower functionality. When the correctness of the highperformance controller is in doubt, the reliable control system takes over the execution of the task. The procedure allows the system to recover, as well as to maintain some acceptable, minimal level of performance. The high-performance controller may take charge of the task again at a later time, after the error has been cleared (for example, by reinitializing and restarting the high-performance software). Under the Simplex architecture, both the reliable and high-performance software are executed in parallel. Part of the task of the reliable software is to decide which output to use to control the system. The reliable and high-performance systems are analytically redundant^ in the sense that they provide different solutions to the same oroblem. The concept is similar to the analytic redundancy that has been studied to tolerate hardware failures, as by Chow and Willsky [7]. In this case, however, the system that may fail is the control system, instead of the controlled system. Further, the characteristics of the failed system are intrinsically more difficult (if not impossible) to predict or categorize. Therefore, detection is a completely different problem. The correct behavior of the high-performance control system is assessed using measures of safety and performance. In addition to the fault detection logic, the two control systems must be designed. The design of the high-performance control system is the subject of much of the current research in control engineering. On the other hand, the design of reliable controllers to perform the functions that are considered here is not conventional. Issues such as software complexity, reliability and recoverabihty are not among the objectives that are optimized by standard control methodologies.

191 3.2*2.3

Simple Controller Issues

One of the foremost issues in developing reliable software using the Simplex architecture is the design of the simple, reliable controller. This step is a crucial one since the simple controller is responsible for the reliability of the system. The simple controller has several requirements: it must monitor the behavior and performance of the complex controller and it must recover from failures of the complex controller while maintaining system performance at an acceptable level.

3.2.2.3.1

Failure Detection

A major problem in the implementation of the reliable controller is to find an adequate detection mechanism so that the reliable controller can take over the control of the system when the high-performance controller fails. It is difficult to check whether a control system is performing correctly based on the values of the control signals it generates. Further, two valid control systems could produce two totally different outputs. Therefore, validity checks and voting are rery limited. To assess the validity of the high-performance control system it is proposed to consider measures of performance and of safety. Indeed, the response of the system under control gives significant information about the validity of the controller. One may check, for example, whether the performance meets the expectations. Using such information means integrating system knowledge with the software design process, but it also opens up new possibilities that cannot be considered in a dissociated software/control design process. To explore this concept, we assume the existence of a state-space of the controlled system. This state-space does not have to be the whole state-space of the system. It may be a lower dimensional projection of the whole state-space. The reduced state-space consists of two non-overlapping regions: the recoverable region and the unrecoverable region. The control objective is to maintain the system at the set-point. The performance is measured by the distance from this set-point. The unrecoverable region is the set of states such that the system has failed or will fail, no matter what control inputs are used. This region must be avoided absolutely. Unrecoverable regions are always present in unstable systems with

192 bounded actuator inputs and state constraints. The decision to switch control from the high-performance controller to the reliable controller may be made in several ways. A first way is to switch control if the complex controller has driven the system to a state near the edge of the unrecoverable region. Here, "near" is defined to be inside a buffer zone of small width denoted by e. The buff*er zone is chosen to be the same shape as the unrecoverable region. When the system state enters this buff'er zone, switching of control is made. This switching logic is essentially a worst case detection. In this case, the high-performance controller has (likely) failed, since the system state will (probably) soon be unrecoverable. Of course, it may be the case that the complex controller will not drive the system into an unrecoverable state, but rather that the system state will simply move near the edge of the unrecoverable region during a portion of the control. However, it is this segment of the system trajectory which threatens the safety of the system. The switching logic assumes that soon after this time the reliable controller can no longer be absolutely counted on for recovery, since the system state may be unrecoverable. Thus, a switch is made. This switch may be termed the safety switch in that it is based on ensuring reliability of the system. No performance considerations are included. A second way to make the decision of takeover by the reliable controller incorporates performance measures. In this case, if the performance of the complex controller is not acceptable, the reliable controller assumes control of the system. More precisely, the decision may be based on deviation from the expected behavior of the high-performance controller. This expected behavior can be deduced from a 'priori analyses or through on-line simulations. If the highperformance controller behaves much differently from the predicted behavior, the controller may be in the process of failing. Another performance measure to be considered is the high-performance controller's tracking error. An alternative definition incorporates more of the system variables by considering the (full or reduced) state-space and measuring the error as the distance of the system state from the set-point. A third approach is to switch control if the highperformance controller deviates significantly from the expected performance of the reliable controller. If the high-performance controller behaves more poorly than the reliable controller would under the circumstances, it is very likely that the complex controller has failed. Certainly, in this case, a switch to the rehable controller should be made, since it would be performing better than the complex controller. A fourth approach is to perform sanity checks on the actions of the high-performance controller. Sanity checks would ensure that the complex controller is behaving in a reasonable manner. For instance, if the error is large.

193 the controller should be acting in a way that will decrease the error, not increase the error. Or, it may be known that certain variables and signals should be within specific ranges. When the high-performance controller violates these sanity checks, this suggests the controller has failed, and the reliable controller is switched into active control of the system.

3.2.2.3.2

M a x i m u m Recoverability

The second major issue in the design of the simple controller involves the recovery from failures of the high-performance controller once they have been detected. Specifically, the reliable controller must be designed to return the system to a safe state as quickly as possible. The reliable controller must also be able to do this for a wide range of states. These two requirements are closely related. As detailed in section 3.2.2.3.1, the state-space of the system consists of two regions, the recoverable and unrecoverable regions. Since the unrecoverable region is the set of states from which it is impossible to recover, no matter what control inputs are used, the recoverable region contains all the states from which it is theoretically possible to recover given appropriate control. The recoverable region, thus, is the largest set of states from which a controller can possibly recover and determines the ideal recoverability region for any given controller. The goal in designing the simple controller is to achieve a region of recoverability which closely matches the ideal recoverability region. At the same time, the simple controller must return the system to a safe state as quickly as possible. Generally, these conditions require the use of the maximum control available and time-optimal control theory will likely be useful for the design of the reliable controller.

3.2.2.4

Software Architecture

An important aspect of the Simplex architecture is the software architecture that is used to ensure that no software errors in the complex controller can compromise the safety of the system. Such compromises could occur, for example, if the complex controller caused an error which crashed the system or if it went into an infinite loop and caused a timing error. The software architecture that we currently use to manage analytical redundancy is illustrated in figure 3.2.1. The architecture has been designed and implemented on LynxOS, a real-time

194 operating system that is compliant with POSIX draft standards for real-time support. There are three real-time processes: 1. Analytic Redundancy Manager and Simple Software Process: This is a periodic process which embeds the analytic redundancy management (ARM) functions and the simple controller software. The analytic redundancy management functions are application independent, and handle all interprocess communications, as well as the selective creation, management and killing of processes. For example, a decision to kill the complex process is immediately logically enacted by the manager by ignoring the complex process after the decision, and a request is sent to a lower priority real-time manager process to actually kill the complex process. The simple control software is application dependent but interfaces with the analytical redundancy manager via an application independent interface. This interface is procedural and requires functions for system initialization, parameter settings based on user inputs, input data sensing, output generation, simple computation, system safety check, and system reset. The application-specific decision logic which embeds the failure detection logic and the decision logic to kill the complex process is also invoked using this interface. 2. Real-Time Manager Process: This process is solely responsible for creating and killing the complex software process at the request of the analytical management functions. This process runs at a lower priority than the manager process such that the complex process is started and killed without affecting the timing behavior of the simple software computations. 3. The Complex Software Process: This process contains an application independent module which interfaces with the analytical redundancy manager and the complex software manager processes. This interface hides the details of interprocess communication, sensor data input, initialization and parameter settings from the ARM functions, and returns the results to the decision logic module for enforcement of the failure detection logic. An application dependent module implements the complex control software. This process has the lowest priority and can be preempted by the other two processes. When the application changes, only the shaded application dependent portions of figure 3.2.1 need to be replaced. The simple software performs all I/O related to the control appUcation, and passes them to the complex software which

195 performs only computation. Results returned by the complex software as well as sensor readings are tested by the simple software by applying its failure detection logic. If the complex software is judged to be performing erroneously or does not return a result in time, it is killed and restarted issuing a request to the Real-Time Manager Process. The following features are currently supported by this software architecture and the system environment. • Online upgradability: This feature allows a new version of complex process to be brought online while the device is being controlled by the simple software. Control is switched to the complex controller when it comes online, and this continues until a failure of the complex controller is detected. • Simultaneous control of multiple devices: LynxOS supports a prioritybased preemptive scheduler, and two or more devices can be controlled at the same time by assigning priorities based on rate-monotonic scheduling analysis. • Flexible user interface: The management functions are defined in such a way to facilitate the testing of new software and/or parameter settings. The user can either run only the simple software, only the complex software, or both under analytical redundancy constraints with a user-choice of failure detection logic to be applied. • Multiprocessor/distributed system support: The software architecture uses network addresses for inter-process communication and thereby allows the complex software manager and complex software processes to run on a different processor. This is particularly useful in situations where any additional processors can be used to run the complex process and improve the performance of a controller without compromising safety.

3*2«2.5

Comparison with Existing Techniques

The Simplex architecture for achieving software fault-tolerance has some similarities with the existing techniques detailed in section 3.2.1.3 but its differences are fundamental. The Simplex architecture could be thought of as a variation of

196

Analytic

Redundancy

Create/KJl

Real-Time Process Manager

Request

Manager Input Process Output

Create/Kil Input

Reliable Controller and Logic

Manager Interface

Result

Complex Controller

Figure 3.2.1: Software Architecture 2-version programming with design diversity, for example. Nevertheless, there are several important differences. First, in both the N-version programming and recovery block methods, all of the different versions of the software are subjected to the same set of tests, either in the form of a majority rule or an acceptance test. In our approach, the simple software is the trusted software, and as such its outputs are not subjected to the failur<- detection logic. In fact, its outputs may be used to help to decide if the complex software is correct, depending on the type of failure detection logic. The simple software is not just alternative software which could also be either correct or incorrect. Second, in both the N-version programming and the recovery block methods, it is assumed that it can be determined whether the output of a questionable piece of software is correct. The methods do not address the handhng of leakage errors, i.e., what should be done if the questionable software's output is wrong but is undetected. Our approach allows undetected incorrect outputs to be sent to the actuator as long as the system remains inside the recoverable region. Only then does the questionable software undergo a performance evaluation. In other words, we work under the assumption that it is impossible to filter out all the incorrect software outputs. Instead, our approach is to deal with the effect of incorrect outputs. Third, the two versions of the software are not alternative designs to satisfy the same set of specifications, but are very different designs developed from different specifications. Indeed, the design of the simple and reliable controller must carefully address the effects that can be caused by the incorrect output from the complex "high-performance" controller. Thus, the simple controller

197 can be also viewed as a part of an error detection and recovery mechanism. Finally, in the other two approaches, especially in N-version programming, great care is exercised to ensure the "independence" of the different versions so that there will be no common failures. In our approach, the notion of avoiding common mode failure does not exist, since we invest the necessary effort required to ensure that the simple but reliable controller is correct. Consequently, we do not need to go to great lengths to isolate designers from implementors and teams from one another to help ensure "independence." Under the Simplex architecture, if any part of the simple controller, for example a filter, can be utilized by the complex software, we encourage its reuse to save the cost of development and testing. Of course, the simple software cannot use any portion of the complex software, since it is assumed to be unsafe.

3.2.3 Experimental Testbed - A Ball and Beam Experiment In the cooperative research effort between Carnegie Mellon University (CMU) and the Software Engineering Institute (SEI), there are two experimental sets of hardware. At CMU, a high-performance ball and beam experimental system has been created. At the SEI, the ball and beam device and an inverted pendulum were purchased from laboratory equipment suppliers. The CMU experiments are directed toward reliable controller design and failure detection logic, while the SEI experiments focus on real-time and software architecture issues. Recently, the SEI prototype was demonstrated at the 1993 SEI Symposium, the 3rd International Workshop on Responsive Systems and at the Flight Data Division of NASA Johnson Space Center. We have demonstrated that it is possible to improve control performance on-line without either of the devices failing. Furthermore, to demonstrate the prototype's robustness, the audiences were invited to insert arbitrary bugs of their choice into the complex high-performance software^. Experts in real-time systems research and in fault-tolerance research participated in this testing. The faults injected included programming system bugs, such as pointer errors; timing errors, such as infinite loops, and control algorithm errors, such as incorrect or randomly changing gains. None of the injected faults led to system failures. Moreover, the software architecture has successfully survived extensive testing using the ^The control software was mailed to NASA JSC three weeks prior to the demonstration to facilitate the testing.

198 commercially available program "crashme." In the following sections, we focus our attention on the control theoretic testing that is being done at CMU.

3.2.3.1

A Ball and Beam Setup

The ball and beam process is a simple experiment often used to illustrate the use of feedback for the control of unstable systems [2]. The system consists of a ball rolling freely on a beam. An indentation or channel in the beam ensures that the motion of the ball is along a straight line. In the specific setup discussed here, two closely spaced cylindrical beams were used instead. The position of the ball is measured by placing a resistive material on the sides of the channel and using a ball made of conductive material. The control variable is the current applied to the motor that changes the angle of the beam. The objective of the experiment is to design a controller to position the ball at an arbitrary position along the beam, or to follow a desired trajectory of motion. The ball position is denoted x and the beam angle with respect to the horizontal is denoted by 9. Two further state variables are the derivatives of the variables X and Oj respectively i;, the ball velocity, and cj, the angular rate. The torque delivered by the motor is denoted by r. Assuming that a DC motor is used, this torque is proportional to the current applied to the motor and can be controlled directly with a current-controlled linear amplifier. A simple linearized model of the ball and beam system is given by

de

-— dt

=

u

(3)

~

=

jir-MsGx).

(4)

The units are ar(m), i^(y), 9{rad), u ; ( ^ ) , and r(A^m). The parameters of the model are the acceleration of gravity, G{j^)y the inertia of the motor and beam around the axis of rotation, J(kg'm?)^ and the mass of the ball MB (kg). The term | originates from the fact that, assuming that there is no slipping, the rotational inertia of the ball is added to its translational inertia.

199 The model assumes a linearization around the origin. Two more detailed, nonlinear models are available in the literature, see Hauser, Sastry and Kokotovic, [11] and Sobhani et ai^ [22]. Examples of nonlinear effects include the centrifugal force acting on the ball when the beam rotates quickly, and the variation of the total inertia with the ball position. The complete analysis of Sobhani et ai also considers effects due to the distance between the center of rotation of the ball and of the beam, and the distance between the center of rotation of the beam and of the motor. Our experience so far indicates that, for our particular laboratory setup, the linearized model is adequate for control design purposes. The system has two main state variables: the position of the ball and the velocity of the ball. The additional states associated with the dynamics of the beam and of the motor can be neglected if a fast inner loop is designed to control the beam position. Even so, this system is unstable. If the beam angle is directly controlled, the ball position is proportional to the double integral of the beam angle. If, for example, the beam is tilted at a fixed angle, the ball position increases parabolically with time until the ball reaches the end of the beam. With the maximum angle, in our testbed, it takes only 1 second for the ball to reach the end of the beam, starting from the center. This makes the reliability of the controller critical: one cannot turn-off the control law or apply incorrect control inputs for any but brief periods of time. There are also some unrecoverable states because of the boundedness of the state and control variables. Since the beam can only be tilted within a restricted range and the beam has finite length, it is impossible to keep the ball on the beam if the ball moves at high speed towards the edge.

3.2.3.2 Steps in the Development of the Fault-Tolerant System 3.2.3.2.1

Reduced S t a t e - S p a c e

The first step in developing the fault-tolerant system is determining an appropriate state-space for the ball and beam system. As explained in section 3.2.2.3.1, the state-space does not have to be the whole state-space of the system and may be a lower dimensional subset of the whole state-space. Such a simplification may be necessary when the full state-space is of high dimensionality. Additionally, some states may not be necessary for a complete analysis. In the case of the ball and beam system, the full state-space involves the four

200 states x, V, 9^ and UJ whose dynamics are given by equations (l)-(4). If a fast ^-controller is designed, then 0 = 9d approximately, and the beam and motor dynamics may be neglected. By defining a new control input u = ^^, the system model is reduced to (1) and (2) alone. The state-space for the reduced model is now two-dimensional where the two states are the position and velocity of the ball. Ball Position

A

Right edge of the beam

ir Unrecoverable Region Ball Velocity

Left edge of the beam

Figure 3.2.2: Ball and Beam: Unrecoverable Region

3.2.3.2.2

The Recoverable and Unrecoverable Regions

The next step is to determine the recoverable and the unrecoverable regions of the state-space. To this end, it is useful to determine the behavior of the system when full (maximum) control is asserted. When 9 = 9max {9max is the maximum beam angle achievable) the solutions of the differential equations (1) and (2) are: X =

xo + vot5

V

Vo —

=

51 --G9„

-G9maxt'

(5) (6)

201 Then, equation (5) may be rewritten as ^

=

7 /-10_ . . 25^2^2 ':o-j^i^-G9masVot+-G'9'^,j']

.2

(7)

or equivalently

i^-^o)

= TK?J—i^'-y'o)10G9,

(8)

Equation (8) describes the ball dynamics when full control (9 = Omax) is asserted. A limiting case under these circumstances is when the ball reaches the end of the beam with zero velocity. Substituting XQ =: Xmax (where Xmax = ^am engt ^ ^^^ y^ — Q into equation (8) and rearranging the terms results in

This equation is a parabola in the (i;, x) plane which, as it turns out, defines a boundary of the set of recoverable states. Repeating the above logic for the case where 0 — —^max results in a second parabola: X

=

-Xmax-^J^r^

1;^

(10)

This second parabola defines another boundary of the set of recoverable states. The two further conditions which define the final boundaries of the set of recoverable states occur when the ball is at the end of the beam (x = ^LXmax)These four conditions may be combined to create one region in the state-space, see figure 3.2.2. It is clear from the shape of the region in figure 3.2.2 that the region is formed by the two parabolas and the two lines x = ±Xmax' If the ball is outside the region, it is impossible to control the system dynamics, no matter what control is asserted. If the ball is inside the region, it is possible to control the system dynamics if the maximum angle is used by the controller. Such states are considered recoverable. Essentially, a state is unrecoverable in one of two ways. It is either off the end of the beam (outside the two lines X = :£xmax) or moviug towards the end of the beam with sufficiently great velocity that it cannot be stopped with a maximum tilt of the beam (outside the two parabolas). The set of recoverable states for an arbitrary ball and beam system are also indicated in figure 3.2.2. In the figure, A = (0, Xmax), B = {J^GOmaxXmax,

0),

202

and r = y

yj ~Y^"rnax^max^

^max)-

3.2.3.2.3

Design of the Reliable Controller

The controller for the ball and beam system consists of two distinct parts which are shown in figure 3.2.3; the ^-controller, which controls the beam dynamics, and the x-controller, which controls the ball position. If a fast ^-controller is designed, then the beam and motor dynamics may be neglected when considering the design of the x-controller. To this end, a proportional integral derivative controller with high bandwidth is chosen as the ^-controller. The control law is

T = MBGx + j(-kj,(9-ed)-k^-^(0~ed)-ki

f(9-ed)dt]

(11)

where kp^ ky and ki are chosen to be large. The PID control law is chosen since it is simple, reliable and reasonably effective in performance. The additional variables can be calculated using differentiation and integration or an observer. The remaining part of the control task is to design the x-controller. The design of the x-controller varies for the simple and complex control designs, while the ^-controller is the same as earlier described. From this point forward, the terms reliable and complex controller refer to the choice of design for the x-controller. As detailed in section 3.2.2.3, the design of the reliable controller must address issues concerning recoverability and performance. Foremost, since a decision may have been made that the safety of the system is in question (the ball is moving fast towards the edge), a first priority is to return the system to a safe position as quickly as possible. In the case of the ball and beam system, this requires the use of the maximum torque on the motor, or the maximum angle of the beam. The safe state is chosen to be (0, 0), that is returning the ball to the center of the beam. The theory of time-optimal control naturally leads to a bang-bang controller using full authority. Assuming that the beam angle is controlled as described above, the bang-bang solution for the return of the ball to its central position consists in using the maximum beam deflection in one direction followed by the maximum deflection in the other direction. The decision to switch the direction of beam deflection is made through the parabolic switching curves X = ixo^^ shown in figure 3.2.4. The figure shows the state-space of the

203 xd

ed X controller

eController

Beam

Ball

Figure 3.2.3: Controller Structure system which has been divided into two regions through the switching curve. If the system state is in region 1, then Od — O^ax- Likewise, if the system state is in region 2, 9d = —Omax- It should be noted that this same curve with x replaced by x ± earn ^en^t (jgf^j^^g ^\^Q parabolas which limit the recoverabihty region shown in figure 3.2.2. Once the system's safety has been ensured through use of the bang-bang controller, control is switched to a simple PID controller. The control law is given by: dd

=

5G V^'"'(i/''

Xd) + kpp(i; — Xd) + ki

I {x — Xi)dt \ .

(12)

The additional variables may be calculated through differentiation and integration or the use of an observer. As mentioned in section 3.2.2.3, acceptable performance is the second requirement of the reliable controller. This simple PID controller can be quite effective in achieving reasonable performance in the tracking of the ball position on some reference trajectory. The integration of the bang-bang and PID controllers is done in a straightforward manner. The approach is shown in figure 3.2.5 where the shaded regions indicate those portions of the state-space that the bang-bang controller would control. The unshaded regions are portions of the state-space which are controlled by the PID controller. Thus, the bang-bang controller is used only during those times that full authority is essential and at other times the PID controller controls the system.

204

Figure 3.2.4: Bang-Bang Switching Curves 3.2.3.2.4

Design of the Complex Controller

For the complex controller, recoverability capabilities are not important. Though reliability remains a concern, the primary focus of the complex control design is on performance measures. A range of possibilities are available for the highperformance control of the ball position, including nonlinear controllers (such as one using approximate input-output linearization, developed by Hauser, Sastry and Kokotovic [11]), controllers based on fuzzy logic (see Laukonen and Yurkovich [14]) or adaptive controllers. The purpose of this example is illustrative, so that the choice of the high-performance controller is not a major The experimental results presented here use a complex controller which is a fuzzy logic controller derived from the work of Laukonen and Yurkovich [14]. Laukonen and Yurkovich [14] defined membership functions for Xe (the position error), —v, and A^^ and developed logical rules to determine the membership of A9d. In the implementation of their work for our specific experimental setup, it was necessary to change the membership functions slightly, while the logical rules were maintained. An additional modification was made in the step which determines a specific value of Od based on its membership (i.e., defuzzification). Since Laukonen and Yurkovich [14] did not fully document their method of defuzzification, it was not possible to reproduce their work for this step. Rather,

205

Figure 3.2.5: Switching Between Bang-Bang and PID Control defuzzification was accomplished through standard methods. Specifically, a technique which calculated the degree to which the variables belonged to each fuzzy set (as weights) was implemented.

3.2.3.2.5

Switching Decision Logic

Section 3.2.2.3.1 presented some general ideas about ways to detect failures of the high-performance controller. The two main measures of the high-performance controller that will be used are safety and performance. Reliable control takeover based on safety measurements has been implemented on the hardware and extensively tested. The idea is based on the definitions of the recoverable and unrecoverable regions that were defined in section 3.2.3.2.2. Recall that the unrecoverable region consists of all the states from which it is absolutely impossible to recover, no matter what control is asserted. Thus, to ensure the safety of the system this region must be avoided. This may be accomplished by defining a third subset of states in the state-space, the buff'er region. It is chosen to be all the states which are within a small distance, e, of the edge of the unrecoverable region. This region is the shaded portion of the state-space shown in figure 3.2.6. The logic is straightforward. If the system's state enters this buff'er region, or zone, then the reliable controller takes control of the system. The logic is essentially a worst case detection. The reliable

206 X

'

1

i ""\^

\.^.^^ \

•

1\

s

1

, •

Figure 3.2.6: Safety Buffer Zone controller only takes control of the system at the point where it is about to become unrecoverable. Beyond this, the performance of the complex controller is not considered. This logic allows the reliable controller to take control of the system when the complex controller hais sent the system state near the edge of the recoverable region. Admittedly, it may be that the complex controller will not drive the system state into an unrecoverable state, but that the system will simply approach the edge of the recoverable region during a portion of control. Under these circumstances, the reliable controller would takeover from the complex controller unnecessarily (in terms of safety considerations only). Unnecessary and incorrect takeovers are false alarms. To minimize the number of false alarms, the buffer zone should be of narrow width (epsilon small).

3.2.3.3

Results

Figure 3.2.7 shows the results of an experiment where the operator played the role of a high-performance controller. The ball position is plotted as a function of time. The task of controlling the system is quite challenging for a human controller, which can be seen to behave more like a faulty software than a high-performance controller. In this experiment, the ^-controller was provided so that the operator had control over the beam angle rather than the motor

207 0.05 0

x(m)

'

^

'

V--

1 point of switcl]

- -T y -

I

-0.05

-0.1 -0.15

-0.2

\ - \ / \

^ / : /

\A {/ \

I :

-0.25

^ ^ ^ - " ^

: \_^^:

-0.3 -0.35 f\ A

10

15

20

25

t (sec) Figure 3.2.7: Ball and Beam - Recovery at Low Speed

x(m)

t (sec) Figure 3.2.8: Ball and Beam - Recovery at High Speed

208

x(m)

-0.3 0

5

10

15

20

25

30

35

40

45

50

t (sec) Figure 3.2.9: Ball and Beam - Recovery from Fuzzy Logic Controller torque. It was found that controlling the system manually from the torque input was an intractable task. For the experiment, the detection logic was based on safety. In figure 3.2.7, switch-over happened at t = .17s when the ball was approximately 35 cm from the center to the left of the beam, very close to the edge (located at about 39 cm). The precise position where the controller takes over depends on the ball velocity. Figure 3.2.8 shows an experiment where the recovery is triggered far from the edge, because of the high speed of the ball. Despite the use of the maximum reference angle of the beam, the ball position reaches a position very close to the edge during recovery. In a separate set of experiments, the fuzzy logic controller was implemented as the high-performance controller. The large number of if-then rules made it easy to deliberately insert an error into one of the rules. The error was such that the assignment of the velocity to a membership function was done incorrectly. Such an error could be a result of a coding, design, or specification error. As is common with software failures, the error did not lead to failure immediately.

209

(m)

10

12

14

16

18

t (sec) Figure 3.2.10: Ball and Beam - Recovery from Fuzzy Logic Controller This is shown in figure 3.2.9. A second such experiment is shown in figure 3.2.10. By comparing the two figures it is clear that the error manifests in an unpredictable manner. This is another common characteristic of software failures.

3.2.4

Conclusion

This chapter has introduced a new approach to the problem of software faulttolerance. The method is based on the concept of analytic redundancy in which two diff*erent software systems are combined, one with high performance but uncertain reliability and the other with high reliability but satisfying only minimal performance specifications. The two are combined using the Simplex architecture, and the result is a composite software system which exhibits the strength of each. The Simplex architecture permits the high-performance software to run when it is performing properly but will switch to the highly reliable software when a failure occurs or the system is becoming unsafe. The Simplex

210 architecture also ensures the timing correctness of the system even if the high performance software fails. These ideas were implemented and tested in a simple control system experiment. The results have been very promising. The system has been consistently able to maintain control even when software errors from a very wide range of fault classes were injected into the complex software. In spite of the great promise of the Simplex architecture for overcoming the current tradeoff between high performance in software and high reliability, many research problems remain. One of the major research challenges involves defining the performance measures to be used for switching between the simple and complex controllers. Up to this point, the focus has been on the safety of the system. Switching decisions based on simple performance checks have also been experimented with. More sophisticated failure detection logic which considers the performance of the system is needed. Moreover, methods must also be developed to differentiate between software failures and external disturbances. Switching from the high-performance to the reliable software is necessary in the former case, but may not be in the latter. A further research issue is the design of the reliable controller. The reliable controller design incorporates safety and recover ability considerations. Determining safety regions and optimal control for the purpose of recovery are not straightforward tasks. It is also necessary to develop criteria for switching control from the simple controller back to the complex controller. This methodology will have to be formalized, and the coverage it provides for various fault classes will need to be determined. Finally, the Simplex software architecture needs to be further developed. Firstly, it needs to be enhanced to support distributed real-time fault-tolerant applications, by supporting changes in both hardware and software in spite of design or implementation errors in new components. Secondly, it should be able to support not only the evolution of the application software but also on-line changes in the architecture itself. Given the great promise of this method, these and related research questions will be actively investigated in the coming years.

3.2 •References [1] E, L. Andrews, "The Precarious Growth of the Software Empire," New York Times, July 14, 1991. [2] K. J. Astrom and B. Wittenmark, Computer Controlled Systems, PrenticeHall, Englewood Cliffs, NJ, 1984.

211 [3] A. Avizienis, "The N-Version Approach to Fault Tolerant Software," IEEE Trans, on Software Engineering^ vol. 11, pp. 1491-1501, 1985. [4] A. Avizienis and J. Kelly, "Fault Tolerance by Design Diversity: Concepts and Experiments," Computer, vol. 17, no. 8, pp. 67-80, 1984. [5] M. Bodson, J. Lehoczky, R. Rajkumar, L. Sha, M. Smith and J. Stephan, "Software Fault-Tolerance for Control of Responsive Systems," Proc. of the Third International Workshop of Responsive Computer Systems, October 1993. [6] M. Bodson, J. Lehoczky, R. Rajkumar, L. Sha, J. Stephan and M. Smith, "Control Reconfiguration in the Presence of Software Failures," to appear in the Proceedings of the IEEE Conference on Decision and Control, San Antonio, TX, 1993. [7] E. Y. Chow and A. S. Willsky, "Analytical Redundancy and the Design of Robust Failure Detection Systems," IEEE Trans, on Automatic Control, vol. 29, no. 7, pp. 603-614, 1984. [8] J. R. Dunham, "Experiments in Software Reliability: Life Critical Applications," IEEE Trans, on software Engineering^ voL SE-12, No. 1, pp. 110-123, January, 1986. [9] R. L. Glass, "Persistent Software Errors," IEEE Trans, on Software Engineering, vol. 7, no. 2, pp. 162-168, 1981. [10] J. Gray, "A Census of Tandem System Availability Between 1985 and 1990," IEEE Transactions on Reliability, vol. 39, no.4, pp. 409-418, 1990. [11] J. Hauser, S. Sastry and P. Kokotovic, "Nonlinear Control Via Approximate Input-Output Linearization: The Ball and Beam Example," IEEE Transactions on Automatic Control, vol. 37, no. 3, pp. 392-398, March 1992. [12] J. Kelly and S. Murphy, "Achieving Dependability Throughout the Development Process: A Distributed Software Experiment," IEEE Trans, on Software Engineering, vol. 16, no 2, pp. 153-165, February 1990. [13] J. C. Knight and P. E. Ammann, "Design Fault Tolerance", Engineering and System Safety, vol. 32, pp. 25-49, 1991. [14] E. Laukonen and S. Yurkovich, "A Ball and Beam Testbed for Fuzzy Identification and Control Design," Proc. of the 1993 American Control Conference, San Fransisco, CA, June 1993.

212 [15] L. Lee, The Day the Phones Stopped, Donald L Fine, New York, 1991. [16] F. Levendel, "Defects and Reliability Analysis of Large Software Systems," 19th Symposium on Fault Tolerant Computing, pp. 238-244, 1989. [17] J. D. Musa, "A Theory of Software Reliability and its Application," IEEE Trans, on Software Engineering, vol. SE-1, no. 3, pp. 312-327, Sept., 1975. [18] S. L. Pfleeger, "Measuring Software Reliability," IEEE Spectrum, pp. 5660, August 1992. [19] B. Randell, "System Structure for Software Fault Tolerance," IEEE Trans, on Software Engineering, vol. 1, pp. 220-232, 1975. [20] L. Sha, J. Lehoczky, and M. Bodson, " The Simplex Architecture: Analytic Redundancy for Software Fault Tolerance," Proc, of the First International Workshop of Responsive Computer Systems, Nice, France, 1991. [21] L. Sha, J. Lehoczky, M. Bodson, P. Krupp and C. Nowacki, "Position Paper: Responsive Airborne Radar Systems," Proc, of the Second International Workshop of Responsive Computer Systems, October 1992. [22] M. Sobhani, B. Neisius, S. Jayasuriya, E. Rumler and M. Rabins, "Some New Insights On the Classical Beam and Ball Balancing Experiment," Proc, of the American Control Conference, pp.. 450-454, 1992. [23] G. Watson, "Three Little Bits Breed a Big, Bad Bug," IEEE Spectrum, p. 52, May 1992. [24] S. Woodfield "An Experiment on Unit Increase in Problem Complexity," IEEE Trans, on Software Engineering, vol. SE-5, No. 2, pp. 76-79, March 1979.

INDEX

E Ada 8,161 Advanced Automation System (AAS) 39 algorithm-based fault tolerance 81 analytic redundancy 188 anytime computation 158 assertion 42 atomic guarded statements 58 atomic multicast 71 atomic objects 63

check 84 checkpointing 165 checksum 83 control systems 157, 185

efficiency 130 error characteristic 162 error detectability 85 error pattern 84,133

fail-stop 20, 63, 129 failure detector service 41 failure semantics 20 false alarm 206 fault pattern 84 fault-invariant 43 feedback control 198 fuzzy logic 204

G group conununication 74

D deadline 4,157 dependence graph 119 diagnosis 174 diagnosis latency 6 diagnosis, error 93, 98 diagnosis, fault 107, 112 diagnosis, system-level 6 distributed system 4,195

imprecise computation 159 interprocess conmiunication 9,40 invariant 42

Linda 57

214

M monotone tasks 161 multiple versions 162, 186

N

Real-Time Communication Network 8 real-time systems 4, 157, 189 recoverable region 191, 200 recovery 175 recovery block 186, 196 refinement 41 replication 64, 172 restart 129 robust algorithm 131

N-version programming 186,196

o operation, backup 64 overload, transient 12

parallel algorithm 84,125 parallel program 57 PID control law 202 postcondition 42 PRAM 128 PRAM, CRCW 133 precondition 42 primary 40 74 process, backup 40 process, monitor 61 process, resilient 40 proof outline 42 protocol, group membership 4, 74 protocol, hand-off 40 protocol, priority ceiling 15

R randomized algorithm 91, 138,150 rate monotonic analysis 10

schedulable 8 scheduling 10, 167,172 sieve method 160 software failures 184 sporadic service 12 survivability 132 syndrome 85

testing assignment 5 threads 8 throttle 18 tuple space 57 tuple space, stable 58

u unit system 90 unstable systems 185

V, W virtual machines 62 Write-all 133

Foundations of Dependable Computing: Models and Frameworks for Dependable Systems

Read more

Foundations of Dependable Computing: System Implementation

Read more

Foundations of Dependable Computing: System Implementation

Read more

Dependable Computing EDCC-4: 4th European Dependable Computing Conference Toulouse, France, October 23-25, 2002, Proceedings

Read more

Dependable Computing - EDCC 2005: 5th European Dependable Computing Conference, Budapest, Hungary, April 20-22, 2005, Proceedings

Read more

Dependable Computing - EDCC-2: Second European Dependable Computing Conference, Taormina, Italy, October 2 - 4, 1996. Proceedings

Read more

Dependable Computing, 3 conf., LADC 2007

Read more

Architecting Dependable Systems

Read more

Architecting Dependable Systems, ICSE 2002

Read more

Architecting Dependable Systems II

Read more

Responsibility and Dependable Systems

Read more

Architecting Dependable Systems V

Read more

Architecting dependable systems

Read more

Highly Dependable Software

Read more

Dependable Computing - EDCC-3: Third European Dependable Computing Conference, Prague, Czech Republic, September 15-17, 1999, Proceedings

Read more

Code Design for Dependable Systems: Theory and Practical Applications

Read more

Software for Dependable Systems: Sufficient Evidence?

Read more

Code Design for Dependable Systems: Theory and Practical Applications

Read more

Foundations for Programming Languages (Foundations of Computing)

Read more

Foundations for Programming Languages (Foundations of Computing)

Read more

IRJET- Dependable Flow and Flood Control of Water Dam

Read more

Foundations of Computing

Read more

Foundations of Computing

Read more

Foundations of Computing

Read more

Foundations of Computing

Read more

Foundations of Computing

Read more

Foundations of Computing

Read more

Architecting Dependable Systems 4 conf., ICSE 2006 and DSN 2006

Read more

Architecting Dependable Systems 3 conf., ICSE 2004 and DSN 2004

Read more

Dependable Computer Systems (Advances in Intelligent and Soft Computing, vol. 97)

Read more

Recommend Documents

Foundations of Dependable Computing: Models and Frameworks for Dependable Systems

FOUNDATIONS OF DEPENDABLE COMPUTING Models and Frameworks for Dependable Systems T H E K L U W E R I N T E R N A T I...

Foundations of Dependable Computing: System Implementation

FOUNDATIONS OF DEPENDABLE COMPUTING System Implementation THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER...

Foundations of Dependable Computing: System Implementation

Dependable Computing EDCC-4: 4th European Dependable Computing Conference Toulouse, France, October 23-25, 2002, Proceedings

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen 2485 3 Berlin Heidelberg New Y...

Dependable Computing - EDCC 2005: 5th European Dependable Computing Conference, Budapest, Hungary, April 20-22, 2005, Proceedings

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Dependable Computing - EDCC-2: Second European Dependable Computing Conference, Taormina, Italy, October 2 - 4, 1996. Proceedings

Dependable Computing, 3 conf., LADC 2007

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Architecting Dependable Systems

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Architecting Dependable Systems, ICSE 2002

Architecting Dependable Systems II

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...