Computer Safety, Reliability and Security: 21st International Conference, SAFECOMP 2002, Catania, Italy, September 10-13, 2002. Proceedings

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen 2434 3 Berlin Heidelberg New Y...

Author: Stuart Anderson | Sandro Bologna | Massimo Felici

102 downloads 1563 Views 5MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

2434

3

Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo

Stuart Anderson Sandro Bologna Massimo Felici (Eds.)

Computer Safety, Reliability and Security 21st International Conference, SAFECOMP 2002 Catania, Italy, September 10-13, 2002 Proceedings

13

Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Stuart Anderson Massimo Felici The University of Edinburgh, LFCS, Division of Informatics Mayfield Road, Edinburgh EH9 3JZ, United Kingdom E-mail: {soa, mas}@dcs.ed.ac.uk Sandro Bologna ENEA CR Casaccia Via Anguillarese, 301, 00060 Rome, Italy E-mail: [email protected] Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Computer safety, reliability and security : 21th international conference ; proceedings / SAFECOMP 2002, Catania, Italy, September 10 - 13, 2002. Stuart Anderson ... (ed.). - Berlin ; Heidelberg ; New York ; Hong Kong ; London ; Milan ; Paris ; Tokyo : Springer, 2002 (Lecture notes in computer science ; Vol. 2434) ISBN 3-540-44157-3

CR Subject Classification (1998):D.1-4, E.4, C.3, F.3, K.6.5 ISSN 0302-9743 ISBN 3-540-44157-3 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2002 Printed in Germany Typesetting: Camera-ready by author, data conversion by DA-TeX Gerd Blumenstein Printed on acid-free paper SPIN 10870130 06/3142 543210

Preface

Welcome to SAFECOMP 2002, held in Catania, Italy. Since its establishment SAFECOMP, the series of conferences on Computer Safety, Reliability, and Security, has contributed to the progress of the state of the art in dependable applications of computer systems. SAFECOMP provides ample opportunity to exchange insights and experiences in emerging methods and practical experience across the borders of diﬀerent disciplines. Previous SAFECOMPs have already registered the need for multidisciplinarity in order better to understand dependability of computer-based systems in which human factors still remain a major criticality. SAFECOMP 2002 further addresses multidisciplinarity by collaborating and coordinating its annual activities with the Eleventh European Conference on Cognitive Ergonomics (ECCE-11). This year, SAFECOMP 2002 and ECCE-11 jointly organized an industry panel on Human-Computer System Dependability. The cross-fertilization among diﬀerent scientiﬁc communities and industry supports the achievement of long-term results contributing to the integration of multidisciplinary experience in order to improve the design and deployment of dependable computer-based systems. SAFECOMP 2002 addressed the need to broaden the scope of disciplines contributing to dependability. The SAFECOMP 2002 program consisted of 27 refereed papers chosen from 69 submissions from all over the word. The review process was possible thanks to the valuable work of the International Program Committee and the external reviewers. SAFECOMP 2002 also included three invited keynote talks, which enhanced the technical and scientiﬁc merit of the conference. We would like to thank the International Program Committee, the organizing committee, the external reviewers, the keynote speakers, the panelists, and the authors for their work and support for SAFECOMP 2002. We would also like to thank the ECCE-11 people, who collaborated with us in organizing this week of events. We really enjoyed the work and we hope you appreciate the care that we put into organizing an enjoyable and fruitful conference. Finally, we will be glad to welcome you again to SAFECOMP 2003 in Edinburgh, Scotland.

July 2002

Sandro Bologna Stuart Anderson Massimo Felici

General Chair Sandro Bologna, I

Program Co-chairs Stuart Anderson, UK Massimo Felici, UK

EWICS TC7 Chair Udo Voges, D

International Program Committee Stuart Anderson, UK Liam J. Bannon, IRL Antonia Bertolino, I Helmut Bezecny, D Robin Bloomﬁeld, UK Andrea Bondavalli, I Helmut Breitwieser, D Peter Daniel, UK Bas de Mol, NL Istvan Erenyi, HU Hans R. Fankhauser, S Massimo Felici, UK Robert Garnier, F Robert Genser, A Chris Goring, UK Janusz Gorski, PL Erwin Grosspietsch, D Michael Harrison, UK Maritta Heisel, D Erik Hollnagel, S Chris Johnson, UK

Mohamed Kaˆ aniche, F Karama Kanoun, F Floor Koornneef, NL Vic Maggioli, US Patrizia Marti, I Odd Nordland, NO Alberto Pasquini, I Gerd Rabe, D Felix Redmill, UK Antonio Rizzo, I Francesca Saglietti, D Erwin Schoitsch, A Meine van der Meulen, NL Udo Voges, D Marc Wilikens, I Rune Winther, NO Stefan Wittmann, D Eric Wong, US Janusz Zalewski, US Zdzislaw Zurakowski, P

Organizing Committee Stuart Anderson, UK Antonia Bertolino, I Domenico Cantone, I Massimo Felici, UK Eda Marchetti, I

Alberto Pasquini, I Elvinia Riccobene, I Mark-Alexander Sujan, D Lorenzo Vita, I

Organization

External Reviewers Claudia Betous Almeida, F Iain Bate, UK Giampaolo Bella, I Stefano Bistarelli, I Linda Brodo, I L. H. J. Goossens, NL Bjørn Axel Gran, NO Fabrizio Grandoni, I Silvano Chiaradonna, I Andrea Coccoli, I Felicita Di Giandomenico, I Juliana K¨ uster Filipe, UK

Marc-Olivier Killijian, F Frank Koob, D Martin Lange, UK Eckhard Liebscher, D Eda Marchetti, I Marc Mersiol, F Stefano Porcarelli, I Andrey A. Povyakalo, UK Thomas Santen, D Mark-Alexander Sujan, D Konstantinos Tourlas, UK

VII

VIII

Organization

Scientific Sponsor

in collaboration with the Scientific Co-sponsors AICA – Associazione Italiana per l’Informatica ed il Calcolo Automatico ARCS – Austrian Research Centers Seibersdorf Interdisciplinary Research Collaboration in Dependability of Computer-Based Systems EACE – European Association of Cognitive Ergonomics ENCRESS – European Network of Clubs for Reliability and Safety of Software GI – Gesellschaft f¨ ur Informatik

IFAC – International Federation of Automatic Control IFIP – WG10.4 on Dependable Computing and Fault Tolerance IFIP – WG13.5 on Human Error, Safety and System Development ISA-EUNET OCG – Austrian Computer Society SCSC – Safety-Critical Systems Club SRMC – Software Reliability & Metrics Club

Organization

SAFECOMP 2002 Organization

SAFECOMP 2002 Management Tool

IX

List of Contributors

K. Androutsopoulos Department of Computer Science King’s College London Strand, London WC2R 2LS United Kingdom

Sandro Bologna ENEA CR Casaccia Via Anguillarese, 301 00060 - Roma Italy

Christopher Bartlett BAE SYSTEMS United Kingdom

R.W. Born MBDA UK Ltd. Filton, Brstol, United Kingdom

Iain Bate Department of Computer Science University of York York YO10 5DD United Kingdom M. Benerecetti Dept. of Physics University of Naples ”Federico II” Napoli Italy Helmut Bezecny Dow Germany Peter G. Bishop Adelard and Centre for Software Reliability, City University Northampton Square London EC1V 0HB United Kingdom Robin Bloomﬁeld Adelard and Centre for Software Reliability, City University Northampton Square London EC1V 0HB United Kingdom A. Bobbio DISTA Universit` a del Piemonte Orientale 15100 - Alessandria Italy

Jan Bredereke Universit¨ at Bremen FB 3 · P.O. box 330 440 D-28334 Bremen Germany Jos´e Carlos Campelo Departamento de Inform´ atica de Sistemas y Computadoras, Universidad Polit´ecnica de Valencia, 46022 - Valencia Spain Luping Chen Safety Systems Research Centre Department of Computer Science University of Bristol Bristol, BS8 1UB United Kingdom E. Ciancamerla ENEA CR Casaccia Via Anguillarese, 301 00060 - Roma Italy D. Clark Department of Computer Science King’s College London Strand, London WC2R 2LS United Kingdom

List of Contributors

XI

Tim Clement Adelard Drysdale Building Northampton Square London EC1V 0HB United Kingdom

Thomas Droste Institute of Computer Science, Dept. of Electrical Engineering and Information Sciences, Ruhr Univ. Bochum, 44801 Bochum Germany

Paulo S´ergio Cugnasca Escola Polit´ecnica da Universidade de S˜ ao Paulo, Dept of Computer Engineering and Digital Systems, CEP 05508-900 - S˜ ao Paulo Brazil

G. Franceschinis DISTA Universit` a del Piemonte Orientale 15100 - Alessandria Italy

Ferdinand J. Dafelmair ¨ S¨ TUV uddeutschland Westendstrasse 199 80686 M¨ unchen Germany Dino De Luca NOKIA Italia S.p.A. Stradale Vincenzo Lancia 57 95121 Catania Italy ´Italo Romani de Oliveira Escola Polit´ecnica da Universidade de S˜ ao Paulo, Dept of Computer Engineering and Digital Systems, CEP 05508-900 - S˜ ao Paulo Brazil S. D. Dhodapkar Reactor Control Division Bhabha Atomic Research Centre Mumbai 400085 India Theo Dimitrakos CLRC Rutherford Appleton Laboratory (RAL) Oxfordshire United Kingdom

Rune Fredriksen Institute For Energy Technology P.O. Box 173 1751 Halden Norway R. Gaeta Dipartimento di Informatica Universit` a di Torino 10150 - Torino Italy Bjørn Axel Gran Institute For Energy Technology P.O. Box 173 1751 Halden Norway M. Gribaudo Dip. di Informatica Universit` a di Torino 10149 - Torino Italy Soﬁa Guerra Adelard Drysdale Building Northampton Square London EC1V 0HB United Kingdom Mark Hartswood School of Informatics University of Edinburgh United Kingdom

XII

List of Contributors

Denis Hatebur ¨ TUViT GmbH System- und Softwarequalit¨ at Am Technologiepark 1, 45032 Essen Germany Klaus Heidtmann Departement of Computer Science Hamburg University Vogt-K¨ olln-Str. 30 D-22527 Hamburg Germany Monika Heiner Brandenburgische Technische Universit¨ at Cottbus Institut f¨ ur Informatik 03013 Cottbus Germany Maritta Heisel Institut f¨ ur Praktische Informatik und Medieninformatik Technische Universit¨ at Ilmenau 98693 Ilmenau Germany Bernhard Hering Siemens I&S ITS IEC OS D-81359 M¨ unchen Germany Erik Hollnagel CSELAB, Department of Computer and Information Science University of Link¨ oping Sweden A. Horv´ ath Dept. of Telecommunications Univ. of Technology and Economics Budapest Hungary

Gordon Hughes Safety Systems Research Centre Department of Computer Science University of Bristol Bristol, BS8 1UB United Kingdom Jef Jacobs Philips Semiconductors, Bld WAY-1, Prof. Holstlaan 4 5656 AA Eindhoven The Netherlands Tim Kelly Department of Computer Science University of York York YO10 5DD United Kingdom Tai-Yun Kim Department of Computer Science & Engineering, Korea University Anam-dong Seungbuk-gu Seoul Korea John C. Knight Department of Computer Science University of Virginia, 151, Engineer’s Way, P.O. Box 400740 Charlottesville, VA22904-4740 USA Monica Kristiansen Institute For Energy Technology P.O. Box 173 1751 Halden Norway Axel Lankenau Universit¨ at Bremen FB 3 · P.O. box 330 440 D-28334 Bremen Germany K. Lano Department of Computer Science King’s College London Strand, London WC2R 2LS United Kingdom

List of Contributors

XIII

Bev Littlewood Centre for Software Reliability City University, Northampton Square, London EC1V 0HB United Kingdom

Yiannis Papadopoulos Department of Computer Science University of Hull Hull, HU6 7RX United Kingdom

John May Safety Systems Research Centre Department of Computer Science University of Bristol Bristol, BS8 1UB United Kingdom

Bernard Pavard GRIC – IRIT Paul Sabatier University Toulouse France

M. Minichino ENEA CR Casaccia Via Anguillarese, 301 00060 - Roma Italy Ali Moeini University of Tehran n. 286, Keshavarz Blvd 14166 – Tehran Iran MahdiReza Mohajerani University of Tehran n. 286, Keshavarz Blvd 14166 – Tehran Iran Tom Arthur Opperud Telenor Communications AS R&D Fornebu Norway Frank Ortmeier Lehrstuhl f¨ ur Softwaretechnik und Programmiersprachen Universit¨ at Augsburg D-86135 Augsburg Germany M. Panti Istituto di Informatica University of Ancona Ancona Italy

S.E. Paynter MBDA UK Ltd. Filton, Brstol, United Kingdom Peter Popov Centre for Software Reliability City University Northampton Square, London United Kingdom L. Portinale DISTA Universit` a del Piemonte Orientale 15100 - Alessandria Italy Rob Procter School of Informatics University of Edinburgh United Kingdom S. Ramesh Centre for Formal Design and Veriﬁcation of Software IIT Bombay, Mumbai 400076 India Wolfgang Reif Lehrstuhl f¨ ur Softwaretechnik und Programmiersprachen Universit¨ at Augsburg D-86135 Augsburg Germany

XIV

List of Contributors

Yoon-Jung Rhee Department of Computer Science & Engineering, Korea University Anam-dong Seungbuk-gu Seoul Korea Francisco Rodr´ıguez Departamento de Inform´ atica de Sistemas y Computadoras, Universidad Polit´ecnica de Valencia, 46022 - Valencia Spain Thomas Rottke ¨ TUViT GmbH System- und Softwarequalit¨ at Am Technologiepark 1, 45032 Essen Germany Mark Rounceﬁeld Department of Computing University of Lancaster United Kingdom Job Rutgers Philips Design The Netherlands Titos Saridakis NOKIA Research Center PO Box 407 FIN-00045 Finland Gerhard Schellhorn Lehrstuhl f¨ ur Softwaretechnik und Programmiersprachen Universit¨ at Augsburg D-86135 Augsburg Germany Juan Jos´e Serrano Departamento de Inform´ atica de Sistemas y Computadoras, Universidad Polit´ecnica de Valencia, 46022 - Valencia Spain

Andrea Servida European Commission DG Information Society C-4 B1049 Brussels Belgium Babita Sharma Reactor Control Division Bhabha Atomic Research Centre Mumbai 400085 India Roger Slack School of Informatics University of Edinburgh United Kingdom L. Spalazzi Istituto di Informatica University of Ancona Ancona Italy Ketil Stølen Sintef Telecom and Informatics, Oslo Norway S.Tacconi Istituto di Informatica University of Ancona Ancona Italy Andreas Thums Lehrstuhl f¨ ur Softwaretechnik und Programmiersprachen Universit¨ at Augsburg D-86135 Augsburg Germany Helmut Trappschuh Siemens I&S ITS IEC OS D-81359 M¨ unchen Germany

List of Contributors

Jos Trienekens Frits Philips Institute Eindhoven University of Technology Den Dolech 2 5600 MB Eindhoven The Netherlands E. Tronci Dip. di Informatica Universit` a di Roma ”La Sapienza” 00198 - Roma Italy Alexander Voß School of Informatics University of Edinburgh United Kingdom

XV

Robin Williams Research Centre for Social Sciences University of Edinburgh United Kingdom Wenhui Zhang Laboratory of Computer Science Institute of Software Chinese Academy of Sciences P.O.Box 8718, 100080 Beijing China

Table of Contents

Human-Computer System Dependability (Joint ECCE-11 & SAFECOMP 2002) Human-Computer System Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Panel moderators: Sandro Bologna and Erik Hollnagel Dependability of Joint Human-Computer Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Erik Hollnagel

Keynote Talk Dependability in the Information Society: Getting Ready for the FP6 . . . . . . 10 Andrea Servida

Human Factors A Rigorous View of Mode Confusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Jan Bredereke and Axel Lankenau Dependability as Ordinary Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Alexander Voß, Roger Slack, Rob Procter, Robin Williams, Mark Hartswood, and Mark Rouncefield

Security Practical Solutions to Key Recovery Based on PKI in IP Security . . . . . . . . . . 44 Yoon-Jung Rhee and Tai-Yun Kim Redundant Data Acquisition in a Distributed Security Compound . . . . . . . . . .53 Thomas Droste Survivability Strategy for a Security Critical Process . . . . . . . . . . . . . . . . . . . . . . . 61 Ferdinand J. Dafelmair

Dependability Assessment (Poster Session) Statistical Comparison of Two Sum-of-Disjoint-Product Algorithms for Reliability and Safety Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Klaus Heidtmann

XVIII Table of Contents

Safety and Security Analysis of Object-Oriented Models . . . . . . . . . . . . . . . . . . . .82 Kevin Lano, David Clark, and Kelly Androutsopoulos The CORAS Framework for a Model-Based Risk Management Process . . . . . 94 Rune Fredriksen, Monica Kristiansen, Bjørn Axel Gran, Ketil Stølen, Tom Arthur Opperud, and Theo Dimitrakos

Keynote Talk Software Challenges in Aviation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .106 John C. Knight

Application of Formal Methods (Poster Session) A Strategy for Improving the Eﬃciency of Procedure Veriﬁcation . . . . . . . . . 113 Wenhui Zhang Veriﬁcation of the SSL/TLS Protocol Using a Model Checkable Logic of Belief and Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Massimo Benerecetti, Maurizio Panti, Luca Spalazzi, and Simone Tacconi Reliability Assessment of Legacy Safety-Critical Systems Upgraded with Oﬀ-the-Shelf Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Peter Popov

Reliability Assessment Assessment of the Beneﬁt of Redundant Systems . . . . . . . . . . . . . . . . . . . . . . . . . .151 Luping Chen, John May, and Gordon Hughes Estimating Residual Faults from Code Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Peter G. Bishop

Design for Dependability Towards a Metrics Based Veriﬁcation and Validation Maturity Model . . . . . 175 Jef Jacobs and Jos Trienekens Analysing the Safety of a Software Development Process . . . . . . . . . . . . . . . . . . 186 Stephen E. Paynter and Bob W. Born Software Criticality Analysis of COTS/SOUP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Peter Bishop, Robin Bloomfield, Tim Clement, and Sofia Guerra

Table of Contents

XIX

Safety Assessment Methods of Increasing Modelling Power for Safety Analysis, Applied to a Turbine Digital Control System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .212 Andrea Bobbio, Ester Ciancamerla, Giuliana Franceschinis, Rossano Gaeta, Michele Minichino, and Luigi Portinale Checking Safe Trajectories of Aircraft Using Hybrid Automata . . . . . . . . . . . . 224 ´ Italo Romani de Oliveira and Paulo S´ergio Cugnasca Model-Based On-Line Monitoring Using a State Sensitive Fault Propagation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 Yiannis Papadopoulos

Keynote Talk On Diversity, and the Elusiveness of Independence . . . . . . . . . . . . . . . . . . . . . . . . 249 Bev Littlewood

Design for Dependability (Poster Session) An Approach to a New Network Security Architecture for Academic Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 MahdiReza Mohajerani and Ali Moeini A Watchdog Processor Architecture with Minimal Performance Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Francisco Rodr´ıguez, Jos´e Carlos Campelo, and Juan Jos´e Serrano

Application of Formal Methods Model-Checking Based on Fluid Petri Nets for the Temperature Control System of the ICARO Co-generative Plant . . .273 M. Gribaudo, A. Horv´ ath, A. Bobbio, E. Tronci, E. Ciancamerla, and M. Minichino Assertion Checking Environment (ACE) for Formal Veriﬁcation of C Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 B. Sharma, S. D. Dhodapkar, and S. Ramesh Safety Analysis of the Height Control System for the Elbtunnel . . . . . . . . . . . 296 Frank Ortmeier, Gerhard Schellhorn, Andreas Thums, Wolfgang Reif, Bernhard Hering, and Helmut Trappschuh

XX

Table of Contents

Design for Dependability Dependability and Conﬁgurability: Partners or Competitors in Pervasive Computing? . . . . . . . . . . . . . . . . . . . . . . . . 309 Titos Saridakis Architectural Considerations in the Certiﬁcation of Modular Systems . . . . . 321 Iain Bate and Tim Kelly A Problem-Oriented Approach to Common Criteria Certiﬁcation . . . . . . . . . .334 Thomas Rottke, Denis Hatebur, Maritta Heisel, and Monika Heiner Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .347

Human-Computer System Dependability Panel moderators: Sandro Bologna and Erik Hollnagel Panellists: Christopher Bartlett, Helmut Bezecny, Bjørn Axel Gran, Dino De Luca, Bernard Pavard and Job Rutgers

Abstract. The intention of this cross-conference session is to bring together specialists of cognitive technologies and ergonomics, with software developers, dependability specialists and, especially, endusers. The objective of the session is to provide a forum for the sharing of practical experiences and research results related to safety and dependability of human-computer systems.

1

Rationale

While a computing device from an isolated perspective is a technological artefact pure and simple, practice has taught us that humans always are involved at some time and in some way. Humans play a role as developers, designers, and programmers of systems. They may be end-users, or they may be the people who service, maintain, repair, and update the systems. Unless we restrict our perspective to a narrow slice of a system’s life cycle, the central issue must be the dependability of human-computer systems. This issue cannot be reduced to a question of fallible humans affecting perfect computers. Indeed, we cannot consider the functioning of a computer without at the same time consider the functioning of humans. In that sense the issue is not one of either-or, but of both – of human-and-computer seen as a single system. The problem of dependability therefore cannot be reduced to the dependability of one part or the other, but must address the system as a whole. This requires a close collaboration of people with different kinds of expertise and a realisation that no single discipline can provide the whole answer. It is in that spirit that cognitive ergonomics community and the safety & reliability community meet in this session – and hopefully will continue to meet afterwards.

2

Position Statements

Christopher Bartlett is Chief Technologist in the Capability Management Group, Bae Systems, UK. Abbreviated position statement: A major reason for retaining the man-in-the-loop is that we do not have totally integrated systems yet. Pilots still have a task to analyse what else is happening within and without the cockpit and it is this integration S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 1-3, 2002.  Springer-Verlag Berlin Heidelberg 2002

2

Sandro Bologna and Erik Hollnagel

attribute to which all their training is directed. We simply do not have sensors with sufficient resolution nor decision-making algorithms as powerful as the human brain. We may achieve such system integration eventually but the larger and more complex the system the more difficult it is to grasp the potential failure modes when we try to formalise it. Until then the best we can do is to recognise both human fallibility alongside our ability to make innovative decisions and to make our systems robust enough to cope. Helmut Bezecny is “Global Process Automation Safety Technology Leader” with Dow Chemical in Stade, Germany. Abbreviated position statement: Human-computer issues are created by the divers capabilities of both, if misapplied. Human can prevent hazardous situations from turning into accidents by identifying and classifying safety relevant information that cannot be measured and engaging in activities that have not been planned but would help in the specific scenario. The strategy for human-computer systems should be to combine the best of both - let the computer do whatever can be automated and the human interact intelligently, creative and with low stress. Bjørn Axel Gran is member of the section on software verification and validation, at the OECD Halden Reactor Project, Norway. Abbreviated position statement: In the operation of complex systems such as nuclear power plants, aircraft, etc., humans and advanced technology interact within a dynamic environment. The dependability of such systems requires that the interplay between humans and technology be addressed for every lifecycle phase of the system. This creates a need for research to formulating new methodologies that ensure a common and uniform platform for the experts from different fields. Dino De Luca is Chief Solution Engineer and Team Leader for MAG (Middleware & Applications Group) in the South Europe hub of Nokia. Abbreviated position statement: Although the possibility to pay interacting with a mobile terminal opens new and interesting scenarios, it is very important that the customer and the service are able to fully trust each other, including all the communication services in-between. Firstly, mutual authentication, privacy and integrity of the communication must be provided for both parties. Secondly, in addition to the secured communications pipe, a higher level of assurance can be achieved by applying digital signatures to a business transaction. The proposed solution is to have the customer’s mobile phone act as a Personal Trusted Device (PTD). Bernard Pavard is professor at the Cognitive Engineering Research Group (GRIC – IRIT) at the Paul Sabatier University, Toulouse, France. Abbreviated position statement: I want to consider how can we design dependability into complex work settings. The paradox of dependability is that the more complex the situation, the more it is necessary to involve humans decisions, immersive and invisible technologies and the less it is possible to carry out traditional safety &

Human-Computer System Dependability

3

reliability analysis. Designers need to tackle non-deterministic processes and answer questions such as: how can controlled (deterministic) processes and distributed noncontrollable interactions be managed optimally to improve the robustness of a system, and how can the risks of such a system be assessed? Job Rutgers is member of the Strategic Design Team at Philips Design in the Netherlands. Abbreviated position statement: Designers engage in the development process by means of scenario’s, such as visual and textual descriptions of how future products and services will enable users to overcome certain problems or will address certain needs. Often, these scenarios are characterized by a simple narration but lack a realistic modelling of everyday life activities into ‘real stories’. To achieve this, we need both to integrate better the complex information collected by social scientist, but also make use of a wider vocabulary of story telling than ‘happy ending’ mainstream Hollywood offers. The detailed modelling of users’ everyday life activities need to result in ‘real life’ scenarios that incorporate a fuller and richer blend of people’s behaviour.

Dependability of Joint Human-Computer Systems Erik Hollnagel CSELAB, Department of Computer and Information Science University of Linköping, Sweden [email protected]

Abstract. Human-computer systems have traditionally been analysed and described in terms of their components – humans and computers – plus the interaction. Several contemporary schools of thought point out that decomposed descriptions are insufficient, and that a description on the level of the system as a whole is needed instead. The dependability of human-computer systems must therefore refer to the performance characteristics of the joint system, and specifically the variability of human performance, not just as stochastic variations but also as purposeful local optimisation.

1

Introduction

In any situation where humans use artefacts to accomplish something, the dependability of the artefact is essential. Quite simply, if we cannot rely or depend on the artefact, we cannot really use it. This goes for simple mechanical artefacts (a hammer, a bicycle), as well as complex technological artefacts (ranging from cars to large-scale industrial production systems), and sociotechnical artefacts (distribution systems, service organisations). In order for an artefact – or more generally, a system – to be dependable it is a necessary but not sufficient condition that the parts and subsystems are dependable. In addition, the interaction between the parts must also be dependable. In pure technological systems this does not constitute a serious problem, because the interaction between the parts usually is designed together with the parts. Take, for instance, an automobile engine. This consists of hundreds (or even thousands?) of parts, which are engineered to a high degree of precision. The same precision is achieved in specifying how the parts work together, either in a purely mechanical fashion or as in a modern car engine with extensive use of software. Compared to the engine in a Ford T, a modern car engine is a miracle of precision and dependability, as are indeed most other technological systems. Since the dependability of a system depends on the parts and their interaction, it makes sense that the design tries to cover both. While this can be achieved for technological systems which function independently of users, the situation is radically different for systems where humans and technology have to work together, i.e., where some of the parts are hardware – or hardware + software – and others are humans. I shall refer to the latter as joint systems, for instance joint human-computer systems or S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 4-9, 2002.  Springer-Verlag Berlin Heidelberg 2002

Dependability of Joint Human-Computer Systems

5

joint cognitive systems (Woods, 1986). Although an artefact may often be seen as representing technology pure and simple, practice has taught us that humans always are involved at some time and in some way. Humans play a role as developers, designers, and programmers of systems. They may be end-users, or they may be the people who service, maintain, repair, and update the systems. Since the dependability cannot be isolated to a point in time, but always must consider what went before (the system’s history), it follows that the issue of system dependability always is an issue of human-system dependability as well. The crucial difference between pure and joint systems is that humans cannot be designed and engineered in the manner of technological components, and furthermore that the interaction cannot be specified in a rigorous manner. This is certainly not because of a lack of trying, as shown by the plethora of procedures and rules that are a part of many systems. There also appears to be a clear relation between the level of risk in a situation and the pressure to follow procedures strictly, where space missions and the military provide extreme examples. Less severe, but still with an appreciable risk, are complex public systems for commerce, communication and transportation, where aviation and ATM are good examples. At the other end of the scale are consumers trying to make use of various devices with more or less conspicuous computing capabilities, such as IT artefacts and everyday machines (household devices, kiosks, ATMs, ticket machines, etc). Although the risks here may be less, the need of dependability is still very tangible, for instance for commercial reasons. It is therefore necessary seriously to consider the issue of dependability of joint humancomputer systems at all levels and especially to pay attention to the dependability of the interaction between humans and machines.

2

Human Reliability

One issue that quickly crops up in this endeavour is that of human reliability, and its not too distant cousin “human error”. Much has been written on these topics, particularly in the aftermath of major accidents (nuclear, aviation, trains, trading, etc). In the 1970-80s it was comme il faut to consider humans as being unreliable and error prone by themselves, i.e., considered as systems in isolation. This leads to several models of “human error” and human reliability, and to the concept of an inherent human error probability (Leplat & Rasmussen, 1987; Miller & Swain, 1987; Reason, 1990). From the 1990s and onwards this view has been replaced by the realisation that human performance and reliability depends on the context as much as on any psychological predispositions (Amalberti, 1996; Hollnagel, 1998). The context includes the working conditions, the current situation, resources and demands, the social climate, the organisational safety culture, etc. Altogether this has been expressed by introducing a distinction between sharp-end and blunt-end factors, where the former are factors at the local workplace and the latter are factors removed in space and time that create the overall conditions for work (Reason, 1997; Woods et al., 1994). The importance of the sharp-end, blunt-end framework of thinking is that system dependability – or rather, the lack thereof – is not seen as a localised phenomenon which can be explained in terms of specific or unique conditions, but rather as

6

Erik Hollnagel

something which is part and parcel of the system throughout its entire existence. Human-computer dependability therefore cannot be reduced to a question of human reliability and “human error”, but requires an appreciation of the overall conditions that govern how a system functions.

3

Automation

One of the leading approaches to automation has been to use it as a way of compensating for insufficient human performance. The insufficiency can be in terms of speed, precision, accuracy, endurance, or – last but not least – reliability and “error”. Automation has been introduced either to compensate for or support inadequate human performance or outright to replace humans in specific functions. From a more formal point of view, automation can be seen as serving in one of the following roles (Hollnagel, 1999): • Amplification, in the sense that the ability to perform a function is being improved. If done inappropriately, this may easily led to a substitution of functions. • Delegation, in which a function is transferred to the automation under the control of the user (Billings (1991; Sheridan, 1992). • Substitution or replacement. Here the person not only delegates the function to the automation but completely relinquishes control. • Extension, where new functionality is being added. Pure cases of extension are, however, hard to find. In relation to the dependability of joint human-computer systems, automation can be seen as an attempt to improve the interaction by replacing the human with technology, hence taking a specific slice of the interaction away from the human. Despite the often honourable intentions, the negative effects of automation across all industrial domains have, on the whole, been larger than the positive ones. Among the welldocumented negative effects are that workload is not reduced but only shifted to other parts of the task, that “human errors” are displaced but not eliminated, that problems of safety and reliability remain and that the need for human involvement is not reduced, and that users are forced into a more passive role and therefore are less able to intervene when automation fails. The shortcomings of automation have been pointed out by many authors (e.g. Moray et al., 2000; Wiener & Curry, 1980) and have been elegantly formulated as the Ironies of Automation (Bainbridge, 1983). From a joint systems perspective, the main problem with automation is that it changes existing working practices. After a system has been in use for some time, a stable condition emerges as people learn how to accomplish their work with an acceptable level of efficiency and safety. When a change is made to a system – and automation certainly qualifies as one – the existing stable condition is disrupted. After a while a new stable condition emerges, but this may be so different from the previous one that the premises for the automation no longer exist. In other words, by making a change to the system, some of the rationale for the change may have become obsolete. Solutions that were introduced to increase

Dependability of Joint Human-Computer Systems

7

system dependability may therefore have unsuspected side-effects, and perhaps not even lead to any improvement at all.

4

Local Optimisation – Efficiency-Thoroughness Trade-Off

In designing complex human-machine systems – as in transportation, process control, and e-commerce – the aim is to achieve a high degree of efficiency and reliability so that we can depend on the system’s functioning. In practice all human-machine systems are subject to demands that require a trade-off between efficiency and thoroughness on the part of the users and a dependable joint system is one that has developed a safe trade-off. In other words, the people who are part of the system at either the sharp or the blunt end have learned which corners to cut and by how much. The trade-off of thoroughness for efficiency is, however, only viable as long as the conditions conform to the criteria implied by the trade-off. Sooner or later a situation will occur when this is not the case – if for no other reason then because the environment is made up of other joint systems that work according to the same principles. It will be impossible to make significant improvements to system dependability if we do not acknowledge the reality of system performance, i.e., that there must be an efficiency-thoroughness trade-off. The basis for design is inevitably a more or less well-defined set of assumptions about the people who are going to use the system – as operators, end-users, maintenance staff, etc. If, for the sake of the argument, we only consider the end-users the design usually envisions some kind of exemplary end-user. By this I mean a user who has the required psychological and cognitive capacity, who is motivated to make use of the system, who is alert and attentive, who is “rational” in the sense that s/he responds the way the designer imagined, and who knows how the system works and is able to interpret outputs appropriately. While such end-users certainly may exist, the population of users will show a tremendous variety in every imaginable – and unimaginable – way (Marsden & Hollnagel, 1996). Apart from differences in knowledge and skills, there will also be differences in needs (reasons for using the system), in the context or working conditions, in demands, in resources, etc. If system design only considers the exemplary user, the result is unlikely to be a dependable joint system. In order for that to come about, it is necessary that designers (programmers, etc.) take into account how things may go wrong and devise ways in which such situations may be detected and mitigated. System designers and programmers are, however, subject to the same conditions as end-users, i.e. a diversity of needs and working conditions. They are therefore prone to use the same bag of tricks to bring about a local optimum, i.e., to trade off thoroughness for efficiency in order to achieve a result that they consider to be sufficiently safe and effective. This means that the dependability of a joint system must be considered from beginning to end of the system’s life cycle. The formal methods that are used for requirement specification and programming are themselves artefacts being used by humans, and therefore subject to the same problems as the final systems. Thus, the diligence by which we scrutinise the dependability of the system-in-use should also be directed at the system as it is being built.

8

Erik Hollnagel

Diversity of needs

Diversity of needs

Diversity of contexts

Population of users Artefact Artefact ++ interaction interaction (HW+SW) (HW+SW)

System designer

Exemplary user

Requirements, tools, technology, …

Diversity of contexts

Fig. 1. User types and design diversity

5

Conclusions

In this presentation I have tried to argue that the issue of joint human-computer dependability cannot be reduced to a question of fallible humans corrupting otherwise perfect computing artefacts. Indeed, we cannot consider the functioning of a computing system without at the same time consider the functioning of humans. In that sense the issue is not one of either-or, but of both – of human-and-computer seen as a single system. Although there is no easy way to solve the problems we see all around, one solution would be to build systems that are resilient in the sense that they are able to detect variations in overall performance and compensate for them at an early stage, either by a direct compensation-recovery capability or by defaulting to a safe state. To do this there is a need for better methods to analyse and assess the dependability of joint human-machine systems. The bottom line is that the problem of dependability cannot be reduced to the dependability of one part or the other, but must address the joint system. This requires a close collaboration of people with different kinds of expertise and a realisation that no single discipline can provide the whole answer. It is in that spirit that cognitive ergonomics community and the safety & reliability community meet in this session – and hopefully will continue to meet afterwards.

References 1. 2.

Amalberti, R. (1996). La conduite des systèmes à risques, Paris: PUF. Bainbridge, L. (1983). Ironies of automation. Automatica, 19(6), 775-779.

Dependability of Joint Human-Computer Systems

3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.

9

Billings, C. E. (1991). Human-centered aircraft automation: A concept and guidelines (NASA Technical Memorandum 103885). Moffett Field, CA: NASA Ames Research Center. Hollnagel, E. (1998). Cognitive reliability and error analysis method – CREAM. Oxford: Elsevier Science. Hollnagel, E. (1999). From function allocation to function congruence. In S. Dekker & E. Hollnagel (Eds.), Coping with computers in the cockpit. Aldershot, UK: Ashgate. Leplat, J. & Rasmussen, J. (1987). Analysis of human errors in industrial incidents and accidents for improvement of work safety. In J. Rasmussen, K. Duncan & J. Leplat (Eds.), New technology and human error. London: Wiley. Marsden, P. & Hollnagel, E. Human interaction with technology: The accidental user. Acta Psychologica, 91, 345-358. Miller, D. P. & Swain, A. D. (1987). Human Error and Human Reliability. In G. Salvendy (Ed.) Handbook of Human factors. New York: Wiley. Moray, N., Inagaki, T. & Itoh, M. (2000). Adaptive automation, trust, and selfconfidence in fault management of time-critical tasks. Journal of Experimental Psychology: Applied, 6(1), 44-58. Reason, J. T. (1990). Human error. Cambridge, U.K.: Cambridge University Press. Reason, J. T. (1997). Managing the risks of organizational accidents. Aldershot, UK: Ashgate. Sheridan, T. B. (1992). Telerobotics, automation, and human supervisory control. Cambridge, MA: M. I. T Press. Wiener, E. L. & Curry, R. E. (1980). Flight deck automation: Promises and problems. Ergonomics, 23(10), pp. 995-1011. Woods, D. D. (1986). Cognitive technologies: The design of joint humanmachine systems. The AI Magazine, 6(4), 86-92. Woods, D. D., Johannesen, L. J., Cook, R. I. & Sarter, N. B. (1994). Behind human error: Cognitive systems, computers and hindsight. Columbus, Ohio: CSERIAC.

Dependability in the Information Society: Getting Ready for the FP6 Andrea Servida European Commission, DG Information Society C-4 B1049 Brussels, Belgium [email protected] http://deppy.jrc.it/

Abstract. The dependable behaviour of information infrastructures is critical to achieve trust & confidence in any meaningful realisations of the Information Society. The paper briefly discusses the aim and scope of the Dependability Initiative under the Information Society Technologies Programme and presents the activities that have recently being launched in this area to prepare the forthcoming Framework Programme 6 th of the European Commission.

1 Introduction The Information Society is increasingly dependent on largely distributed systems and infrastructures for life-critical and business-critical functions. The complexity of systems in Information Society is rapidly increasing because of a number of factors like the size, unboundness and interdependency as well as the multiplicity of actors involved, the need to pursue more decentralised control and growing sophistication in functionality. This trend together with the increasing use of open information infrastructures for communications, freeware software and common application platforms expose our society to new vulnerabilities and threats that would need better understanding, assessment and control. The dependable and predictable behaviour of information infrastructures provides the basis for Trust & Confidence (T&C) in any meaningful realisations of the global Information Society and, in particular, in Electronic Commerce. However, the expectation and perception of T&C are dramatically changing under the pressures new business, technological and societal drivers among which are: the deregulation in telecommunications, which has led to the emergence of new players, actors and intermediaries inter-playing in new added value chains, multi-national consortiums, services and applications but also to the blurring of sector and jurisdictional boundaries;

1

Disclaimer: The content of this paper is the sole responsibility of the author and in no way represents the view of the European Commission or its services

S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 10-18, 2002. c Springer-Verlag Berlin Heidelberg 2002

Dependability in the Information Society: Getting Ready for the FP6

11

the convergence of communications and media infrastructures together with the interoperability of systems and services, which has boosted the deployment of unbounded network computing and communication environments; the realisation of information as an asset, which has facilitated the transition of companies from a manufacturing-centred to an information/knowledge management centred model with quality met production at the lowest point of global cost; the globalisation of services, markets, reach-ability of consumers and companies with virtual integration of business processes; the emergence of new threats and vulnerabilities, which are mostly connected with the increased openness and reach-ability of the infrastructures; the realisation by a number of nations that ‘information superiority’ brings strategic gains; the increased sophistication and complexity of individual systems; the changes in the traditional chain of trust which is affected by blurring of geographic border and boundaries. The European Dependability Initiative, called in short DEPPY [1], is a major R&D initiative under the Information Society Technologies Programme [2] to develop technologies, systems and capability to tackle the emerging dependability challenges in the Information Society. The experience gained in DEPPY has shown that to attain these new challenges objectives there is a need to foster the integration of research efforts and resources coming from a number of areas such as security, fault tolerance, reliability, safety, survivability but network engineering, psychology, human factor, econometrics, etc. In the following we would present how DEPPY has developed and discussed the new dependability challenges which could be tackled in the forthcoming 6th Framework Programme [3] of the European Commission (called in short FP6).

2 The European Dependability Initiative DEPPY was launched 1997/1998 as an initiative of the IST Programme with the primary objective of addressing dependability requirements in tightly connected systems and services, which are at the basis of the Information Society. The mission statement for the DEPPY was: “to contribute towards raising and assuring trust and confidence in systems and services, by promoting dependability enabling technologies”. This mission statement embraces the main goals, precisely: fostering the development of new dependability technologies, and using better the available dependability technologies.

12

Andrea Servida

2.1 The DEPPY Objectives Five key objectives were identified as qualifying the success of DEPPY, precisely: fostering a dependability-aware culture, which would include promoting innovative approaches to dependability, disseminating industrial best pra ctice and training to promote the ability to work in multi-disciplinary teams; providing a workable characterisation of affordable dependability, which would support the integration and layering of services, the assurance of quality of intangible assets and the certification of both new distributed architectures and massively deployed embedded systems; facilitating global interoperable trust frameworks, that would also consider mediation and negotiation along chains of trust, dependable business process integration and guidance on issues of liability that might arise from system failures in large-scale distributed and embedded settings and; mastering heterogeneous technical environments, including the integr ation of COTS and legacy systems software into Internet based applications, rapid recovery strategies and mechanisms to preserve essential services and business continuity, systems composability, dependability assurance and verification in dynamic environments; managing dependability and risk in largely distributed and open systems-of-systems environments, including dependability assurance and ver ification, united frameworks for modelling and validation, flexible business driven models of dependability. In the following, we will briefly discuss the main element of the DEPPY research agenda as it developed through the years.

2.2 The DEPPY Research Agenda The DEPPY research agenda was determined on an early basis in line with the overall approach taken to define the Workprogramme for the IST Programme in which DEPPY was present as a Cross-Programme Action [4]. In 1999, the research agenda for DEPPY focussed on dependability in services and technologies and, in partic ular on: technologies, methods and tools to meet the emerging dependability requirements stemming from the ubiquity and volume of embedded and networked systems and services, the global and complex nature of large-scale information and communication infrastructures, risk and incident management tools as well as on privacy enhancing tec hnologies, self-monitoring, self-healing infrastructures and services.

Dependability in the Information Society: Getting Ready for the FP6

13

Seven R&D projects were funded covering technical areas like intrusion tolerance paradigm in largely distributed systems, the dependable composition of systems-ofsystems and advance tools for embedded system design. In 2000, the technical focus was on promoting research and industrially oriented projects in areas like: large scale vulnerabilities in multi-jurisdictional and unbounded systems; information assurance; survivable systems relying on self organising and self-diagnostic capabilities; dependability of extensively deployed and tightly networked embedded systems; risk management of largely distributed and open systems-of-systems; methods for workable characterisation of affordable dependability. Beside these technical objectives, we also tried to stimulate the international collaboration in particular with the US. Six projects were funded on areas like depen dability benchmarks for COTS, security of global communication networks, methods and tools for assuring dependability and, last but not least, management and control systems for electrical supply and for telecommunications networks. The objectives set for the year 2001, which were logically built on the work of the previous years, were also closely related to the action on dependability of information infrastructures which was part of the “Secure networks and smart cards” objective of the eEurope 2002 Action Plan [3]. Such an action aimed to “stimulate public/private co-operation on dependability of information infrastructures (including the development of early warning systems) and improve co-operation amongst national 'computer emergency response teams'”. In this respect, the technical objectives for 2001 focussed on developing: innovative and multidisciplinary approaches, methods and technologies to build and manage dependability properties of large-scale infrastructures composed of tightly networked embedded systems. methods and technologies to model and manage dependability and survivability of globally interdependent systems and highly interconnected critical infrastructures. technologies to measure, verify and monitor dependability properties and behaviours of large-scale unbounded systems. Of the three projects funded one, called Dependability Development Support Initiative [6] contributes to raise awareness that making the information infrastructure dependable would mean protecting our industry wealth and investments in inform ation and communication technologies as well as in other intangible assets.

3 The Future: Towards FP6 The experience gained with DEPPY shows that we just start to understand what is the scope of the technological, economic and social implications and challenges connected with the increasing reliance of our economy and society on digital communication networks and systems. Such a reliance is developed through an unprec e-

14

Andrea Servida

dented scale of integration and interconnectedness of highly heterogeneous systems that are, individually and collectively, “emergent”, that is, the result of the casual or intentional composition of smaller and more homogeneous components. These aspects are critical in the area networked embedded systems and components where the large volume of deployed networked devices bring to the surface novel and unique system challenges. Lastly, this scenario is made even more complex by the large variety of patterns of use, user profiles and deployment environments. In the following, are some of the issues that we believe may characterise the context for future activities on dependability: In the area of open information infrastructure and unbounded networks there is a growing demand for "working and affordable dependability" which leads to the need to holistically address issues of safety, security, availability survivability, etc. This could only be accomplished by both stimulating innovative multidisciplinary approaches as well as facilitating the convergence of diverse scientific cultures and technical communities. In the network security arena there is a clear shift from "resist to attack" to “survive and adapt”. The target of "absolute security & zero risk" is unfeasible in domains where openness and interconnectivity are vital elements for successful operations. In this respect, the notion of "adaptable environment" (which would have a level of “self awareness”), within which security performance, quality of services and risks should be managed, is becoming the key element of any operational strategy. There is no language to describe dependability of unbounded systems and infrastructures, nor there are global dependability standards. Hence, novel mult idimensional models (which also cover behaviour, composition, physical elements, thermal properties, etc.) and approaches should be developed. In the area of survivability and dependability, the R&D often drives the Policy activity, but Policy must also drive R&D. There is a need to ensure dependability of critical infrastructures across Nations. In this respect, the meaning of "critical" varies because of Trans-national dependencies. A common knowledge base for this purpose does not exist. Pooling R&D resources across nations can build such knowledge. We are just at the beginning of distributed computing and the pace of its change is dramatic. Very monolithic platforms would disappear to be replaced by new computing platforms/fabric whose impact on dependability is to be ascertained. The next dependability challenge would be related to networks bandwidth and latency. It is anticipated that both the global and the local (intimately related to emerging short-scale interaction/communication means and capability) dime nsions and aspects of cyberspace deserve a fundamental paradigm shift in conceiving and realising a globally (including the time dimension) trustworthy and secure Information Society.

Dependability in the Information Society: Getting Ready for the FP6

15

Software is still the big problem. Achieving the automated (similarly to what is an automated banking process) production and evolution of software seems to be the good target, but we are still very far away from it. In the e-commerce environment software is getting more and more a “utility” for which “scalability” is more important than “features”. From a business perspective there is no difference between "intentional" (normally dealt with in the "security" context) and "unintentional" (normally dealt with in the safety context) disruptive events. From a business perspective there is no difference between a virus and a bug or from a bomb and a quake. The human component is still a very critical to the dependability of systems and organisations.

For the future, the overall goal of pursuing dependability and interdependencies in Information Society would have to support innovative and multidisciplinary RTD to tackle scale issues of dependability connected with new business and everyday life application scenarios such as (i) the increasing volatility and growing heterogeneity of products, applications, services, systems and processes in the digital environment as well as (ii) the increasing interconnection and interdependency of the information and communication infrastructure and with other vital services and systems for our society and our economy. This would lead to new areas for research on dependability aiming at building robust foundations for Information Society through novel multidisciplinary and innovative system-model approaches, architectures and technologies to realise dependable, survivable and evolvable systems, platforms and information infrastructures; understanding, modelling and controlling the interdependencies among largescale systems and infrastructures resulting from the pervasiveness and inte rconnectedness of information and communication technologies.

3.1 Towards FP6: The Roadmap Projects In order to prepare the ground for research initiatives in the FP6 [7], with particular attention to the new instruments of Integrated Projects (IP) and Networks of Excellence (NoE) [8], seven Roadmap projects on security and dependability have recently been launched with the goals: to identify the research challenges in the respective area, to assess Europe’s competitive position and potential, and to derive a strategic roadmaps for a pplied research driven by visionary scenarios; to build constituencies and reach consensus by means of feedback loops with the stakeholders at all relevant levels. The projects address issues around securing infrastructures, securing mobile ser vices, dependability, personal trusted devices, privacy and basic security technologies. Below is a short summary of the three Roadmap projects on dependability, precisely

16

Andrea Servida

AMSD, which focuses on a global and holistic view of dependability; ACIP, which tackles the are of simulation and modelling for critical infrastructure protection; WGALPINE, which looks at survivability and loss prevention aspects. These roadmaps would nicely complement and enrich the work of DDSI that tackles the area of dependability from a policy support angle.

AMSD - IST-2001-37553: Accompanying Measure System Dependability This project addresses the need for a coherent major initiative in FP6 encompas sing a full range of dependability-related activities, e.g. RTD on the various aspects of dependability per se; (reliability, safety, security, survivability, etc.), education and training; and means for encouraging and enabling sector-specific IST RTD projects to use dependability best practice. It is aimed at initiating moves towards the creation of such an Initiative, via road- mapping and constituency and conse nsus building undertaken in co-operation with groups, working in various depen dability-related topic areas, who are already undertaking such activities for their domains. The results will be an overall dependability roadmap that considers d ependability in an adequately holistic way, and a detailed roadmap for dependable embedded systems. ACIP - IST-2001-37257: Analysis & Assessment for Critical Infrastructure Protection Developed societies have become increasingly dependent on ICT and services. Infrastructures such as IC, banking and finance, energy, transportation, and others are relying on ICT and are mutually dependent. The vulnerability of these infr astructures to attacks may result in unacceptable risks because of primary and cascading effects. The investigation of cascading and feedback effects in highly complex, networked systems requires massive support by computer-based tools. The aim of ACIP is to provide a roadmap for the development and application of modelling and simulation, gaming and further adequate methodologies for the fo llowing purposes: identification and evaluation of the state of the art of CIP; analysis of mutual dependencies of infrastructures and cascading effects; investigation of different scenarios in order to determine gaps, deficiencies, and robustness of CIS; identification of technological development and necessary protective measures for CIP. WG-ALPINE - IST-2001-38703 : Active Loss Prevention for ICT-enabled Enterprise Working Group The main objective of this project is the creation, operation and consolidation of an Active Loss Prevention Working Group to address the common ICT Security problems faced by users, achieve consensus on their solutions across multiple disciplines, and produce a favourable impact in the overall eBusiness market. The Working Group approaches the problems from an ICT user perspective, with spe-

Dependability in the Information Society: Getting Ready for the FP6

17

cial emphasis on the view of small/medium systems integrators (SMEs), while establishing liaisons with all players, including representatives from the key European professional Communities that must collaborate to achieve a more effective approach to ICT Security. These include legal, audit, insurance, accounting, commercial, government, standardisation bodies, technology vendors, and others. DDSI – IST-2001-29202 : Dependability Development Support Initiative The goal of DDSI is to support the development of dependability policies across Europe. The overall aim of this project is to establish networks of interest, and to provide baseline data upon which a wide spectrum of policy-supporting activities can be undertaken both by European institutions and by public and private sector stakeholders across the EU and in partner nations. By convening workshops, bringing together key experts and stakeholders in critical infrastructure depen dability, DDSI facilitates the emergence of a new culture of Trans-national collaboration in this field, which is of global interest, and global concern. In order to make rapid progress in the area, the outcomes of the workshops as well as the i nformation gathered in order to prepare for the workshops will be actively disseminated towards a wider, but still targeted community of interest, including policy makers business, decision makers, researchers and other actors already actively contributing to this field today.

4 Conclusions The construction of the Information Society and the fast growing development of ecommerce are making our Society and Economy more and more dependent on computer based information systems, electronic communication networks and information infrastructures that are becoming pervasive as well as an essential part of the EU citizens’ live. Achieving the dependable behaviour of the Information Society means protecting our industry wealth and investments in IT as well as in other intangible assets. Furthermore, achieving the dependable behaviour of the infrastructure would mean ensuring flexible and co-operative management of the large-scale computing and networking resources and providing resources for effective prevention detection, confinement and response to disruptions. The dependable behaviour of the information infrastructure depends, however, on the behaviour of a growing number of players, systems and networks, including the users and the user systems. The interdependency among critical infrastructures that are enabled and supported by the inform ation infrastructure can not be easily mastered by currently available technologies. The dependability approach, which privileges the understanding of the implic ation of our need to rely on systems and, consequently, the adoption of a risk management approach, appears to be instrumental to foster a new culture of social and economic responsibility. However, more innovative and multidisciplinary research

18

Andrea Servida

on dependability is needed to make the Information Society more robust and resilient to technical vulnerability, failures and attacks.

References 1. DEPPY Forum htpp:/deppy.jrc.it/ 2. IST web site www.cordis.lu/ist 3. IST in FP6 http://www.cordis.lu/ist/fp6/fp6.htm 4. Cross Programme Action on dependability http://www.cordis.lu/ist/cpt/cpa4.htm 5. eEurope 2002 Action Plan http://europa.eu.int/information_society/eeurope/index_en.htm 6. DDSI web site http://www.ddsi.org/DDSI/index.htm 7. FP6 http://europa.eu.int/comm/research/fp6/index_en.html 8. FP6 Instruments http://europa.eu.int/comm/research/fp6/networks-ip.html

A Rigorous View of Mode Confusion Jan Bredereke and Axel Lankenau Universit¨ at Bremen FB 3, P.O. box 330 440, D-28334 Bremen, Germany {brederek,alone}@tzi.de www.tzi.de/{~brederek,~alone} Fax: +49-421-218-3054

Abstract. Not only in aviation psychology, mode confusion is recognised as a signiﬁcant safety concern. The notion is used intuitively in the pertinent literature, but with surprisingly diﬀerent meanings. We present a rigorous way of modelling the human and the machine in a shared-control system. This enables us to propose a precise deﬁnition of “mode” and “mode confusion”. In our modelling approach, we extend the commonly used distinction between the machine and the user’s mental model of it by explicitly separating these and their safety-relevant abstractions. Furthermore, we show that distinguishing three diﬀerent interfaces during the design phase reduces the potential for mode confusion. A result is a new classiﬁcation of mode confusions by cause, leading to a number of design recommendations for shared-control systems which help to avoid mode confusion problems. A further result is a foundation for detecting mode confusion problems by model checking.

1

Introduction and Motivation

Automation surprises are ubiquitous in today’s highly engineered world. We are confronted with mode confusions in many everyday situations: When our cordless phone rings while it is located in its cradle, we establish the line by just lifting the handset — and inadvertently cut it when we press the “receiver button” as usual with the intention to start speaking. We get annoyed if we once again overwrite some text in the word processor because we had hit the “Ins”key before (and thereby left the insert mode!) without noticing. The American Federal Aviation Administration (FAA) considers mode confusion to be a significant safety concern in modern aircraft. So, it’s all around — but what exactly is a mode, what deﬁnes a mode confusion situation and how can we detect and avoid automation surprises? As long as we have no rigorous deﬁnition, we should regard a mode confusion as one kind of an automation surprise. It refers to a situation in which a technical system can behave diﬀerently from the user’s expectation. Whereas mode confusions in typical human-computer interactions, such as the word processor example mentioned above, are “only” annoying, they become dangerous if we consider safety-critical systems. S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 19–31, 2002. c Springer-Verlag Berlin Heidelberg 2002

20

Jan Bredereke and Axel Lankenau

Today, many safety-critical systems are so-called embedded shared-control systems. These are interdependently controlled by an automation component and a user. Examples are modern aircraft, automobiles, but also intelligent wheelchairs. We focus on such shared-control systems in this paper and call the entirety of technical components technical system and the human operator user. Note that we have to take a black-box stand, i. e. we can only work with the behaviour of the technical system observable at its interfaces: since we want to solve the user’s problems, we have to take his or her point of view, which does not allow access to internal information of the system. As Rushby points out [1], in cognitive science it is generally agreed upon that humans use so-called mental models when they interact with (automated) technical systems. Since there are at least two completely diﬀerent interpretations of the notion “mental model” in the pertinent literature, it is important to clarify that we refer to the one introduced by Norman [2]: A mental model represents the user’s knowledge about a technical system, it consists of a na¨ıve theory of the system’s behaviour. According to Rushby [1], an explicit description of a mental model can be derived, e. g., in form of a state machine representation, from training material, from user interviews, or by user observation. We brieﬂy recapitulate the pertinent state of the art here. It remains surprisingly unclear what a mode as such is. While some relevant publications give no [3, 4] or only an implicit deﬁnition [5, 6] of the notions “mode” and “mode confusion”, there are others that present an explicit informal deﬁnition [7, 8, 9, 10]. Doherty [11] presents a formal framework for interactive systems and also gives an informal deﬁnition of “mode error”. Wright and colleagues give explicit but example driven deﬁnitions of the notions “error of omission” and “error of commission” by using CSP to specify user tasks [12]. Interestingly, the way of modelling often seems to be inﬂuenced signiﬁcantly by the tool that is meant to perform the ﬁnal analysis. Degani and colleagues use State Charts to model separately the technical system and the user’s mental model [13]. Then, they build the composition of both models and search for certain states (so-called “illegal” and “blocking” states) which indicate mode confusion potential. Butler et al. use the theorem prover PVS to examine the ﬂight guidance system of a civil aircraft for mode confusion situations [4]. They do not consider the mental model of the pilot as an independent entity in their analysis. Leveson and her group specify the black-box behaviour of the system in the language SpecTRM-RL that is both well readable by humans and processible by computers [10, 14, 15]. In [10], they give a categorisation of diﬀerent kinds of modes and a classiﬁcation of mode confusion situations. Rushby and his colleagues employ the Murφ model-checking tool [16, 5, 3]. Technical system and mental model are coded together as a single set of so-called Murφ rules. L¨ uttgen and Carre˜ no examine the three state-exploration tools Murφ, SMV, and Spin with respect to their suitability in the search for mode confusion potential [17]. Buth [9] and Lankenau [18] clearly separate the technical system and the user’s mental model in their CSP speciﬁcation of the well-known MD88-“kill-the-capture” scenario and in a service-robotics example, respectively.

A Rigorous View of Mode Confusion

21

The support of this clear separation is one reason why Buth’s comparison between the tool Murφ and the CSP tool FDR favours the latter [9, pages 209-211]. Almost all publications refer to aviation examples when examining a case study: an MD-88 [19, 7, 10, 16, 9], an Airbus A320 [3, 6], or a Boeing 737 [5]. Rushby proposes a procedure to develop automated systems which pays attention to the mode confusion problem [1]. The main part of his method is the integration and iteration of a model-checking based consistency check and the mental model reduction process introduced by [20, 3]. Hourizi and Johnson [6, 21] generally doubt that avoiding mode confusions alone helps to reduce the number of plane crashes caused by automation surprises. They claim that the underlying problem is not mode confusion but what they call a “knowledge gap”, i. e. the user’s insuﬃcient perception prevents him or her from tracking the system’s mode. As far as we are aware, there is no publication so far that deﬁnes “mode” and “mode confusion” rigorously. Therefore, our paper clariﬁes these notions. Section 2 introduces to the domain of our case study, which later serves as a running example. Section 3 and 4 present a suitable system modelling approach and clarify diﬀerent world views, which enables us to present rigorous deﬁnitions in Sect. 5. Section 6 works out the value of such deﬁnitions, which comprises a foundation for the automated detection of mode confusion problems and a classiﬁcation of mode confusion problems by cause, which in turn leads to recommendations for avoiding mode confusion problems. A summary and ideas for future work conclude the paper.

2

Case Study Wheelchair

Our case study has a service robotics background: we examine the cooperative obstacle avoidance behaviour of our wheelchair robot. The Bremen Autonomous Wheelchair “Rolland” is a shared-control service robot which realizes intelligent and safe transport for handicapped and elderly people [22, 23]. The vehicle is a commercial oﬀ-the-shelf power wheelchair. It has been equipped with a control PC, a ring of sonar sensors, and a laser range ﬁnder. Rolland is jointly controlled by its user and by the software. Depending on the active operation mode, either the user or the automation is in charge of driving the wheelchair.

3

Precise Modelling

Before we can discuss mode confusion problems, some remarks on modelling a technical system in general are necessary. The user of a running technical system has a strict black-box view. Since we want to solve the user’s problems, we must take the same black-box point of view. This statement appears to be obvious, but has far-reaching consequences for the notion of mode. The user has no way of observing the current internal state, or mode, of the technical system.

22

Jan Bredereke and Axel Lankenau

Nevertheless, it is possible to describe a technical system in an entirely blackbox view. Our software engineering approach is inspired by the work of Parnas [24, 25], even though we start out with events instead of variables, as he does. We can observe (only) the environment of the technical system. When something relevant happens, we call this an event. When the technical system is the control unit of an automated wheelchair, then an event may be that the user pushes the joystick forward, that the wheelchair starts to move, or an event may as well be that the distance between the wheelchair and a wall ahead becomes smaller than the threshold of 70 cm. The technical system has been constructed according to some requirements document REQ. It contains the requirements on the technical system, which we call SYSREQ, and those on the system’s environment NAT. However, if we deal with an existing system for which no (more) requirements speciﬁcation is available, it might be necessary to “reverse engineer” it from the implementation. For the wheelchair, SYSREQ should state that the event of the wheelchair starting to move follows the event that the joystick is pushed forward. SYSREQ should also state what happens after the event of approaching a wall. Of course, the wheelchair should not crash into a wall in front of it, even if the joystick is pushed forward. We can describe this entirely in terms of observable events, by referring to the history of events until the current point of time. If the wheelchair has approached a wall, and if it has not yet moved back, it must not move forward further. For this description, no reference to an internal state is necessary. In order to implement the requirements SYSREQ on a technical system, one usually needs several assumptions about the environment of the technical system to construct. For example, physical laws guarantee that a wheelchair will not crash into a wall ahead unless it has approached it closer than 70 cm and has continued to move for a certain amount of time. We call the documentation of assumptions about the environment NAT. NAT must be true even before the technical system is constructed. It is the implementer’s task to ensure that SYSREQ is true provided that NAT holds.

4 4.1

Clarification of World Views Where are the Boundaries?

The control software of a technical system cannot observe physical events directly. Instead, the technical system is designed such that sensor devices generate internal input events for the software, and the software’s output events are translated by actuator devices into physical events, again. Neither sensors nor actuators are perfectly precise and fast, therefore we have a distinct set of software events. Accordingly, the requirements on the technical system and the requirements on the software cannot be the same. For example, the wheelchair’s ultrasonic distance sensors for the diﬀerent directions can be activated in turns only, resulting in a noticeable delay for detecting obstacles. We call the software

A Rigorous View of Mode Confusion

23

SYSREQ environment events

IN

software events

SOF

software events

OUT

environment events

Fig. 1. System requirements SYSREQ vs. software requirements SOF

requirements SOF, the requirements on the input sensors IN and the requirements on the output actuators OUT. Figure 1 shows the relationships among them. An important consequence is that the software SOF must compensate for any imperfectness of the sensors and actuators so that the requirements SYSREQ are satisﬁed. When deﬁning SOF, we deﬁnitely need to take care whether we refer to the boundary of SOF or of SYSREQ. This becomes even more important when we consider the user who cooperates with the technical system. He or she observes the same variables of the environment as the technical system does. But the user observes them through his/her own set of senses SENS. SENS has its own imperfections. For example, a human cannot see behind his/her back. Our automated wheelchair will perceive a wall behind it when moving backwards, but the user will probably not. Therefore, we need to distinguish what actually happens in reality (speciﬁed in REQ, i. e. the composition of SYSREQ and NAT) from the user’s mental model MMOD of it. When making a statement about MMOD, we deﬁnitely need to take care whether we refer to the boundary of MMOD or of REQ. When we deﬁne the interfaces precisely, it turns out that there is an obvious potential for a de-synchronisation of the software’s perception of reality with the user’s perception of it. And when we analyse this phenomenon, it is important to distinguish between the three diﬀerent interfaces: environment to machine (or to user), software to input/output devices, and mental to senses. As a result, we are able to establish a precise relation between reality as it is perceived by the user and his/her mental model of it. This relation will be the base of our deﬁnition of mode confusion. 4.2

Brief Introduction to Refinement

As will be explained later, we use a kind of speciﬁcation/implementation relation in the following sections. Such relations can be modelled rigorously by the concept of reﬁnement. There exist a number of formalisms to express reﬁnement relations. We use CSP [26] as speciﬁcation language and the reﬁnement semantics proposed by Roscoe [27]. One reason is that there is good tool support for performing automated reﬁnement checks of CSP speciﬁcations with the tool FDR [27]. This section shall clarify the terminology for readers who are not familiar with the concepts. In CSP, the behaviour of a process P is described by the set traces(P ) of the event sequences it can perform. Since we must pay attention to what can be

24

Jan Bredereke and Axel Lankenau

done as well as to what can be not done, the traces model is not suﬃcient in our domain. We have to enhance it by so-called failures. Definition 1 (Failure). A failure of a process P is a pair (s, X) of a trace s (s ∈ traces(P )) and a so-called refusal set X of events that may be blocked by P after the execution of s. If an output event o is in the refusal set X of P , and if there also exists a continuation trace s which performs o, then process P may decide internally and non-deterministically whether o will be performed or not. Definition 2 (Failure Refinement). P reﬁnes S in the failures model, written S F P , iﬀ traces(P ) ⊆ traces(S) and also failures(P ) ⊆ failures(S). This means that P can neither accept an event nor refuse one unless S does; S can do at least every trace which P can do, and additionally P will refuse not more than S does. Failure reﬁnement allows to distinguish between external and internal choice in processes, i.e. whether there is non-determinism. As this aspect is relevant for our application area, we use failure reﬁnement as the appropriate kind of reﬁnement relation. 4.3

Relation between Reality and the Mental Model

Our approach is based on the motto “The user must not be surprised ” as an important design goal for shared-control systems. This means that the perceived reality must not exhibit any behaviour which cannot occur according to the mental model. Additionally, the user must not be surprised because something expected does not happen. When the mental model prescribes some behaviour as necessary, reality must not refuse to perform it. These two aspects are described by the notion of failure reﬁnement, as deﬁned in the previous section. There cannot be any direct reﬁnement relation between a description of reality and the mental model, since they are deﬁned over diﬀerent sets of events (i.e., environment/mental). We understand the user’s senses SENS as a relation from environment events to mental events. SENS(REQ) is the user’s perception of what happens in reality. The user is not surprised if SENS(REQ) is a failure reﬁnement of MMOD. As a consequence, the user’s perception of reality must be in an implementation/speciﬁcation relationship to the mental model. Please note that an equality relation always implies a failure reﬁnement relation, while the converse is not the case. If the user does not know how the system will behave with regard to some aspect, but knows that he/she does not know, then he/she will experience no surprise nevertheless. Such indiﬀerence can be expressed mathematically by a non-deterministic internal choice in the mental model. 4.4

Abstractions

When the user concentrates on safety, he/she performs an on-the-ﬂy simpliﬁcation of his/her mental model MMOD towards the safety-relevant part

A Rigorous View of Mode Confusion system designer’s view safety− relevant abstraction

SENSE

SAFE

(REQ

)

user’s view failure refinement

SAFE

abstraction A R detailed black−box description

SENSE(REQ)

25

MMOD

SAFE

abstraction A M failure refinement

MMOD

Fig. 2. Relationships between the diﬀerent reﬁnement relations

MMODSAFE . This helps him/her to analyse the current problem with the limited mental capacity. Analogously, we perform a simpliﬁcation of the requirements document REQ to the safety-relevant part of it REQSAFE . REQSAFE can be either an explicit, separate chapter of REQ, or we can express it implicitly by specifying an abstraction function, i. e., by describing which aspects of REQ are safety-relevant. We abstract REQ out of three reasons: MMODSAFE is deﬁned over a set of abstracted mental events, and it can be compared to another description only if it is deﬁned over the same abstracted set; we would like to establish the correctness of the safety-relevant part without having to investigate the correctness of everything; and our model-checking tool support demands that the descriptions are restricted to certain complexity limits. We express the abstraction functions mathematically in CSP by functions over processes. Mostly, such an abstraction function maps an entire set of events onto a single abstracted event. For example, it is irrelevant whether the wheelchair’s speed is 81.5 or 82 cm/s when approaching an obstacle – all such events with a speed parameter greater than 80 cm/s will be abstracted to a single event with the speed parameter fast. Other transformations are hiding (or concealment [26]) and renaming. But the formalism also allows for arbitrary transformations of behaviours; a simple example being a certain event sequence pattern mapped onto a new abstract event. We use the abstraction functions AR for REQ and AM for MMOD, respectively. The relation SENS from the environment events to the mental events must be abstracted in an analogous way. It should have become clear by now that SENS needs to be rather true, i. e., a bijection which does no more than some renaming of events. If SENS is “lossy”, we are already bound to experience mode confusion problems. For our practical work, we therefore ﬁrst make sure that SENS is such a bijection, and then merge it into REQ, even before we perform the actual abstraction step which enables the use of the model-checking tool. Figure 2 shows the relationships among the diﬀerent descriptions. In order that the user is not surprised with respect to safety, there must be a failure reﬁnement relation on the abstract level between SENSSAFE (REQSAFE ) and MMODSAFE , too.

26

5

Jan Bredereke and Axel Lankenau

A Rigorous View of Mode and of Mode Confusion

We will now present our rigorous deﬁnitions of mode and mode confusion. We will then motivate and discuss our choices. In the following, let REQSAFE be a safety-relevant black-box requirements speciﬁcation, let SENSSAFE be a relation between environment events and mental events representing the user’s senses, and let MMODSAFE be a safety-relevant mental model of the behaviour of REQSAFE . Definition 3 (Potential future behaviour). A potential future behaviour is a set of failures. Definition 4 (Mode). A mode of SENSSAFE (REQSAFE ) is a potential future behaviour. And, a mode of MMODSAFE is a potential future behaviour. Definition 5 (Mode confusion). A mode confusion between SENSSAFE (REQSAFE ) and MMODSAFE occurs if and only if SENSSAFE (REQSAFE ) is not a failure reﬁnement of MMODSAFE i.e., iﬀ MMODSAFE F SENSSAFE (REQSAFE ) . After the technical system T has moved through a history of events, it is in some “state”. Since we have to take a black-box view, we can distinguish two “states” only if T may behave diﬀerently in the future. We describe the potential future behaviour by a set of failures, such that we state both what T can do and what T can refuse to do. This deﬁnition of “state” is rather diﬀerent from the intuition in a white-box view, but necessarily so. Our next step to the notion of “mode” then is more conventional. We use the notion of “state”, if at all, in the context of the non-abstracted descriptions. Two states of a wheelchair are diﬀerent, for example, if the steerable wheels will be commanded to a steering angle of 30 degrees or 35 degrees, respectively, within the next second. These states are equivalent with regard to the fact of obstacle avoidance. Therefore, both states are mapped to the same abstracted behaviour by the safety-relevance abstraction function. We call such a distinct safety-relevant potential future behaviour a mode. Usually, many states of the non-abstracted description are mapped together to such a mode. On a formal level, both a state and a mode are a potential future behaviour. The diﬀerence between both is that there is some important safety-relevant distinction between any two modes, which need not be the case for two states. We now can go on to mode confusions. The perceived reality and the user’s mental model of it are in diﬀerent modes at a certain point of time if and only if the perceived reality and the mental model might behave diﬀerently in the future, with respect to some safety-relevant aspect. Only if no such situation can arise in any possible execution trace, then there is no mode confusion. This means that the user’s safety-relevant mental model must be a speciﬁcation of the perceived reality. Expressed the other way around, the perceived reality must be an implementation of the user’s safety-relevant mental model. This speciﬁcation/implementation relationship can be described rigorously by failure reﬁnement. If we have precise descriptions of both safety-relevant behaviours, we can

A Rigorous View of Mode Confusion

27

rigorously check whether a mode confusion occurs. Since model-checking tool support exists, this check can even be automated. Please note that we referred to the reality, as described by REQ, which not only includes the system’s requirements SYSREQ but also the environment requirements NAT. This restricts the behaviour of SYSREQ by NAT: behaviour forbidden by physical laws is not relevant for mode confusions. Our mathematical description allows for some interesting analysis of consequences. It is known in the literature that implicit mode changes may be a cause of mode confusion. In our description, an implicit mode change appears as an “internal choice” of the system, also known as a (spontaneous) “τ transition”. The reﬁnement relation dictates that any such internal choice must appear in the speciﬁcation, too, which is the user’s mental model in our case. This is possible: if the user expects that the system chooses internally between diﬀerent behaviours, he/she will not be surprised, at least in principle. The problem is that the user must keep in mind all potential behaviours resulting from such a choice. If there is no clarifying event for a long time, the space of potential behaviours may grow very large and impractical to handle in practice.

6

Results

Our deﬁnitions form a foundation for detecting mode confusions by modelchecking. It has opened new possibilities for a comprehensive analysis of mode confusion problems, which we currently explore in practice. Our clariﬁcation of world views in Sect. 4 enables us to classify mode confusion problems into three classes: 1. Mode confusion problems which arise from an incorrect observation of the technical system or its environment. Formally, this is the case when SENS(REQ) is not a failure reﬁnement of MMOD, but where SENS(REQ) would be a failure reﬁnement of MMOD, provided the user’s senses SENS would be a perfect mapping from environment events to mental events. The imperfections of SENS may have physical or psychological reasons: either the sense organs are not perfect; for example eyes which cannot see behind the back. Or an event is sensed, but is not recognised consciously; for example because the user is distracted, or because the user currently is ﬂooded with too many events. (“Heard, but not listened to.”) Please note that our notion of mode confusion problem also comprises the “knowledge gap” discussed in the research critique by Hourizi and Johnson [6, 21] (see Sect. 1). In our work, it appears as a mode confusion problem arising from an incorrect observation due to psychological reasons. 2. Mode confusion problems which arise from incorrect knowledge of the human about the technical system or its environment. Formally, this is the case when SENS(REQ) is not a failure reﬁnement of MMOD, and when a perfect SENS would make no diﬀerence.

28

Jan Bredereke and Axel Lankenau

3. Mode confusion problems which arise from the incorrect abstraction of the user’s knowledge to the safety-relevant aspects of it. Formally, this means that SENS(REQ) is a failure reﬁnement of MMOD, but SENSSAFE (REQSAFE ) is not a failure reﬁnement of MMODSAFE . Since the safety-relevant requirements abstraction function AR is correct by deﬁnition, the user’s mental safety-relevance abstraction function AM must be wrong in this case (compare Figure 2 above). In contrast to previous classiﬁcations of mode confusion problems, this classiﬁcation is by cause and not phenomenological, as, e.g., the one by Leveson [10]. The above causes of mode confusion problems lead directly to some recommendations for avoiding them. In order to avoid an incorrect observation of the technical system and its environment, we must check whether the user can physically observe all safety-relevant environment events, and we must check whether the user’s senses are suﬃciently precise to ensure an accurate translation of these environment events to mental events. If this is not the case, then we must change the system requirements. We must add an environment event controlled by the machine and observed by the user which indicates the corresponding software input event. This measure has been recommended by others too, of course, but our rigorous view now indicates more clearly when it must be applied. Avoiding an incorrect observation also comprises that we check whether psychology ensures that observed safety-relevant environment events become conscious. Our approach points out clearly the necessity of this check. The check itself and any measures belong to the ﬁeld of psychology, in which we are not expert. Establishing a correct knowledge of the user about the technical system and its environment can be achieved by documenting the requirements of them rigorously. This enables us to conceive user training material, such as a manual, which is complete with respect to functionality. This training material must not only be complete but also learnable. Complexity is an important learning obstacle. Therefore, the requirements of the technical system should allow as little non-deterministic internal choices as possible, since tracking all alternative outcomes is complex. This generalises and justiﬁes the recommendation by others to eliminate “implicit mode changes” [10, 8]. Internal non-determinism may arise not only from the software, but also from the machine’s sensor devices. If they are imprecise, the user cannot predict the software input events. We can eliminate both kinds of non-deterministic internal choice by the same measure as used against an incorrect physical observation: we add an environment event controlled by the machine which indicates the software’s choice or the input device’s choice, respectively. Ensuring a correct mental abstraction process is mainly a psychological question and mostly beyond the scope of this paper. Our work leads to the basic recommendation to either write an explicit, rigorous safety-relevance requirements document or to indicate the safety-relevant aspects clearly in the general requirements document. The latter is equivalent to making explicit the safety-relevance

A Rigorous View of Mode Confusion

29

abstraction function for the machine AR . Either measure facilitates to conceive training material which helps the user to concentrate on safety-relevant aspects.

7

Summary and Future Work

We present a rigorous way of modelling the user and the machine in a sharedcontrol system. This enables us to propose precise deﬁnitions of “mode” and “mode confusion”. In our modelling approach, we extend the commonly used distinction between the machine and the user’s mental model of it by explicitly separating these and their safety-relevant abstractions. Furthermore, we show that distinguishing three diﬀerent interfaces during the design phase reduces the potential for mode confusion. Our proposition that the user must not be surprised leads directly to the conclusion that the relationship between the mental model and the machine must be one of speciﬁcation to implementation, in the mathematical sense of reﬁnement. Mode confusions can occur if and only if this relation is not satisﬁed. A result of this insight is a new classiﬁcation of mode confusions by cause, leading to a number of design recommendations for shared-control systems which help to avoid mode confusion problems. Since tools to model-check reﬁnement relations exist, our approach supports the automated detection of remaining mode confusion problems. For illustration, we presented a case study on a wheelchair robot as a running example. A detailed version of the case study is discussed in [28]. Our work lends itself to extension into several directions. We currently work in our case study on exploiting the new potential for detecting mode confusion problems by model-checking. Furthermore, the recommendations for avoiding mode confusion problems can be tried out. Experts in psychology will be able to implement the non-technical ones of our rules by concrete measures. Finally, we see still more application domains beyond aviation and robotics.

References [1] Rushby, J.: Modeling the human in human factors. In: Proc. of SAFECOMP 2001. Volume 2187 of LNCS., Springer (2001) 86–91 20, 21 [2] Norman, D.: Some observations on mental models. In Gentner, D., Stevens, A., eds.: Mental Models. Lawrence Erlbaum Associates Inc., Hillsdale, NJ, USA (1983) 20 [3] Crow, J., Javaux, D., Rushby, J.: Models and mechanized methods that integrate human factors into automation design. In Abbott, K., Speyer, J. J., Boy, G., eds.: Proc. of the Int’l Conf. on Human-Computer Interaction in Aeronautics: HCI-Aero 2000, Toulouse, France (2000) 20, 21 [4] Butler, R., Miller, S., Pott, J., Carre˜ no, V.: A formal methods approach to the analysis of mode confusion. In: Proc. of the 17th Digital Avionics Systems Conf., Bellevue, Washington, USA (1998) 20 [5] Rushby, J.: Analyzing cockpit interfaces using formal methods. In Bowman, H., ed.: Proc. of FM-Elsewhere. Volume 43 of Electronic Notes in Theoretical Computer Science., Pisa, Italy, Elsevier (2000) 20, 21

30

Jan Bredereke and Axel Lankenau

[6] Hourizi, R., Johnson, P.: Beyond mode error: Supporting strategic knowledge structures to enhance cockpit safety. In: Proc. of IHM-HCI 2001, Lille, France, Springer (2001) 20, 21, 27 [7] Sarter, N., Woods, D.: How in the world did we ever get into that mode? Mode error and awareness in supervisory control. Human Factors 37 (1995) 5–19 20, 21 [8] Degani, A., Shafto, M., Kirlik, A.: Modes in human-machine systems: Constructs, representation and classiﬁcation. Int’l Journal of Aviation Psychology 9 (1999) 125–138 20, 28 [9] Buth, B.: Formal and Semi-Formal Methods for the Analysis of Industrial Control Systems – Habilitation Thesis. Univ. Bremen (2001) 20, 21 [10] Leveson, N., Pinnel, L., Sandys, S., Koga, S., Reese, J.: Analyzing software speciﬁcations for mode confusion potential. In: Workshop on Human Error and System Development, Glasgow, UK (1997) 20, 21, 28 [11] Doherty, G.: A Pragmatic Approach to the Formal Speciﬁcation of Interactive Systems. PhD thesis, University of York, Dept. of Computer Science (1998) 20 [12] Wright, P., Fields, B., Harrison, M.: Deriving human-error tolerance requirements from tasks. In: Proc. of the 1st Int’l Conf. on Requirements Engineering, Colorado, USA, IEEE (1994) 135–142 20 [13] Degani, A., Heymann, M.: Pilot-autopilot interaction: A formal perspective. In Abbott, K., Speyer, J. J., Boy, G., eds.: Proc. of the Int’l Conf. on HumanComputer Interaction in Aeronautics: HCI–Aero 2000, Toulouse, France (2000) 157–168 20 [14] Rodriguez, M., Zimmermann, M., Katahira, M., de Villepin, M., Ingram, B., Leveson, N.: Identifying mode confusion potential in software design. In: Proc. of the Int’l Conf. on Digital Aviation Systems, Philadelphia, PA, USA (2000) 20 [15] Zimmermann, M., Rodriguez, M., Ingram, B., Katahira, M., de Villepin, M., Leveson, N.: Making formal methods practical. In: Proc. of the Int’l Conf. on Digital Aviation Systems, Philadelphia, PA, USA (2000) 20 [16] Rushby, J., Crow, J., Palmer, E.: An automated method to detect potential mode confusions. In: Proc. of the 18th AIAA/IEEE Digital Avionics Systems Conf., St. Louis, Montana, USA (1999) 20, 21 [17] L¨ uttgen, G., Carre˜ no, V.: Analyzing mode confusion via model checking. In Dams, D., Gerth, R., Leue, S., Massink, M., eds.: SPIN’ 99. Volume 1680 of LNCS., Berlin Heidelberg, Springer (1999) 120–135 20 [18] Lankenau, A.: Avoiding mode confusion in service-robots. In Mokhtari, M., ed.: Integration of Assistive Technology in the Information Age, Proc. of the 7th Int’l Conf. on Rehabilitation Robotics, Evry, France, IOS Press (2001) 162–167 20 [19] Palmer, E.: “Oops, it didn’t arm.” – A case study of two automation surprises. In: Proc. of the 8th Int’l Symp. on Aviation Psychology. (1995) 21 [20] Javaux, D.: Explaining Sarter & Woods’ classical results. The cognitive complexity of pilot-autopilot interaction on the Boeing 737-EFIS. In: Proc. of HESSD ’98. (1998) 62–77 21 [21] Hourizi, R., Johnson, P.: Unmasking mode errors: A new application of task knowledge principles to the knowledge gaps in cockpit design. In: Proc. of INTERACT 2001 – The 8th IFIP Conf. on Human Computer Interaction, Tokyo, Japan (2001) 21, 27 [22] R¨ ofer, T., Lankenau, A.: Architecture and applications of the Bremen Autonomous Wheelchair. Information Sciences 126 (2000) 1–20 21

A Rigorous View of Mode Confusion

31

[23] Lankenau, A., R¨ ofer, T.: The Bremen Autonomous Wheelchair – a versatile and safe mobility assistant. IEEE Robotics and Automation Magazine, “Reinventing the Wheelchair” 7 (2001) 29–37 21 [24] Parnas, D. L., Madey, J.: Functional documents for computer systems. Science of Computer Programming 25 (1995) 41–61 22 [25] van Schouwen, A. J., Parnas, D. L., Madey, J.: Documentation of requirements for computer systems. In: IEEE Int’l. Symp. on Requirements Engineering – RE’93, San Diego, California, USA, IEEE Comp. Soc. Press (1993) 198–207 22 [26] Hoare, C.: Communicating Sequential Processes. Prentice-Hall, Englewood Cliﬀs, New Jersey. USA (1985) 23, 25 [27] Roscoe, A. W.: The Theory and Practice of Concurrency. Prentice-Hall (1997) 23 [28] Lankenau, A.: Bremen Autonomous Wheelchair “Rolland”: Self-Localization and Shared-Control – Challenges in Mobile Service Robotics. PhD thesis, Universit¨ at Bremen, Dept. of Mathematics and Computer Science (2002) To appear 29

Dependability as Ordinary Action Alexander Voß1 , Roger Slack1 , Rob Procter1, Robin Williams2 , Mark Hartswood1 , and Mark Rounceﬁeld3 1

2

School of Informatics, University of Edinburgh, UK {av,rslack,rnp,mjh}@cogsci.ed.ac.uk Research Centre for Social Sciences, University of Edinburgh, UK [email protected] 3 Department of Computing, University of Lancaster, UK [email protected]

Abstract. This paper presents an ethnomethodologically informed study of the ways that more-or-less dependable systems are part of the everyday lifeworld of society members. Through case study material we explicate how dependability is a practical achievement and how it is constituted as a common sense notion. We show how attending to the logical grammar of dependability can clarify some issues and potential conceptual confusions around the term that occur between lay and ‘professional’ uses. The paper ends with a call to consider dependability in its everyday ordinary language context as well as more ‘professional’ uses of this term.

1

Introduction

In this paper we are concerned with the ways in which people experience dependability. We are interested in explicating in-vivo ethnographic accounts of living with systems that are more or less reliable and the practices that this being ‘more or less dependable’ occasions. The situated practical actions of living with systems (e.g. work arounds and so on) are important to us in that they show how society members1 experience dependability as a practical matter. Drawing on the work of the later Wittgenstein [13], we seek to explicate what dependability means in an ordinary language sense and to provide an analysis of the ways in which systems come to be seen as dependable, and the work involved in making them more or less dependable. Such an analysis of ordinary language uses of terms is not intended as a remedy or corrective to ‘professional’ uses, but to show how we might capitalise on lay uses of the term and thereby secure in part a role for ethnographic analysis of what, following Livingston [7] we call the ‘lived world’ of working with more or less dependable systems. We draw on a case study of practical IT use and development [11] that illustrate how dependability is realised in and as part of peoples’ everyday ordinary activities. 1

This points to the skills persons have, what they know and do competently in a particular setting. In this usage we also stress mundane, banal competence as opposed to professionalised conduct.

S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 32–43, 2002. c Springer-Verlag Berlin Heidelberg 2002

Dependability as Ordinary Action

2

33

The Case Study

The case study organisation, EngineCo, produces mass-customised diesel engines from 11 to 190 kW. Production in its plant was designed to work along a strict production orthodoxy and large parts are automated. Since the plant was built in the early 1990s, signiﬁcant changes have been made to keep up with changing customer demands and to keep the plant operational in a diﬃcult economic environment. The organisation makes heavy use of a wide range of information technologies and, to a large extent, their operation depends on complex ensembles of these technologies. An ethnographic study of the working practices of control room workers has been conducted over the course of the last two years [10] as a pradicate for participatory design activities. The ethnographic method is dedicated to observing in detail everyday working practices and seeks to explicate the numerous, situated ways in which those practices are actually achieved [4]. Interviews with staﬀ were recorded, and notes made of activities observed and artifacts employed. The data also includes copious notes and transcriptions of talk of ‘members’ (i.e. regular participants in the work setting) as they went about their everyday work. Ethnography is attentive to the ways in which work actually ‘gets done’, the recognition of the tacit skills and cooperative activities through which work is accomplished as an everyday, practical activity and it aims to make these processes and practices ‘visible’. As noted above, the production environment at EngineCo is shaped according to a particular just-in-time (JIT) production orthodoxy. Material is delivered to an external logistics provider that operates a high-shelf storage facility near the plant on EngineCo’s behalf. Upon EngineCo’s order, the logistics provider delivers parts to the plant. Consequently, the plant itself was not designed to store large numbers of parts, containing buﬀer spaces for only four hours of production. The layout of production is basically linear, with an engine picking up its component parts as it moves from one side of the plant to the other. The production of engines is divided into two main steps: the basic engine is produced on an assembly line while customer-speciﬁc conﬁguration is done in stationary assembly workspaces. Central to production is the Assembly Control Host which controls all processes within the plant, interacting with local systems in the various functional units of the plant (e.g., assembly lines) as well as with the company’s ERP system (SAP R3). The Assembly Control Host is custom-built rather than being part of the ERP system. It has been developed and is now operated and maintained by an external IT service provider which has personnel located in the plant. A basic precondition for production to work along the lines of the JIT regime is that all parts are available in time for production. This notion of buildability is the key concept in the production management orthodoxy at EngineCo. Located within the plant, an assembly planning department is responsible for the buildability of engines, assuring that all component parts as well as the various pieces of information needed (such as workers’ instructions) are available before production starts. They are also responsible for scheduling production orders in time to meet the agreed delivery dates. Assembly planners create a schedule for

34

Alexander Voß et al.

production taking into consideration their knowledge about the current status of the plant, upcoming events and the requirements of control room workers.

Doing Dependability: Normal Natural Troubles Instances of undependability in this setting are quite frequent but are not normally catastrophic. Rather, they are ordinary, mundane events that occasion situated practical (as opposed to legal) inquiry and repair. This is in contrast to much of the extant literature which has focused on dependability issues as fatal issues, e.g. studies of such cases as the London Ambulance Service [1] or Therac25 [6]. The study points to some of the worldly contingencies that control room workers routinely deal with as a part of their planning and scheduling work. More precisely we might say that all plans are contingent on what, following Suchman [9], we call ‘situated actions’. Due to problems with the availability of certain parts, especially crankcases and because of ever increasing customer demands, the notion of buildability was renegotiated [10] in order not to let the plant fall idle. Today, there are ‘green’, ‘orange’, and ‘red’ engines in the plant that are respectively: strictly buildable, waiting for a part known to be on its way, or waiting for something that is not available and doesn’t have a delivery date. Control room workers eﬀectively share responsibility for ensuring that engines are buildable with Assembly Planning as is illustrated by the following extract from the control room shiftbook: As soon as crankcases for 4-cylinders are available, schedule order number 56678651 (very urgent for Company X). Engines are red even when only loose material is missing.

The ﬁrst example shows how control room workers eﬀectively assign material to orders and how their decisions may be inﬂuenced by various contingencies. Choosing the order in which to schedule engines is a situated accomplishment rather than a straightforward priority based decision wherein the importance of the engine dictates its order. Control room workers need to attend to the way scheduling an engine might inﬂuence the ‘ﬂow’ of other engines through the plant and take into consideration the workload a particular type of engine places on workers on the shop ﬂoor, i.e. they have to attend to the ‘working division of labour’ [8]. The second example refers to a problem with the IT systems which does not allow them to start production of engines which are missing loose material (e.g., manuals). Clearly, while a missing crankcase eﬀectively prevents production of the engine, loose material is not needed until the engine is actually shipped to the customer (and perhaps not even then in very urgent cases). By redeﬁning details of the working division of labour [8], EngineCo has eﬀectively addressed a situation that was impossible to predict during the original planning of the plant. This is not to say that the notion of buildability has ceased to exist. Rather, the general notion as originally inscribed in working practices has, by appropriation, been localised to take into consideration the

Dependability as Ordinary Action

35

‘worldly contingencies’ – situations which arise in and as a part of the everyday practical work of the plant and its members and which are not, for example, involved with setting up a new system or introducing new machinery or practices – of production in EngineCo’s plant. Where, previously, buildability was a veriﬁable property of an engine in relationship to the inventory, now buildability of ‘orange’ and ‘red’ engines is an informed prediction based on members’ knowledge about various kinds of socio-material circumstances. In our research we have found a series of expectable, ‘normal’ or ‘ordinary’ troubles whose solution is readily available to members in, and as a part of, their working practices. That is, such problems do not normally occasion recourse to anything other than the ‘usual solutions’. Usual solutions invoke what we call horizons of tractability. By this we mean that a problem of the usual kind contains within it the candidate (used-before-and-seen-to-work) solution to that problem. These problems and their solutions are normal and natural and putatively soluble in, and as a part of, everyday work. From the shiftbook: SMR [suspended monorail] trouble 14:15 to 16:30, engines not registered into SMR, took 25 engines oﬀ the line using emergency organisation. Info for Peter: part no. 04767534, box was empty upon delivery, so I booked 64 parts out of the inventory.

The emergency organisation involved picking up the engines by forklift truck and moving them to a location where they can be picked up by the autonomous carrier system. A number of locations have been made available for this purpose where forklift truck drivers can access the Assembly Control Host to update the location information for the engine they just reintroduced into the system. This is one of many examples where non-automated activity leads to a temporary discrepancy between the representation and the represented, which has to be compensated for. The second example illustrates the same point. Updating the inventory in response to various kinds of events is a regular activity in the control room and the fact that control room workers have acquired authority to eﬀect such transactions is witness to the normality of this kind of problem compensation activity. Workers are also able to assess the potential impacts of seen-before problem situations and they take measures to avoid them: From the shiftbook: Carrier control system broken down 10:45–11:05 resulting in delayed transports, peak number of transports in the system = 110 If in the carrier control system you can’t switch from the list of transport orders to the visualisation, don’t reboot the PC if the number of transport orders is more than about 70.

In the ﬁrst two lines of the above example, workers report on problems with the system that controls the autonomous carriers that supply material to the workstations in the plant. The recording of a breakdown in the shiftbook is a way

36

Alexander Voß et al.

to make this incident accountable to fellow workers, including those working on another shift. The entry contains a number of statements which, on the surface, seem to be rather uninformative. However, they point to a number of normal, natural troubles that can result from this particular incident such as material being stored in places that are far from the workstations where it’s going to be needed. This will aﬀect the length of transports for some time after the root problem has gone away. The result of this is that since transports take longer, more of them will queue up in the carrier control system. Such ‘ripple eﬀects’ are quite common in this production context. In eﬀect, because of the breakdown of the control system, the ‘transport situation’ might be considered problematic for quite a long time. The next extract can be read in this same kind of context as being part of the process of workers’ making sense of, and responding to the potential undependability of the carrier control system. It has become part of the local practice to avoid certain actions that might result in the breakdown of the carrier system if the ‘transport situation’ is regarded as problematic by control room workers: From a video recording of control room work: Pete: Hey, the carrier control is still not running properly. Let’s not run the optimisation, ok Steve? Steve: We didn’t run it this morning either, because we had 40 transports.

Other problems that are not susceptible to these remedies are also interesting to us in that they demand a solution – members cannot remain indiﬀerent to their presence – but that solution is not a normal or usual one (by deﬁnition). In order to keep production running, members have to ﬁnd and evaluate possible solutions quickly, taking into consideration the present situation, the resources presently available, as well as, ideally, any (possibly long-term and remote) consequences their activities might have: From fieldwork notes: A material storage tower went oﬄine. Material could be moved out of the tower to the line but no messages to the Assembly Control Host were generated when boxes were emptied. Control room workers solved this problem by marking all material in the tower ‘faulty’ which resulted in new material being ordered from the logistics provider. This material was then supplied to the line using forklift trucks. [...] A material requirements planner called to ask why so many parts were suddenly ‘faulty’.

Such situated problem-solving results in work-arounds which are initially speciﬁc to the situation at hand but may become part of the repertoire of used-beforeand-seen-to-work candidate solutions. They may be further generalised through processes of social learning [12] as members share them with colleagues or they might get factored into the larger socio-material assemblage that makes up the working environment. This process of problem solution and social learning, however, is critically dependent on members’ orientation to the larger context, their making the problem solution accountable to fellow members and their ability to

Dependability as Ordinary Action

37

judge the consequences. The following ﬁeldwork material illustrates how problem solutions can get factored into ongoing systems development as well as how they can adversely aﬀect the success of the system: From an interview with one of the system developers responsible for the ongoing development of the Assembly Control Host: [Such a complex system] will alway have ﬂaws somewhere but if the user has to work with the system and there’s a problem he will ﬁnd a work-around himself and the whole system works. [...] The whole works, of course, only if the user really wants to work with it. If he says: “Look, I have to move this box from here to there and it doesn’t work. Crap system! I’ll let a forklift do this, I will not use your bloody system” then all is lost. Then our location information is wrong cause the driver doesn’t always give the correct information; then it will never ﬂy. [... If they come to us and say] that something’s not working, we will say “oh! we’ll quickly have to create a bug ﬁx” and, for the moment, I’ll do this manually without the system, then it works, the system moves on, everything stays correct, the whole plant works and if the next day we can introduce a bug ﬁx the whole thing moves on smoothly.

The plans that members come up with within this horizon of tractability do not usually work one way only – it is our experience that an unexpected problem can become a normal problem susceptible to the usual solutions in, and through, the skillful and planful conduct of members. That is to say, the boundaries between the types of problem are semi-permeable (at least). The order of the potentially problematic universe is not similarly problematic for all members, diﬀerent members will view diﬀerent problems in a variety of ways and, through the phenomenon of organisational memory [5], this may lead to the resolution for the problem in, and through, the ability to improvise or to recognize some kind of similarities inherent in this and a previous problem. It is important to note that problem detection and solving is ‘lived work’ [7] and that it is also situated. That is, it is not to be divorced from the plans and procedures through which it is undertaken and the machinery and interactions that both support and realise it. Working practices and the structure of the workplace aﬀord various kinds of activities that allow members to check the proper progress of production and to detect and respond to troubles. These very ‘mundane’ (i.e., everday) activities complement the planned-for, made-explicit and formalised measures such as testing. As in other collaborative work (see e.g., [2]), members are aware of, and orient to, the work of their colleagues. This is supported by the aﬀordances of their socio-material working environment as the following example illustrates: From a video recording of control room work: Oil pipes are missing at the assembly line and Jim calls workers outside the control room to ask if they “have them lying around”. This is overheard by Mark who claims that: “Chris has them”. He subsequently calls Chris to conﬁrm this: “Chris, did you take all the oil pipes that were at the line?” Having

38

Alexander Voß et al. conﬁrmed that Chris has the oil pipes he explains why he thought that Chris had them: “I have seen the boxes standing there”.

Here, the visibility of situations and events within the plant leads to Mark being aware of where the parts in question are. The problem that the location of the parts was not accurately recorded in the information system was immediately compensated by his knowledge of the plant situation. Likewise, Jim’s knowledge of working practices leads him to call speciﬁc people who are likely to have the parts. Mark’s observation makes further telephone calls unnecessary2 . Video recording continued: Now that the whereabouts of the oil pipes has been established, the question remains why Chris has them. Mark explains that this was related to conversion work Chris is involved in at the moment. This leads Jim to ask if there are enough parts in stock to deal with the conversion work as well as other production orders. Mark explains how the inventory matches the need.

Having solved the problem of locating the parts, there is the question of how the problem emerged and what further problems may lie ahead. It is not immediately obvious that Chris should have the parts but Mark knows that Chris is involved in some conversion work resulting from a previous problem. Again, awareness of what is happening within the plant is crucial as information about the conversion work is unlikely to be captured in information systems as the work Chris is carrying out is not part of the normal operation of the plant. Rather, it is improvised work done to deal with a previous problem. Jim raises the question whether enough oil pipes are available to deal with the conversion work as well as normal production. Again, it is Mark who can ﬁll in the required information and demonstrate to Jim how the parts in the inventory match the needs. As Jim comments in a similar situation: “What one of us doesn’t know, the other does.” Problem detection and solving is very much a collaborative activity depending on the situated and highly condensed exchange of information between members. By saying that Chris has taken the parts from the line, Mark also points to a set of possible reasons as members are well aware who Chris is, where he works and what his usual activities are. Video recording continued: Since it was ﬁrst established that parts were missing, production has moved on and there is the question what to do with the engines that are missing oil pipes. Jim and Mark discuss if the material structure of the engine allows them to be assembled in ‘stationary assembly’.

Workers in the plant are aware of the material properties of the engines produced and are thus able to relate the material artefact presented to them to the process of its construction. In the example above, Mark and Jim discuss this relationship 2

Another example of the mutual monitoring that goes on in control rooms and similar facilities is to be found in [3]

Dependability as Ordinary Action

39

in order to ﬁnd out if the problem of missing oil pipes can be dealt with in stationary assembly, i.e., after the engines have left the assembly line. They have to attend to such issues as the proper order in which parts can be assembled. The knowledge of the material properties of engines also allows members to detect troubles, i.e., the product itself aﬀords checking of its proper progress through production (cf. [2]). From a video recording of control room work: Jack has ‘found’ an engine that, according to the IT system, has been delivered to the customer quite a while ago. It is, however, phyiscally present in the engine buﬀer and Jack calls a colleague in quality control to ﬁnd out the reason for this. “It’s a 4-cylinder F200, ‘conversion [customer]’ it says here, a very old engine. The engine is missing parts, screws are loose, ... if it’s not ready yet – I wanted to know what’s with this engine – it’s been sitting in the buﬀer for quite a while.”

Here, the physical appearance is an indication of the engine’s unusual ‘biography’. This, together with the fact that the engine has “been sitting in the buﬀer for quite a while” makes the case interesting for Jack. These worldly contingencies are interesting for us since they invite consideration of the ‘seen but unnoticed’ aspects of work – that is, those aspects which pass the members by in, and as a part of, their everyday work but which, when there are problems or questions, are subject to inquiry (e.g., have you tried this or that? Did you do this or that? What were you doing when it happened). The answer to such questions, especially to the latter, illustrates the seen-butunnoticed character of work in that, when called upon to so do, members can provide such accounts, although they do not do so in the course of ordinary work.

3

Dependability as a Society Member’s Phenomenon

A central problem for us is the manner in which the term ‘dependability’ has been used in the ‘professional’ literature to be found at, for example, this conference. We argue that there is a need to complement this with a consideration of the ways in which dependability is realised as a practical matter by members and over time. This is not to say that we reject notions of dependability oﬀered by this literature or that our comments here are incommensurable: the point is that we want to look at dependability and similar terms by doing an ethnography of what it means for a system to be reliable or dependable as a practical matter for society members engaged in using that system with just the resources and the knowledge they have. That is, we are interested in what it means to be dependable or reliable in context. In other words, while we are interested in the notions of dependability invoked in the ‘professional’ literature and these inform our discussions, we ﬁnd that these deﬁnitions are dependability professionals’ objects as opposed to society members’ objects. We feel that we should consider how society members experience dependability in context, and that is what we

40

Alexander Voß et al.

present here. Indeed it is our contention that such lay uses are important for understanding what we could mean by ‘dependability’. Our aim here, then, is to bring forward the lay uses, the practical procedures and knowledges that are invoked in working with more or less dependable systems and to consider this alongside ‘professional’ uses of terms and the metrics that realise them. Think, for instance, of the phrase “Tom is very dependable” – we would not want to say qua lay member that we would want to have some kind of metric testing, for example, how many times Tom had arrived at the cinema at the time speciﬁed. We would say that such metrics are not something with which members have to deal – it would be unusual to state apropos of Tom’s dependability: “over the last month Tom has turned up on time in n% of cases, where ‘on time’ means +/– 10 minutes”. Such metrics treat dependability in a sense that we might consider outwith the bounds of everyday language. They are somewhat problematic for the purposes of needing to know if Tom will arrive before we go into the cinema. This is not to suggest that ‘professional’ metrics and deﬁnitions of dependability have no value but that their use in everyday language is limited. Our aim here, then, is to focus on what it means to live with more or less dependable systems and to do so in the natural attitude through ordinary language and situated actions such as repair. As for humans, for machines the notion of being dependable is an accountable matter3 . It is also a matter that might well occasion work arounds or other situated actions which we should consider when examining what we call the ‘logical grammar’ of dependability. By this we mean the ways that the concept might be used. Consider “this machine is totally undependable, let’s buy it” – such a use cannot be said to be acceptable except as an ironic utterance. Uses such as “you cannot always depend on the machine but we usually ﬁnd ways of making it work” point us to the ways people treat notions such as dependability not simply as common understandings but as common understandings intimately related to practical actions. Our study of control room work shows that in a strict sense the system is not 100% reliable but in showing how members make it work, we aim to provide a complement to such metrics and to show the work of making systems dependable. This is also our reason for recommending that one do an ethnography since it is only by so doing that one might see the work arounds in action and come to know just how the system is unreliable or cannot be depended on. As practical matters, dependability is important for members; yet the senses in which members treat such terms seems to be of little consequence for those who write on dependability. We want to argue that if one does examine natural language uses of these terms (and others) the beneﬁt will be in a fuller appreciation of what it means to work with (or around) such systems. Consideration of technology in its (social) context illuminates the practical actions that make technologies workable-with and which realise horizons of dependability as soci3

By this we point to as phenomenon which, if asked, members could give an account of, e.g., in the example given above, Tom might report that he is late because his bus was late. This is an account of his being late. We might then say the same of machines, that this or that job could not be done because the machine broke down.

Dependability as Ordinary Action

41

ety members’ objects. Such an exercise might appear as if it is trivial – playing with words – but we ﬁnd value in it in that it shifts attention to how people cope with technology and away from metrics and measures that ﬁnd their use in the technical realm but which have little value on the shop ﬂoor. They also show us something of the ‘missing what’ of making technologies reliable or dependable, the practical actions that occur, the work arounds, the procedures adopted and so on – the majority of which we would not expect to ﬁnd in the literature. In other words, we want to present a consideration of the ways in which dependability is ad hoced into being. It is only by doing the ethnography that such features might be found. We might be seen as providing an outsider’s comment on something that has been professionalised and ﬁne-tuned, yet we would argue that such issues are of merit to professionals and that they should be examined in considering what we mean by ‘dependability’ and that maybe the ethnography together with the consideration of such terms in natural language will, in Wittgenstein’s terms [13], be ‘therapeutic’. Therapeutic in the sense that it opens up some elbow room in which to do the kinds of ethnographic work that illustrates how knowledge is deployed within the working division of labour and how members in settings such as EngineCo treat knowledge as a practical resource for making more or less dependable systems work. This directs our attention to knowledge in and as part of practical action and we would argue forms a complement to the work currently being undertaken in the area of dependability.

4

Conclusions

It is important to us not to be seen as having resorted to some form of social constructivism since we believe that such approaches – with their faux na¨ıf insistence that things could have been otherwise if people had agreed – are at best unhelpful. Social constructivist approaches appear to us to locate their territories and then make claims that there is no ‘there’ there, so to speak, without agreement. We would not wish to be heard to endorse an approach that treats dependability as a contingent matter as ‘how you look at it’. Our approach is to regard these things as worldly achievements that require one to look at the practices that exist in and as part of their achievement. This is why we recommend ‘doing the ethnography’ to show what it means to live with systems that are more or less dependable. Through examination of the ‘lived work’ of working with undependable systems (including the work arounds etc. that this involves) we aim to complement existing work in this area. We also believe that a focus on the grammar of dependability is important – Wittgenstein’s insistence on inquiring into what we mean when we use terms such as ‘dependable’ further focuses our attention on the realisation of dependability as lived work.

42

Alexander Voß et al.

We propose that the study of dependability can only be enhanced and strengthened by attending to lay4 uses of the term and by focussing on the work that goes on to make systems more or less dependable. We do not argue that ethnographic studies or grammatical investigations should replace the work currently undertaken under the rubric of dependability, but that there is what we would call a ’missing how’ that needs to be addressed and that this can be done satisfactorily in and through ethnographic research on the procedures and situated actions involved in making systems dependable. There is also a sense in which the study of dependability can be developed through the securing of a deeper understanding of the practices by which it is constitued. Ethnographies of the making of dependability measures/metrics might be useful in that they aﬀord those involved the opportunity to reﬂect on their practice. We have instantiated the need to ‘do the ethnography’ and to consider the manner in which a consideration of Wittgensteinian philosophy might assist in clarifying what we mean by dependability. We have provided examples from our own work on the realisation of dependability in the workplace, and shown how such practical actions such as work-arounds contribute to the notion of dependability. It must be kept in mind that dependability is a situated concept and that when one considers the constitution of dependable systems one must keep in mind the settings in which such systems are used and the accompanying work practices. When we look at, for example, the work of the engine plant system we ﬁnd that the workers engage in a series of situated practical actions in order to have the system be reliable. That is to say, dependability is not simply an inherent property of the system itself but of the work in which it is enmeshed. We can therefore speak of the ‘lived work’ of dependability in that, having done the ethnography, we can see that there is a reﬂexive relationship between work practice and dependable systems. The aim of this paper has been to demonstrate not that ‘professional’ discourses of dependability have no place in our considerations, but that there is an important practical counterpart to these in lay notions of dependability and the work practice that goes on in and as a part of working with (un-)dependable systems. It is our recommendation that researchers consider this often neglected component when employing these concepts. In addition, we hope to have shown the essentially fragile nature of the JIT system and to have shown how the agility of the work practice, predicated on the autonomy accorded to plant workers, is necessary to keep the system running. The more or less dependable system that comprises the factory requires workers to be accorded autonomy in order to have things work. We have focused attention on the ways that this goes on and how reliability and dependability are practical outcomes of the deployment of knowledge by control room workers in an organisation whose production orthodoxy requires agility to repair its rather fragile nature and to make it work. 4

By ‘lay’ we do not suggest some impoverished version of a term but a necessary complement to the ‘professional’ uses to be found in the literature. Therefore, our use should be seen as a complement rather than an alternative.

Dependability as Ordinary Action

43

The research reported here is funded by the UK Engineering and Physical Sciences Research Council (award numbers 00304580 and GR/N 13999). We would like to thank staﬀ at the case study organisation for their help and participation.

References [1] P. Beynon-Davies. Information systems failure and risk assessment: the case of the London Ambulance Service Computer Aided Despatch System. In European Conference on Information Systems, 1995. 34 [2] Mark Hartswood and Rob Procter. Design guidelines for deadling with breakdowns and repairs in collaborative work settings. International Journal of Human-Computer Studies, 53:91–120, 2000. 37, 39 [3] Christian Heath, Marina Jirotka, Paul Luﬀ and John Hindmarsh. Unpacking collaboration: the interactional organisation of trading in a city dealing room. Journal of Computer Supported Cooperative Work 3, 1994, pages 147-165. 38 [4] John Hughes, Val King, Tom Rodden, Hans Andersen. The role of ethnography in interactive systems design. interactions, pages 56–65, April 1995. 33 [5] J. A. Hughes, J. O’Brien, M. Rounceﬁeld. Organisational memory and CSCW: supporting the ‘Mavis’ phenomenon. Proceedings of OzCHI, 1996. 37 [6] Nancy Leveson and Clark S. Turner. An investigation of the Therac-25 accidents. IEEE Computer, 26(7):18-41, 1993. 34 [7] Eric Livingston. The Ethnomethodological Foundations of Mathematics. Routledge, Kegan, Paul, London, 1986. 32, 37 [8] Wes Sharrock, John A. Hughes. Ethnography in the Workplace: Remarks on its theoretical basis. TeamEthno-Online, Issue 1, November 2001. Available at http://www.teamethno-online.org/Issue1/Wes.html (accessed 14th Feb. 2002) 34 [9] Lucy A. Suchman. Plans and Situated Actions: The Problem of Human-Machine Communication. Cambridge University Press, 1987. 34 [10] Alexander Voß, Rob Procter, Robin Williams. Innovation in Use: Interleaving day-to-day operation and systems development. PDC’2000 Proceedings of the Participatory Design Conference, T. Cherkasky, J. Greenbaum, P. Mambrey, J. K. Pors (eds.), pages 192–201, New York, 2000. 33, 34 [11] Alexander Voß, Rob Procter, Roger Slack, Mark Hartswood, Robin Williams. Production Management and Ordinary Action: an investigation of situated, resourceful action in production planning and control. Proceedings of the 20th UK Planning and Scheduling SIG Workshop, Edinburgh, Dec. 2001. 32 [12] Robin Williams, Roger Slack and James Stewart. Social Learning in Multimedia. Final Report of the EC Targeted Socio-Economic Research Project: 4141 PL 951003. Research Centre for Social Sciences, The University of Edinburgh, 2000. 36 [13] Ludwig Wittgenstein. Philosophical Investigations. Blackwell, Oxford 1953 (2001). 32, 41

Practical Solutions to Key Recovery Based on PKI in IP Security Yoon-Jung Rhee and Tai-Yun Kim Dept. of Computer Science & Engineering, Korea University Anam-dong Seungbuk-gu, Seoul, Korea {genuine,tykim}@netlab.korea.ac.kr

Abstract. IPSec is a security protocol suite that provides encryption and authentication services for IP messages at the network layer of the Internet. Key recovery has been the subject of a lot of discussion, of much controversy and of extensive research. Key recovery, however, might be needed at a corporate level, as a form of key management. The basic observation of the present paper is that cryptographic solutions that have been proposed so far completely ignore the communication context. We propose example to provide key recovery capability by adding key recovery information to an IP datagram. It is possible to take advantage of the communication environment in order to design key recovery protocols that are better suited and more efficient.

1

Introduction

Internet Protocol Security (IPSec) is a security protocol suite that provides encryption and authentication services for Internet Protocol (IP) messages at the network layer of the Internet [5,6,7,8]. Two major protocols of IPSec are the Authentication Header (AH) [7], which provides authentication and integrity protection, and the Encapsulating Security Payload (ESP) [8], which provides encryption as well as (optional) authentication and integrity protection of IP payloads. IPSec offers a number of advantages over other protocols being used or proposed for Internet security. Since it operates at the network layer, IPSec can be used to secure any protocol that can be encapsulated in IP, without any additional requirements. Moreover, IPSec can also be used to secure non-IP networks, such as Frame Relay, since operation of many parts of IPSec (E.g., ESP) do not necessarily require encapsulation in IP. Key recovery (KR) has been the subject of a lot of discussion, of much controversy and of extensive research, encouraged by the rapid development of worldwide networks such as the Internet. A large-scale public key infrastructure is required in order to manage signature keys and to allow secure encryption. However, a completely liberal use of cryptography is not completely accepted by governments and companies so that escrowing mechanisms need to be developed in order to fulfill current regulations. Because of the technical complexity of this problem, many rather S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 44-52, 2002.  Springer-Verlag Berlin Heidelberg 2002

Practical Solutions to Key Recovery Based on PKI in IP Security

45

unsatisfactory proposals have been published. Some of them are based on tamperresistant hardware, others make extensive use of trusted third parties. Furthermore, most of them notably increase the number of messages exchanged by the various parties, as well as the size of the communications. Based on these reasons, the widespread opinion of the research community, expressed in a technical report [13] written by well known experts, is that large-scale deployment of a key recovery system is still beyond the current competency of cryptography. Despite this fact, key recovery might be needed at a corporate level, as a form of key management. The basic observation of the present paper is that cryptographic solutions that have been proposed so far, completely ignore the communication context. Static systems are put forward for key recovery at IP layer in the Internet. This paper proposes a method for carrying byte oriented Key Recovery Information in a manner compatible with the IPSec architecture. We design a key recovery protocol that is connection oriented, interactive and more robust than other proposals.

2

Background on Key Recovery

In this section, we describe needs of key recovery and practical key recovery protocols compatible with Internet standards. The history of key recovery started in April 1993, with the proposal by the U.S government of the Escrow Encryption Standard [14], EES, also know as the CLIPPER project. Afterwards, many key recovery schemes have been proposed. To protect user privacy, the confidentiality of data is needed. For this, key recovery seems useless, but there are some scenarios where key recovery may be needed [15,16]: ! !

!

3

When the decryption key has been lost or the user is not present to provide the key Where commercial organizations want to monitor their encrypted traffic without alerting the communicating parties; to check that employees are not violating an organization’s policy, for example When a national government wants to decrypt intercepted data for the investigation of serious crimes or for national security reasons.

Related Protocols

On one hand the Internet Engineering Task Force (IETF) standardization effort has led to Internet Security Association and Key Management Protocol (ISAKMP) [5] and IPSec [7,8] for low layers of the ISO model for all protocols above the transport layer. Each of these protocols splits the security protocol into two phases. The first phase enables communicating peers to negotiate the security parameters of the association (encryption algorithm, Hashed Message Authentication Code (HMAC) [11,12] mechanisms, encryption mode), the session key, etc. Moreover this first phase

46

Yoon-Jung Rhee and Tai-Yun Kim

can be split again in an authentication stage and a negotiation stage but these results are an increase of the exchanges (from 3 to 6), which accordingly decreases the performances (the aggressive mode and main mode [4]). The second phase allows encryption of the transmitted messages by means of cryptographic algorithms defined in during the first phase and adds integrity and authentication services with HMAC methods. On the other hand, publications about key recovery systems come from the cryptography community: cryptographic primitives are used to design high-level mechanisms, which cannot fit easily into standards such as Requests for Comments RFCs. We have studied the Royal Holloway Protocol (RHP) [1] that comes from pure academic research in cryptography and the Key Recovery Alliance (KRA) Protocols. Both schemes are based on the encapsulation mechanism. Additional key recovery data are sent together with the encrypted message enabling the Key Escrow Agent (KEA) to recover the key. These data are included in a Key Recovery Field (KRF). Another possibility is to escrow partial or total information about the secret key such as in CLIPPER. However, this technique requires more cryptographic mechanisms to securely manage the escrowed key (threshold secret sharing, proactive systems, etc.). The main drawback of the first system is that the security relies on one single key owned by the Trusted Third Party (TTP) and that this key may be subject to attacks. The main problem raised by escrowing the user’s secret is the need to keep the key in a protected location. 3.1

RHP Encapsulation

The RHP architecture is based on a non-interactive mechanism with a single exchanged message and uses the Diffie-Hellman scheme. The RHP system allows messages sent to be decrypted using the user’s private receive key. Each user is registered with a TTP denoted TTPA for user A . The notations used in this.

A,B

The communicating peers

TTPA , TTPB p g

A ’s TTP, B ’s TTP

K (TTPA , TTPB ) K pr − s ( A) K pu − s ( A) K pr − r ( A) K pu −r ( A)

TTPA and TTPB an element shared between TTPA and TTPB a secret key shared between TTPA and TTPB A ’s private send key ( random value x , 1 < x < p − 1 ) a prime shared between

A ’s public send key ( g x mod p ) A ’s private receive key ( a , 1 < a < p − 1 , derived from A ’s name and K (TTPA , TTPB ) ) A ’s public receive key ( = g a mod p )

The following is the RHP protocol descriptions.

Practical Solutions to Key Recovery Based on PKI in IP Security

47

Fig. 1. RHP Protocol descriptions

1.

A obtains K pu − r ( B ) ,(= g b mod p ). TTPA can compute K pr − r ( B ) ,(= b ), from

2.

B ’s name and K (TTPA , TTPB ) . A derives a shared key, ( g b mod p ) x mod p = g xb mod p from K pr − s ( A) .

3.

This is the session key, or the encryption key for the session key A transmits K pu − s ( A) signed by TTPA and K pu −r ( B ) . This information serves

4.

both as a KRF and as means of distributing the shared key to B . Upon receipt, B verifies K pu − s ( A) from A ’s public send key and K pr − r ( B ) .

The main advantage of the RHP is to be robust in terms of basic interoperability. But, the drawback of the RHP is to mix key negotiation and key recovery. It is difficult to integrate this scheme inside the security protocols of the ISAKMP since the protocol has only one phase. Another drawback is that the KRF is sent once. In fact, this is a major disadvantage in the system since the session can be long and the KEA can miss the beginning. We refer to this difficulty as the session long-term problem. It is necessary to send the KRF more than once. However, the advantage of this system is to encrypt the session key with the shared key so that the security depends on the communicating peers and not on the TTP. But since the private receive keys depends on the TTP, this advantage disappears. Finally, this solution is hybrid between encapsulation and escrow mechanisms because the private send key is escrowed and the private receive key can be regenerated by both TTPs. 3.2

KRA Encapsulation

The KRA system proposes to encrypt the session key with the public key of the TTPs. Key Recovery Header (KRH) [9] is designed to provide a means of transmitting the KRF across the network so that they may be intercepted by an entity attempting to perform key recovery. The KRH carries keying information about the ESP security association. Therefore, KRH is used in conjunction with an ESP security association. In the ISAKMP, the use of the KRH can be negotiated in the same manner as other IPSec protocols (e.g., AH and ESP) [10]. Figure 1 shows IP packets with the KRH used with IPv4.

48

Yoon-Jung Rhee and Tai-Yun Kim

IP Header

KRH

ESP

Payload

(a) KRH used with ESP

IP Header

AH

KRH

ESP

Payload

(b) KRH used with AH and ESP Fig. 2. IP Packets with KRH in IPv4

Various schemes using this technique have been proposed such as TIS CKE (Commercial Key Escrow) [3], or IBM SKR (Secure Key recovery) [2]. The system is quite simple and allows many variations according to the cryptographic encryption schemes. This proposal separates the key recovery information and the key exchange. The system modularity is also compatible with the IETF recommendation. But, the KRF contains the encryption of the same key under a lot of TTP public key. Thus, the KRF can rapidly grow and one must take proper care against broadcast message attacks. The KRA solution is not necessary to sends a KRF in each IP packet inside the IPSec [9]. The intervals at which the initiator and responder send KRF are established independently [10]. But since the KRF size is big, the KRF cannot be included in the IP Header. So, it can be sent in the IPSec header that is a part of the IP packet data. This leads to decrease the bandwidth. The second drawback is to encrypt the session key under the TTP public key. Finally, this solution is not robust because if this key is compromised, the system collapses.

4

The Proposed Key Recovery for IPSec

In this section, we propose some key recovery solutions for IPSec, which improve on the previous protocols described in Section 3. The main problem with the RHP proposal is that the protocol is connectionlessoriented. Therefore, the protocol is not well suited to IPSec or ISAKMP that are connection-oriented and allow interactivity. The KRA’s proposal seems a better solution than the RHP. Still, the security of the session key depends on a fixed public key of TTPs for all communications. Our solution is based on IETF protocols in order to improve the security of the system, the network communication, and the interoperability for cross-certification. We can integrate modified RHP method in the IETF protocols (ISAKMP, IPSec) if we realize a real Diffie-Hellman key exchange such as in Oakley [4] in order to negotiate a shared key. After this first phase, the KRF is sent with the data. 4.1

Negotiation of Security Association for KRF

In phase 2 of ISAKMP, the negotiation for security association of key recovery arises. The proposed system allows session key sent to be encrypted by the communicating peers using Diffie-Hellman shared key between two peers and time stamp, and

Practical Solutions to Key Recovery Based on PKI in IP Security

49

decrypted by the communicating peers and their TTPs using the each user’s private keys. The notations used in our proposed protocol are shown below.

TT

a time stamp

K pr ( A)

A ’s private key. TTPA escrows it. ( x , 1 < x < p − 1 )

K pu ( A)

A ’s public key. ( g x mod p )

K DH ( A, B )

a Diffie-Hellman shared key between A and B . ( A : ( g y mod p ) x mod p = g xy mod p ,

B : ( g x mod p ) y mod p = g xy mod p ) K ek − sk ( A − > B )

an encryption key for the session key in case A is the initiator and B is the responder. ( f ( g xy mod p, TT ) , where f is a one-way function )

The following is the mechanism used in the proposed protocol.

Fig. 3. Negotiation of Security Association for KRF used in the Proposed Protocol

1.

A obtains K pu (B ) , generates TT , derives K DH ( A, B ) , and calculates K ek − sk ( A − > B ) . B obtains K pu ( A) and derives K DH ( A, B )

2.

A transmits {TT }K

DH ( A , B )

to

B.

3. Upon receipt, B calculates K ek − sk ( A − > B ) from {TT }K DH ( A , B ) . In the proposed protocol, we simplify RHP by reducing required keys. In the step 2, we can improve freshness of encryption key for the session key, K ek − sk ( A − > B ) , by generating different key every time using time stamp and one-way function, and reduce dependency of encryption key for the session key on TTPs. Consequently, this can be more robust than RHP because this reduces the influence affected by escrowing the private receive keys depends on the TTP in RHP.

50

4.2

Yoon-Jung Rhee and Tai-Yun Kim

Transmission of KRF

During the IPSec session, we send the KRF with the encrypted message as KRA does. The Following is our proposed KRH format. The KRH holds key recovery information for an ESP security association. The format of the KRH is shown in figure 2. 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2

Next Header

Length

3 3 4 5 6 7 8 9 0 1

Reserved

Security Param eter Index (SPI)

Encrypted Tim e Stam p

KRF Length

Key Recovery Field (KRF), variable length

Validation Field type

Validation Field Length

Validation Field Value, variable length

Fig. 4. Key Recovery Header format

Next Header. 8 bits wide. Identifies the next payload after the Key Recovery Header. The values in this field are set of IP Protocol Numbers as defined in the most recent RFC from the Internet Assigned Numbers Authority (IANA) describing ‘Assigned Numbers’. Length. 8 bits wide. The length of the KRB in 32-bit words. Minimum value is 0 words, which is only used in the degenerate case of a ‘null’ key recovery mechanism. Security Parameters Index (SPI). A 32-bit pseudo-random value identifying the security association for this datagram. The SPI value 0 is reserved to indicate that ‘no security association exists’. Encrypted Time Stamp. The value of {TT } K DH ( A , B ) generated in phase 2 of ISAKMP. It needs to be transmitted with KRF because it is required when the corresponding TTP recovers KRF, but it is not escrowed to the TTP. Key Recovery Field Length. Number of 32-bit words in the Key Recovery Field. Key Recovery Field. The Key Recovery Data. It contains session key of current IPSec session encrypted by encryption key ( K ek − sk ( A − > B ) ) in ISAKMP. Validation Field Type. Identifies the technique used to generate the Validation Field.

Practical Solutions to Key Recovery Based on PKI in IP Security

51

Validation Field Length. Number of 32-bit words in the Validation Field Value. The Validation Field Length must be consistent with the Validation Field Type. Validation Field Value. The Validation Field Value is calculated over the entire KRH. The TTPs can recover the key as well as the user (execute a Diffe-Hellman operation) since they escrow the user’s private key and obtain Time stamp from KRH. Even if we keep the same secret key, a KRF must be sent, since the session key is not escrowed. Hence, the KRF is sent several times according to an accepted degradation bandwidth. We send the KRF in the IPSec packet as a part of IP packet. The KRF only depends upon a specific user. This allows sending the KRF in a single direction according the user’s policy. User A can choose to send (or not) the session key encrypted with his TTP’s public key and user B can do the same. This is an interesting feature compared to the RHP, since in the RHP scheme both TTP can decrypt all messages without communication with each other. 4.3

Comparison of Protocols

Our proposal is a mix of the modified RHP and the KRA solutions that combines the advantages of both systems. This scheme is based on an escrow mechanism. First, we keep the interoperability of the RHP, improve robustness comparing with RHP, and include it in the Internet Protocols. Secondly, the KRA solution is used but we encrypt the session key with a Time Stamp and a shared key by Diffie-Hellman key exchange between communicating users or user’s public key to gain robustness, not with the TTPs public keys. Therefore it can provide a method for carrying byte oriented Key Recovery Information in a manner compatible with the IPSec architecture. We compare existing protocols and our proposed protocol. In Table 1, we show the comparison between proposals of RHP, KRA and our proposed. Table 1. Comparison of protocols (O: high support

Compatibility with IETF

RHP

KRA

The Proposed

X

O

O

X

O

Robustness Reducing overhead of network

5

: low support X: not support)

O

Conclusion and Future Works

We propose example to provide key recovery capability by adding key recovery information to an IP datagram. It is possible to take advantage of the communication

52

Yoon-Jung Rhee and Tai-Yun Kim

environment in order to design key recovery protocols that are better suited and more efficient. We design a key recovery protocol that is suitable for connection oriented and more robust than RHP or KRA proposals by combining the advantages of the modified RHP and the KRA solutions. As future works, we plan for analysis and evaluation of performance of the mechamism. To progress, we apply this by modifying existing IPSec system for exact analysis results.

References 1.

2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

14. 15. 16.

N. Jefferies, C. Mitchell, and M. Walker, “A Proposed Architecture for Trusted Third Party Services”, in Cryptography: Policy and Algorithms, Proceedings: International Conference BrisAne, Lecture Notes In Computer Science, LNCS 1029, Springer-Verlag, 1995 R. Gennaro, P. Karger, S. Matyas, M. Peyravian, A. Roginsky, D. Safford, M. Zollett, and N. Zunic. Two-Phase Cryptography Key Recovery System. In computers & Security, Pages 481-506. Elsevier Sciences Ltd, 1997 D. M. Balenson, C. M. Ellison, S.B. Lipner and S. T. Walker, “A new Approach to Software Key Encryption”, Trusted Information Systems The Oakley Key Determination Protocol (RFC 2412) Internet Security Association and Key Management Protocol (ISAKMP) (RFC 2408) The Internet Key Exchange (IKE) (RFC 2409) IP Authentication Header (AH) (RFC 2402) IP Encapsulating Security Payload (ESP) (RFC 2406) T.Markham and C. Williams, Key Recovery Header for IPSEC, Computers & Security, 19, 2000, Elsevier Science D. Balenson and T. Markham, ISAKMP Key Recovery Extensions, Computers & Security, 19, 2000, Elsevier Science The Use of HMAC-MD5-96 within ESP and AH (RFC 2403) The Use of HMAC-SHA-1-96 whithin ESP and AH (RFC 2404) H. Abelson, R. Anderson, S. Bellovin, J. Benaloh, M. Blaze, W. Diffie, J. Gilmore, P.Neumann, R. Rivest, J. Schiller, and B. Schneirer. The Risks of Key Recovery, Key Escrow, and Trusted Third-Party Encryption. Technical report, 1997. Available from htt://www.crypto.com/key-study NIST, “Escrow Encryption Standard (EES)”, Federal Information Processing Standard Pubilication (FIPS PUB) 185, 1994 J. Nieto, D. Park, C. Boyd, and E. Dawson, “Key Recovery in Third Generation Wireless Communication Systems”, Public Key Cryptography-PKC2000, LNCS 1751, pp. 223-237, 2000 K. Rantos and C. Mitchell. “Key recovery in ASPeCT Authentication and Initialization of Payment protocol”, Proc. Of ACTS Mobile Summit, Sorrento, Italy, June 1999

Redundant Data Acquisition in a Distributed Security Compound Thomas Droste Institute of Computer Science Department of Electrical Engineering and Information Sciences Ruhr-University Bochum, 44801 Bochum, Germany [email protected]

Abstract. This paper introduces a new concept for an additional security mechanism which works on every host inside a local network. It is focussed on the used redundant data acquisition to get the complete net-wide network traffic for later analysis. The compound itself has a distributed structure. Different components act together on different hosts in the security compound. Therefore, the acquisition and analysis are done net-wide by hosts with free resources, parallel to their usual work. Because the hosts, in particular workstations, change dynamical over the day, the compound must adapt to the actual availability of all hosts. It must be guaranteed, that every transferred packet inside the local network is recorded. Each network traffic at one host in the network is recorded by a minimum of two others. The recorded traffic is combined at a node in order to get a single complete stream for analysis. The resulting problems at the different states of the redundant data acquisition are described and used solutions are presented.

1

Introduction

The distributed security compound is implemented on each host in the local network. It works parallel to other security mechanisms like firewalls, or intrusion detection systems (IDS) and intrusion response systems (IRS). The intention for the use is to get an additional security mechanism which works transparent on the established network and on different hosts. The achieved knowledge and analysis depends on the whole net-wide network traffic transferred inside a local network. A security violation or unusual behavior can then be detected in the whole network. An advantage is the use of usual workstations for the compound. No additional hardware is necessary. All functions and components are added to the existing host configuration. Different components for compound detection, network data acquisition, distributed analysis, compound communication, and reaction on security violations act together [3]. A basic component is the compound detection. The whole network is scanned for hosts which have to be integrated into the security compound. Hosts without the S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 53-60, 2002.  Springer-Verlag Berlin Heidelberg 2002

54

Thomas Droste

needed components are detected, too. Therefore, a new host in the network - a new own member or a possible intruder with physically access to the local network - will be identified as such a system. Each host can acquire the network traffic (see Sect. 2), which is used for the later analysis. The analysis of recorded and combined data is splitted and distributed to the compound members. The compound communication is possible among all components. Cryptographic mechanisms are required to encode the communication inside the compound and for the recorded and transferred data. If a local preanalysis process or a distributed analysis recognizes a security violation the reaction component will be activated. Therefore, an internal compromised host is decoupled from the compound and access from/to other hosts are districted. The local reaction component can e.g. shut down the host or log off the local user. All components of the security compound are working as services in the background, i.e. transparent for the local user. They start interacting before the user logs on and remain active all the time the system is up.

2

Acquisition of Network Traffic

The acquisition is realized parallel to the normal network traffic. The complete network traffic inside a collision domain dCD,i (i means number of collision domain) is recorded at the network adapter of a host. Contrary to host based traffic dr,i,j (j means host in collision domain i) all packets are recorded. One assumes that all hosts record the traffic, the resulting traffic in a collision domain is given by

d CD,i =

ni

∑ d r ,i , j .

(1)

j =1

The net-wide traffic dges to record is given by

d ges =

nCD

∑ d CD,i .

(2)

i =1

A recorded data d includes each transmitted frame on the link layer. The data for the higher layers, i.e. header and encapsulated data, are completely included [12]. Each frame is marked with a timestamp. After recording the data, a transformation is done by adding the MAC1 address of the recording host and conversion into a readable packet order (layers are dissolved) for the combining process and the later analysis. This disintegration is done max. up to transport layer. Begin of a converted frame [11]: date;time;mac_source;mac_dest;mac_host;frame_type;... Parallel to the dissolved data dr´ the original frames dr´´ are stored for later requests from analysis processes. 1

Medium access control, the hardware address of the network interface [13].

Redundant Data Acquisition in a Distributed Security Compound

55

The traffic in a collision domain dCD can be recorded at any point inside this physically combined network. Therefore, each host inside the collision domain can record this traffic. A problem is the acquisition in switched networks. There, the traffic is controlled by the switch. Every packet is transferred directly to the station(s) at the corresponding port behind. Other stations can not recognize the traffic. If only one host is connected to a port, it will act as a single collision domain (ni=1). The recorded data at the host dr,i,1 is equal to dCD,i. A detection of hosts which are not part of the compound is still possible because of the administrative traffic (e.g. DNS2-queries). It does not depend on the topology (e.g.. ethernet 10Base-2/10Base-T with CSMA/CD3 or switched network). As soon as communication takes place to any compound member, this host is detected.

3

Generation of Redundancy

For the analysis, it is important to have a maximum of information about the network traffic. To combine the traffic from each host dr´ a filtering of duplicated packets is necessary. The result is the net-wide network data dcombine, given by

d combine = ∪k =1 d r ,k ´, nk = ∑i CD ni nk

n

(3)

nk is the number of hosts in the compound, including all hosts ni in each collision domains nCD. The redundancy degree increases with the number of the compound members. Therefore, it is not necessary to record the complete traffic at all hosts. A reduction from this high redundant level to a smaller one is forced. This is done by distribution of acquiring rules. Each compound member j gets a list of two or more other hosts to monitor (scalable). A recorded traffic dr,j (see Sect. 2) is filtered with the list. Now, the resulting output dr,j´ depends only on the traffic of the hosts in the list (and the own one). A rough estimation of the relation between number of hosts and recorded data is given in Table 1. Table 1. Dependency for recorded and combined data for a special host j

2 3

Number of hosts

Recorded data

low (ni≤3)

dr,j´=dges´

medium (3
dr,j´< dges´

high (ni>15)

dr,j´<< dges´

Combined data

Domain name system [10]. Carrier-sense multiple access with collision detection [13].

d r, j

=1 d combine d r, j 3 ≈ d combine ni d r, j d combine

≈

1 3 " ni ni

56

Thomas Droste

The mentioned relation depends on the traffic, due to the addressing of packets. Three groups are possible: communication between a host and his recording partner, communication inside the network or net-wide requests (see Sect. 2, administration traffic), and communication between an internal and external host (outside the local network, e.g. to the internet). A resulting traffic can not be predicted. It depends on the real network traffic. 3.1

Lost of Packets

The transparent work of the active components is a problem. The goal is to use the distributed compound parallel to the usual work of the local user. Depending on the host (processor speed, actual system load) packets can be lost. This problem occurs if a large data stream is transferred and the system is not able to save the packets on the local hard disk. An example is a transfer of a large file (i.e. in single packets). It can be sent to another system or be received by a host. While sending the file dr,send, the own host (sender) and the hosts monitoring it, acquire dr,send and must store all transferred packets. The sender stores the file in a network record dr,receive. The write process itself is the bottle-neck. The sending and reception are independent. The buffer of the network card is not fast enough to store the complete data (dr,receive≤dr,send). For the monitoring hosts, the problem occurs if the local system load is high or an additional network transfer is in progress. While receiving a file, the host (receiver) and the monitoring hosts also have that latter problem. To solve this problem the network acquire can be adapted. Thus, not the complete frame is progressed. Only the beginning of the frame dr,1..b (a defined number b of bytes) is recorded and stored. This part has all necessary information to extract the used protocols (dr,1..b→dr´). It is done by a coupling to the system load. For the analysis of the communication, the complete data (here the file) is not required for the analysis. Only if data on the application layer is needed it must be recorded or available for recording. E.g. virus scans should proceeded before a transmission starts on the local host or at a central point in case of a connection to an external system. Missing packets of a large transfer are of lower priority, if the content itself is irrelevant. Packets not belonging to the large transfer must be recorded. This is done by the above mentioned reduction of the frame size to acquire. The redundancy is still given. 3.2

Dynamic Update

A second reason for the use of different sources is the availability of hosts. If a host is temporary not reachable or useable due to extreme high system load, system compromising, hardware problems or network errors, the network traffic must be acquired by other hosts. The redundancy for all hosts must be recreated to a minimum factor of two. Building redundancy is delegated basing on the compound detection. The remaining systems are advised to monitor new hosts by random. At least each host is monitored by a minimum of two compound members.

Redundant Data Acquisition in a Distributed Security Compound

57

In case of a network error, e.g. a connection between sub-networks is broken, independent compounds will be established. After remigration of networks the compound will be recombined and a new random association for monitoring is done. Additionally, an update of the association is done periodically and in case a host joins the compound or is shut down. 3.3

Trusted Relationship

A trust degree is used to include the relationship among the hosts, i.e. which host is allowed to monitor another. This trust level is given in the host configuration. Basing on this value, hosts with the same trust level build a group. Increasing the trust level (↑,>0) results in a lowering of confidential (↓). Only hosts on the same trust level monitor each other to guarantee redundancy. If there are too few hosts in a group (i.e. minimum redundancy of two) hosts from the next lower trust level (higher confidential) take over the monitoring. Table 2. Hosts Rl,j (trust level l, host number j) in a compound associated with trust levels

Trust level

Host R

0

0.1

1

1.1

2

2.1

3

R R

R 0.2 1.2

R

3.1

R

R 0.3

R 0.4

R

1.3

R 3.2

An example of a small compound is given in Table 2. 3

2

1

0

R0.1

R0.4

R1.1

R0.2

R0.3

R1.2

R1.3

Trust level

R2.1

R3.1

R3.2

Fig. 1. Result of associated hosts based on Table 2

58

Thomas Droste

The group of trust level 1 monitors trust level 2 and two compound members in trust level 3 are monitored by the host in trust level 2 (Fig. 1 shows an example), this is done to guarantee the redundancy. The shown example presents only one simple case of a possible association. With each new update of the association (see Sect. 3.2) a new result is randomly forced [7]. For a larger number of hosts per trust level, it is not necessary that lower trust levels monitor higher. Therefore, the groups are independent from each other. To increase the redundancy factor, one host can be monitored by a larger number.

4

Combining Acquired Data

To use the redundancy of the recorded data the single host data dr,j´ is combined at a node to dcombine (see equation 3). The combination process checks every packet for equality with packets from the other hosts dr,j. The result is one single packet stream without redundant packets [5].

dr,1´ t11

dr,2´

dr,3´

dcombine tsync

t21

t31

tsync=t11, ∆t11=0 ∆t21=tsync-t21 ∆t31=tsync-t31

t12

∆t12=∆t11

t22

∆t12=tsync-t12

∆t22=∆t21

t32

∆t32=∆t31 ∆t32=tsync-t32

Fig. 2. Process of combination of (parted) data dr,j´ (j = 1..3) to dcombine with time synchronization

Fig. 2 shows a combination process with synchronization. Here, the data of three hosts are available. The timestamp t11 of the recorded data of host 1 dr,1 is the lowest

Redundant Data Acquisition in a Distributed Security Compound

59

and is used as time base. Beginning with the first packet of host 1 (in dr,1) the synchronization time tsync for the combined data dcombine is set to this time. The later data (from dr,2 and dr,3) are sorted into this time line. A synchronization between an input data for combination (dr,j) and the combined data (dcombine) is done when the first identical packets occur. In the figure the synchronization is shown indirectly by the lines between the combined data and two input data. Each time a new (additional, following) input data from a host is used, a new synchronisation must be done. The former time difference in relation to the combined data is used until a packet for synchronising is available. The combination and sorting process for one single packet can be done in less than 1 ms (on an actual system assembled for usual work). Recorded data of 10.000 packets have a size of 1.5 Mbyte (about 160 byte a packet).

5

Summary & Outlook

The acquisition of data is one step in the security compound. Redundancy is important to guarantee the availability of correct data for analysis [8]. The distributed structure helps but produces a lot of problems [6][14], some shown in this article. The main goal is to have a security compound in the background parallel to usual user work. With this improved security the safety will be increased, too. All components and resources can be observed by adapting the mechanisms of the compound to safety requirements. The implementation of the mechanisms for acquisition and combining data are parted. The acquisition is done by a packet driver with an additional specialized service. Depending on the local system and network load some packets may get lost in a single system. This is compensated by the other systems. The dynamic update is done by an extra service, which adapts to the actually available hosts. Test series verified the implementation. The combining process resulted in correct acquired traffic for each series of measurement. Only at the beginning of combined traffic, gaps occurred in the time line. A more exact synchronization among all hosts solved this problem. For further use, an optimization is necessary to speed up the combining process. With the acquired and combined data the analysis of the traffic can start. Examples are pattern recognition to known attacks (e.g. port scan), anomaly detection (like IDS do) or other forms of analysis. A main advantage is the possibility to have a compound in case the network is parted (malfunction of couple elements). A new security compound is established by the dynamically detection in single network parts. Furthermore, the availability of the complete net-wide network traffic for the analysis is a factor other security systems can not use. The distributed analysis inside the compound needs no additional hardware and is faster (and especially safe due the distributed structure) than an implementation on a single system.

6 1.

References Chapman, D.B., Zwicky, E.D.: Building Internet Firewalls. O´Reilly & Associates Inc. (1995)

60

2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

12. 13. 14.

Thomas Droste

Droste, T.: Data Exchange via Multiple Connections. 4th International Conference on Telecommunications in Modern Satellite, Cable and Broadcasting Services, TELSIKS´99. University of Niš (1999) Proceedings of Papers, vol 1 Droste, T: Sicherheitsdienste in einem Rechnerverbund. In: Horster, P.: Kommunikationssicherheit im Zeichen des Internet. Vieweg & Sohn, Braunschweig (2001) 1-12 Droste, T.: Weighted Communication in a Security Compound. 5th International Conference on Telecommunications in Modern Satellite, Cable and Broadcasting Services, TELSIKS´01. University of Niš (2001) Proceedings of Papers, vol 2 Droste, T., Ruhl, A.: Aufgaben bei verteilter Datenakquisition und dessen Zusammenführung für die Analyse. In: Horster, P.: Enterprise Security. ITVerlag (2002) Droste, T., Vogel, M.: Richtlinien für eine verteilte Sicherheitsinfrastruktur. In: Horster, P.: Enterprise Security. IT-Verlag (2002) Hamburg, D.: Entwurf von Rotationsmechanismen zur verteilten Datenakquisition. Studienarbeit S292, Institute of Computer Science. RuhrUniversity Bochum (2002) Höper, K.: Analyse eines verteilten Datenstroms auf Vollständigkeit. Studienarbeit S293, Institute of Computer Science. Ruhr-University Bochum (2002) Hunt, C.: TCP/IP Network Administration. O´Reilly & Associates Inc., Sebastopol (1994) Naugle, M.G.: Network Protocol Handbook. McGraw-Hill series on computer communication. McGraw-Hill, New York (1994) Schüppel, V.: Entwicklung einer Client-/Server-Applikation zum Informationsaustausch von Systemdaten zwischen entfernten Systemen.. Diplomarbeit D341, Institute of Computer Science. Ruhr-University Bochum (1999) Stevens, W.R.: TCP/IP Illustrated, Volume 1. Addison-Wesley Publ. Company, Reading (1994) Taylor, E.: The McGraw-Hill Internetworking Handbook. McGraw-Hill Inc., New York (1994) Vogel, M.: Entwicklung, Analyse und Definition von Richtlinien für eine verteilte Sicherheitsinfrastruktur. Diplomarbeit D363, Institute of Computer Science. Ruhr-University Bochum (2002)

Survivability Strategy for a Security Critical Process Ferdinand J. Dafelmair TÜV Süddeutschland Westendstrasse 199, 80686 München [email protected] Phone: +49-89-5791-1464 Fax: +49-89-5791-2902

Abstract. Security critical processes in the context of this paper are considered information technology processes that are implemented to achieve a high level of integrity, authenticity and confidentiality of data processing. It’s a matter of fact that high-level security may only be achieved through adequate physical protection. The classical approach is to close away such processes in computing centers protected by physical barriers with high drag factors – well known to cause substantial investment and operational costs. This paper describes a new, recently patented method to protect such a security critical process by different means – not to keep the attacker out but to take the valuables out when the attacker comes in. Practical application of this strategy is detailed giving an example of protecting a Trust Center of a Public Key Infrastructure by such means.

1

Introduction

The operation process of a Trust Center (TC) as heart of a Public Key Infrastructure (PKI) is a typical security critical process. A high-level TC contains at least facilities for Key Generation (KG), where user keys may be generated under well known conditions, a Certification Authority (CA), which issues digital certificates that state relationship between key owner and key and a system that takes care that both the user keys, and the certificates are written to e.g. smart cards as a personal security environment for the PKI user. Additional systems to be protected inside a TC might be the Online Certificate Status Verification Service (OCSP) and the Timestamp Service (TS). These systems use master-keys to digitally sign certificates or sign authoritative answers (OCSP and TS). Therefore a TC usually contains IT-systems for the various services, equipped with special cryptographic processing modules and some kind of machinery that handles the personalization of smart cards, meaning a sub-process where blank cards receive personal data for the PKI user. The smart card may be considered a small data safe by itself protecting the data written to it. Some major security goals of such a TC may be summarized as follows:

S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 61-69, 2002.  Springer-Verlag Berlin Heidelberg 2002

62

Ferdinand J. Dafelmair

SG1: the master-keys1 must remain confidential under any circumstances SG2: key generation of user keys must remain confidential under any circumstances2 SG3: the master-keys may not be replaced by any means SG4: the personalization process of smart cards may not be tampered with

-

From a PKI point of view several more security goals have to be considered but the simplified view is enough to explain the survivability strategy this paper focuses on.

Fig. 1. The main components of the Trust Center Module

2

Logical Protection Substitutes Physical Barriers

The TC processes need careful protection – both information technology protection of data communication and physical protection against tampering with the hardware the processes run on. Physical barriers are the classical mean to avoid attackers tampering 1 2

Those of the CA as well as those of eventually installed OCSP and TS Several ways of key generation are possible, either on smart card or externally with special cryptographic hardware. To avoid any kind of tampering, even key generation on smart card should be done inside secure environment.

Survivability Strategy for a Security Critical Process

63

with the processes run inside the TC behind the barriers. The drag factor of such barriers varies and so do costs. If protection against attackers with a high degree of determination as well as sophisticated equipment is necessary, costs increase dramatically and exceed the costs for information technology protection substantially. To achieve the security goals mentioned above the following alternative strategy may be used [1]: The security critical processes inside the TC are locked away behind lightweight physical barriers that comprise a compact module – the so called Trust Center Module (TCM). The module includes all IT-systems as well as the Smart card Personalization Equipment (CPE) including the Smart card Personalization Machine (CPM). Data that needs to be stored inside the TC for reference or logging is signed and encrypted through auxiliary data signing and enciphering keys. The module is closed during operation and deserted. Blank smart cards are automatically fed into the module through a transport mechanism and automatically returned in personalized form. Full information technology protection is achieved by e.g. firewalls systems and network intrusion detection systems. Figure 1 shows the main components of the TC module. The drag factor of the physical barrier is kept low but sufficiently high to ensure, that enough reaction time remains in case of physical attacks. Typically 1-2 minutes are enough time for reaction. The reaction itself consists of the following steps: -

Forced deletion of the master keys Shutdown of the CA and all other systems that require master keys Invalidation of smart cards currently in the CPM and deactivation of CPM Forced deletion of auxiliary data signing and enciphering keys

This 4 steps of reaction make sure, that the security goals SG1-SG4 are met in case of such physical attacks. The next question is what kind of attacks are considered and how they are detected.

3

Detection – The Basis for Proper Reaction

Physical attacks need to be differentiated into the following two categories: Cat1: Attacks that leave the physical enclosure of the process intact and only try to gain information from electromagnetic emission. Cat2: Attacks that penetrate the physical enclosure to insert probes or tools into the interior or achieve human access. Cat1 Attacks are rendered useless through sophisticated shielding using highly permeable metal and filters for decoupling lines. Cat2 attacks however need to be detected with a high level of reliability. Any deficiency in detection directly questions achievement of the security goals. Therefore diverse basic physical effects are used to detect Cat 2 attacks. Each of the physical criteria that are used to determine Cat 2 attacks are focussed on detection of any kind of enclosure penetration, from laser drilled 10 mils hole up to opening the door. Figure 2 shows, how the walls of the enclosure, including door, floor and top is built to facilitate this detection.

64

Ferdinand J. Dafelmair

The wall consists of a hollow space between an inner and outer alluminum plate which is kept at a low pressure level of around 15 mbar abs. A patented shielding and detection plate (SDP) is located in the middle of the hollow space. The SDP forms a condensor with one electrode grounded and the other at a voltage of 1.5 kV. If the outer plate of the wall gets penetrated, the pressure level in the hollow space increases due to inflowing air. This increase is detected and it is the first criteria for a Cat 2 attack. To successfully penetrate the wall, the SDP, must be penetrated which affects the dielectric of the SDP. Dielectric defects of the SDP cause a discharging current that is the second criteria for a Cat 2 attack. Finally position sensors at the door register any door movement and thus provide the third criteria for a Cat 2 attack. Any of the criteria are measured using 3 redundant sensors. Due to the sophisticated detection technology any kind of attempt to physically penetrate the enclosure gets registered and triggers proper reaction of the internal Security Control System (SCS).

4

Safety Design for Dependable Operation

The Security Control System (SCS) of the TCM is responsible for detecting attacks and carry out the proper reaction to fulfill the security goals. The design of the SCS follows proven design principles from the safety domain. The entire system is comprised out of 3 independent programmable logic controllers (PLC) interconnected to each other to form a 2-out-of-3 system. Such a system fulfills the single failure criterion making sure that despite a single failure in one of the PLCs the functions are still carried out with the high accuracy of a redundant system. Thus the design achieves both, reliability and fail-safe operation, because in case of no 2 PLCs remaining that agree on actions to be taken, the predefined safe state will be enforced by voting safety-relay logic. Data acquisition peripherals as well as the sensors follow this three-train architecture. Once an attack is detected or the SCS is forced into safe state, the relay contacts of the voting logic switch of the key stores for the master keys and the auxiliary data signing and enciphering keys. At the same moment the corresponding systems get a signal that initiates a shutdown of these systems and after a configurable delay time, the power of the systems gets interrupted and the CPM gets blocked. To withstand a total loss of power, the TCM has an internal emergency backup power supply that ensures enough power to achieve a controlled shutdown. In case of total power loss uncontrolled shutdown will take place which still does not violate any of the security goals since keys are solely kept in key storage environments based on voltile memory technology that immediately loose theis information in case of power loss. Important logging information is additionally printed inside the module on nonforgeable paper such that total loss of power does not jeopardize the required traceability of actions. Besides detection of attacks the SCS is in charge of carrying out instrumentation and control functions. These include startup- and ordinary shutdown control, temperature

Survivability Strategy for a Security Critical Process

65

and heat removal control, leakage detection of the water driven heat removal system, and supervision functions for various other parameters. Furthermore the SCS controls logging of security critical and operational events.

5

Closed Automation Provides against Tampering

One major source of security problems is associated with reliability of operational staff. Out of an efficiency point of view it may not be feasible to guarantee continuous mutual supervision of two individual operators according to the 4-eyes principle. Therefore security critical processes and related manufacturing equipment should be designed such that interaction between the operational staff and the equipment is limited to only those actions absolutely and inherently necessary for carrying out the operators tasks. Regarding this principle the smart card personalization machine inside the TCM, where the critical process of smart card personalization takes place, is also located entirely inside the TCM with only a card feeder for blank card input and a card outlet is accessible to the operator. Smart card transport and contacting is closed away inside the TCM and the smart card personalization machine is designed such that long values of Mean Time Between Failure (MTBF) are achieved. This was made possible trough using air flow and gravity as driving force for card movement thus avoiding smart card transportation by contact and limiting moving parts to the absolute minimum. Therefore service may be reduced to once in a year. This high reliable automation in a closed environment is a prerequisite for the implementation of the whole survivability strategy of the TCM since it relies on long term unattended operation of the internal TCM systems since any opening of the door leads to a shutdown of the entire system. Without necessity to allow human interaction at all with systems inside TCM during operation of the TCM, implementation of access control is quite straightforward. There is no differentiation between roles and any normal access is logged using the attack detection systems already explained. Despite there is no way to tamper with the critical processes or the systems they run on, it is crucial to verify the integrity both in hard and software for all the systems inside the TCM before closing the door and starting operation. For practical reasons it is advisable to restrict access to the TCM through some simple methods like a conventional door lock or closed room to provide against disruption of availability. Furthermore it’s good practice to feed integrity status warning messages of the TCM into the facility management system and make sure that proper actions are taken.

6

Verifiable Logging – Proof of Correct Operation

Following a survivability strategy of sensing and active reaction instead of passive protection it is crucial to furnish proofs of correct operation. This is done by different

66

Ferdinand J. Dafelmair

forms of logging. The IT-systems of the PKI3, which are the actual payload of the TCM environment do their own logging. The details of this logging are not to be considered here but it is necessary to mention that these logs are serialized, signed and eventually encrypted to ensure proper level of confidentiality and traceable integrity authenticity and non repudiation. These tamper-protected logs are stored inside the TCM and they are replicated through communication lines to external locations. Logging may be verified at any time by checking their signatures. Logs that document correct operation of the SCS cannot be implemented this way because PLCs can’t carry out cryptographic operations. To avoid additional auxiliary logging systems that introduce new complexity and need to be at least redundant the following method may be used to furnish proof of correct operation. As mentioned before, the TCM is vulnerable only in the shutdown state with door open. As with any other security approach, there need to be finally human beings to take over responsibility for correct operations. In the case of the TCM they are called Security Officers (SO) and there are at least three of them. They act under mutual supervision and they are forced by design to act by mutual consent. They start the TCM operation by verifying the integrity of the installed systems using several auxiliary technologies. During the first system startup of the TCM the also generate the master keys and the auxiliary signing and encryption keys inside the TCM on non forgeable hardware. Now the system is ready for use. To gain verifiable proof of the TCM system integrity over the entire period of operation, the Security officers load serialized paper into redundant printers connected to the PLCs. The serial numbers of the papers are recorded, signed and stored in a safe location. They then trigger the PLC to carry out an integrity self check and record the result together with a timestamp as a first item to the serialized paper. Then they close the door which automatically starts penetration detection. This and all following major incidents are logged by the printers. Next time the TCM is regularly the printers indicate the history of operation. Serialization of printer paper provides against printout counterfeit and redundant printouts level single output failures.

7

Comprehensive MMI Supports Operational Security

As security is not just a matter of dealing with deliberate attacks or technical failures but also with human errors, Operator Interface needs to be paid attention to. As already stated above, human interaction with the security critical process is exactly tailored to the operational task. The complex processing inside the TCM needs a human interface that hides complexity and provides robust and self explaining controls. To solve this problem experience from industrial automation and process control determined the design. It features an industrial Touch Panel for controlling the smart card personalization machine with animated state visualization and a browser based production control system. 3

E.g. The Certification Authority, the Card Personalization Equipment, the Timestamp System and the OCSP System

Survivability Strategy for a Security Critical Process

67

The operator has the sole task of personalization blank smart cards for a lot of already registered users. Therefore a tailored production control system provides him with a list of lots to be processed and some additional information necessary for packaging and delivering of the production. The operator cannot influence any data written to the smart cards nor can he personalize smart cards not already registered. The interface supports atomic transaction processing avoiding unknown intermediate processing states of user entries or lots to be processed. The operator receives visual guidance including animated process visualization that lets him monitor the processing inside the smart card personalization machine even from the browser interface.

8

Three Steps to Survive the Attack

Up to now the explanation of the approach focussed on the detection of attacks, reaching a safe shutdown state, proving correct operation and avoid human errors. The question still has to be solved how survivability is achieved and how the system will regain operational state once it was brought to a shut down. To really survive the attack three principles had to be implemented: Trustworthy Key Backup: The master-keys are the main target of an attack and in order to avoid that an attacker could get access to them, they are destroyed immediately after an attempt of attack has been detected. The security officers however, who act on mutual supervision and act by mutual consent and who initially created the master keys and auxiliary signing and encryption keys using special key management functionality based on cryptographic hardware control the security of the entire system and therefore have the power to recreate deleted keys. To accomplish this, the security officers have the possibility to create a backup of the keys. The design of the built in key management allows them to split the keys and export them to personal smart cards of security officers such that a single security officer only has control over a fraction of the key. Multiple exports are possible to ensure sufficient backup availability. Backups are secured by means of conventional safe storing. Trustworthy Key Restoration The key management functionality also provides the reverse process of restoration of the keys such that security officers may recombine the original keys out of their fractions but only if they act on mutual consent. After key restoration, the system is in the same and fully operational state than originally after the initial key generation during first startup. Trustworthy Cryptography The operation of TC processes as well as many other security critical processes relies on data that must survive attacks and which can vary in size up to quite substantial amounts. To secure such data trustworthy cryptography is applied. The security officers create auxiliary data signing and encryption keys that are treated like master

68

Ferdinand J. Dafelmair

keys. This keys are used to sign and optionally encrypt such data whenever they are created during normal operation. After this the data get replicated to one or more storage locations outside the TCM such that dislocation ensures adequate availability of data. Further protection of this exported data is not required since the data is only accessible through the auxiliary data signing and encryption keys which are kept under non disclosure inside the TCM. In case of attacks that lead to destruction of part of the TCM or data within it, it is still possible to recover data on a second TCM system. The example shows that this new survivability strategy also has to rely on backup strategies in case of disastrous events as well as classical approaches have to do this. The major advantage however is the cost of the backup, which is again only a fraction of the cost of a classical solution.

9

Commercial Aspects

Compared to classical security environments for TCs based on physical barriers with high drag factors, the new approach based on the TCM leads towards substantial cost reduction as follows: • • • • •

Space requirements for TCM compared to classical solutions based on high security computing centers are much lower Construction costs for a TCM are only a fraction of those of a high security computing center Operation costs are extremely low due to automation with only limited requirements for auxiliary operational staff Full Backup solutions require much less investment than with classical solutions Due to deserted operation no trusted staff with security clearance is required for operation

The modular design of the TCM solution allows furthermore extended operation periods that are similar to those of process industry installations and cover at least 10 years because all IT-systems with much shorter innovation cycles may be easily substituted.

10

Conclusion

This paper gave an example on how consequent application of cryptographic technologies has the potential to change traditional paradigms of how to achieve survivability for security critical processes. The principle is for sure not applicable to every security critical process but it shows, how costs can be brought down for such important and fundamental security technologies like a public key infrastructure. Companies that shrank from investing huge amounts into conventional trust center solutions for their own PKI now have a chance to get an adequate solution at a fraction of the costs. They might think even more about how this survivability

Survivability Strategy for a Security Critical Process

69

approach may be extended to other fields of company wide information processing like digital archiving - always keeping in mind the basic principle “If you can’t keep the enemy out, let him in and laugh at him”

References 1. 2.

Deutsches Patent und Markenamt, Patentschrift DE 199 15 668 A1, H04L9/10 Trust-Center-Einheit Deutsches Patent und Markenamt, Patentschrift DE 199 21 170 C2, H04L9/10, Sicherheitselement

Statistical Comparison of Two Sum-of-Disjoint-Product Algorithms for Reliability and Safety Evaluation Klaus Heidtmann Departement of Computer Science, Hamburg University Vogt-Kölln-Str. 30, D-22527 Hamburg, Germany [email protected]

Abstract. The evaluation of system reliability and safety is important for the design of new systems and the improvement or further development of existing systems. Especially the probability that a systems operates (safely) using the probabilities that its components operate is a vital system characteristic and its computation is a nontrivial task. The most often used method to solve this problem is to derive disjoint events from the description of the system structure and to sum up the probabilities of these disjoint events to quantify system reliability or safety. To compute disjoint products as logical representation of disjoint events Abraham's algorithm inverts single variables indicating the state of a component and therefor produces a huge number of disjoint products. To avoid this disadvantage Heidtmann developed a new method which inverts multiple variables at once and results in a much smaller number of disjoint products as confirmed by some examples. This paper quantifies this advantage by statistical methods and statistical characteristics for both algorithms presenting measurements of the number of produced disjoint products and the computation time of both algorithms for a large sample of randomly generated systems. These empirical values are used to investigate the efficiency of both algorithms by statistical means showing that the difference between both algorithm grows exponentially with system size and that Heidtmanns method is significantly superior. The results were obtained using our Java tool for system reliability and safety computation which is available in the WWW.

1

Introduction

Fault trees and reliability networks (also known as two-, K- or multi- terminal networks) illustrate the causal relationship between system and component failures. Both kinds of graphs representing the so called system structure are frequently used for reliability and safety modeling of complex technical systems. The advantage that they posses over Markov chains and stochastic Petri nets is a concise representation S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 70-81, 2002.  Springer-Verlag Berlin Heidelberg 2002

Statistical Comparison of Two Sum-of-Disjoint-Product Algorithms

71

and efficient solution algorithms. The class of algorithms known as sum-of-disjointproducts (SDP) is the most often used for such reliability resp. safety calculations and is therefore an important technique for reliability and safety analysis. It is applied to compute various exact or approximate measures related to the reliability and safety of technical systems like availability, dependability, reliability, performability, safety, and risk [Hei97]. SDP-algorithms play also an important role in the calculation of degrees of support in the context of probabilistic assumption-based reasoning and the Dempster-Shafer theory of evidence [BeK95, Koh95, KoM95]. SDP methods start by generating minpaths for reliability graphs or the mincuts for fault trees. Given the reliabilities of the components and minimal sets of components (called minpath) that can allow a system to operate (safely), an SDP algorithm computes the reliability resp. safety, i.e. the probability that all components in at least one of these sets are operational. The system reliability or safety is the probability of the union of the minimal sets. This is a union of product problem. The idea of SDP is to convert the sum of products into the sum of disjoint products so to that the probability of the sum of products can be expressed as the sum of the probability of each disjoint product. If component failures are statistically independent then the probability of each disjoint product is the product of the corresponding component reliabilities. If this assumption of statistical independence is not valid in practice the result computed under this assumption may serve as an approximate bound. In general, when using the SDP approach, the evaluation is performed in three phases; 1) enumeration of all minimal sets (minpaths or mincuts), 2) computation of the disjoint products by using an SDP algorithm, and 3) calculation of the system reliability measure by assigning component indices to the reliability formula as a (weighted) sum of the disjoint products. Results reported in literature show that usually the time of the second phase is dominant in the total running time needed for evaluation. It is thus important to reduce the amount of the SDP algorithm by obtaining fewer disjoint products resulting in a smaller probabilistic reliability formula, because this reduces the rounding error, the storage space, and the computation time of reliability calculation. That fewer disjoint products result in a smaller probabilistic reliability formula is extremely important in system modeling, when this formula (as a sum of products) is evaluated repeatedly for various combinations of component indices. So the quality of an SDP-algorithm can be determined by the number of disjoint products it generates. As one of the first Abraham introduced an algorithm for the generation of disjoint products using single variable inversion and therefor producing a huge number of disjoint products [Abr79]. Further work on this technique of single variable inversion yielded only marginal improvements. To avoid this disadvantage of producing an immense number of disjoint products Heidtmann presented a fundamentally new approach (sometimes called KDH88) applying inversion to multiple or groupedvariables which he called subproducts [Hei89]. Later a similar but less powerful [LuT98a,b] version was published in [VeT91], where the method of multiple-variable inversion was verified. Heidtmann [Hei95] extended his technique to non-coherent (non-monotone) systems and this algorithm was verified in [BeM96]. The extended sum of disjoint products algorithm is applicable to all coherent and non-coherent system structures. It applies not only to this kind of reliability and safety structures which can be specified in terms of Boolean Logic, but it is also applicable to dynamic

72

Klaus Heidtmann

reliability and safety problems which can be supported by logical extensions like Temporal Logic [Hei91, Hei92, Hei97]. Unfortunately its computational amount grows exponentially with the size of the considered system. Just for specific classes of system structures like k-out-of-n systems [BaH84, Hei86], k-to-l-out-of-n structures [Hei81, Rus87, UpP93], and series-parallel systems linear time algorithms are known [Mis93, Hei95, Hei97]. A lot of comparative computations published in the literature [Hei89, VeT91, Hei95, BeM96, Hei97, LuT98a-c, CDR99] confirm that for both methods the number of produced disjoint products and the computation time grow exponentially with the size of the analyzed system. But these examples also indicate that the newer technique of Heidtmann using multiple variable inversion is superior to approaches considering only single variables like Abraham's algorithm. This was mathematically proved by [And92] in the sense that Abraham's technique is in no case superior to Heidtmann's algorithm. In [BeM96] it is said: "Compared to the algorithm of Abraham Heidtmann's method is considerably more efficient and will generate a much smaller number of disjoint products in most cases. A simple look at the results shows the strong superiority of Heidtmann's algorithm over the algorithm of Abraham, because it produces much less disjoint products in much less time: 29 cases the improvement (reduction) in the number of disjoint products generated is more than 50% and 17 cases it is more than 80%. The improvement becomes larger as the number of system components (edges) increases." These results were based on a very small sample of 35 systems only. Consequently, we want to investigate a much larger sample of systems to answer the following quantitative question with much more confidence: How much better is the newer approach of Heidtmann? Is it significantly superior? Is the trends observed on small samples also valid for a much larger sample? So in this paper we derive statements of statistical validity to answer these questions. This means that we investigate a large and statistically valid sample of systems to obtain results like Heidtmanns algorithm is so and so much better on the average or significantly better than Abrahams method. Reliability and safety computation for large systems impose high storage and computation time requirements on computers. To quantify and characterize these requirements we study the best known sum of disjoint products techniques for a huge number of systems. As results we present detailed measurements and stochastic characteristics of the output of both algorithms. In general the presented results give some insight into the stochastic characteristics of both algorithms and can be applied to design new algorithms and manage the computation process skillfully. First section 2 presents an example to illustrate both methods and especially their difference. Then details of their implementation and utilization from our homepage are discussed. Section 3 presents the experimental environment and describes the performed computational experiments followed by the first results on the number of disjoint products as the characteristic attribute of SDP-algorithms. This is confirmed by the observed computation times. Section 4 regression analysis is applied to the measured values to characterize the increase of the number of disjoint products and computation time by exponential expressions. We apply these expressions for instance to estimate the number of disjoint products and the computation time for larger reliability graphs.

Statistical Comparison of Two Sum-of-Disjoint-Product Algorithms

2

73

Illustration of Both Algorithms and Their Implementation

In order to explain the purpose and the different approach as well as the different results of the two SDP-algorithms compared in this paper, let us start with a short example. Assume that a system with 5 components operates if components 1, 2 and 3 operate. It may also operate if components 4 and 5 operate. So this system has minsets (minpaths) {1,2,3} and {4,5}. For this example system Abrahams method produces the following disjoint events when it makes the last minset {4,5} disjoint to the former one{1,2,3}: A1. Component 1 is not operational, component 4 and 5 are operational. A2. Component 2 is not operational, components 1, 4, and 5 are operational. A3. Component 3 is not operational, components 1, 2, 4, and 5 are operational. In all three cases the first condition represents the single variable inversion (negation) technique. In a formal Boolean Logic notation where the AND-operator (conjunction) is written as a product, these disjoint events were represented by the following disjoint products: ¬x1x4x5, x1¬x2x4x5, x1x2¬x3x4x5 This results in the reliability formulas of Equation 1 for system reliability R depending on component reliabilities r1, r2, r3, r4, r5 in case of statistical independent component failures. For this example system the probabilistic reliability formula implied by Abrahams method consists of 4 summands (s. Equation 1): R = r1r2r3 + (1-r1)r4r5 + r1(1-r2)r4r5 + r1r2(1-r3)r4r5

(1)

Heidtmanns method results in the following single event which is disjoint to minpath {1,2,3}: H1. It is not true that component 1 and 2 and 3 are operational, while component 4 and 5 are operational. Here we invert or negate multiple variables (x1x2x3) or a group of component states (component 1 and 2 and 3 are operational). In a formal notation of Boolean Logic this looks like ¬(x1x2x3)x4x5 (or equivalently like (¬x1∨¬x2∨¬x3)x4x5 after applying the Shannon Inversion Theorem with the OR-Operator ∨ ). This yields the following reliability formula with only two summands (s. Equation 2): R = r1r2r3 + (1-r1r2r3)r4r5

(2)

So far we gave an explanation of both algorithms by example. Detailed textual and formal descriptions can be found in [Hei89, VeT91, Hei95, Hei97]. The last cited reference includes also programs in Pascal. Most implementations of the algorithms use the programming language C [VeT91, SoR91, LuT98a/b, TKK00, Vah98] and some of them are included in tools for reliability analysis, i.e. MOSEL [ABS99,Her00], SHARPE [TrM93,STP95,PTV97], CAREL [SoR91], SyRelAn [Vah95,Vah98] Another implementation uses Common Lisp [BeM96]. The experiments discussed in the following are based on our own implementation in Java. Our implementation, the experiments and the following results refer to the class of systems called reliability graphs. All nodes of these graphs are assumed to be perfectly reliable so that they can't fail and need not be considered in the reliability

74

Klaus Heidtmann

analysis, while the edges may fail so that they represent the components of the system which are susceptible to failure. A reliability graph is said to be operational if and only if all terminal nodes as a marked subset of all nodes are connected via operational edges. So the reliability of a reliability graph is the probability that all of its terminal nodes are connected by paths of operational edges. An algorithm for randomly generating reliability graphs for a given number edges as well as Abraham's and Heidtmann's algorithms were implemented in Java 2 and executed by the Java 2 Runtime Environment. These programs were used for the experiments which were discussed in the following sections. Moreover we integrated these programs as Java applets into a Web-tool named ReNeT (Reliability of Network Topologies) which can be run from our homepage http://www.informatik.unihamburg.de/TKRN/world/tools on your own computer. There is a graphical user interface where you may enter your reliability graph by drawing nodes (black circles for terminal nodes (left mouse bottom) and white circles for others (right mouse bottom)) and connecting them with edges. After drawing you can enter probabilities for component (edge) failures. The program then computes and shows both reliability formulas generated by Abraham's resp. Heidtmann's algorithm as different sums of disjoint products. At last it shows the reliability of the system or reliability graph you entered. The following section presents our comprehensive experimental measurements which were produced by a notebook with Pentium II 366 processor, 64 MB RAM, 128 MB Swap, and SuSE Linux 7.1, Kernel 2.4.0.

3

Results of Measurement

In this section we give first computational results to compare the performance of Heidtmanns algorithm versus the algorithm of Abraham. All together 3200 randomly generated reliability graphs with 5 to 16 unreliable edges were investigated representing 3200 systems with 5 to 16 unreliable components. In more detail, for each number of edges from 5 to 12 we randomly produced and analyzed 400 reliability graphs and for 13 to 16 edges we generated and investigated 100 graphs for each number of edges. First we show the detailed results for reliability graphs with 16 edges. In this case we randomly generated 100 graphs, computed there disjoint products by each of the two algorithms and counted the number of the resulting disjoint products. According to the theoretical result that Abraham's algorithm produces at least as many disjoint products as Heidtmann's method the number of actually computed disjoint products for each of the invested graphs was smaller in case of Heidtmann's algorithm. For Fig. 1 we arranged the resulting values by increasing number of disjoint products with regard to Heidtmann's algorithm (s. Fig. 1). It can be seen that in some cases the Abraham method produces more than twice the number disjoint products than Heidtmanns algorithm. In the worst case Abrahams algorithm produces about 8000 disjoint products whereas only about 2000 disjoint products result from Heidtmann's method, i.e. a third. So we notice from small to large differences in the number of disjoint products for both algorithms. The same applies to the computation time, which can be seen from Fig 2 noticing the logarithmical scale. In single cases, when the number of produced disjoint products is nearly identical for both methods,

Statistical Comparison of Two Sum-of-Disjoint-Product Algorithms

75

Heidtm ann Abraha m

Number of Disjoint Products

8000

6000

4000

2000

0 0

20

40

60

Number of Sample Graph

80

Computation Time in ms (log)

Abraham's algorithm may become a little faster than Heidtmann's method, because the Abraham procedure to derive disjoint products is a little bit simpler. All in all the values of both attributes evaluated by the large sample show the great advantage of Heidtmann's method over Abraham's. Heidtmann Abraham

16 12 8 4 0

20 40 60 80 Number of Sample Graph

Fig. 1. Number of disjoint products (left) and computation time (right) for 100 sample graphs with 16 edges produced by Abrahams and Heidtmanns algorithm

Now we present the mean value of the number of disjoint products and the mean computation time in milliseconds for every sample of 400 resp. 100 graphs with an identical number of edges. These mean values are given in Tab. 1 and illustrated in the following Fig. 2. They seem to grow exponentially. This will be investigated in the following section. Table 1. Empirical mean of the number of disjoint products and the computation time in ms for 400 resp. 100 sample graphs resulting from Abraham's and Heidtmann's algorithm

76

Klaus Heidtmann

Abraham

Heidtmann

10000000 Mean Computation Time in ms (log)

Number of Disjoint Products (log)

Abraham

Heidtmann

10000

1000

100

10

100000

1000

10

1 5

6

7

8

9

10

11

12

13

14

15

16

Number of Edges

0.1 5

6

7

8

9

10

11

12

13

14

15

16

Number of Edges

Fig. 2. Comparison of mean number of disjoint products (left) and mean computation time (right) for both algorithms (Notice the logarithmical scale !)

To show the influence of different operating systems on the computation time Table 2 includes the empirical mean values of computation time in milliseconds for 100 sample graphs with 10 edges and both algorithms. Table 2. Empirical mean of the computation time (ms) on computers with different operating systems for 100 sample graphs with 10 edges

Operating System Linux Windows NT Solaris

4

Computation Abraham 173,7 114,7 203,8

Time (ms) Heidtmann 65,3 94,0 124,2

Statistical Comparison

To see the increase of the mean value of the number of disjoint products and the mean computation time with increasing number of edges we illustrate the values of Tab. 1 by the following Fig. 3. Here it is obvious that both values for Heidtmann's algorithm follow exponential distributions. The similar figure of Abraham's method which increases much more rapidly is not presented here. So the measured results lead to the hypothesis that neither the number of disjoint products produced nor the computation time follow a normal distribution. The corresponding statistical tests (Kolmogoroff-Smirnov) confirmed this hypothesis. The regression analysis of SPSS (Software Packet of Statistical Standard-Software) was used to compute an approximation of the mean value of the number of disjoint

Statistical Comparison of Two Sum-of-Disjoint-Product Algorithms

77

products depending on the number of edges n. It yields the following exponential functions: 0.0106 exp(0.5594 n) for Heidtmann's algorithm as seen in Fig.3 (left) and 0.0869 exp(0.6276 n) for Abraham's technique. A similar approximation of the computation time yields 0.0015 exp(1.0839 n) for Heidtmann's algorithm as seen in Fig.3 (right) and 0.0011 exp(1.1212 n) for Abraham's method Using these expressions we can estimate the number of disjoint products and the computation time for reliability graphs with more than 16 edges by regression analysis without explicit generation of the graphs and their further investigation. So for systems with 17 to 20 components and both algorithms the estimated number of disjoint products and the estimated computation time are given in Table 3. Here the immensely increasing superiority of Heidtmann's technique over the method of Abraham is obvious. observed exponential

observed exponential 80000 Mean Computation Time in ms

Mean Number of Disjoint Products

800

600

400

200

60000

40000

20000

0 5

6

7

8

9

10

11

12

Number of Edges

13

14

15

16

0 5

6

7

8

9

10 11 12 13 14 15 16

Number of Edges

Fig. 3. Mean numbers of disjoint products (left) and computation time (right) for sample graphs with 5 to 16 edges produced by Heidtmann's algorithm

As the number of disjoint products and the computation time do not follow normal distributions as was tested with the Kolmogoroff-Smirnov-Test we applied the Wilcoxon-Test for related samples. It is clear that for a specific reliability graph the number of disjoint products produced by the Abraham algorithm and the corresponding number of Heidtmann's algorithm are related. The same applies to the computation time of both algorithms applied to the same reliability graph. So the number of disjoint products and the computation time can be arranged in pairs, where each pair belongs to the same reliability graph. So we applied the Wilcoxon-Test to the samples of 400 randomly generated reliability graphs with 5 to 12 and to the samples of 100 randomly generated graphs each with 13 to 16 edges. The results of the Wilcoxon-Test shows that for all these 11 samples the values of the number of disjoint products and the computation time differ with high significance. The same results when we apply the Test to the whole sample of 3200 reliability graphs. In all cases the level of significance is Zero, which means high significance. So the

78

Klaus Heidtmann

algorithm of Heidtmann is high significantly superior to Abraham's method as far as the number of produced disjoint products and the computation time is concerned. Table 3. Estimates for the number of disjoint products and for the computation time of reliability graphs with more than 16 edges for both algorithms

Number of Components (Edges)

5

Estimated Number of Disjoint Estimated Computation Time Products (ms) Abraham Heidtmann Abraham Heidtmann

17

3.740

1.444

1.006.48

152.73

18

7.006

2.527

3.384.97

451.53

19

13.124

4.422

11.384.28

1.334.90

20

24.585

7.738

38.287.39

3.946.49

Conclusion and Outlook

This paper presented comprehensive measurements and a detailed stochastic characterization of the output of two frequently used algorithms for reliability and safety computation as an important step of reliability and safety analysis. First the two sum of disjoint products algorithms of Abraham and Heidtmann were explained by an example, which also illustrated the typical difference in the number of disjoint products. This attribute together with the computation time of these two algorithms was measured for 3200 reliability graphs. The results of this measurement were discussed beginning with the illustration of the measured values for 100 randomly generated reliability graphs with 16 edges, which correspond to systems with 16 unreliable components. The number of disjoint products and the computation time grow exponentially with the number of components for both algorithms as well as the difference of their numbers of disjoint products and their computation time. Heidtmann's algorithm was significantly superior with respect to both attributes. Very similar output was observed for 3100 other reliability graphs with 5 to 15 edges. Obviously the single variable inversion technique of Abraham generally produces very large numbers of disjoint products while multiple variable inversion of Heidtmann's method achieves appreciably fewer disjoint products. This is very important as it results in much shorter computation time for the set of disjoint products, additionally reduces the storage space and produces a much smaller reliability formula. Therefore computation time for the evaluation of the reliability formula is reduced as well as the rounding error. No observed values of the performed measurements were normally (Gaussian) distributed. Based on the empirical results the number of disjoint products and the computation time for both algorithms and the their differences were estimated for larger systems by regression analysis. All in all the values of both attributes, i.e. number of disjoint products and computation time, evaluated by a large sample show the great advantage of Heidtmann's algorithm over Abraham's. Because of the exponential growth reliability and safety measures are often approximated for large systems. If only a given subset of a systems minsets is

Statistical Comparison of Two Sum-of-Disjoint-Product Algorithms

79

considered for approximate computation Heidtmann's algorithm computes the approximate value faster than Abraham's method. If special resources like computation time or storage is limited than Heidtmann's technique uses these restricted resources much more efficient than Abraham's algorithm producing closer approximations than Abraham's method with identical resources. The quantification of these advantages as well as the use of both algorithms for approximation purposes in general needs further investigation and will be subject to further studies. The derived stochastic characteristics can be applied to the development of more efficient algorithms for reliability computation so that they produce fewer disjoint products and a smaller reliability formula. In the context of approximation our characterization can be used straight forward to derive some comparative aspects. Furthermore, these results can be applied to estimate characteristic attributes for larger systems, for instance by regression analysis, or to use our Java code to analyze your own systems. In this case their are two ways for your own reliability analyses. On one hand our tool is extremely useful as a Java applet, e.g. when applying it via the WWW directly to your reliability problem executing the reliability computation on your own computer. On the other hand our Java programs can serve as a basis for local installations of both algorithms. In both cases our programs can be used for your own reliability and safety analysis. Moreover, the behavior of the algorithms can be investigated for various systems by means of measurement or the influence of different system structures on the efficiency of the algorithms can be observed. It may also be possible to exploit new areas of application for this efficient SDPalgorithm [YuT98]. Besides coherent and non-coherent system structures as well as the solution of dynamic reliability and safety problems using Temporal Logic [Hei91, Hei92, Hei97] it is important to know and use the algorithm that minimizes the reliability formula (minimal sum of disjoint products) since such formulas play also a central role in probabilistic assumption-based truth maintenance systems [LaL89] and probabilistic assumption-based reasoning [Koh95, KoM95, BeK95]. In many practical applications it is possible to reduce large systems by well known reduction methods [Mis83, Hei97] so that the reduced system can further be analysed by SDP-methods. The often used example of a two terminal ARPA-network with 24 edges (system components) can be reduced to 8 edges by series- and polygon-tochain-reduction [Hei97].

References [Abr79]

Abraham J.A., An improved algorithm for network reliability, IEEE Trans. Reliability Vol. 28, No. 1, 1979 58-61, also in: [RaA90], 89-92 [ABS99] Almasi B., Bolch G., Sztrik J., Modeling Terminal Systems using MOSEL, Proc. Europ. Simulation Symp. ESS'99, Erlangen, Germany, 1999 [And92] Anders J.M., Methods for the reliability analysis of complex binary systems, PhD thesis, Dept. Mathematics, Humboldt-University, Berlin, 1992 [BaH 84] Barlow R.E., Heidtmann K.D., Computing k-out-of-n system reliability, IEEE Trans. Reliability, 33, 3, 1984

80

Klaus Heidtmann

[BeK95] [BeM96] [CDR99] [Hei81] [Hei86] [Hei89] [Hei91]

[Hei92] [Hei95]

[Hei97] [Her00] [Koh95] [KoM95] [LaL89] [LuT98a]

Besnard P., Kohlas J., Evidence Theory based on general consequence relations, Intern. J. Foundation of Computer Science 6, 1995, 119-135 Bertschy R., Monney P.A., A generalization of the algorithm of Heidtmann to non-monotone formulas, J. of Computational and Applied Mathematics, Dec 1996, Vol. 76, No. 1-2, 55-76 Chatelet E., Dutuit Y., Rauzy A., Bouhoufani T., An optimized procedure to generate sums of disjoint products, Reliability Engineering and System Safety, Sept. 1999, Vol. 65, No. 3, 289-294 Heidtmann K.D., A class of noncoherent systems and their reliability analysis, Dig. 11th Ann. Intern. Symp. Fault-Tolerant Computing, FTCS 11, Portland/USA, 1981 Heidtmann K.D., Minset splitting for improved reliability computation, IEEE Trans. Reliability 35, 5, 1986 Heidtmann K.D., Smaller sums of disjoint products by subproduct inversion, IEEE Trans. Reliability 38, 3, 1989, 305-311 Heidtmann K.D., Temporal Logic applied to reliability modeling of faulttolerant systems, Proc. 2nd Intern. Symp. Formal Techniques in RealTime and Fault-Tolerant Systems, Nijmegen/Netherlands, 1992, in: Vytopil J., Lecture Notes in Computer Science, No. 571, Springer, Berlin, 1991 Heidtmann K.D., Deterministic reliability modeling of dynamic redundancy, IEEE Trans. Reliability 41, 3, 1992, 378-385 Heidtmann K.D., Methoden zur Zuverlässigkeitsanalyse unter besonderer Berücksichtigung von Rechnernetzen (Methods for reliability analysis with special emphasis on computer communication networks), Habilitationsschrift, Dept. Comp. Science, Hamburg University, 1995 (in german) Heidtmann K.D., Zuverlässigkeitsbewertung technischer Systeme (Reliability analysis of technical systems), Teubner, Stuttgart, 1997 (in german) Herold H., MOSEL, An Universal Language for Modeling Computer, Communication, and Manufacturing Systems, PhD Thesis, Techn. Faculty, University Eralngen, 2000 Mathematical foundations of evidence theory, in: Coletti G., Dubois D., Scozzafa R., (Eds.), Mathematical Models for Handling Partial Knowledge in Artificial Intelligence, Plenum Press, New York, 1995, 31-64 Kohlas J., Monney P.A., A Mathematical Theory of Hints, An Approach to the Dempster-Shafer Theory of Evidence, Lecture Notes in Economics and Mathematical Systems Vol. 425, Springer, Berlin, 1995 Laskey K.B., Lehner P.E., Assumptions, belief and probabilities, Artificial Intelligence 41, 1989, 65-77 Luo T., Trivedi K.S., An improved multiple variable inversion algorithm for reliability calculation, 10th Intern. Conf. Tools'98, Palma de Mallorca, Spain, Sept. 1998, in: Puigjaner R., Savino N.N., Serra B. (eds.), Computer Performance Evaluation, Modeling Techniques and Tools, Springer, 1998

Statistical Comparison of Two Sum-of-Disjoint-Product Algorithms

81

[LuT98b] Luo T., Trivedi K.S., An improved algorithm for coherent-system reliability, IEEE Trans. Reliability, March 1998, Vol. 47, No. 1, 73-78 [LuT98c] Luo T., Trivedi K.S., Using Multiple Inversion Techniques to Analyze Fault-trees with Inversion Gates, 28Th Ann. Fault Tolerant Computing Symp., FTCS 98, Munich 1998 [Mis93] Misra K.B., New trends in system reliability evaluation, Elsevier Publishers, 1993 [PTV97] Puliafito A., Tomarchio O, Vita L., Porting SHARPE on the Web, Proc. TOOL'97, Saint Malo, June 1997 [RaA90] Rai S., Agrawal D.P., Distributed Computing Network Reliability, IEEE Computer Society Press, Washington, 1990 [Rus87] Rushdi A.M., Efficient computation of k-to-l-out-of-n systems, Reliability Engineering 17, 1987, 157-163 [RVT95] Rai S., Veeraraghavan, Trivedi K.S., A Survey of Efficient Reliability Computation Using Disjoint Products Approach, Networks 25, 3, 1995, 147-163 [SoR91] Soh S., Rai S., CAREL: computer aided reliability estimator for distributed computing networks, IEEE Trans. Parallel and Distributed Systems 2, 2, 1991, 199-213 [STP95] Sahner R., Trivedi K.S., Puliafito A., Performance and Reliability Analysis of Computer Systems - An Example Based Approach Using SHARPE Software Package, Kluwer Academic Publishers, Massachussetts, 1995 [TKK00] Tsuchiya T., Kajikawa T., Kikuno T., Parallelizing SDP (Sum of Disjoint Products) Algorithms for fast reliability analysis, IEICE Trans. Inf. & Syst., Vol. E83, No. 5, May 2000, 1183-1186 [TrM93] Trivedi K.S., Malhotra M., Reliability and Performability Techniques and Tools: A Survey Proc. 7th ITG/GI Conf. Measurement, Modelling and Evaluation of Computer and Communication Systems, Aachen University of Technology, 1993, 27-48 [UpP93] Upadhyaya, Pham, Analysis of a class of noncoherent systems and an architecture for the computation of the system reliability, IEEE Trans. Computers 42, 4, 1993 [Vah95] Vahl A. Reliability Assessment of Complex System Structures – A Software Tool for Design Support, Proc. 9th Symp. Quality and Reliability in Electronics, Relectronic'95, Budapest, 1995, 161-166 [Vah98] Vahl A., Interaktive Zuverlässigkeitsanalyse von FlugzeugSystemarchitekturen, PhD Thesis, Technical University HamburgHarburg, Flugzeug-Systemtechnik, VDI-Verlag, Düsseldorf, 1998 (in german) [VeT91] Veeraraghavan and Trivedi K.S., An improved algorithm for the symbolic reliability Analysis of Networks," IEEE Trans. on Reliability, Vol. 40, No. 3, Aug. 1991, 347-358 [YuT98] Ma Yue, Trivedi K.S., An algorithm for reliability analysis of phased mission systems, Intern. Symposium on Software Reliability Engineering, ISSRE 1998

Safety and Security Analysis of Object-Oriented Models Kevin Lano, David Clark, and Kelly Androutsopoulos Department of Computer Science, King’s College London Strand, London WC2R 2LS, UK Phone: +44 (0)207 2832, Fax: +44 (0)207 2851 [email protected] Abstract. In this paper we review existing approaches for the safety and security analysis of object-oriented software designs, and identify ways in which these approaches can be improved and made more rigorous.

1

Introduction

Object-oriented software is increasingly being used for critical systems: the widespread use of Java for smartcard based security-critical applications in particular is one example where intensive veriﬁcation and proof techniques have been used [4, 6]. C++ has also been used in safety-critical and safety-related systems, and it seems likely that the use of such languages and methods will increase, as the impact of the standardisation of UML [10] and the convergence of commercial software development on Java and Java-like environments throughout the software industry spreads into the safety-critical systems domain. There are good engineering reasons for this trend, in that the capabilities of developers seem to be generally enhanced through the use of such methods and languages: it becomes easier to analyse and review the software and to detect and correct errors, because of the increased opportunities for modularising and structuring the software into relatively independent subcomponents with restricted interfaces. Java in particular could be regarded as a safe(r) subset of C++, eliminating many of the features, such as pointers and programmer deﬁned memory management, which are a frequent cause of subtle errors in C and C++ programs. It would be direct to adapt the MISRA C rules [3] to Java, indeed many of these rules are already common practice for Java programmers. Assertion-based static analysis tools for Java are now available [11]. In this paper we examine how safety analysis techniques such as HAZOPS and fault tree analysis (FTA) can be adapted to object-oriented system and software modelling notations, particularly UML.

2

Hazard Analysis of Object-Oriented Models

Standards such as [8] recommend that hazard identiﬁcation, hazard analysis and risk analysis should be carried out for critical systems. Recommended techniques for these processes include HAZOPS [9] and Fault Tree Analysis (FTA). S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 82–93, 2002. c Springer-Verlag Berlin Heidelberg 2002

Safety and Security Analysis of Object-Oriented Models

83

Table 1. 00-58 Guideword Interpretations for Event Guideword No As well as Other than

Interpretation Event does not happen Another event takes place as well An unexpected event occurs instead of the expected event

HAZOPS consists of a process of systematic exploration of analysis and design diagrams and documentation of a system for possible deviations from required behaviour or design intent, and identiﬁcation of hazards caused by such deviations. Based on a design diagram, the elements of the system are identiﬁed, and for each element attributes are identiﬁed, for example, the position of a valve. A set of guidewords such as NO , REVERSE , AS WELL AS are then applied to each attribute to identify possible deviations (eg, the valve being open when it should be closed). HAZOPS can be applied at an early stage of system construction, where the main features of a system and intended behaviour are known, but detailed algorithms or decomposition of modules have not been produced. FTA derives a detailed causality structure for each credible hazard produced from HAZOPS, and assigns maximum acceptable probability bounds to events based on the maximum acceptable probabilities of their consequent hazards. Def-Stan 00-58 [9] gives some guidelines for the hazard analysis of state transition diagrams and entity-relationship-attribute models using HAZOPS. These guidelines and the examples given are unclear and ambiguous in some cases, which we discuss below. We provide an alternative and systematic set of guidelines for HAZOPS of UML notations, and give examples of their use. 2.1

State Transition Diagrams

For state transition diagrams Def-Stan 00-58 Part 2 considers the transitions as the basic elements of analysis and the attributes event and action of transitions. For events there are the following guidewords and interpretations (Table 1). However “Event does not happen” is ambiguous – its meaning depends on what the state machine model represents: a speciﬁcation of a control system,

Table 2. Revised Guideword Interpretations for Event Guideword Interpretations No Event not received by control system: either it occurs but is not transmitted to the controller because of sensor or other failure, or it does not occur even though expected As well as Another event is detected by the control system as well as the intended event Other than An unexpected event is detected instead of the expected event

84

Kevin Lano et al.

Table 3. 00-58 Guideword Interpretations for Action Guideword No As well as Part of Other than

Interpretation No action takes place Additional (unwanted) actions take place An incomplete action is performed An incorrect action takes place

or a description of the equipment under control (EUC). In the former case the guidewords can be alternatively interpreted as follows (Table 2). For example, having commanded a railway switch (points) to change position, the control system expects an event conﬁrming the completion of the close. Suitable action must be taken if this event does not occur within a given time period. An initial HAZOPS would consider deviations at the complete system level, not examining details of the control system behaviour but instead issues of how deviations from design intent in the controlled system can lead to hazards. For example failure of the switch to close can lead to derailment if undetected. At an initial design level HAZOPS can consider the intended state transition behaviour of the control system and identify more speciﬁc ways that deviations from design intent can occur. Eg: the switch may actually close but the control system not detect this due to faulty sensors, connections or control logic, etc. A similar distinction is necessary on the output side of a state machine as well. The attribute in this case is the transition action, which has the following guidewords (Table 3). Again we can regard the above as appropriate when considering the overall behaviour of a system, and the following as appropriate when considering deviations of a software control system (Table 4). For states, deviations could be as given in Table 5. Consideration of timing also needs to be relative to either the controlled system or the controller behaviour. A signiﬁcant time delay between a realworld event and its detection by a controller would be a deviation considered by the latter analysis for example, whereas a railway switch closing more slowly than intended, because of ice, for example, would be a deviation considered by the former.

Table 4. Revised Guideword Interpretations for Action Guideword Interpretations No No action is produced by the controller, or this action is not transmitted to/carried out by actuators As well as Additional (unwanted) actions are generated/performed Part of An incomplete action is generated/performed Other than An incorrect action is generated/performed

Safety and Security Analysis of Object-Oriented Models

85

Table 5. Guideword Interpretations for State Guideword Interpretation No Object not in expected state Other than Object in an unexpected state

2.2

Class Diagrams

Entity-relationship diagrams are considered in section A.5.2 of 00-58, Part 2. However the guideword interpretations suggested tend towards consideration of design ﬂaws of the diagram (eg: “there is a required relationship that is not shown on the diagram”) instead of deviations from design intent of the system described by the diagram. An example of the latter could be: “not all train locations are recorded in the data of the control system” – ie, a discrepancy exists between what the system is supposed to do, according to the diagram, and what it may do in a “part of” deviation. We suggest the following guideword interpretations for relationships (Table 6). For classes we could have the interpretations given in Table 7. For attributes of classes, we could have the interpretations given in Table 8. 2.3

Sequence Diagrams

Sequence diagrams in UML describe the objects and messages involved in a use case (for example, the system reaction to a particular external request or event). Compared to class diagrams and statecharts sequence diagrams are imprecise, representing examples or instances of behaviour which could be more comprehensively and accurately deﬁned in these other models. However in the development process sequence diagrams may precede the construction of statechart models, so may therefore be analysed earlier.

Table 6. Guideword Interpretations for Relationships Guideword Interpretations No No information about this relationship between two objects is recorded by the control system, even though the relation is true in the real world; or the relation does not hold between two real-world objects when it is expected to More/less The number of objects in the relationship with another does not obey the cardinality restrictions expected. This may either be a feature of the real world or erroneous data held by the control system Part of Some semantic constraints of the relation given in the diagram hold, but others do not (either in the real world or in the control system data) Other than The specified relation does not occur between some objects, another unintended relation is present instead between these objects

86

Kevin Lano et al.

Table 7. Guideword Interpretations for Classes Guideword Interpretation No An object is not a member of an expected class More/Less A class has more/fewer instances than expected Part of Some of the class constraints are true, others are not Other than An object is a member of an unintended class

Table 8. Guideword Interpretations for Attributes Guideword Interpretation No An object has no value for this attribute when it should do More/Less An attribute has a higher/lower value than expected (or more/fewer values than expected, if it is multivalued) Part of Some of the attribute constraints are true, others are not Other than Attribute is a member of an incorrect type

The main entities in a sequence diagram are objects and messages. Objects have the following attributes: (i) their (runtime) class; (ii) their lifetime. Messages have the following attributes: (i) data; (ii) time of occurrence; (iii) source object; (iv) destination object; (v) condition; (vi) delay; (vii) duration (in the case of synchronous messages). For each attribute of objects and messages in a sequence diagram we can apply guidewords, as follows. Table 9 considers the time of occurrence of a message. For the attribute of target object we could have the interpretations given in Table 10. Table 11 gives interpretations for message conditions. 2.4

Def-Stan 00-58 Case Studies and Problems

Annex C of 00-58 part 2 illustrates some PES HAZOPS. These examples are unclear in several cases due to omission of the ‘interpretation’ column in the tables. This interpretation is still only implicit in the example dialogues and should be made explicit. For example, “Other information is present in the data ﬂow” in C.5.1. Figure 15 has errors (eg, the conditions on the transitions between

Table 9. Guideword Interpretations for Message Timing Guideword NO/NONE OTHER THAN AS WELL AS MORE THAN LESS THAN SOONER LATER

Interpretation Message not sent when it should be Message sent at wrong time Message sent at correct time and also an incorrect time Message sent later/more often than intended Message sent earlier/less often than intended Message sent earlier within message sequence than intended Message sent later within message sequence than intended

Safety and Security Analysis of Object-Oriented Models

87

Table 10. Guideword Interpretations for Message Destination Guideword NO/NONE OTHER THAN AS WELL AS REVERSE MORE THAN LESS THAN

Interpretation Message not sent when intended (to any destination) Message sent to wrong object Message sent to correct object and also an incorrect object Source and destination objects are reversed Message sent to more objects than intended Message sent to fewer objects than intended

Table 11. Guideword Interpretations for Message Conditions Guideword NO/NONE OTHER THAN REVERSE AS WELL AS

Interpretation Condition not true when intended Condition has unintended truth value Condition has opposite truth value to intended An unintended property is true in addition to the condition

“Creep” and “low speed” are swapped). In D.4.1 the analysis could be more accurately expressed as a deviation OTHER THAN on the Event attribute of the state machine, ie, an incorrect event is detected by the system. One cause of this incorrect event being detected is wheel spin leading to an incorrect high speed measurement. A consequence is that the system will fail to be activated when it should be (because the real speed is ≤ MIN). D.4.2 again concerns the Event attribute on a transition, and guideword NO: failure of the system to register the vehicle becomes present event when it should.

{ (b,l) : located_at & l : DepositBelt => b.pressed } Blank * located_at pressed: 0..n bool 0..1 0..n 0..1

1 Location

0..1 Feedbelt

Press 0..1

Deposit Belt 0..1

Table 0..1

Fig. 1. Analysis Class Diagram for Production Cell

88

Kevin Lano et al.

Table 12. Guideword Interpretations for Case Study Attribute Blank/table relationship

Guideword Interpretations No Blank never placed on table after leaving feed belt. Or system has failed to detect it on table. More 2 or more blanks are on the table. Blank/deposit Part of There is an unpressed belt relationship blank on the deposit belt.

2.5

Case Study: Production Cell

An example of application of the above approach is the following, taken from an analysis of a production cell, in which blanks (car body templates) are moved along a feed belt onto an elevating table to be picked up by a robot arm and then transferred to a press for pressing before being transferred to a deposit belt [5]. Figure 1 shows part of the UML class diagram for this system. Applying HAZOPS to this diagram, using table 6, we could have the analyses given in Table 12. In general either the real world deviates from the model described in the class diagram, and either the software correctly represents this failure, or it does not, or the real world conforms to the model but the software has incorrect knowledge. These three cases may be signiﬁcantly diﬀerent in terms of deviation causes and consequences, so should be analysed separately. For instance if a blank has failed to be pressed, and the software fails to detect this, hazards may arise with later processing stages. On the other hand if the blank has been pressed but the software fails to detect this, it may be re-pressed, possibly causing damage to the press. Figure 2 shows the controller state machine for the feed belt. s2 is the blank sensor at the end of the feed belt, stm a signal indicating that the feed belt is safe to move, and bm is the belt motor. Hazard states are indicated by a cross. HAZOPS of this state machine, using tables 2 and 4, identiﬁes the following potential deviations and hazards (Table 13).

Table 13. Guideword Interpretations for Feed belt state machine Attribute Guideword Interpretation and Consequences Event No stmoﬀ not detected when it should be. (for transition Control system leaves motor on, oﬀ on on → resulting in hazard state oﬀ oﬀ on) oﬀ oﬀ on in EUC. Action Other than bmSeton issued instead of bmSetoﬀ . (for transition EUC stays in state on oﬀ on on oﬀ on → which is hazardous. on oﬀ oﬀ )

Safety and Security Analysis of Object-Oriented Models Tuples in order s2_stm_bm stmon

off_off_off

89

Sensor transition: Actuator transition: off_on_off

/bmSeton

/bmSetoff stmoff

off_on_on

off_off_on s2off

s2on

on_on_on on_on_off

/bmSeton

stmoff

stmon on_off_on on_off_off

/bmSetoff

Fig. 2. Controller state machine for Feed belt

3

Security Analysis

Many safety-critical systems have security issues – eg, in a railway network management system, communication between a train coordinator and train drivers must be authenticated to ensure safe operation of the railway. Other systems may not have direct safety implications (eg, an online banking system) but have security aspects with critical consequences. Traditional HAZOPS concerns deviations from intended functionality, in general. For security HAZOPS we want to focus on the relevant components (elements which store, process or transmit data internally, or which receive/transmit data to the external world) and relevant attributes, guidewords and interpretations. Deviation analysis of object-oriented models for security properties can follow the same set of guidewords and interpretations as given above. The elements and attributes of interest will be focussed on security, and similarly for the causes and consequences of deviations. For example in the model of Figure 3 a “part of” deviation could be that some user with a security level below admin is logged into the secure host.

User permissions

*

logged_on

1

Secure Host

{ (user,host): logged_on => user.permissions >= admin }

Fig. 3. Part of Class Diagram of Secure System

90

Kevin Lano et al.

The aim of security HAZOPS is to identify deviations from design intent which have security implications, ie, which aﬀect the systems ability to: – ensure conﬁdentiality and integrity of information it stores/processes/internally transmits (Confidentiality/Integrity) – provide authentication of information it produces (outputs) and to authenticate input information (Authentication) – ensure availability of services/information (Availability) – provide a trace of all user interactions, such as login attempts to the system (Non-repudiation). The team approach and other organisational aspects of HAZOPS deﬁned in Def-Stan 00-58 could still be used, although the range of expertise should be relevant (eg, including experts in the encryption/networking approaches used in the system). The components involved will be: – Physical components: processing nodes (including intermediate relay nodes in a communication network); memory; connectors; input devices (eg, keyboard; smartcard reader; biometric sensor); output devices (eg, CD burner; smartcard writer; VDU) – Information components: ﬁle, email, network packet. Subcomponents such as particular record ﬁelds/email header, etc. Therefore the guidewords and interpretations could be: – NO/NONE – complete absence of intended CIA/A/NR property for component being considered – LESS – inadequate CIA/A/NR property compared with intention – PART OF – only part of intended CIA/A/NR achieved – OTHER THAN – intended CIA/A/NR property achieved, but some unintended property also present. LATE, EARLY, BEFORE, AFTER, INTERRUPT could also be relevant. The paper [12] gives more focussed guidewords, using the idea of a guidephrase template. We can extend this concept to consider the additional properties of authentication and non-repudiation. CIA/A/NR properties may be lost due either to device behaviour (technical failure/cause) or to human actions. Human actions may either be deliberate hostile actions (by insiders or outsiders) or unintentional human failures (or correct procedures which nonetheless lead to a loss of CIA/A/NR). Negations of CIA/A/NR properties are: disclosure, manipulation (subcases being fabrication, amendment, removal, addition, etc), denial, misauthentication (includes authentication of invalid data and inauthentication of valid data), repudiation. The adapted guidephrase template is given in Table 14. Examples of these and their interpretations could be:

Safety and Security Analysis of Object-Oriented Models

91

Table 14. Revised Guidephrase Template (Pre) Guideword Attribute Deliberate Disclosure Manipulation of COMPONENT by Unintentional Denial Misauthentication Repudiation

(Post) Guideword Insider Outsider Technical behaviour/ functionality

– Unintentional misauthentication of connection request to firewall by technical behaviour. Interpretation (1): ﬁrewall authenticates sender of request when it should not. Interpretation (2): ﬁrewall does not authenticate a valid sender of request and denies the connection when it should not. – Deliberate disclosure of patient record by insider. Interpretation: staﬀ member provides patient information intentionally to a third party. We could apply this approach to an example of an internet medical analysis system (Figure 4): – The company provides a server to perform analysis of data which customers upload to the website, and download results from. – Data is encrypted by a private key method between the client and server, but stored in decrypted form on the server and the smartcard. – The smartcard reader at the client computer reads data from the smartcard for transmission to the server. – Customers have a password/login for the website, this is checked against their customer ID data on the smartcard (written by the server on registration/initialisation of the card). Only if both agree is the login accepted. – Each customer is allocated a separate directory at the server and cannot load/read the data of any other customer.

Internet

Server

Client PC Data storage Smartcard reader

Fig. 4. Architecture of Medical Analysis System

92

Kevin Lano et al.

An example security HAZOPS analysis from this system is given in Table 15. Detailed risk analysis can then be carried out:

Table 15. Example Analysis of Medical System Guidephrase Unintentional misauthentication of smartcard by technical function Deliberate manipulation of server data by outsider

Interpretation An invalid smartcard is authenticated when it should not be Server data altered by unauthorised person

Causes Software fault in checker, or HW failure in reader Outsider gains login access to server, or via unsecured webpages or scripts

Consequences Valid user can use invalid card: may corrupt data stored on server Analysis data may be altered so incorrect results are given

– Given a speciﬁc loss of security situation from the HAZOP, we can compute the approximate likelihoods and severity, and therefore the risk level. – The probability of decryption of encrypted data can be estimated using mathematical theory, other probabilities, of interception, intrusion, denial of service attacks, are less clear, although estimates based on previous experience could be used.

4

Related Work

State machine hazard analysis has been used informally in the HCI ﬁeld [2] and tool support for generating fault trees from state machines has been developed as part of RSML [7]. The CORAS Esprit project (http://www.nr.no/coras/) is investigating security analysis of object-oriented systems, using scenario analysis. Such analysis could be based on the identiﬁcation of individual deviations using the techniques give here. Tools for HAZOPS analysis such as PHA-Pro 5 (http://www.woodhill.co.uk/pha/pha.htm) and PHAWorks 5 (http://www.primatech.com/software/phaworks5.htm) deal mainly with process control system HAZOPS support, and do not cover programmable electronic system designs.

5

Conclusion

We have described approaches for HAZOPS of UML notations and for security HAZOPS. These revised guidewords and interpretations have been implemented

Safety and Security Analysis of Object-Oriented Models

93

in a hazard analysis tool for UML class diagrams and E/E/PES P&I diagrams within the RSDS support environment [1].

References [1] K. Androutsopoulos, The RSDS Tool, Department of Computer Science, King’s College, 2001. http://www.dcs.kcl.ac.uk/pg/kelly/Tools/ 93 [2] Alan Dix, Janet Finlay, Gregory Abowd, Russell Beale, Human-Computer Interaction, 2nd Edition, Prentice Hall, 1998. 92 [3] ISO, Guidelines for the Use of the C Language in Vehicle Based Software, ISO TR/15497. Also at: http://www.misra.org.uk/. 82 [4] J. L. Lanet, A. Requet, Formal Proof of Smart Card Applets Correctness, Proceedings of 3rd Smart Card Research and Advanced Application Conference (CARDIS ’98), Sept. 1998. 82 [5] K. Lano, D. Clark, K. Androutsopoulos, P. Kan, Invariant-based Synthesis of Fault-tolerant Systems, FTRTFT 2000, Pune, India, 2000. 88 [6] P. Lartigue, D. Sabatier, The use of the B Formal Method for the Design and Validation of the Transaction Mechanism for Smart Card Applications, Proceedings of FM ’99, pp. 348–368, Springer-Verlag, 1999. 82 [7] Nancy G. Leveson. Designing a Requirements Speciﬁcation Language for Reactive Systems. Invited talk, Z User Meeting, 1998, Springer Verlag 1998. 92 [8] Ministry of Defence, Defence Standard 00-56, Issue 2, 1996. 82 [9] Ministry of Defence, Defence Standard 00-58, Issue 2, 2000. 82, 83 [10] Rational Software et al, OMG Uniﬁed Modeling Language Speciﬁcation Version 1.4, 2001. 82 [11] K R Leino, J Saxe, R Stata, Checking Java programs with Guarded Commands, in Formal Techniques for Java Programs, technical report 251, Fernuniversit¨ at Hagen, 1999. 82 [12] R. Winther, O-A. Johansen, B. A. Gran, Security Assessments of Safety Critical Systems using HAZOPS, Safecomp 2000. 90

The CORAS Framework for a Model-Based Risk Management Process Rune Fredriksen1 , Monica Kristiansen1 , Bjørn Axel Gran1 , Ketil Stølen2 , Tom Arthur Opperud3 , and Theo Dimitrakos4 1

Institute for Energy Technology, Halden, Norway {Rune.Fredriksen,Monica Kristiansen,Bjorn-Axel Gran}@hrp.no http://www.ife.no 2 Sintef Telecom and Informatics, Oslo, Norway [email protected] http://www.sintef.no 3 Telenor Communications AS R&D, Fornebu, Norway [email protected] http://www.telenor.no 4 CLRC Rutherford Appleton Laboratory (RAL), Oxfordshire, UK [email protected] http://www.rl.ac.uk

Abstract. CORAS is a research and technological development project under the Information Society Technologies (IST) Programme (Commission of the European Communities, Directorate-General Information Society). One of the main objectives of CORAS is to develop a practical framework, exploiting methods for risk analysis, semiformal methods for object-oriented modelling, and computerised tools, for a precise, unambiguous, and eﬃcient risk assessment of security critical systems. This paper presents the CORAS framework and the related conclusions from the CORAS project so far.

1

Introduction

CORAS [1] is a research and technological development project under the Information Society Technologies (IST) Programme (Commission of the European Communities, Directorate-General Information Society). CORAS started up in January 2001 and runs until July 2003. The CORAS main objectives are as follows: – To develop a practical framework, exploiting methods for risk analysis, semiformal methods for object-oriented modelling, and computerised tools, for a precise, unambiguous, and eﬃcient risk assessment of security critical systems. – To apply the framework in two security critical application domains: telemedicine and e-commerce. – To assess the applicability, usability, and eﬃciency of the framework. – To promote the exploitation potential of the CORAS framework. S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 94–105, 2002. c Springer-Verlag Berlin Heidelberg 2002

The CORAS Framework for a Model-Based Risk Management Process

2

95

The CORAS Framework

This section provides a high-level overview of the CORAS framework for a modelbased risk management process. By ”a model-based risk management process” we mean a tight integration of state-of-the-art UML-oriented modelling technology (UML = Uniﬁed Modeling Language) [2] in the risk management process. The CORAS model-based risk management process employs modelling technology for three main purposes: – Providing descriptions of the target of analysis at the right level of abstraction. – As a medium for communication and interaction between diﬀerent groups of stakeholders involved in a risk analysis. – To document results and the assumptions on which these results depend. A model-based risk management process is motivated by several factors: – Risk assessment requires correct descriptions of the target system, its context and all security relevant features. The modelling technology improves the precision of such descriptions. Improved precision is expected to improve the quality of risk assessment results. – The graphical style of UML furthers communication and interaction between stakeholders involved in a risk assessment. This is expected to improve the quality of results, and also speed up the risk identiﬁcation and assessment process since the danger of wasting time and resources on misconceptions is reduced. – The modelling technology facilitates a more precise documentation of risk assessment results and the assumptions on which their validity depend. This is expected to reduce maintenance costs by increasing the possibilities for reuse. – The modelling technology provides a solid basis for the integration of assessment methods that should improve the eﬀectiveness of the assessment process. – The modelling technology is supported by a rich set of tools from which the risk analysis may beneﬁt. This may improve quality (as in the case of the two ﬁrst bullets) and reduce costs (as in the case of the second bullet). It also furthers productivity and maintenance. – The modelling technology provides a basis for tighter integration of the risk management process in the system development process. This may considerably reduce development costs and ensure that the speciﬁed security level is achieved. The CORAS framework for a model-based risk management process has four main anchor-points, a system documentation framework based on the Reference Model for Open Distributed Processing (RM-ODP) [3], a risk management process based on the risk management standard AS/NZS 4360 [4], a system development process based on the Rational Uniﬁed Process (RUP) [5], and a platform

96

Rune Fredriksen et al.

for tool-integration based on eXtensible Markup Language (XML) [6]. In the following we describe the four anchor-points and the model-based risk management process in further detail. 2.1

The CORAS Risk Management Process

The CORAS risk management process provides a sequencing of the risk management process into the following ﬁve sub-processes: 1. Context Identification: Identify the context of the analysis that will follow. The approach proposed here is to select usage scenarios of the system under examination. 2. Risk Identification: Identify the threats to assets and the vulnerabilities of these assets. 3. Risk Analysis: Assign values to the consequence and the likelihood of occurrence of each threat identiﬁed in sub-process 2. 4. Risk Evaluation: Identify the level of risk associated with the threats already identiﬁed and assessed in the previous sub-processes 5. Risk Treatment: Address the treatment of the identiﬁed risks. The initial experimentation with UML diagrams can be summarised into the following: 1. UML use case diagrams support the identiﬁcation of both the users of a system (actors) and the tasks (use cases) they must undertake with the system. UML scenario descriptions can be used to give more detailed input to the identiﬁcation of diﬀerent usage scenarios in the CORAS risk management process. 2. UML class/object diagrams identify the classes/objects needed to achieve the tasks, which the system must help to perform, and the relationships between the classes/objects. While class diagrams give the relationships between general classes, object diagrams present the instantiated classes. This distinction could be important when communicating with users of the system.

Sub-process 1: Identify context Sub-process 2: Identify risks Sub-process 3: Analyse Risks Sub-process 4: Risk Evaluation Sub-process 5: Risk Treatment

Fig. 1. Overview over the CORAS risk management process

The CORAS Framework for a Model-Based Risk Management Process

97

3. UML sequence diagrams describe some aspects of system behaviour by e.g. showing which messages are passed between objects and in what order they must occur. This gives a dynamic picture of the system and essential information to the identiﬁcation of important usage scenarios. 4. UML activity diagrams describe how activities are co-ordinated and record the dependencies between activities. 5. UML state chart diagrams or UML activity diagrams can be used to represent state transition diagrams. The UML state chart diagram may be used to identify the sequence of state transitions that leads to a security break. 2.2

The CORAS System Documentation Framework

The CORAS system documentation framework is based on RM-ODP. RM-ODP deﬁnes the standard reference model for distributed systems architecture, based on object-oriented techniques, accepted at the international level. RM-ODP is adopted by ISO (ISO/IEC 10746 series: 1995) as well as by the International Telecommunication Union (ITU) (ITU-T X.900 series: 1995). As indicated by Figure 2, RM-ODP divides the system documentation into ﬁve viewpoints. It also provides modelling, speciﬁcation and structuring terminology, a conformance module addressing implementation and consistency requirements, as well as a distribution module deﬁning transparencies and functions required to realise these transparencies. The CORAS framework extends RM-ODP with: 1. Concepts and terminology for risk management and security. 2. Carefully deﬁned viewpoint-perspectives targeting model-based risk management of security-critical systems. 3. Libraries of standard modelling components. 4. Additional support for conformance checking. 5. A risk management module containing risk assessment methods, risk management processes, and a speciﬁcation of the international standards on which CORAS is based.

2.3

The CORAS Platform for Tool Integration

The CORAS platform is based on data integration implemented in terms of XML technology. Figure 3 outlines the overall structure. The platform is built up around an internal data representation formalised in XML/XMI (characterised by XML schema). Standard XML tools provide much of the basic functionality. This functionality allows experimentation with the CORAS platform and can be used by the CORAS crew during the trials. Based on the eXtensible Stylesheet Language (XSL), relevant aspects of the internal data representation may be mapped to the internal data representations of other tools (and the other way around). This allows the integration of sophisticated case-tools targeting system development as well as risk analysis tools and tools for vulnerability and treat management.

98

Rune Fredriksen et al.

Viewpoints enterprise

information computation engineering technology

ODP Foundations modelling, specification & structuring terminology

Conformance Module implementation & consistency requirements

Distribution Module transparencies & functions

Fig. 2. The main components of RM-ODP

2.4

The CORAS Model-Based Risk Management Process

The CORAS methodology for a model-based risk management process builds on: – – – – – –

HAZard and OPerability study (HAZOP) [7]; Fault Tree Analysis (FTA) [8]; Failure Mode and Eﬀect Criticality Analysis (FMECA) [9]; Markov analysis methods (Markov) [10]; Goals Means Task Analysis (GMTA) [11]; CCTA Risk Analysis and Management Methodology (CRAMM) [12].

These methods are to a large extent complementary. They address conﬁdentiality, integrity, availability as well as accountability; in fact, all types of risks, threats, hazards associated with the target system can potentially be revealed and dealt with. They also cover all phases in the system development and maintenance process. In addition to the selected methods other methods may also be needed to implement the diﬀerent sub-processes in the CORAS risk management process. So far two additional methods have been identiﬁed. These are Cause-Consequence Analysis (CCA) [13] and Event-Tree Analysis (ETA) [13]. The CORAS risk management process tightly integrates state-of-the-art technology for semiformal object-oriented modelling. Modelling is not only used to provide a precise description of the target system, but also to describe its context and possible threats. Furthermore, description techniques are employed to document the risk assessment results and the assumptions on which these results

The CORAS Framework for a Model-Based Risk Management Process

99

The CORAS platform

XSL

Commercial modelling tools

XML internal representation

X SL

XML tools providing basic functionality

Commercial risk analysis tools

XSL

Commercial vulnerability and treat management tools

Fig. 3. The meta-model of the CORAS platform

depend. Finally, graphical UML-based modelling provides a medium for communication and interaction between diﬀerent groups of stakeholders involved in risk identiﬁcation and assessment. The following table gives a brief summary and a preliminary guideline to which methods that should be applied for which sub-process in the CORAS model-based risk management process. This guideline will be updated further during the progress of the CORAS project. Risk management requires a ﬁrm but nevertheless easily understandable basis for communication between diﬀerent groups of stakeholders. Graphical objectoriented modelling techniques have proved well suited in this respect for requirements capture and analysis. We believe they are equally suited as a language for communication in the case of risk management. Entity relation diagrams, sequence charts, dataﬂow diagrams and state diagrams represent mature paradigms used daily in the IT industry throughout the world. They are supported by a wide set of sophisticated case-tool technologies, they are to a large extent complementary and, together they support all stages in a system development. Policies related to risk management and security are important input to risk assessment of security critical systems. Moreover, results from a risk assessment will often indicate the need for additional policies. Ponder [14] is a very expressive declarative, object-oriented language for specifying security and management policies for distributed systems. Ponder may beneﬁt from an integration with graphical modelling techniques. Although the four kinds of graphical modelling techniques and Ponder are very general paradigms they do not always provide

100

Rune Fredriksen et al.

Table 1. How the RA methods apply to the CORAS risk management process Sub-process Goal Context Identiﬁcation Identify the context of the analysis (e.g. areas of concern, assets and security requirements). Risk Identiﬁcation Identify threats. Risk Analysis Find consequence and likelihood of occurence. Risk Evaluation Evaluate risk (e.g. risk level, prioritise, categorise, determine interrelationships and prioritise). Risk Treatment Identify treatment options and assess alternative approaches.

Recommended methods CRAMM, HAZOP

HAZOP FMECA, CCA, ETA CRAMM

HAZOP

the required expressiveness. Predicate logic based approaches like OCL [15] in addition to contract-oriented modelling are therefore also needed. 2.5

The CORAS System Development and Maintenance Process

The CORAS system development and maintenance process is based on an integration of the AS/NZS 4360 standard for risk management and an adaptation of the RUP for system development. RUP is adapted to support RM-ODP inspired viewpoint oriented modelling. Emphasis is placed on describing the evolution of the correlation between risk management and viewpoint oriented modelling throughout the system’s development and maintenance lifecycle. In analogy to RUP, the CORAS process is both stepwise incremental and iterative. In each phase of the system lifecycle, suﬃciently reﬁned versions of the system (or its model) are constructed through subsequent iterations. Then the system lifecycle moves from one phase into another. In analogy to the RM-ODP viewpoints, the viewpoints of the CORAS framework are not layered; they are diﬀerent abstractions of the same system focusing on diﬀerent areas of concern. Therefore, information in all viewpoints may be relevant to all phases of the lifecycle.

3

Standard Modeling Components

Much is common from one risk assessment to the next. CORAS aims to exploit this by providing libraries of reusable speciﬁcation fragments targeting the risk management process and risk assessment. These reusable speciﬁcation fragments are in the following referred to as standard modelling components. They will typically be UML diagrams annotated with constraints expressed in OCL

The CORAS Framework for a Model-Based Risk Management Process

101

(Object Constraint Language), or in for this purpose other suitable speciﬁcation languages. The process of developing standard modelling components will continue in the CORAS project. In this phase the focus has been to: 1. Build libraries of standard modelling components for the various security models developed in CORAS. 2. Provide guidelines for the structuring and maintenance of standard modelling components. The following preliminary results and conclusions have been reached: 1. Standard modelling components may serve multiple purposes in a process for model-based risk management. They can represent general patterns for security architectures, or security policies. They can also represent the generic parts of diﬀerent classes of threat scenarios, as well as schemes for recording risk assessment results and the assumptions on which they depend. 2. In order to make eﬀective use of such a library, there is need for a computerised repository supporting standard database features like storage, update, rule-based use, access, search, maintenance, conﬁguration management, version control, etc. 3. XMI oﬀers a standardised textual XML-based representation of UML speciﬁcations. Since UML is the main modelling methodology of the CORAS framework and XML has been chosen as the main CORAS technology for tool integration, the repository should support XMI based exchange of models. 4. The UML meta-model is deﬁned in Meta Object Facility (MOF) [16]. In relation to a CORAS repository, MOF may serve as a means to deﬁne a recommended subset of UML for expressing standard modelling components, required UML extensions to support a model-based risk management process, as well as the grammar of component packets. The repository should therefore be MOF based. 5. To support eﬀective and smooth development of a consistent library, a single CORAS repository that all partners in the consortium may access via the Internet would be useful. 6. The OMG standards MOF and XMI ensure open access to the library and ﬂexible exchange of standard modelling components between tools. There are already commercial tools for building repositories supporting MOF and XMI on the market; others are under development. The consortium will formalise the library of standard modelling components in terms of MOF and XMI using a suitable for this purpose UML CASE-tool.

4

The CORAS Trials

The trials in CORAS are performed within two diﬀerent areas: e-commerce and telemedicine. The purpose of the trials is to experiment with all aspects of the

102

Rune Fredriksen et al.

Web Server

Web Client

Application Server

Data Storage

reqLogin() reqLogin() Create(sn) Add(sn) return(status) retLoginPage(sn) retLoginPage(sn)

Fig. 4. An example of the UML sequence diagram used in the trial

framework during its development, provide feedback for improvements and oﬀer an overall assessment. The ﬁrst e-commerce trial was based on the authentication mechanism. Among other models an indicative UML sequence diagram for starting the FMECA method, see Figure 4, was used. It is important to stress that the sequence diagram presented here is only one example of typical possible behaviours of the system. Scenarios like unsuccessful login, visitor accessing the platform, registration of new user, etc, could also be modelled. A more detailed description of the CORAS trials will be provided in the reports from the CORAS project. This trial was focused on the sub process 2 – identify risks, and to gain familiarity with use of CRAMM, HAZOP, FTA and FMECA for this purpose. The results from the ﬁrst e-commerce trial are divided into four partly overlapping classes: 1. 2. 3. 4. 4.1

Experiences with the use of the speciﬁc risk analysis methods. Experiences from the overall process. Input to changes to the way the trials are performed. Input to minor changes of the CORAS risk management process. Experiences with the Use of the Specific Risk Analysis Methods

The individual methods used during the ﬁrst e-commerce risk analysis trial session provided the following main results: – CRAMM was useful for identiﬁcation of important system assets. – HAZOP worked well with security-related guidewords/attributes [17] that reﬂected the security issues addressed.

The CORAS Framework for a Model-Based Risk Management Process

103

– FTA was useful for structured/systematic risk analysis, but was timeconsuming and might present scalability problems. – FMEA worked well, but has to be well organised before it is applied and it may even be prepared beforehand by the system developers. The trial also demonstrated, through the interactions between the models on the board and the risk analysis methods that model-based risk assessment provides an eﬀective medium for communication and interaction between diﬀerent groups of stakeholders involved in a risk assessment. 4.2

Experiences from the Overall Process

The CORAS risk management process was initially diﬃcult to follow without guidance from experienced risk analysts. Especially the interfacing between models and the objective for using each method was not initially clear. During the process it became obvious that suﬃcient input of documentation, including models, was critical to obtain valuable results. The process did, however, provide identiﬁcation of threats and some important issues were discovered despite time limitations. The diﬀerent methods provided complementary results, and the application of more than one method was very successful. 4.3

Input to Changes to the Way the Trials Are Performed

One of the objectives of the ﬁrst e-commerce trial was to provide input on how the following trials should be performed. Four major issues were addressed: 1. The trials should be more realistic, regarding the people that participate, the duration and the functionality that is analysed. 2. The CORAS risk management process should be followed more formally. 3. Documentation, including models, should be provided in suﬃcient time before the trial so that clariﬁcations can be provided in time. 4. Tool support for the diﬀerent risk analysis methods would make the application of the methods more productive. 4.4

Input to Minor Changes of the CORAS Risk Management Process

The major results from this trial for the subsequent updates of the CORAS risk management process are: 1. Guidelines for the application of the CORAS risk management process need to be provided; 2. The terminology in use need to be deﬁned in more detail; and 3. Templates for the diﬀerent risk analysis methods need to be available.

104

5

Rune Fredriksen et al.

Conclusions

This paper presents the preliminary CORAS model-based risk management process. The main objective of the CORAS project is to develop a framework to support risk assessment of security critical systems, such as telemedicine or ecommerce systems. A hypothesis where risk analysis methods traditionally used in a safety context were applied in a security context, has been evaluated - and will be evaluated further during the forthcoming trials. This paper also presents the experiences from the ﬁrst trial in the project. The diﬀerent methods provided complementary results, and the use of more than one method seemed to be an eﬀective approach. The ﬁrst trial experiences also demonstrated the advantages of the interactions between the models on the board and the risk analysis methods. In addition the trial provided the identiﬁcation of threats and some important issues for further follow up. The trials to be performed during the spring 2002 will provide feedback to updated versions of the recommendations developed in the CORAS project.

References [1] CORAS: ”A Platform for Risk Analysis of Security Critical Systems”, IST-200025031,(2000).(http://www.nr.no/coras/) 94 [2] OMG: UML proposal to the Object Management Group(OMG), Version 1.4, 2000. 95 [3] ISO/IEC 10746: Basic Reference Model of Open Distributed Processing, 1999. 95 [4] AS/NZS 4360: Risk Management. Australian/New Zealand Standard 1999. 95 [5] Krutchen, P.: The Rational Uniﬁed Process, An Introduction, Addison-Wesley (1999) 95 [6] W3C: Extensible Markup Language (XML) 1.0 October 2000 96 [7] Redmill F., Chudleigh M., Catmur J.: Hazop and Software Hazop, Wiley, 1999. 98 [8] Andrews J. D., Moss, T. R.: Reliability and Risk Assessment, 1st Ed. Longman Group UK, 1993. 98 [9] Bouti A., Kadi A. D.: A state-of-the-art review of FMEA/FMECA, International Journal of Reliability, Quality and Safety Engineering, vol. 1, no. 4, pp (515-543), 1994. 98 [10] Littlewood B.: A Reliability Model for Systems with Markov Structure, Appl. Stat., 24 (2), pp (172-177), 1975. 98 [11] Hollnagel E.: Human Reliability Analysis: Context and Control, Academic press, London, UK, 1993. 98 [12] Barber B., Davey J.: Use of the CRAMM in Health Information Systems, MEDINFO 92, ed Lun K. C., Degoulet P., Piemme T. E. and Rienhoﬀ O., North Holland Publishing Co, Amsterdam, pp (1589 – 1593), 1992. 98 [13] Henley E. J., and Kumamoto, H.: Probabilistic Risk Assessment and Management for Engineers and Scientists. 2nd Ed. IEEE Press, 1996. 98 [14] Damianou N., Dulay N., Lupu E., and Sloman M.: Ponder: A Language for Specifying Security and Management Policies for Distributed Systems. The Language Speciﬁcation - Version 2.2. Research Report DoC 2000/1, Department of Computing, Imperial College, London, April, 2000. 99

The CORAS Framework for a Model-Based Risk Management Process

105

[15] Warmer Jos B., and Kleppe Anneke G.: The Object Constraint Language - precise modeling with UML. Addison-Wesley, 1999. 100 [16] OMG: Meta Object Facility. Object Management Group(OMG), http://www.omg.org 101 [17] Winther, Rune et al.: Security Assessments of Safety Critical Systems Using HAZOPs, U.Voges (Ed.): SAFECOMP 2001, LNCS 2187, pp. (14-24), 2001, SpringerVerlag Berlin Heidelberg 2001. 102

Software Challenges in Aviation Systems John C. Knight Department of Computer Science, University of Virginia 151, Engineer’s Way, P.O. Box 400740, Charlottesville, VA 22904-4740, USA [email protected] Abstract. The role of computers in aviation is extensive and growing. Many crucial systems, both on board and on the ground, rely for their correct operation on sophisticated computer systems. This dependence is increasing as more and more functionality is implemented using computers and as entirely new systems are developed. Several new concepts are being developed specifically to address current safety issues in aviation such as runway incursions. This paper summarizes some of the system issues and the resulting challenges to the safety and software engineering research communities.

1

Introduction

The operation of modern commercial air transports depends on digital systems for a number of services. Some of these services, e.g., autopilots, operate on board, and others, e.g., current air-traffic management systems, operate on the ground. In many cases, the systems interact with each other via data links of one form or another, e.g., ground system interrogation of on-board transponders, and aircraft broadcast of position and other status information [1]. This dependence on digital systems includes general aviation aircraft in a significant way also. In most cases, digital systems in aviation are safety-critical. Some systems, such as a primary flight-control system [2], are essential for normal aircraft operation. Others, such as some displays and communications systems, are important but only crucial under specific circumstances or at specific times. Any complex digital system will be software intensive, and so the correct operation of many aviation systems relies upon the correct operation of the associated software. The stated requirement for the reliability of a flight-crucial system on a commercial air transport is 10-10 failures per hour where a failure could lead to loss of the aircraft [2, 3]. This is a system requirement, not a software requirement, and so it is not the case that software must meet this goal—software must exceed it because hardware components of the system will not be perfect. The development of digital aviation systems present many complex technical challenges because the dependability requirements are so high. Some of the difficulties encountered are summarized in this paper. In the next section, aviation systems are reviewed from the perspectives of enhanced functionality and enhanced safety, and the characteristics of such systems are discussed. In section 3, some of the challenges that arise in software engineering are presented. S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 106-112, 2002.  Springer-Verlag Berlin Heidelberg 2002

Software Challenges in Aviation Systems

2

Aviation Systems

2.1

Enhanced Functionality

107

The trend of reduced digital hardware costs and the coincident reduction in hardware size and power consumption has led to an increasing use of digital systems in aviation. In some cases, digital implementations have replaced older analog-based designs. In other cases, entirely new concepts become possible thanks to digital systems. An example of the former is autopilots. Autopilots used to be based on analog electronics but now are almost entirely digital. The basic ideas behind the operation of an autopilot have remained the same through this transition, but modern digital autopilots are characterized by greater functionality and flexibility. Examples of entirely new concepts are modern full-authority, digital engine controllers (FADECs) and envelope protection systems. FADECs manage large aircraft engines and monitor their performance with sophistication that would be essentially impossible in anything but a digital implementation. Similarly, comprehensive envelope protection is only possible using a digital implementation. Functionality enhancement is taking place in both on-board and ground-based systems. Flight deck automation is very extensive, and this has lead to the use of the term “glass cockpit” since most information displays are now computer displays [4, 5]. Ground based automation is extensive and growing. Much of the development that is taking place is designed to support Free Flight [6] and the Wide Area Augmentation System (WAAS) [7], a GPS-based precision guidance system for aircraft navigation and landing. Both Free Flight and WAAS depend heavily on computing and digital communications. It is difficult to obtain accurate estimates of the number of processing units, the precise communications architecture, and the amount of software in an aviation system for many reasons. It is sometimes not clear what constitutes “a processor”, for example, because so much specialized electronics is involved. Similarly, software is sometimes in read-only memories and called “firmware” rather than software. In addition, digital systems are often used for non-safety-related functions and so are not of interest. Finally, many of the details of digital systems in aviation applications are considered proprietary and are not made available. Although some details are not available, it is clear that there are many safety-critical digital systems in present aviation applications. It is also clear that these systems are extremely complex in many cases. Both aircraft on-board systems and groundbased systems are often sophisticated computer networks, and these systems also interact. In some cases, such as WAAS, the architecture is a wide-area network with very high dependability and real-time performance requirements. Given the continuing technological trends, it is to be expected that there will be many more such systems in the future. 2.2

Enhanced Safety

The stimulus for developing new and enhanced digital systems is evolving. While the change from analog to digital implementation of major systems will no doubt con-

108

John C. Knight

tinue, there are major programs underway to develop techniques that will address safety issues explicitly [8]. Three of the major concerns in aviation safety are: (1) accidents caused by Controlled Flight Into Terrain (CFIT); (2) collisions during ground operations, take off, or landing; and (3) mechanical degradation or failure. CFIT occurs when a perfectly seviceable aircraft under control of its pilots impacts the ground, usually because the crew was distracted. CFIT was involved in 37% of 76 approach and landing accidents or serious incidents from 1984-97 [9, 10, 11], and CFIT incidents continue to occur [12]. The prevention of collisions on the ground is a major goal of the Federal Aviation Administration [13]. During the decade of the 1990’s, 16 separate accident categories (including “unknown”) were identified in the world-wide commercial jet fleet [14]. The category that was responsible for the most fatalities (2,111) was CFIT. An analysis of these categories by Miller has suggested that nine of the categories (responsible for 79% of the accidents) might be addressable by automation [15]. Thus, there is a very strong incentive to develop new technologies to address safety explicitly, and this, together with the rapidly rising volume of commercial air traffic, is the motivation for the various aviation safety programs [8]. These new programs are expected to yield entirely new systems that will enhance the safe operation of aircraft. The Aircraft Condition Analysis and Management System (ACAMS), for example, is designed to diagnose and predict faults in various aircraft subsystems so as to assess the flight integrity and airworthiness of those aircraft subsystems [16]. The ACAMS system operates with on-board components that diagnose problems and ground-based components that inform maintenance and other personnel. Another important new direction in aviation safety is in structural health monitoring. The concept is to develop systems that will perform detailed observation of aircraft structures in real time during operation. They are expected to provide major benefits by warning of structural problems such as cracks while they are of insignificant size. The approach being followed is to develop sensors that can be installed in critical components of the airframe and to use computers to acquire and analyze the data returned by the sensors. For an example of such a system, see the work of Munns et al [17]. A significant innovation in ground-based systems is automatic alerts of potential runway incursions. In modern airports, the level of ground traffic is so high that various forms of traffic entering runways being used for flight operations are difficult to prevent. The worst accident in aviation history, with 583 fatalities, occurred in Tenerife, Canary Islands in March 1977, and was the result of a runway incursion. Research is underway to develop systems that will warn pilots of possible incursions so that collisions can be avoided [18]. 2.3

Characteristics of Enhanced System

Inevitably, new aviation systems, whether for enhanced functionality or enhanced safety, will be complex—even more so than current systems. Considerable hardware will be required for the computation, storage and communication that will be

Software Challenges in Aviation Systems

109

required, and extensive hardware replication will be present to address dependability goals. Replication will, in most cases, have to go beyond simple duplication or triplication because the reliability requirements cannot be met with these architectures. Replication will obviously extend also into power and sensor subsystems. The functional complexity of the systems being designed is such that they will certainly be software intensive. But functionality is not the only requirement that will be addressed by software. Among other things, it will be necessary to develop extensive amounts of software to manage redundant components, to undertake error detection in subsystems such as sensors and communications, and to carry out routine health monitoring and logging. The inevitable conclusion of a brief study of the expected system structures is that very large amounts of ultra-dependable software will be at the heart of future aviation systems. It is impossible to estimate the total volume of software that might be expected in a future commercial transport, but it is certain that the number of lines will be measured in hundreds of millions. Not all of that software will be flight crucial, but much of it will be.

3

Software Challenges

The development of software for future aviation applications will require that many technical challenges be addressed. Most of these challenges derive from the required dependability goal and approaches that might be used to meet it. An important aspect of the goal is assurance that the goal is met. In this section six of the most prominent challenges are reviewed. These six challenges are: •

Requirements Specification Erroneous specification is a major source of defects and subsequent failures of safety-critical systems. Many failures occur in systems using software that is perfect, it is just not the software that is needed because the specification is defective. Vast amounts of research has been conducted in specification technology but errors in specifications continue to occur. It is clear that the formal languages which have been developed offer tremendous advantages, yet they are rarely used even for the development of safety-critical software.

•

Verification Verification is a complex process. Testing remains the dominant approach to verification, but testing is able to provide assurance only in the very simplest of systems. It has been shown that it is impossible to assess ultra-high dependability using testing in a manner reminiscent of statistical sampling, a process known as life testing [19, 20]. The only viable alternative is to use formal verification, and case studies in the use of formal verification have been quite successful. However, presently formal verification has many limitations, such as floating-point arithmetic and concurrent systems, that preclude its

110

John C. Knight

comprehensive and routine use in aviation systems. In addition, formal verification is usually applied to a relatively high-level representation of the program, such as a high-level programming language. Thus it depends upon a comprehensive formal semantic definition of the representation and an independent verification of the process that translates the high-level representation to the final binary form. •

Application Scale Building the number of ultra-dependable systems that will be required in future aviation systems will not be possible with present levels of productivity. The cost of development of a flight-crucial software system is extremely high because large amounts of human effort is employed. Far better synthesis and analysis tools and techniques are required that provide the ability to develop safety-critical software having the requisite dependability with far less effort.

•

Commercial Off The Shelf Components The use of commercial-off-the-shelf (COTS) components as a means of reducing costs is attractive in all software systems. COTS components are used routinely in many application domains, and the result is a wide variety of inexpensive components with impressive functionality including operating systems, compilers, graphics systems and network services. In aviation systems, COTS components could be used in a variety of ways but for the issue of dependability. If an aviation system is to meet the required dependability goals, it is necessary to base any dependability argument on extensive knowledge of everything used in building the system. This knowledge must include knowledge of the system itself as well as all components in the environment that are used to produce the final binary form of the software. COTS components, no matter what their source, are built for a mass market. As such they are not built to meet the requirements of ultra-dependable applications, they are built to meet the requirements of the mass market. Making the situation worse is that COTS components are sold in binary form only. The source code and details of the development process used in creating a COTS component are rarely available. Even if they are available, they usually reflect a development process that does not have the rigor necessary for ultra-dependable applications. If COTS components are to be useful in safety-critical aviation applications, it will be necessary to develop techniques to permit complete assurance that defects in the COTS components cannot affect safety.

•

Development Cost And Schedule Management Managing the development of major software systems and estimating the cost of that development have always been difficult, but they appear to be especially difficult for aviation systems. Development of the WAAS system, for example, was originally estimated to cost $892.4M but the current program cost estimate is $2,900M. The original deployment schedule for WAAS was expected to begin in 1998 and finish in 2001. The current deployment schedule is to start in 2003 and no date for completion has been projected. WAAS is not an isolated

Software Challenges in Aviation Systems

111

example [21]. The need to develop many systems of the complexity of WAAS indicates that success will depend on vastly improved cost estimation and project management. •

4

System Security Many future aviation systems will be faced with the possibility of external threats. Unless a system is entirely self contained, any external digital interface represents an opportunity for an adversary to attack the system. It is not necessary for an adversary to have physical access. Of necessity many systems will communicate by radio, and digital radio links present significant opportunities for unauthorized access. Present critical networks are notoriously lacking in security. This problem must be dealt with for aviation systems. Even something as simple as a denial-ofservice attack effected by swamping data links or by jamming radio links could have serious consequences if the target was a component of the air-traffic network. Far worse is the prospect of intelligent tampering with the network so as to disrupt service. Dealing with tampering requires effective authentication. Again, this is not a solved problem, and must be dealt with if aviation systems are to be trustworthy.

Summary

The application of computers in aviation systems is increasing, and the range of applications being developed is increasing. If the requisite productivity and dependability goals for these systems are to be met, significant new technology will be required. Further details can be found about many aspects of aviation in general and safety in particular from many sources including the Federal Aviation Administration [22], the National Transportation Safety Board [23], NASA’s Aviation Safety program [8], NASA’s Aviation Safety Reporting System [24], Honeywell International, Inc. [25], and Rockwell Collins, Inc. [26].

Acknowledgments It is a pleasure to thank Ms. Kelly Hayhurst of NASA’s Langley Research Center for her suggestions for the content of this paper. This work was supported in part by NASA under grant number NAG-1-2290.

References 1. 2.

Automatic Dependant Surveillance – Broadcast (ADS-B) System. http://www.ads-b.com Yeh, Y.C.: Design Considerations in Boeing 777 Fly-By-Wire Computers. 3rd. IEEE International High-Assurance Systems Engineering Symposium (1998)

112

3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.

19. 20. 21. 22. 23. 24. 25. 26.

John C. Knight

RTCA Incorporated: Software Considerations in Airborne Systems and Equipment Certification. RTCA document number RTCA/DO-178B (1992) Swenson, E.H.: Into The Glass Cockpit. Navy Aviation News (May-June, 1998) http://www.history.navy.mil/nan/1998/0598/cockpit.pdf Inside The Glass Cockpit: IEEE Spectrum http://www.spectrum.ieee.org/publicaccess/0995ckpt.html Federal Aviation Administration: Welcome to Free Flight http://ffp1.faa.gov/home/home.asp Federal Aviation Administration: Wide Area Augmentation System http://gps.faa.gov/Programs/WAAS/waas.htm NASA Aviation Safety Program, http://avsp.larc.nasa.gov/ Aviation Week and Space Technology, Industry Outlook (January 15, 2001) Aviation Week and Space Technology, Industry Outlook (November 27, 2000) Aviation Week and Space Technology, Industry Outlook (July 17, 2000) Bateman, Donald: CFIT Accident Statistics. Honeywell International Incorporated, http://www.egpws.com/general_information/cfitstats.htm Aviation Week and Space Technology, Industry Outlook (June 26, 2000) Aviation Week and Space Technology (July 2001) Miller, S., personal communication (2002) ARINC Engineering Services LLC: Aircraft Condition Analysis and Management System. http://avsp.larc.nasa.gov/images_saap_ACAMSdemo.html Munns, T.E. et al.: Health Monitoring for Airframe Structural Characterization. NASA Contractor Report 2002-211428, February 2002 Young, S.D., Jones, D.R.: Runway Incursion Prevention: A Technology Solution. Flight Safety Foundation’s 54th Annual International Air Safety Seminar, the International Federation of Airworthiness’ 31st International Conference, Athens, Greece (November 2001) Finelli, G.B., Butler, R.W.: The Infeasibility of Quantifying the Reliability of Life-Critical Real-Time Software. IEEE Transactions on Software Engineering, pp. 3-12 (January 1993) Ammann, P.A., Brilliant, S.S., Knight, J.C.: The Effect of Imperfect Error Detection on Reliability Assessment via Life Testing. IEEE Transactions on Software Engineering pp. 142-148 (February 1994) U.S. Department of Transportation, memorandum from the Inspector General to various addresses: Status of Federal Aviation Administration’s Major Acquisitions. (February 22, 2002) http://www.oig.dot.gov/show_pdf.php?id=701 Federal Aviation Administration, http://www.faa.gov National Transportation Safety Board, http://www.ntsb.gov NASA Aviation Safety Reporting System, http://asrs.arc.nasa.gov/ Honeywell International Incorporated: Enhanced Ground Proximity Warning Systems http://www.egpws.com/ Rockwell Collins Incorporated. http://www.rockwellcollins.com

A Strategy for Improving the Eﬃciency of Procedure Veriﬁcation Wenhui Zhang Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences P.O.Box 8718, 100080 Beijing, China [email protected]

Abstract. Veriﬁcation of operating procedures by model checking has been discussed in [11, 12]. As an execution of a procedure may aﬀect or be aﬀected by many processes, a model of the procedure with its related processes could be very large. We modify the procedure veriﬁcation approach [11, 12] by introducing two strategies that make use of detail knowledge of procedures in order to reduce the complexity of model checking. A case study demonstrates the potential advantages of the strategies and shows that the strategies may improve the eﬃciency of procedure veriﬁcation signiﬁcantly and therefore scale up the applicability of the veriﬁcation approach.

1

Introduction

Operating procedures are documents telling operators what to do in various situations. They are widely used in process industries including the nuclear power industry. The correctness of such procedures is of great importance to the safe operation of power plants [7, 8]. Veriﬁcation of operating procedures by model checking has been discussed in [11, 12]. For the veriﬁcation of a procedure, the basic approach is to formalize the procedure speciﬁcation, formulate logic formulas (or assertions) for correctness requirements (such as invariants and goals), create an abstract model of relevant plant processes, and specify a set of the possible initial states of the plant. In order to obtain a compact model, we may use diﬀerent techniques including cone of inﬂuence reduction [1], semantic minimization [10], state information compression [4], and other types of abstraction techniques [2, 9]. After we have created these speciﬁcations, we use a model checker to verify the speciﬁcation against the logical formulas, the plant processes and the initial states. As an execution of a procedure may aﬀect or be aﬀected by many processes, a model of the procedure with its related processes could be very large. It is necessary to have strategies for reducing the complexity in order to scale up the applicability of model checking. In this paper, we introduce two strategies that make use of detail knowledge of procedures in order to reduce the complexity of model checking. Let the model be a parallel composition of the processes P, S1 , ..., Sn where P is the model of the procedure and S1 , ..., Sn are the related processes referred to as the environment S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 113–125, 2002. c Springer-Verlag Berlin Heidelberg 2002

114

Wenhui Zhang

processes. Unlike models where all processes are equally important, in models of this type, we are only interested in the correctness of P . This simpliﬁes the veriﬁcation task and we can take advantage of this fact and use specialized strategies for procedure veriﬁcation. Formally, the problem could be stated as P ||S1 || · · · ||Sn |= ϕ where we are interested in the correctness of P with respect to ϕ in the environment consisting of the processes S1 , ..., Sn running in parallel with P . The two strategies (where one is a modiﬁcation of the other) for increasing the eﬃciency of the veriﬁcation of such problems are presented in Section 2. In Section 3, we propose a modiﬁcation of the procedure veriﬁcation approach presented in [11, 12] and present a case study to demonstrate the potential advantages of the strategies. In Section 4, we present a discussion of the strategies and the conclusion that the strategies may improve the eﬃciency of procedure veriﬁcation signiﬁcantly and scale up the applicability of the procedure veriﬁcation approach presented in [11, 12].

2

Veriﬁcation Strategies →

Let T be a system and x be the global variable array of T . The system is in the → → → state v , if the value of x at the current moment is v . A trace of T is a sequence of states. The property of such a trace can be speciﬁed by propositional linear temporal logic (PLTL) formulas [3]. →

– ϕ is a PLTL formula, if ϕ is of the form z = w where z ∈ x and w is a value. – Logical connectives of PLTL include • ¬, • ∧, • ∨ and • →. If ϕ and ψ are PLTL formulas, then so are ¬ϕ, ϕ ∧ ψ, ϕ ∨ ψ, and ϕ → ψ. – Temporal operators include • X (next-time), • U (until), • <> (future) and • [] (always). If ϕ and ψ are PLTL formulas, then so are X ϕ, ϕ U ψ, <>ϕ, and []ϕ. Let t be a trace of T . Let HEAD(t) be the ﬁrst element of t and TAILi (t) be the trace constructed from t by removing the ﬁrst i elements of t. For convenience, we write TAIL(t) for TAIL1 (t). The relation “t satisﬁes ϕ”, written: t |= ϕ, is deﬁned as follows:

A Strategy for Improving the Eﬃciency of Procedure Veriﬁcation

t |= x = v t |= ¬ϕ t |= ϕ ∧ ψ t |= ϕ ∨ ψ t |= ϕ → ψ t |= X ϕ t |= ϕ U ψ t |=<>ϕ t |=[]ϕ

iﬀ iﬀ iﬀ iﬀ iﬀ iﬀ iﬀ iﬀ iﬀ

115

the statement x = v is true in HEAD(t). t |= ϕ. t |= ϕ and t |= ψ. t |= ϕ or t |= ψ. t |= ϕ implies t |= ψ. TAIL(t) |= ϕ. ∃k such that TAILk (t) |= ψ and TAILi (t) |= ϕ for 0 ≤ i < k. ∃k such that TAILk (t) |= ϕ. t |= ϕ and TAIL(t) |=[]ϕ

Let T be the set of the traces of T and ϕ be a propositional linear temporal logic formula. T satisﬁes ϕ, written: T |= ϕ, if and only if: ∀t ∈ T .(t |= ϕ). Let e(T ) be a modiﬁcation of T which extends the global variable array of T → and adds assignment statements of the form y = expr where y ∈ x to the → processes of T . Let ϕ( x ) be an X-free (X for next-time) propositional linear → temporal logic formula that only involves variables of x . The following rule is sound. →

e(T ) |= ϕ( x ) → T |= ϕ( x )

(R1)

The set of traces of T can be constructed from that of e(T ) by deleting → variables not in x and their respective values from the states of e(T ), and deleting → (some of) the stuttering states. These actions do not aﬀect the validity of ϕ( x ), → since the formula does not involve variables not in ϕ( x ) and does not involve → the temporal operator X. The condition that ϕ( x ) be an X-free formula can be relaxed, if the additional assignments can be attached to existing statements with atomic constructions. → → The intention is to verify e(T ) |= ϕ( x ) instead of T |= ϕ( x ). However, additional eﬀort is necessary in order to get beneﬁt out of this construction, → → because the problem e(T ) |= ϕ( x ) is more complicated than T |= ϕ( x ), as there are additional variables and additional assignment statements. The idea is therefore to utilize the additional variables in e(T ) to add procedure knowledge in order to improve the eﬃciency of model checking. We consider two veriﬁcation strategies: – The ﬁrst one is referred to as STR-1 reduction that utilizes the knowledge of the process P in order to discard traces with irrelevant execution statements. – The second is referred to as STR-2 reduction which is a modiﬁcation of the ﬁrst strategy by inserting the knowledge into the environment processes.

116

2.1

Wenhui Zhang

STR-1 Reduction

This strategy is based on deduction rule (R1). Instead of verifying P ||S1 || · · · ||Sn |= ϕ we consider two steps: 1. Extend P to e(P ). The problem to be veriﬁed after this step is: e(P )||S1 || · · · ||Sn |= ϕ. According to the deduction rule, a veriﬁcation of e(P )||S1 || · · · ||Sn |= ϕ is also a veriﬁcation of P ||S1 || · · · ||Sn |= ϕ. 2. Write a formula ψ to represent some useful knowledge of the procedure execution to be used to improve the eﬃciency of model checking. The problem to be veriﬁed after this step is: e(P )||S1 || · · · ||Sn |= ψ → ϕ. The purpose is to use ψ to discard (or eliminate at an early point of the model checking process) traces that do not aﬀect the validity of ϕ (provided that the procedure knowledge is correctly represented). Example: Consider an example where we have two processes S1 and S2 speciﬁed in Promela (the process meta-language provided with the model checker Spin [5, 6]) as follows: proctype S1 () { proctype S2 () { byte i,k,num; byte i,num; atomic{ c0?num; c0!num; } atomic{ c0?num; c0!num; } do do :: k< num → i=i+1; k=k+i; :: i × i < num → i=i+1; :: k> num → c1!0; :: i × i > num → c2!0; :: k == num → c1!1; :: i × i == num → c2!1; od; od; } } S1 is a process that reads a value from the channel c0 and tests whether it equals ki=1 (i) for some k. It reports through the channel c1, with 1 meaning k that it has found a k such that n1 = i=1 (i). S2 is a process that reads a value and tests whether it equals k 2 for some k. It reports through the channel c2 in a similar manner. Suppose that we have a procedure which ﬁrst puts a number in the channel c0 and then uses S1 or S2 for performing a test to determine the property of the number. Suppose further that the choice of using S1 or S2 is determined by an input from the environment modeled by the process E as follows.

A Strategy for Improving the Eﬃciency of Procedure Veriﬁcation

117

proctype P () { c0!n1; e0!a1; if :: a1==1; do :: c1?r1; od; :: a1==2; do :: c2?r1; od; ﬁ; } proctype E() { if :: e0!1; :: e0!2; ﬁ; } The process E puts randomly 1 or 2 into e0. The process P puts the number n1 into c0, reads from e0 and gets the reported values through channel c1 or c2 according to the input from k e0. Assume that we want to verify whether there is no k such that n1 = i=1 (i) or n1 = k 2 and we represent the property by ϕ : [](r1 = 1), i.e. we want to verify: P ||E||S1 ||S2 |= ϕ. The knowledge we gained from the construction of the Promela model tells us that Si (with i = 1, 2 respectively) is not relevant with respect to the execution of P , unless P executes along the path with a1==i. We apply the strategy STR-1 reduction and specify P , i.e. e(P ), as follows (the changes are in italic fonts): proctype P () { c0?n1; e0?a1; if :: atomic{a1==1; b= 1}; do :: c1?r1; od; :: atomic{a1==2; b= 2}; do :: c2?r1; od; ﬁ; } In this speciﬁcation, b is a new global variable used to hold the path information. Let ψ be (b = 2 → last = 2) ∧ (b = 1 → last = 1) where last = i means that process i (i.e. Si in the above Promela speciﬁcation) is not the last process that just has executed an action. We verify P ||E||S1 ||S2 |= []ψ → ϕ instead of P ||E||S1 ||S2 |= ϕ. As there are additional conditional statements, the number of relevant traces are reduced. For n1 = 2, the number of transitions in model checking using this strategy is 228. The number of transitions in model checking without using this strategy is 506 (with Spin 3.4.2). Note that the strategy is not compatible with partial order reduction and the former is compiled with the option -DNOREDUCE, while the latter did not use this option. Remarks: The use of the formula ψ is not compatible with fairness constraints (with the implication that one cannot exclude any executable process). In the next subsection, we introduce a modiﬁed version which does not suﬀer from this limitation.

118

2.2

Wenhui Zhang

STR-2 Reduction

In this approach, we code the knowledge into the processes, instead of adding it into the formula to be proved. For instance, the knowledge [](b = 1 → last = 1) can be coded into process S1 by inserting the guard b == 1 to appropriate places. So instead of verifying P ||S1 || · · · ||Sn |= ϕ we verify

e(P )||S1 || · · · ||Sn |= ϕ

where S1 , ..., Sn are modiﬁcations of, respectively, S1 , ..., Sn by adding guards to appropriate statements. The soundness relies on the analysis of the main process and the environment processes, and whether one puts the guards correctly. Example: Let E, P , S1 and S2 be as previously speciﬁed. To use the STR-2 reduction strategy we specify S1 and S2 as follows: proctype S1 () { proctype S2 () { byte i,k,num; byte i,num; b==1; b==2; atomic{ c0?num; c0!num; } atomic{ c0?num; c0!num; } do do :: k< num → i=i+1; k=k+i; :: i × i < num → i=i+1; :: k> num → c1!0; :: i × i > num → c2!0; :: k == num → c1!1; :: i × i == num → c2!1; od; od; } } The problem P ||E||S1 ||S2 |= ϕ is now reduced to P ||E||S1 ||S2 |= ϕ. As there are additional conditional statements in S1 and S2 , the number of potential execution paths of S1 and S2 would be less than that of S1 and S2 . For n1 = 2, the number of transitions in model checking using this strategy is 38. 2.3

Summary

The following table presents the number of transitions, the number of states and the peek memory usage in model checking the example speciﬁcation for n1 = 2 with the two reduction strategies: Strategy States Transitions Memory No Strategy 396 506 1.493mb STR-1 Reduction 146 228 1.493mb STR-2 Reduction 34 38 1.493mb

A Strategy for Improving the Eﬃciency of Procedure Veriﬁcation

119

The use of the reduction strategies can signiﬁcantly reduce the model checking complexity (however, in this example, no advantage with respect to the memory usage has been achieved, because the example is too simple). The applicability of the strategies depends on the property to be proved as well as on the structure of the speciﬁcation, in particular, the ﬁrst strategy is not compatible with fairness constraints. The idea to simplify proofs by extending the proposition to be proved has for long been a very important principle in automated reasoning. We are applying this idea to the veriﬁcation of procedures. In this particular application, the applicability of this idea is restricted to properties where fairness is not required and it is therefore necessary to modify this idea in order to deal with properties with fairness constraints, i.e. instead of extending the proposition, we code the procedure knowledge (i.e. add appropriate guards) into environment processes.

3

Application – A Case Study

An approach to the veriﬁcation of operating procedures has been discussed in [11, 12]. In the light of the above discussion, we modify the approach to include the following steps: – create a process representing the procedure; – create abstract processes representing the plant processes; – create processes for modeling the interaction between the procedure process and the plant processes; – create a process for modeling the initial state of the plant; – formulate and formalize the correctness requirements of the procedure; – analyze diﬀerent paths of the procedure and determine relevant plant processes with respect to the paths, in order to use the proposed strategies; – verify the procedure by model checking. The potential beneﬁts of the strategies are illustrated by a case study which is the veriﬁcation of an operating procedure with seeded errors. The operating procedure considered here is “PROCEDURE D-YB-001 — The steam generator control valve opens and remains open” [11]. It is a disturbance procedure to be applied when one of the steam generator control valves (located on the feed water side) opens and remains open. Description of the Procedure: The primary goal of the procedure is to check the control valves (there are 4 such valves), identify and repair the valve that has a problem (i.e. opens and remains open). After the defective valve is identiﬁed, the basic action sequence of the operator is as follows: Start isolating the steam generator. Manipulate the primary loop. Continue isolation of the steam generator. Repair the steam generator control valve and prepare to increase power. Increase the power.

120

Wenhui Zhang

There are 93 instructions (each of the instructions consists of one or several actions) involving around 40 plant units including 4 control valves, 4 cycling valves, 4 steam level meters, 4 pumps, 4 pump speed meters and 4 protection signals. Seeded Errors: In order to demonstrate the beneﬁts (with respect to model checking times) of the reduction strategies for detection of errors and for veriﬁcation, we seed 5 errors into the procedure. These seeded errors are as follows: 1 2 3 4 5

A A A A A

wrong condition in a wait-statement - leading to a fail stop. wrong condition in a conditional-jump - leading to a loop. missing instruction - leading to a loop. wrong reference - leading to an unreachable instruction. missing instruction - leading to an unreached goal.

Creating Models: In this step, we create the model of the procedure and the related processes. The model in this case study consists of totally 19 processes: – 1 procedure process for modeling the procedure (with the 5 seeded errors). – 14 simple plant processes: 1 for each of the 4 cycling valves, 1 for each of the 4 pump speed meters, 1 for each of the 4 steam level meters, 1 for modeling notiﬁcation of supervisor, and 1 for modeling opening and closing valves with protection signals (the consequence is that a valve may not be opened, although an instruction for opening the valve is executed). – 4 other processes: 3 processes for modeling the interactions (for dealing with the 3 main types of procedure elements for respectively checking the value of a system state, waiting for a system state, and performing an action) between the procedure process and the plant processes, and 1 initialization process for choosing an initial state from the set of possible initial states of the plant. Formulating Correctness Requirements: For the purpose of this paper, it is sufﬁcient with one correctness requirement which is as follows: “Every execution of the procedure terminates and upon the completion of the execution, all control valves (identiﬁed by the names: RL33S002, RL35S002, RL72S002 and RL74S002) are normal or the supervisor is notiﬁed”. For veriﬁcation, correctness requirements are speciﬁed by using the propositional linear temporal logic and we specify the given requirement by the following formulas (referred to as ϕ0 and ϕ1 , respectively): []

(ProcedureCompleted==Yes -> ( (Valve[RL33S002]==Normal && Valve[RL35S002]==Normal && Valve[RL72S002]==Normal && Valve[RL74S002]==Normal) || Supervisor==Notiﬁed)) <>(ProcedureCompleted==Yes)

A Strategy for Improving the Eﬃciency of Procedure Veriﬁcation

121

where P rocedureCompleted is a variable to be assigned the value “Yes” right before the end of an execution of the procedure, RL33S002, RL35S002, RL72S002 and RL74S002 are identiﬁers of the control valves which may have values of the type {“Normal”, “Defective”} and “Valve” is an array that maps a valve identiﬁer to the current state of the valve. Analyzing Procedure Paths: In this step, we analyze the procedure and the plant processes in order to group diﬀerent execution paths and ﬁnd out plant processes relevant to the diﬀerent paths. The procedure starts with identifying the symptoms of the problem and a defective valve. There are 5 branches of executions depending on whether the valve RL33S002, RL35S002, RL72S002 or RL74S002 is defective or none of them is defective. The plant process for opening and closing of valves is relevant for all of the executions except in the case where no valves are defective. The other relevant plant processes for the 5 branches are as follows: – None of the valves are defective: In this case, none of the other processes is relevant. – The valve RL33S002 is defective: 1 process for the relevant (1 of 4) pump speed meter, 1 process for the relevant (1 of 4) steam level meter and 1 process for the relevant (1 of 4) cycling valve. – The cases where the valve RL35S002, RL72S002 or RL74S002 is defective are similar to the previous case. 3.1

STR-1 Reduction

We modiﬁed the procedure process by adding the variable i and assign the path information to i at the entrance of each of the 4 main branches (i.e. except the branch dealing with the case where no valves are defective). Let b(i) be the proposition representing that the procedure process is executing in the i-th (i ∈ {1, 2, 3, 4}) branch. Let ps(i), sl(i) and cv(i) represent respectively the i-th pump speed process, the i-th steam level process and the ith cycling valve process. We deﬁne ψ as the conjunction of the following set of formulas: {¬b(k) → last = ps(k) ∧ last = sl(k) ∧ last = cv(k) | k ∈ {1, 2, 3, 4}}. In the veriﬁcation, the test results were generated by Spin 3.4.2. The option for the veriﬁcation of ϕ0 is -a for checking acceptance cycles. The options for the veriﬁcation of ϕ1 are -a and -f for checking acceptance cycles with weak fairness. The models were compiled with the option -DNOREDUCE. The error detection and veriﬁcation steps using this approach were as follows: – Instead of verifying ϕ0 , we verify []ψ → ϕ0 . – The veriﬁcation detects errors 5 and 4 with model checking times respectively 0.4 and 9.9 seconds.

122

Wenhui Zhang

– The model checking time for re-checking []ψ → ϕ0 is 10.2 seconds. The subgoal ϕ1 is not veriﬁed here, because the veriﬁcation requires the weak fairness constraint and we cannot use STR-1 reduction (i.e. verifying []ψ → ϕ1 instead of ϕ1 ). 3.2

STR-2 Reduction

We modiﬁed the processes ps(i), sl(i), cv(i) for i = 1, 2, 3, 4 by coding the knowledge represented by the formula []ψ (in other words: putting the condition b(i) as a guard to appropriate statements) in the plant processes and tried to verify whether ϕ0 and ϕ1 hold. The error detection and veriﬁcation steps using this approach were as follows: – Veriﬁcation of ϕ0 detects errors 5 and 4 with model checking times respectively 0.3 and 1.4 seconds. The time for re-checking ϕ0 is also 1.5 seconds. – Veriﬁcation of ϕ1 detects errors 2, 3 and 1 with model checking time 1.3, 1.4 and 1.4 second, respectively. The time for re-checking ϕ1 is 80.4 seconds. – After that, the formula ϕ0 was re-checked with model checking time 1.8 seconds. 3.3

Summary

The following table sums up the model checking times in the diﬀerent veriﬁcation tasks with the diﬀerent veriﬁcation strategies. For simplicity, the re-checking of ϕ0 after the subtask of checking ϕ1 is not shown in the table. For comparison, we have also carried out veriﬁcation without using the proposed strategies (the data is presented in the column marked with “No Strategy”). Task Error No Strategy Strategy 1 Strategy 2 ϕ0 5 2.6 0.4 0.3 4 207.4 9.9 1.4 0 214.1 10.2 1.5 ϕ1 2 4.4 1.3 3 27.5 1.4 1 5.6 1.4 0 7134.5 80.4 The ﬁrst column is the two veriﬁcation subtasks. The second column is the numbers of errors detected in the veriﬁcation. The item “0” means that no errors were detected in that point of the veriﬁcation. The third column is the model checking times (all in seconds) with the original veriﬁcation approach, the fourth column is the model checking times with STR-1 reduction, and the ﬁfth column is the model checking times with STR-2 reduction. This table shows that the use of STR-1 reduction can signiﬁcantly reduce the model checking time (compared with the veriﬁcation where no strategy was used), in the task where STR-1 reduction is applicable. STR-2 reduction is better than STR-1 reduction in the

A Strategy for Improving the Eﬃciency of Procedure Veriﬁcation

123

case study. STR-2 reduction also has the advantage that it can be applied to verify properties that need fairness constraints. For further comparison, the number of visited states, the number of transitions, and the memory usage for the veriﬁcation of ϕ0 and ϕ1 when no errors were found are presented in the following table (with k for thousand and mb for mega-byte). The data in the table are also clearly in favor of the proposed reduction strategies. Task Type of Data No Strategy Strategy 1 Strategy 2 ϕ0 States 117.6k 21.2k 2.1k Transitions 1450.1k 60.9k 7.2k Memory 19.1mb 6.1mb 3.4mb ϕ1 States 134.8k 2.5k Transitions 57706.2k 482.3k Memory 39.8mb 20.7mb As the case study is considered, the use of the strategies is not time-saving, taking into account the time used for the preparation and analysis of the problem. However the main purpose of the strategies is not to save a few hours of model checking time, it is to scale up the applicability of model checking, such that problems that are originally infeasible for model checking may become feasible with the strategies. For instance, we may have problems that cannot be solved with the available memory using the original approach, while can be solved by using the proposed strategies.

4

Discussion and Concluding Remarks

We have modiﬁed the procedure veriﬁcation approach [11, 12] by adding a step for analyzing the paths of procedures and using the proposed strategies. This has improved the eﬃciency of procedure veriﬁcation signiﬁcantly and therefore has scaled up the applicability of the veriﬁcation approach. The strategies are suitable for procedure veriﬁcation (however may not be easily generalizable to other types of models), because procedures normally have many paths and the relevance of environment processes could be diﬀerent for diﬀerent paths. The strategies are based on an analysis of the main procedure process and the environment processes. After the analysis of the procedure paths, for STR-1 reduction, all we need to do is to register some path information in the main process and add procedure knowledge to the property we want to verify. The procedure knowledge considered in this paper is simple. They are of the form “if a condition is met, then the execution of a given process is irrelevant” and are easy to formulate. For the modiﬁed version, instead of adding procedure knowledge to the property we add it to the environment processes. The reliability of the strategies depends on the correct analysis of the main process and the environment processes. For STR-1 reduction, one problem is the complexity of the representation of the procedure knowledge. The conversion of a formula (such as []ψ → ϕ0 ) to the

124

Wenhui Zhang

corresponding never-claim (using the converter provided by Spin) may require much time and memory when a formula is long. The second problem is that it is not compatible with fairness constraints. For STR-2 reduction, there are no such problems. On the other hand, the additional complexity of the modiﬁed version is the modiﬁcation of the environment processes which may be a greater source for errors than adding the formula []ψ. The strategies are ﬂexible and can be combined with other complexity reduction approaches which do not aﬀect the execution structure of the processes. We may use some of the techniques mentioned in the introduction (e.g. various abstraction techniques) to reduce the complexity of a model and then use the strategies (in the cases where they are applicable) to further reduce model checking time.

Acknowledgment The author thanks Huimin Lin, Terje Sivertsen and Jian Zhang for reading an earlier draft of this paper and providing many helpful suggestions and comments. The author also thanks anonymous referees for their constructive critics that helped improving this paper.

References [1] S. Berezin and S. Campos and E. M. Clarke. Compositional Reasoning in Model Checking. Proceedings of COMPOS’97. Lecture Notes in Computer Science 1536: 81-102. 1998. 113 [2] E. M. Clarke, O. Grumberg and D. E. Long. Model Checking and Abstraction. ACM Transactions on Programming Languages and Systems 16(5): 1512-1542, 1994. 113 [3] E. A. Emerson. Temporal and Modal Logic. Handbook of Theoretical Computer Science (B):997-1072. 1990. 114 [4] J. Gregoire. Veriﬁcation Model Reduction through Abstraction. Formal Design Techniques VII, 280-282, 1995. 113 [5] G. J. Holzmann. Design and Validation of Computer Protocols. Prentice Hall, New Jersey, 1991. 116 [6] G. J. Holzmann. The model checker Spin. IEEE Transactions on Software Engineering 23(5): 279-295. May 1997. 116 [7] J. G. Kemeny. Report of the President’s Commission on the Accident at Three Mile Island. U. S. Government Accounting Oﬃce. 1979. 113 [8] N. G. Leveson. Software System Safety and Computers. Addison-Wesley Publishing Company. 1995. 113 [9] C. Loiseaux, S. Graf, J. Sifakis, A. Bouajjani and S. Bensalem. Property preserving abstractions for the veriﬁcation of concurrent systems. Journal of Formal methods in System Design 6:1-35. 1995. 113 [10] V. Roy and R. de Simone. Auto/Autograph. In Computer Aided Veriﬁcation. DIMACS series in Discrete Mathematics and Theoretical Computer Science 3: 235-250, June 1990. 113

A Strategy for Improving the Eﬃciency of Procedure Veriﬁcation

125

[11] W. Zhang. Model checking operator procedures. Lecture Notes in Computer Science 1680:200-215. SPIN 1999. Toulouse, France. 113, 114, 119, 123 [12] W. Zhang. Validation of control system speciﬁcations with abstract plant models. Lecture Notes in Computer Science 1943:53-62. SAFECOMP 2000. Rotterdam, The Netherlands. 113, 114, 119, 123

Verification of the SSL/TLS Protocol Using a Model Checkable Logic of Belief and Time Massimo Benerecetti1 , Maurizio Panti2 , Luca Spalazzi2 , and Simone Tacconi2 1

Dept. of Physics, University of Naples ”Federico II”, Napoli, Italy 2 Istituto di Informatica, University of Ancona, Ancona, Italy [email protected] {panti,spalazzi,tacconi}@inform.unian.it

Abstract. The paper shows how a model checkable logic of belief and time (MATL) can be used to check properties of security protocols employed in computer networks. In MATL, entities participating to protocols are modeled as concurrent processes able to have beliefs about other entities. The approach is applied to the verification of TLS, the Internet Standard Protocol that IETF derived from the SSL 3.0 of Netscape. The results of our analysis show that the protocol satisfies all the security requirements for which it was designed.

1

Introduction

In this paper we apply a symbolic model checker (called NuMAS [5, 2]) for a logic of belief and time to the veriﬁcation of the TLS protocol. TLS is the Internet Standard Protocol that IETF derived [8] from Nescape’s SSL 3.0. The veriﬁcation methodology is based on our previous works [3, 4, 6]. The application of model checking to security protocol veriﬁcation is not new (e.g., see [11, 13, 14]). However, in previous work, security protocols are veriﬁed by introducing the notion of intruder and, then, by verifying whether the intruder can attack a given protocol. This approach allows for directly ﬁnding a trace of a possible attack, but it may not make clear what the protocol ﬂaw really is. This work usually employs temporal logics or process algebras. A diﬀerent approach makes use of logics of belief or knowledge to specify and verifying security protocols (see, e.g. [1, 7, 16]). The use of such logics requires no models of intruder, and allows one to ﬁnd what the protocol ﬂaw is, allowing to specify (and check) security properties in a more natural way. However, in this approach, usually veriﬁcation is performed proof-theoretically. Our approach can be seen as a combination of the above two: we employ a logic called MATL (MultiAgent Temporal Logic) able to express both temporal aspects and beliefs (thus following the line of the work based on logics of belief or knowledge, which does not use a model of the intruder); veriﬁcation, on the other hand, is performed by means of a symbolic model checker (called NuMAS [2]). NuMAS is built on the work described in [5], where model checking is applied to BDI attitudes (i.e., Belief, Desire, and Intention) of agents. Our work aims S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 126–138, 2002. c Springer-Verlag Berlin Heidelberg 2002

Verification of the SSL/TLS Protocol

127

at the use of MATL to model security protocols, and uses NuMAS for their veriﬁcation. The paper is structured as follows. In Section 2 we shortly describe the SSL/TLS protocol. Section 3 provides a brief introduction to MATL. The use of MATL as a logic for security protocols is described in Section 4. Section 5 describes the formal speciﬁcations for the usual requirements of security protocols. The results of the veriﬁcation of the SSL/TLS protocol are reported in Section 6. Finally, some conclusions are drawn in Section 7.

2

The SSL/TLS Protocol

The Security Sockets Layer, commonly referred to as SSL, is a cryptographic protocol originally developed by Netscape in order to protect the traﬃc conveyed by HTTP applications and, potentially, by others diﬀerent applications types. The ﬁrst two versions of this protocol had several ﬂaws that limited its application. The version 3.0 of the protocol [10] was published as Internet Draft document and the eﬀorts of Internet Engineering Task Force (IETF) yielded to the deﬁnition of an Internet standard protocol, named Transport Layer Protocol (TLS) [8]. As a consequence, the standard version is often mentioned with the name SSL/TLS. TLS is not a monolithic protocol, but it has a two-layers architecture. The lower protocol, named TLS Record Protocol, is placed on the top of a reliable transport protocol (i.e., TCP) and aims at guaranteeing the privacy and the integrity of connection. The higher protocol, named TLS Handshake Protocol, is layered over the TLS Record Protocol and allows entities to mutually authenticate and agree on an encryption algorithm and a shared secret key. After the execution of this protocol, entities are able to exchange application data in a secure manner by means of the Record Protocol. The Handshake Protocol is the most complex and crucial part of TLS, since the achievement of privacy and integrity requirements provided in the Record Protocol depend upon the cryptographic parameters negotiated during its execution. For this reason, we concentrate our analysis on the TLS Handshake Protocol. In order to reduce the high degree of complexity of the protocol, we conduce the formal analysis on an abstract version of TLS. This simpliﬁcation allows us to leave out from veriﬁcation the implementation details of the protocol, so as to concentrate on its intrinsic design features. We based our veriﬁcation on the description provided by [14], and depicted in Figure 1. In the ﬁgure, C and S denote the client and server, respectively. {m}KX is a component m encrypted with the public key of the entity X, SX {m} is a component m signed by the entity X, and H(m) is the hash of the component m. Moreover, CertX is a digital certiﬁcate of the entity X, V erX the protocol version number used by the entity X, SuiteX the cryptographic suite preferred by the entity X, NX a random number (nonce) generated by the entity X. Finally, SID is the session identiﬁer, P M S is the so called pre-master secret, M ex the concatenation of all messages exchanged up to, this step, and KSC is the shared secret key agreed between entities.

128

(1) (2) (3) (4) (5)

Massimo Benerecetti et al.

C→ S→ C→ S→ C→

S C S C S

: : : : :

C, V erC , SuiteC , NC , SID S, V erS , SuiteS , NS , SID, CertS CertC , {V erC , P M S}KS , SC {H(KSC , M ex)} {H(S, KSC , M ex)}KSC {H(C, KSC , M ex)}KSC

Client Hello Server Hello Client Verify Client Finished Server Finished

Fig. 1. The simpliﬁed SSL/TLS handshake protocol

3

A Brief Introduction to MATL

In this section we brieﬂy introduce MATL [4, 5], a model checkable logic of belief and time. The intuitive idea underlying MATL is to model entities engaged in a protocol session as ﬁnite state processes. Suppose we have a set I of entities. Each entity is seen as a process having beliefs about (itself and) other entities. We adopt the usual syntax for beliefs: Bi φ means that entity i believes φ, and φ is a belief of i. Bi is the belief operator for i. The idea is then to associate to each (level of) nesting of belief operators a process evolving over time, each of which intuitively correspond to a “view” about that process. View Structure. Let B = {B1 , ..., B n }, where each index 1, ..., n ∈ I corresponds to an entity. Let B ∗ denote the set of ﬁnite strings of the form B 1 , ..., B n with B i ∈ B. We call any α ∈ B ∗ , a view. Each view in B ∗ corresponds to a possible nesting of belief operators. We also allow for the empty string, . The intuition is that represents the view of an external observer (e.g., the designer) which, from the outside, “sees” the behavior of the overall protocol. Example 1. Figure 2 depicts the structure of views for the SSL/TLS protocol. The beliefs of entity C are represented by view BC and are modeled by a process playing the client’s role in the protocol. The beliefs that C has about (the behavior of) entity S are represented by view B C BS and are modeled by a process playing S’s role in the protocol. Things work in the same way for any arbitrary nesting of belief operators. The beliefs of entity S (the view B S ) are modeled similarly. Language. We associate a language Lα to each view α ∈ B ∗ . Intuitively, each Lα is the language used to express what is true (and false) about the process of view α. We employ the Computational Tree Logic (CTL) [9], a well known propositional branching-time temporal logic widely used in formal veriﬁcation. For each α, let Pα be the set of propositional atoms (called local atoms), expressing the atomic properties of the process α. Each Pα allows for the deﬁnition of a diﬀerent language, called a MATL language (on Pα ). A MATL language Lα on Pα is the smallest CTL language containing the set of local atoms Pα and the belief atoms Bi φ, for any formula φ of LαB i . In particular, L is used to speak about the whole protocol. The language LB i (LB j ) is the language adopted to represent i’s (j’s) beliefs. i’s beliefs about j’s beliefs are speciﬁed by the language of the view B i B j . Given a family {Pα } of sets of local atoms, the family of MATL

Verification of the SSL/TLS Protocol

129

ε Bc

Bc Bs rec X

Bc

Bs

Bs rec X Bc

Bs

Bs

BcBc

BcBs rec X

. . .

. . .

Fig. 2. The structure of views for the SSL/TLS protocol and the proposition C believes S sees X in MATL languages on {Pα } is the family of CTL languages {Lα }. We write α : φ (called labeled formula) to mean that φ is a formula of Lα . For instance, the formula AG (p → Bi ¬q) ∈ L , (denoted by : AG (p → Bi ¬q)), intuitively means that in every future state (the CTL operator AG), if p is true then entity i believes q is false. In order to employ MATL as logic that is suitable for security protocols, we need to deﬁne appropriate sets of local atoms Pα , one for each process α. First of all, a logic for security protocols has propositions about which messages a generic entity P sends or receives. Therefore, we introduce atoms of the form rec X and send Q X (where X can be a full message of a given protocol but not a fragment of message) that represent the P ’s communicative act of receiving X and sending X to Q, respectively. This means that we need to introduce the local atoms rec X and sendQ X in PB P (see Figure 2). Furthermore, we introduce the notion of seeing and saying (fragments of) messages by means of local atoms as sees X and said X. This allows us to take into account the temporal aspects of a protocol. Indeed, rec and send represent the acts of receiving and sending a message during a session. They allows us to capture the occurrence of those events by looking at the sequence of states. This is diﬀerent from the notion of the fragments of messages that an entity has (atom sees) or uses when composing its messages (atom said ). Finally, the atom sees represents both the notion of possessing (what an entity has because it is initially available or newly generated) and seeing (what has been obtained from a message sent by another entity). A logic for security protocols also has local atoms of the form f resh(X), expressing the freshness of the fragment/message X. The intuitive meaning is that X has been generated during the current protocol session. Local atoms of the form pubk P K and prik P K −1 mean that K is the public key of P and K −1 the corresponding private key and can be directly added as local atoms to the languages of MATL. Finally, a logic for security protocols also has local atoms such as P says X to express that entity P has sent X recently. This can be expressed in MATL by the formula BP says X.

130

Massimo Benerecetti et al.

Example 2. We can  set the atoms Pα for the views B C and     said SID,      sees NS ,       said {V erC , P M S}KS , SC {H(KSC , M ex)},   PB C =

PB S =

fresh {H(S, KSC , M ex)}KSC

  fresh SID,     pubk S Ks ,   ...    sees SID,   said NS ,  

              

fresh NC ,     prik S KS −1 ,  

     

sees {V erC , P M S}KS , SC {H(KSC , M ex)}, ...

BS

as follows:

For instance, the atom said SID in view B C represents C sending SID to S (message of step (1) in the SSL/TLS protocol). The local atoms of the other views can be deﬁned similarly. Since each view αB i (with i = C, S) models the (beliefs about the) behavior of entity i, the set of local atoms will be that of view B i (see [4]). Semantics. To understand the semantics of the family of languages {Lα }α∈B ∗ (hereafter we drop the subscript), we need to understand two key facts. On the one hand the semantics of formulae depend on the view. For instance, the formula p in the view B i expresses the fact that i believes that p is true. The same formula in the view B j expresses the fact that j believes that p is true. As a consequence, the semantics associates locally to each view α a set of pairs m, s, where: m = S, J, R, L is a CTL structure, with S a set of states, J ⊆ S the set of initial states, R the transition relation, and L : S → 2P the labeling function; and s is a reachable state of m (a state s of a CTL structure is said to be reachable if there is a path leading from an initial state of the CTL structure to state s). On the other hand there are formulae in diﬀerent views which have the same intended meaning. For instance Bj p in view B i , and p in view B i B j both mean that i believes that j believes that p is true. This implies that only certain interpretations of diﬀerent views are compatible with each other, and these are those which agree on the truth values of the formulae with the same intended meaning. To capture this notion of compatibility we introduce the notion of chain as a ﬁnite sequence c , ..., cβ , ..., cα of interpretations m, s of the language of the corresponding view. A compatibility relation C on the MATL languages {Lα } is then a set of chains passing through the local interpretations of the views. Intuitively, C will contain all those chains c, whose elements cα , cβ (where α, β are two views in B ∗ ) assign the same truth values to the formulae with the same intended meaning. The notion of satisﬁability local to a view is the standard satisﬁability relation between CTL structures and CTL formulae. The (global) satisﬁability of formulae by a compatibility relation C needs to take into account the chains, and is

Verification of the SSL/TLS Protocol ε

Bs

Bc <mc,S’cc> >

Bc(Bs said X ^ sees X) <mε,Se>

Bc Bs said X ^ sees X

131

<mc,S’’c c> > Bs

Bs said X ^ sees X

...

BcBs ...

said X

<ms,S’’s s> > <ms,S’ss> >

said X

Fig. 3. The notion of compatibility in the SSL/TLS protocol deﬁned as follows: for any chain c ∈ C and for any formula φ ∈ Lβ , the satisﬁability relation |= is deﬁned as cβ |= φ if and only if φ is satisﬁed by the local interpretation cβ = m, s of view β. Example 3. Let us consider the situation of Figure 3, where chains are represented by dotted lines. The formula Bc (Bs saidX ∧ seesX) is satisﬁed by the interpretation in view . This means that such interpretation must be compatible with interpretations in view Bc that satisfy Bs saidX ∧ seesX. This is indeed the case for both < mc , Sc > and < mc , Sc >. Therefore both of them must be compatible with interpretations in view Bc Bs that satisfy saidX. This is the case for both < mc , Sc > and < mc , Sc >. We are now ready to deﬁne the notion of model for MATL (called MATL structure) Definition 1 (MATL structure). A nonempty compatibility relation C for a family of MATL languages on {Pα } is a MATL structure on {Pα } if for any chain c ∈ C, 1. cα |= Bi φ iﬀ for every chain c ∈ C, cα = cα implies cαB i |= φ; 2. if cα = m, s, then for any state s of m, there is a chain c ∈ C such that cα = m, s . Brieﬂy: the nonemptyness condition for C guarantees that there is at least a consistent view in the model. The only if part in condition 1 guarantees that each view has correct beliefs, i.e., any time Bi φ holds at a view then φ holds in the view one level down in the chain. The if part is the dual property and ensures the completeness of each view. Notice that the two conditions above give to belief operators in MATL the same strength as modal K(m), where m is the number of entities (see [5]).

4

MATL as a Logic for Security Protocols

MATL is expressive enough to be used as a logic for security protocols. Furthermore, it has a temporal component that usually is not present in the other

132

Massimo Benerecetti et al.

logics of authentication (e.g., see [1, 7, 16]). In order to show how the properties of security protocols can be expressed within MATL, we shall now impose some constraints to the models in order to capture the intended behavior of a protocol. These constraints can be formalized with a set of sound axioms. This is similar to what happens with several logics of authentication (see for example [1, 16]). Indeed MATL encompasses such logics. Here, for the sake of readability, we show how it is possible to translate in MATL some of the most signiﬁcant axioms that has been proposed in most logics of authentication. As a ﬁrst example, let us consider the message meaning axioms. Usually, such axioms correspond to the following schema: shkP,Q K ∧ P sees{X}K → Q said X Intuitively, it means that when a entity P receives a (fragment of) message encrypted with K, and K is a key shared by P and Q, then it is possible to conclude that the message comes from Q. The above axiom schema can be formulated in MATL as follows: P : shk P,Q K ∧ sees{X}K → BQ said X

(1)

where with P : Ax we also emphasize which view (P ) the axiom Ax belongs to. Message meaning is often used with the nonce veriﬁcation, that has the following schema: Q said X ∧ f resh(X) → Q says X This schema expresses that when an entity Q has sent X (i.e., Q said X) recently (i.e., f resh(X)), then we can assert that Q says X. In MATL, this becomes P : BQ said X ∧ f resh(X) → BQ saysX

(2)

As a consequence, it is important to establish whether a fragment of message is fresh. The following axioms help on this task: f resh(Xi ) → f resh(X1 , . . . , Xn ) f resh(X) → f resh({X}K ) Intuitively, they mean that when a fragment is fresh, then the message containing such fragment (the encryption of the fragment, respectively) is fresh as well. In MATL, they can be inserted in the appropriate views without modiﬁcation. Another important set of axioms establishes how a message can be decomposed. For example, in [1] we have the following schemata: P sees(X1 . . . Xn ) → P sees Xi P sees {X}K ∧ P has K → P sees X The ﬁrst schema states that an entity sees each component of any compound unencrypted message it sees, while the second schema states that an entity can

Verification of the SSL/TLS Protocol

133

decrypt a message encrypted with a given key when it knows the key. In MATL the above axiom schemata can be expressed in each view without modiﬁcation. The following axiom schemata relate sent (received) messages to what an entity says (sees). P : sendQ X → said X (3) P : rec X → sees X

(4)

The next axiom schemata capture the idea that an entity sees what it previously said or what it says. P : said X → sees X (5) P : says X → sees X (6) The following schema expresses the ability of an entity to compose a new message starting from the (fragment of) messages it already sees. P : sees X1 ∧ . . . ∧ sees Xn → sees(X1 . . . Xn )

(7)

In modeling security protocols such as SSL/TLS, we also need to take into account hash functions and signatures. We assume that a signature is reliable, without considering how such schema looks like1 . This allows us to focus on whether the protocol is trustworthy, and is a usual assumption in security protocol veriﬁcation. The corresponding axiom schemata are the following: P : f resh(X) → f resh(H(X))

(8)

P : f resh(X) → f resh(SQ {X}) P : sees X → sees H(X)

(9) (10)

P : sees X ∧ sees KP −1 ∧ prikP KP −1 → sees SP {X} P : sees X ∧ sees KQ ∧ pubkQ KQ → sees {X}KQ

(11) (12)

where Q and P can be both substituted with the same entity or with diﬀerent ones. The above schemata are the obvious extension to hash functions and signatures of the axioms about the freshness and the capability of composing messages. The following axiom schema expresses the capability to extract an original message from its signed version when the corresponding public key is known: P : sees SQ {X} ∧ sees KQ → sees X

(13)

The next schema corresponds to the message meaning axiom for signed messages: P : sees SQ {X} ∧ pubkQ KQ → BQ saidX

(14)

Intuitively, it means that when an entity sees a message signed with the key of Q, then it believes that Q said such a message. Finally, the following axiom expresses that if P sees a secret and Q as well sees it, then entities share such a secret. P : sees X ∧ BQ seesX → shsecP,Q X (15) 1

The most common schema is the following: SQ {X} = (X, H(X)KQ −1 )

134

5

Massimo Benerecetti et al.

Security Requirements for SSL/TLS

SSL/TLS is a cryptographic protocol belonging to the family of the so called “authentication and key-exchange” protocols. This means that it must guarantee to each entity participating to the protocol assurance of the identities of other ones. In other words, it must achieve the goal of mutual authentication. Authentication typically depends upon secrets, such as cryptographic keys, that an entity can reveal or somehow use to prove its identity to the other ones. For this reason, such protocols must provide entities with a secure way to agree on a secret session key for encrypting subsequent messages. In order to express these security goals, we introduce the authentication of secrets requirement, where the secret is composed by the pre-master secret P M S and the shared key KSC together. Moreover, such a protocol must guarantee that exchanged secrets are new as well, i.e., it must achieve the goal of freshness. In order to formalize also this security goal, we introduce the freshness of secrets requirement. In MATL, the requirements of authentication and freshness of secrets can be expressed as a set of formulae in the view of each entity. CLIENT REQUIREMENTS Authentication of Secrets. The client C is not required to satisfy possession of P M S, since the client generate this secret term itself. Conversely, in every possible protocol execution, if C sends to S the Client Finished message, then P M S has to be a shared secret between C and S. Formally, we write as follows: C : AG (sendS ClientF inished → shsecC,S P M S)

(16)

Similarly, being the shared key KSC calculated by C starting from P M S and the two nonces after Step (2), it is not necessary to verify its possession. On the other hand, if the client C sends the Client Finished message to S, then KSC needs to be a shared key between C and S. This can be expressed by: C : AG (sendS ClientF inished → shkC,S KSC )

(17)

Another important authentication requirement is that, if C sends the Client Finished message to S, then C must believe that S shares with it both P M S and KSC , as expressed by the following property: C : AG (sendS ClientF inished → BS shkC,S KSC ∧ BS shsecC,S P M S) (18) Freshness of Secrets. It amounts to the requirement that if C receives the Server Finished message, then C must believe that S has recently sent both KSC and P M S. In formulas this can be expressed as follows: C : AG (rec ServerF inished → BS says KSC ∧ BS says P M S)

(19)

SERVER REQUIREMENTS Authentication of Secrets. In the view of S, diﬀerently from the case of view C, we need to verify possession by S of P M S, since this term is not locally

Verification of the SSL/TLS Protocol

135

generated, but sent by C. Therefore, we require that if S sends the Server Finisher message to C, then S must possess P M S: S : AG (sendC ServerF inished → sees P M S)

(20)

Moreover, under the same condition, S must share P M S with C: S : AG (sendC ServerF inished → shsecC,S P M S)

(21)

Similarly to the case of C’s view, since the shared key KSC is calculated by S starting from P M S and the nonces, it is not necessary to verify its possession. On the other hand, we need to verify that if S sends the Server Finished message to C, then KSC must be a shared key between C and S: S : AG (sendC ServerF inished → shkC,S KSC )

(22)

Finally, if S sends the Server Finished message to C, then S must believe that C shares with it both P M S and KSC : S : AG (sendC ServerF inished → BC shkC,S KSC ∧ BC shsecC,S P M S)(23) Freshness of Secrets. This requirement is exactly as for the client C. It amounts to check that if S receives the Client Finished message, then S must believe that C has recently sent both KSC and P M S: S : AG (rec ClientF inished → BC says KSC ∧ BC says P M S)

6

(24)

Verification of the SSL/TLS Protocol

The core of the veriﬁcation process performed by NuMAS is a model checking algorithm for the MATL logic. This algorithm is built on top the CTL model checking and is based on Multi Agent Finite State Machine (MAFSM). Since a CTL model checking algorithm employs Finite State Machine (FSM), thus we have to extend the notion of FSM to accomodate beliefs using the notion of MAFSM. Following the above approach, in order to verify a protocol by means of NuMAS, we need to describe it as a MAFSM. To specify a MAFSM we have to describe the ﬁnite state machine of each view that models the behavior (i.e., the temporal evolution) of the corresponding entity that partecipates to the protocol. As consequence, we need to specify the propositional atoms (i.e., message variables and freshness variables) and the explicit beliefs atoms, establishing what are the local atoms of each view and specifying the compatibility relation among the views by means the set of belief atoms for each view. Moreover, we have to specify how atoms vary during the protocol execution. In particular, we need to model entity sending and receiving messages, by means of local atoms as sendP X and rec X, derived directly from the protocol description. These atoms follow the sequence of messages in the protocol and, once they become true. The behavior of other atoms (for example, atoms as sees and says) derives

136

Massimo Benerecetti et al.

Table 1. Summary of the veriﬁcation of the whole SSL/TLS Parameter Views in the model State variables client/server Security specifications checked Time to build the model Time to check all the properties Total time required for the verification Memory required for the verification

Value 2 120/ 119 18 21.2 s. .21 s. 21.41 s. 2.1 Mb

from the axioms described in Section 4. In the MASFM associated to protocol to verify, these axioms are introduced as invariant, namely they are intended as MATL formulas that must be true in every state. Furthermore, we need to express beliefs about other principals. Also in this case, we use boolean variables and we constraint their behavior by means of axioms. For more details about the process veriﬁcation, the reader can refer to our previous works. In particular, in [5, 3] is shown the semantics of MATL, brieﬂy sketched in Section 3, can be ﬁnitely presented so as to allow model checking of MATL formulae. In [3, 4] can be found a detailed description of how the ﬁnite state MATL structure can be speciﬁed, and in [2] is described the symbolic model checking algorithm for MATL implemented within the NuMAS model checker. We ran the veriﬁcation of SSL/TLS protocol with NuMAS on a PC equipped with a Pentium III (frequency clock 1GHz) and 256 MB RAM. Notice that, due to space constraints, in the paper we have only described the analysis of the Handshake Protocol. On the other hand we have veriﬁed the whole SSL/TLS protocol suite, including the Record Layer and the Resumption Protocol. Table 1 reports some parameters characterizing the size of the model devised for the entire version of SSL/TLS and some runtime statistics of the veriﬁcation. It is worth noticing that, despite the fairly big size of the model (more than one hundred state variables per view, i.e., a state-space of more than 2240 states), NuMAS could build and verify the protocol in few seconds. All the expected security requirements turned out to be satisﬁed. This means that the speciﬁcation of the protocol provided by IETF is sound. Therefore, we can assert that the design of this protocol is not aﬄicted with security ﬂaws and so, if correctly implemented, the SSL/TLS protocol can be considered secure.

7

Conclusions

In this paper we have described how a logic of belief and time (MATL) has be used for the veriﬁcation of the SSL/TLS security protocol. The veriﬁcation has been performed with NuMAS, a symbolic model checker for MATL. The veriﬁcation methodology has been applied to the SSL/TLS protocol, the new Internet Standard Protocol proposed by IETF. From the analysis arises that the

Verification of the SSL/TLS Protocol

137

protocol satisﬁes all the desired security requirements. Our results agree with those found by other researchers that have led on this protocol a formal [14, 15] or informal [17] analysis. The same veriﬁcation approach to security protocols has also been applied to other protocols, in particular the Lu and Smolka variant of SET [12]. The complete veriﬁcation of this variant of SET required .6 seconds with a normally equipped PC, and allowed us to ﬁnd a ﬂaw in the protocol, causing to a possible attack (see [6]). For lack of space, in this paper we have described only the veriﬁcation of the SSL/TLS protocol.

References [1] M. Abadi and M. Tuttle. A semantics for a logic of authentication. In Proceedings of the 10th Annual ACM Symposium on Principles of Distributed Computing, pages 201–216, 1991. 126, 132 [2] M. Benerecetti and A. Cimatti. Symbolic Model Checking for Multi–Agent Systems. In CLIMA-2001, Workshop on Computational Logic in Multi-Agent Systems, 2001. Co-located with ICLP’01. 126, 136 [3] M. Benerecetti and F. Giunchiglia. Model checking security protocols using a logic of belief. In Proceedings of the 6th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2000), 2000. 126, 136 [4] M. Benerecetti, F. Giunchiglia, M. Panti, and L. Spalazzi. A Logic of Belief and a Model Checking Algorithm for Security Protocols. In Proceedings of IFIP TC6/WG6.1 International Conference FORTE/PSTV 2000, 2000. 126, 128, 130, 136 [5] M. Benerecetti, F. Giunchiglia, and L. Serafini. Model Checking Multiagent Systems. Journal of Logic and Computation, 8(3):401–423, 1998. 126, 128, 131, 136 [6] M. Benerecetti, M. Panti, L. Spalazzi, and S. Tacconi. Verification to Payment Protocols via MultiAgent Model Checking. In Proceedings of the 14th International Conference on Advanced Information Systems Engineering (CAiSE ’02), 2002. 126, 137 [7] Michael Burrows, Martin Abadi, and Roger Needham. A logic of authentication. ACM Transactions on Computer Systems, 8(1):18–36, feb 1990. 126, 132 [8] T. Dierks and C. Allen. The TLS Protocol Version 1.0. IETF RFC 2246, 1999. 126, 127 [9] E. A. Emerson. Temporal and Modal Logic. In Handbook of Theoretical Computer Science, volume B, pages 996–1072, 1990. 128 [10] A. Frier, P. Karlton, and P. Kocher. The SSL 3.0 Protocol. Netscape Communications Corp., 1996. 127 [11] G. Lowe. Finite-State Analysis of SSL 3.0. In Proceedings of the 4th Conference Tools and Algorithms for the Construction and Analysis of Systems, pages 147– 166, 1996. 126 [12] S. Lu and S. A. Smolka. Model Checking the Secure Electronic Transaction (SET) Protocol. In Proceedings of 7th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pages 358–365. IEEE Computer Society, 1999. 137

138

Massimo Benerecetti et al.

[13] W. Marrero, E. Clarke, and S. Jha. Model Checking for Security Protocols. In Proceedings of the DIMACS Workshop on Design and Formal Verification of Security Protocols, 1997. 126 [14] C. Mitchell, V. Shmatikov, and U. Stern. Finite-State Analysis of SSL 3.0. In Proceedings of the 7th USENIX Security Symposium, pages 201–216, 1998. 126, 127, 137 [15] L. C. Paulson. Inductive Analysis of the Internet Protocol TLS. ACM Transactions on Computer and System Security, 2(3):332–351, 1999. 137 [16] P. Syverson and P. C. van Oorschot. On Unifying Some Cryptographic Protocol Logics. In Proceedings of the IEEE Symposium on Research in Security and Privacy, pages 14–28, 1994. 126, 132 [17] D. Wagner and B. Schneier. Analysis of the SSL 3.0 Protocol. In Proceedings of the 2nd USENIX Workshop on Electronic Commerce Proceedings, pages 29–40, 1996. 137

Reliability Assessment of Legacy Safety-Critical Systems Upgraded with Off-the-Shelf Components Peter Popov Centre for Software Reliability, City University, Northampton Square, London, UK [email protected]

Abstract. Reliability assessment of upgraded legacy systems is an important problem in many safety-related industries. Some parts of the equipment used in the original design of such systems are either not available off-the-shelf (OTS) or have become extremely expensive as a result of being discontinued as mass production components. Maintaining a legacy system, therefore, demands using different OTS components. Trustworthy reliability assurance after an upgrade with a new OTS component is needed which combines the evidence about the reliability of the new OTS component with the knowledge about the old system accumulated to date. In these circumstances Bayesian approach to reliability assessment is invaluable. Earlier studies have used Bayesian inference under simplifying assumptions. Here we study the effect of these on the accuracy of predictions and discuss the problems, some of them open for future research, of using Bayesian inference for practical reliability assessment.

1

Introduction

The use of off-the-shelf (OTS) components with software becomes a practice increasingly widespread for both development of new systems and upgrading existing (i.e. legacy) systems as part of their maintenance. The main reason for the trend is the low cost of the OTS components compared with a bespoke development or older components being discontinued as mass production units. In this paper we focus on reliability assessment of a legacy system upgraded with an OTS component which contains software. Two factors make the reliability assessment in this case significantly different from the assessment of a bespoke system. First, reliability data about the OTS component, if available at all, comes, as a rule, for an unknown, possibly different from the target, environment. The evidence of high reliability in different environment will give modest confidence in the reliability of the OTS component in the target environment. Second, acceptance testing of the upgraded system must be, as a rule, short. In some cases postponing the deployment of an upgraded system to undertake a long V&V procedure will simply prevent from gaining market advantage. In some other cases, e.g. upgrading a nuclear plant with smart sensors, it is simply prohibitively expensive or even impossible to run a long acceptance testing on the upgraded system before it is deployed. And yet in many S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 139-150, 2002. c Springer-Verlag Berlin Heidelberg 2002

140

Peter Popov

cases, e.g. in safety critical systems, there are very stringent requirements for demonstrably high reliability of systems in which OTS components are used. In these circumstances Bayesian approach to reliability assessment is very useful. It allows one to combine rigorously both, the a priori knowledge about the reliability of a system and its components, and the new (possibly very limited) evidence coming from observing the upgraded system in operation. The simplest way to assess the reliability of a system is to observe its failure behaviour in (real or simulated) operation. If we treat the system as a black box, i.e. ignore the internal structure of the system, standard techniques of statistical inference can be applied to estimate its probability of failure on demand (pfd) on the basis of the amount of realistic testing performed and the number of failures observed. However, this ‘black-box’ approach to reliability assessment has severe limitations [1], [2]: if we want to demonstrate very small upper bounds on the pfd, the amount of testing required becomes very expensive and then infeasible. It is then natural to ask whether we can use the additional knowledge about the structure of the system to reduce this problem - to achieve better confidence for the same amount of testing. This is the problem which we address in this paper. We present a model of reliability assessment of a legacy system upgraded with a single OTS component and discuss the difficulties and limitations of its practical use. In detail, section 2 presents the problem studied in the paper, in section 3 the main result is presented. In section 4 we discuss the implications of our results and the difficulties in applying the Bayesian approach to practical reliability assessment, some of them as open research questions. Finally, conclusions are presented in section 5.

2

The Problem

For simplicity we assume that the system under consideration is an on-demand system, i.e. it is called upon when certain predefined circumstances occur in the environment. A typical example of an on-demand system is a safety protection system intended to shut down a plant if the plant leaves its safety envelope. Legacy System ROS Old component

Upgraded System ROS New OTS

Fig. 1 Upgrading a legacy system with an OTS component.

We analyse the simplest possible case of system upgrade – the replacement of a single component with an OTS component which interacts with the rest of a legacy system (ROS), as illustrated in Fig. 1. In the rest of the paper we refer to ROS as sub-system A and to the new OTS component as sub-system B. The paper analyses a special case of a system in which both sub-systems are used exactly once per demand.

Reliability Assessment of Legacy Safety-Critical Systems Upgraded

3

141

Reliability Assessment: Bayesian Approach

Bayesian approach to reliability assessment of an upgraded on-demand system is used. The probability of failure on demand (pfd) is the measure of interest. Sub-system A demands

Output A

System output

OR Sub-system B

Output B

Fig. 2. Black-box model of a system. The internal structure of the system is unknown. The outputs of the sub-systems are not used in the inference. Only the system output is recorded on each demand and fed into the inference.

If the system is treated as a black box, i.e. we can only distinguish between system failures or successes (Fig. 2), the inference proceeds as follows. Denoting the system pfd as p, the posterior distribution of p after seeing r failures in n demands is: f p ( x | r, n ) ∝ Λ(n, r | x ) f p ( x ) , (1) where f p (•) is the prior distribution of p, which represents the assessor’s belief about p, before seeing the result of the test on n demands. L(n, r | x ) is the likelihood of observing r failures in n demands if the pfd were exactly x. This is given in this case n (of independent demands) by the binomial distribution, L(n, r | x ) =   x r (1 − x ) n −r . r  The details of system behaviour which may be available but are ignored in the blackbox model, such as the outcomes of the sub-systems which make up the system on a demand, are taken into account in the clear-box model. As a rule the predictions obtained with the clear-box and black-box models differ. We have shown elsewhere, in a specific context of a parallel system [3], that the black-box predictions can be over-optimistic or over-pessimistic and the sign of the error cannot be known in advance – it depends on the prior and the observations. The Baeysian inference with a clear-box model is more complex than with the blackbox model because a multivariate prior distribution and likelihood are used. The dimensions of the prior distribution depend on the number of sub-systems which make up the system and whether the system score (a success or a failure) is a deterministic or a non-deterministic function of the scores of the sub-systems involved1. For instance, modelling a system with two sub-systems (e.g. Fig. 1) and a deterministic system score as a clear box requires a 3-variate prior distribution/likelihood. A clear– 1

Examples of systems with deterministic system score are parallel systems [3] and serial systems (i.e. which fail if at least one of their sub-systems fails). An example of a system with a non-deterministic system score is the system in Fig. 1 if for the same sub-system scores (e.g. ROS fails, OTS component succeeds) it fails on some demands but succeeds on others. Nondeterministic system score must be explicitly modelled as a separate binary random variable.

142

Peter Popov

box model of a system with the same number of sub-systems but a non-deterministic system score requires a 7-variate prior and likelihood. A clear box of a system with 3 sub-systems with a deterministic and a non-deterministic system score requires 7- or 15-variate distribution/likelihood, respectively, etc. Such an exponential explosion of complexity of the prior/likelihood with the increase of the number of the sub-systems poses two difficulties for using a clear-box Bayesian inference: − Defining a multidimensional prior is difficult. A phenomenon is widely reported that humans are not very good at using probabilities [4]. Increasing dimensions of the prior distribution makes it very difficult for an assessor to justify a particular prior to match their informal belief about the reliability of the system and its subsystems; − Unless the prior and the likelihood form a conjugate family [5] the complexity of the Bayesain inference itself increases with the number of dimensions of the prior used in the inference because multiple integrals must be calculated. These two difficulties are a good incentive for one to try to simplify the multivariate prior distribution. One way of simplification is assuming various forms of independence between the variates of the distribution used which describe the assessor’s knowledge (belief) about system reliability. Recently Kuball et al. [6] in different context used the assumption of independence between the failures of the subsystems, which is attractive. It allows one to avoid the difficulties in defining the dependencies that may exist between several failure processes, the most difficult part in defining a multivariate distribution. Once the failures of the subsystems are assumed independent, however, they will stay so despite what is observed in operation, even if overwhelming evidence is received that the sub-system failures are correlated (positively or negatively). This evidence of failure dependence is simply ignored; the only uncertainty affected by the inference is that associated with the pfd of the sub-systems. The multivariate Bayesian inference collapses to a set of univariate inferences, which are easily tractable. Kuball et al. assert that the predictions derived under the assumption of independence will be pessimistic at least in the case that no failure is observed. Even if this is true there is no guarantee that ‘no failure’ will be the only outcome to observe in operation, e.g. during acceptance testing after the upgrade. The justification that the ‘no failure’ case is the only one of interest for the assessor (since any other outcome would imply restarting the acceptance testing afresh) is known to have a problem. Littlewood and Wright have shown [7] that ignoring the previous observations (i.e. rounds of acceptance testing which ended with failures) can produce overoptimistic predictions. It is worth, therefore, studying the consequences of assuming independence between the failures of the sub-systems for a broader range of observations, not just for the ‘no failure’ case. 3.1

Experimental Setup

Now we formally describe the clear-box model of the system (Fig 1). The sub-systems A and B are assumed imperfect and their probabilities of failure - uncertain. The scores of the sub-systems, which can be observed on a randomly chosen demand, are summarised in Table 1.

Reliability Assessment of Legacy Safety-Critical Systems Upgraded

143

Table 1. The combinations of sub-system scores which can be observed on a randomly chosen demand are shown in columns 1-2. The notations used for the probabilities of these combinations are shown in column 3. The number of times the score combinations are observed in N trials, r0 , r1, r2 and r3 (N = r0 + r1 + r2 + r3) respectively, are shown in the last column Sub-system scores

Sub-system A

Sub-system B

0 0 1 1

0 1 0 1

Probability

Observed in N demands

p00 p01 P10 p11

r0 r1 r2 r3

Clearly, the probabilities of failure of sub-systems A and B, pA and pB, respectively, can be expressed as: p A = p10 + p11 and p B = p 01 + p11 . p11 represents the probability of coincident failure of both sub-systems, A and B, on the same demand and hence the notation pAB ≡ p11 captures better the intuitive meaning of the event it is assigned to. The joint distribution f p A , pB , p AB (•,•,•) describes completely the a priori knowledge of an assessor about the reliability of the upgraded system. It can be shown that for a given observation (r1, r2, and r3 in N demands) the posterior distribution can be calculated as: f p A , pB , p AB ( x, y , z | N , r1 , r2 , r3 ) = f p A , pB , p AB ( x , y , z ) L( N , r1 , r2 , r3 | p A , p B , p AB )

∫∫∫ f

p A , p B , p AB

(2)

( x, y , z ) L( N , r1 , r2 , r3 | p A , p B , p AB )dxdydz

p A , p B , p AB

L( N , r1 , r2 , r3 | p A , p B , p AB ) = N! ( p A − p AB )r2 ( p B − p AB )r1 p AB r3 (1 − p A − p B + p AB )N − r1 − r2 − r3 r1! r2 ! r3 ! ( N − r1 − r2 − r3 )!

is the likelihood of the observation. Up to this point the inference will be the same no matter how the event ‘system failure’ is defined but calculating the marginal distribution of system pfd, PS, is affected by how the event ‘system failure’ is defined. We proceed as follows: 1. A serial system: a failure of either of the sub-systems leads to a system failure. The posterior distribution, f p A , pB , p AB (•,•,• | N , r1 , r2 , r3 ) , must be transformed to a new distribution, f p A , pB , pS (•,•,• | N , r1 , r2 , r3 ) , where PS is defined as: PS = PA + PB - PAB, from which the marginal distribution of PS, f pS (• | N , r1 , r2 , r3 ) , will be calculated by integrating out the nuisance parameters PA and PB. If the system is treated as a black-box (Fig. 2) the system pfd can be inferred using formula (1) above. The marginal prior distribution of PS, f pS (•) , and a binomial likelihood of observing r1 + r2 + r3 system failures in N trials will be used. If the failures of the sub-systems A and B are assumed independent then for any values of PA and PB the probability of joint failure, PAB, of both sub-systems is PAB=PAPB. Formally, the joint distribution can be expressed as:

144

Peter Popov  f p (x ) f pB ( y )d (xy ), if z = xy f p*A , pB , p AB ( x, y , z ) =  A 0, if z ≠ xy

The failures of the two sub-systems remain independent in the posterior:  f * (x | N , r1 + r2 ) f p* ( y | N , r1 + r3 )d (xy ), if z = xy B f p*A , pB , p AB (x, y , z | N , r1 , r2 , r3 ) =  p A 0, if z ≠ xy f p*A (• | N , r1 + r2 ) and f p*b (• | N , r2 + r3 ) are the marginal posterior distributions of

sub-systems A and B, respectively, inferred under independence. The inference for sub-system A proceeds according to (1) using the marginal prior of sub-system A, f p A (•) , and binomial likelihood of observing r2+r3 failures of sub-system A in N trials. Similarly, f p*b (• | N , r1 + r3 ) is derived with prior f pB (•) and binomial likelihood of observing r1+r3 failures of sub-system B in N trials. The posterior marginal distribution of system pfd, f p*S (• | N , r1 , r2 , r3 ) , can be obtained from f p*A , pB , p AB (x, y , z | N , r1 , r2 , r3 ) as described above: first the joint posterior is

transformed to a form which contains PS as a variate of the joint distribution and then PA and PB are integrated out. 2. The system fails when sub-system A fails. In this case the probability of system failure is merely the posterior pfd of sub-system A (ROS). The marginal distribution, f p A (• | N , r1 , r2 , r3 ) , can be calculated from f p A , pB , p AB (•,•,• | N , r1 , r2 , r3 ) by integrating out PB and PAB. With black-box inference another marginal posterior *

can be obtained, f p (• | N , r2 + r3 ) , using (1) with the marginal prior of pfd of subA

system A, f p A (•) , and binomial likelihood of observing r2 + r3 failures of subsystem A in N trials. Notice that the marginal distribution f p A (• | N , r1 , r2 , r3 ) is *

different, as a rule, from the marginal distribution f p (• | N , r2 + r3 ) , obtained with A

the black-box inference. 3.2

Numerical Example

Two numerical examples are presented below which illustrate the effect of various simplifying assumptions used in the inference on the accuracy of the predictions. The prior, f p A , pB , p AB (•,•,•) , was constructed under the assumption that f p A (•) and f pB (•) are both Beta distributions, B (•, a, b) , in the interval [0, 0.01] and are

independent of each other, i.e. f p A , pB (•,•) = f p A (•) f pB (•) . The parameters a and b for the two distributions were chosen as follows: aA = 2, bA = 2 for sub-system A and aB =3, bB = 3 for sub-system B. If the sub-systems are assumed to fail independently the parameters above are a sufficient definition of the prior distribution. If the sub-systems are not assumed to fail independently we specify the conditional distributions, f p AB | pB , p A (• | PA , PB ) , for every pair of values of PA and PB, as Beta

Reliability Assessment of Legacy Safety-Critical Systems Upgraded

145

distributions, B (•, a, b) in the range [0, min(PA, PB)] with parameters aAB = 5, bAB = 5 which complete the definition of the trivariate distribution, f p A , pB , p AB (•,•,•) . We do not make any claims that the priors used in the examples should be used in practical assessment. They serve illustrative purposes only and yet, have been chosen from a reasonable range. Each of the sub-systems, for instance, has an average pfd of 5.10-3, which is a value from a typical range for many applications. Two sets of observations were used for the calculations with the same number of trials, N = 4000: - Observation 1 (The sub-systems never failed together): r3 = 0, r1 = r2 = 20; - Observation 2 (Sub-systems always fail together): r3 = 20, r2 = r3 = 0. The number of failures of the sub-systems has been chosen so that they are indistinguishable under the assumption of failure independence – in both cases each of the sub-systems failed 20 times2. The observations, however, provide evidence of different correlation between the failures of the sub-systems: in the first observation of strong positive - while in the second observation - of strong negative correlation. The inference results under various assumptions for both observations are summarised in Table 2 and 3, respectively, which show the percentiles of the marginal prior/posterior distributions of system pfd: Table 2. Observation 1: Strong negative correlation between the failures of the sub-systems (N = 4000, r3 = 0, r1 = r2 = 20). The upper part of the table shows the posteriors if the upgraded system were a ‘serial’ system while the lower part of the table shows the posteriors if the system failed only when sub-system A failed.

50 %

75%

90%

95%

99%

prior system pfd, f pS (•)

0.0079

0.0096

0.0114

0.0124

0.0144

‘proper’ posterior pfd, f pS (• | N , r1 , r2 , r3 )

0.01

0.0118

0.012

0.0126

0.0137

Posterior pfd with independence,

0.0103

0.0112

0.0122

0.0128

0.01393

0.01

0.011

0.012

0.0125

0.0136

0.0095

0.01035

0.0113

0.0118

0.0128

Serial system

f p*S

(• | N , r1 , r2 , r3 )

Black-box posterior with independence Black-box posterior without independence

Failure of sub-system A (ROS) only leads to a system failure Prior system pfd, f p A (•)

0.0049

0.0066

0.0080

0.0086

0.0093

‘proper’ posterior pfd, f p A (• | N , r1 , r2 , r3 )

0.0051

0.0059

0.0066

0.0071

0.0079

0.005

0.0058

0.0065

0.0069

0.0078

Posterior

system

independence, 2

f p*A

pfd

with

(• | N , r1 + r2 )

equal to the expected number of failures of each of the sub-system in 4000 demands as defined by the prior.

146

Peter Popov

The results in Table 2 reveal that the black-box inference produces optimistic posteriors: there is stochastic ordering between the posteriors obtained with the clearbox model, no matter whether independence of failures is assumed or not. Comparing the clear-box predictions with and without failure independence reveals another stochastic ordering: the predictions with independence are conservative. This is in line with the result by Kuball et al. The differences between the various posteriors are minimal. The tendency remains the same (the same ordering between the posteriors was observed) for a wide range of observations with negative correlation between the failures of the sub-systems. Finally, for a non-serial system the independence produces more optimistic predictions than without independence (the last two rows of the table). In other words, the independence is not guaranteed to produce conservative predictions – the ordering depends on how the event ‘system failure’ is defined. Table 3. Observation 2: Strong positive correlation between the failures of the two subsystems (N = 4000, r3 = 20, r1 = r2 = 0). The same arrangement of the results is given as in Table 2

50 %

75%

90%

95%

99%

Prior system pfd, f pS (•)

0.0079

0.0096

0.0114

0.0124

0.0144

‘proper’ posterior pfd, f pS (• | N , r1 , r2 , r3 )

0.0051

0.0058

0.0065

0.0069

0.0076

Posterior pfd with independence,

0.0103

0.0113

0.0123

0.0128

0.0139

Black-box posterior with independence Black-box posterior without independence

0.0055

0.0063

0.0071

0.0075

0.0084

0.0059

0.0066

0.0073

0.0078

0.0088

Serial system

f p*S (• | N , r1 , r2 , r3 )

Failure of sub-system A (ROS) only leads to a system failure 0.0049

0.0066

0.0080

0.0086

0.0093

pfd,

0.00495

0.0056

0.0063

0.0066

0.0074

with

0.005

0.0058

0.0065

0.0069

0.0078

Prior system pfd, f p A (•) ‘proper’ posterior f p A (• | N , r1 , r2 , r3 ) Posterior

system

independence,

f p*A

pfd

(• | N , r1 + r2 )

The results from the black-box inference in Table 3 reveal a pattern different from Table 2. Black-box predictions here are more pessimistic than the ‘proper’ posteriors, i.e. with clear-box without assuming independence. This is true for both the serial system and the system which only fails when sub-system A fails. The fact that the sign of the error of the black-box predictions changes (from over-estimation in Table 2 to underestimation in Table 3) is not surprising, it is in line with our result for parallel systems, [3]. If we compare the clear-box predictions – with and without independence – the same stochastic ordering is observed no matter how the event ‘system failure’ is defined. If the system failure is equivalent to a failure of sub-system A, the predictions with independence (i.e. if the effect of sub-system B on sub-system

Reliability Assessment of Legacy Safety-Critical Systems Upgraded

147

A is neglected) are more pessimistic than the predictions without independence. In other words the ordering for this type of system is the opposite to what we saw in Table 2. For serial systems, the predictions shown in Table 3 obtained with independence are, as in Table 2 - more pessimistic than without independence. The pessimism, however, in this case is much more significant than it was in Table 2. The values predicted under independence are almost twice the values without independence: the conservatism for a serial system may become significant. With the second observation (Table 3) the assumption of statistical independence of the failures of the sub-systems is clearly unrealistic! If the independence were true the expected number of joint failures is 0.1 failure in 4000 trials, while 20 were actually observed!

4

Discussion

The results presented here are hardly surprising! They simply confirm that simplifications of models or model parameterisation may lead to errors. If the errors caused by the simplifications were negligible or at least consistently conservative, reporting on them would not have made much sense. What seems worrying, however, and therefore we believe worth pointing out, is that the errors are neither guaranteed to be always negligible nor consistently conservative. Simplifying the model and using black-box inference may lead to over- or under-estimation of system reliability. We reported elsewhere on this phenomenon with respect to a parallel system. Here we present similar results for alternative system architectures. One seems justified in concluding that using a more detailed clear-box model always pays off by making the predictions more accurate. In some cases, the accuracy may imply less effort on demonstrating having reached a reliability target, i.e. makes the V&V less expensive. In some other cases, it prevents from over-confidence in system reliability although at the expense of longer acceptance testing. In the particular context of this study – reliability assessment of an upgraded system – there is another angle of why clear-box must be preferred to black-box. We would like to reuse the evidence available to date in the assessment of the upgraded system. This evidence, if available at all, is given for the sub-systems: for sub-system A and, possibly, for sub-system B but not for the upgraded system as a whole. Using the evidence available seems only possible by first modelling the system explicitly as a clear box and plugging-in the pieces of evidence into the definition of the joint prior. From the multivariate prior, the marginal prior distribution of system pfd can be derived and used in a marginal Bayesian inference. It does not seem very sensible, however, to use the marginal inference after carrying out the hard work of identifying the plausible multivariate prior. The gain will be minimal, only in terms of making the inference itself easier, too little of a gain at the expense of inaccurate predictions. The results with the simplified clear-box inference seem particularly worth articulating. Our two examples indicated conservative predictions obtained for a serial system under the independence assumption. One may think that this is universally true for serial systems as Kuball et al. asserted. Unfortunately, this is not the case! We have found examples of priors/observations when the conservatism does not hold. An

148

Peter Popov

example of such a set of prior/observation is the following: f p A (•) and f pB (•) assumed independent Beta distributions in the range [0,1] with parameters aA = 2, bA = 20, for aA = 2, bA = 2. The conditional distribution, f p AB | pB , p A (• | PA , PB ) , for a pair of values PA and PB assumed to be a Beta distributions in the range [0, min(PA, PB)] with parameters aAB = 5, bAB = 5, observations: N = 40, r1 = 0, r2 =12, r3=12. In this case the posterior system pfd under the assumption of independence is more optimistic than the posterior without independence. The point with this counterexample is that there exist cases in which the assumption of independence may lead to over-optimism. Since the exact conditions are unknown under which the predictions become overoptimistic assuming independence between the failures of the sub-systems may be dangerous: it may lead to unacceptable errors such as overconfidence in achieved system reliability. A second problem with the independence assumption exists that is that even when the predictions under this assumption are conservative, the level of conservatism may be significant which is expensive. This tendency seems to escalate with the increase of the number of sub-systems. In the worse case it seems that the level of conservatism in the predicted system reliability is proportional to the number of sub-systems used in the model. For a system of 10 sub-systems, for example, the worst case underestimation of system reliability can reach an order of magnitude. The implications are that by using the predictions based on the independence assumption the assessor may insist on unnecessary long acceptance testing until unnecessary conservative targets are met. We call them unnecessary because the conservatism is merely due to the error caused by the independence assumption. Using the independence assumption in Bayesian inference is in a sense ironic because it is against the Bayesian spirit to let data ‘speak for itself’. Even if the observations provide an overwhelming evidence of dependence between the failures of the subsystems, the strong assumption of independence precludes from taking this knowledge into account. In the posteriors the failures of the sub-systems will continue to be modelled as independent processes. Having pointed out problems with the simplified solutions is not a solution of the problem of reliability assessment of a system made up of sub-systems. The full inference requires a full multivariate prior to be specified which for a system with more than 3 components seems extremely difficult unless a convenient parametric distribution, e.g. a Dirichlet distribution [5], is used, which in turn, is known to be problematic as reported in [3]. In summary, with the current state of knowledge it does not seem reasonable to go into detailed structural reliability modelling because of the intrinsic difficulties in specifying the prior without unjustifiable assumptions. Our example of a system upgraded with an OTS component is ideally suited for a ‘proper’ clear-box Bayesian inference because only a few sub-systems are used in the model. It is applicable if one can justify that the upgrade is ‘perfect’, i.e. there are no (design) faults in integrating the new OTS component with the ROS. If this is the case the following assumptions seem ‘plausible’ in defining the joint prior after the upgrade: - the marginal distribution of pfd of sub-system A (ROS) is available from the observations of the old system before the upgrade.

Reliability Assessment of Legacy Safety-Critical Systems Upgraded

149

the marginal distribution of pfd of the OTS component will be, generally, unknown for the new environment (in interaction with sub-system A and the system environment). We should take a "conservative" view here and assume that the new OTS component is no better in the new environment than it is reported to have been in other environments. It may be even worth assuming it less reliable than the component it replaces, unless we have very strong evidence to believe otherwise. The strong evidence can only come from the new OTS component being used extensively in an environment similar to the environment created for the new OTS component by the system under consideration. The new OTS component may have a very good reliability record in various environments. This, however, cannot be used ‘automatically’ as strong evidence about its reliability in the upgraded system. - the pfd of sub-systems A and B are independently distributed (as we assumed in the examples) unless there is evidence to support assuming otherwise. In the latter case, the supporting evidence will have to be used in defining the dependence between the pfd of the two sub-systems. - specifying the pfd of joint failure of sub-system A and sub-system B we can use ‘indifference’ within the range of possible values, but sensitivity analysis is worth applying to detect if ‘indifference’ leads to gaining high confidence in high system reliability too quickly. If justifying a ‘perfect’ upgrade is problematic at least a 7-variate prior distribution must be used to allow for system failures in operation to be accommodated in the inference which are neither failures of ROS nor of the new OTS component. In this case the marginal distributions of the pfds of the two sub-systems, A and B, are the only ‘obvious’ constraints which can be used in defining the prior. These, however, are likely to be insufficient to set the parameters of the 7-variate joint prior distribution and additional assumptions are needed which may be difficult to justify. -

5

Conclusions

We have studied the effect of the model chosen and of a simplifying assumption in parameterising a clear-box model on the accuracy of Bayesian reliability assessment of a system upgraded with a single OTS component. We have shown: - that simplifications may lead to overestimation or underestimation of system reliability and the sign of the predictions is not known in advance. The simplified inference, therefore, is not worthy recommending for predicting the reliability of safety-critical systems. - that even when the simplified predictions are conservative, e.g. the predictions for a serial system under the assumption of independence of failures of the subsystems, they may be too conservative. In the worst case the conservatism is proportional to the number of sub-systems used in the model. This leads to unjustifiably conservative reliability targets achieving which is expensive. - that detailed clear-box modelling of a system is intrinsically difficult because: i) the full inference without simplifications requires specifying a multivariate prior

150

Peter Popov

which is difficult with more than 3 variates, ii) the simplified inferences (blackbox or clear-box with simplifications) have problems with the accuracy of the predictions. - how the available knowledge about the reliability of the sub-systems before the upgrade can be reused in constructing a multivariate prior when the the upgrade is free of design faults. The following problems have been identified and are open for further research: - clear-box inference with the simplifying assumption that the sub-systems fail independently has been shown to lead to over- or underestimation of system reliability. Identifying the conditions under which the simplified clear-box inference produces conservative results remains an open research problem. - Further studies are needed into multivariate distributions which can be used as prior distributions in a (non-simplified) clear-box Bayesian inference.

Acknowledgement This work was partially supported by the UK Engineering and Phisical Sciences Research Council (EPSRC) under the ‘Diversity with Off-the-shelf components (DOTS)’ project and the ‘Interdisciplinary Research Collaboration in Dependability of Computer-Based Systems (DIRC)’.

References 1. Littlewood, B. and L. Strigini, Validation of Ultra-High Dependability for Software-based Systems. Communications of the ACM, 1993. 36(11): p. 69-80. 2. Butler, R.W. and G.B. Finelli. The Infeasibility of Experimental Quantification of LifeCritical Software Reliability. in ACM SIGSOFT '91 Conference on Software for Critical Systems, in ACM SIGSOFT Software Eng. Notes, Vol. 16 (5). 1991. New Orleans, Louisiana. 3. Littlewood, B., P. Popov, and L. Strigini. Assessment of the Reliability of Fault-Tolerant Software: a Bayesian Approach. in 19th International Conference on Computer Safety, Reliability and Security, SAFECOMP'2000. 2000. Rotterdam, the Netherlands: Springer. 4. Strigini, L., Engineering judgement in reliability and safety and its limits: what can we learn from research in psychology? 1994. http://www.csr.city.ac.uk/people/lorenzo.strigini/ls.papers/ExpJudgeReport/ 5. Johnson, N.L. and S. Kotz, Distributions in Statistics: Continuous Multivariate Distributions. Wiley Series in Probability and Mathematical Statistics, ed. R.A. Bradley, Hunter, J. S., Kendall, D. G., Watson, G. S. Vol. 4. 1972: John Weley and Sons, INc. 333. 6. Kubal, S., May, J., Hughes, G. Structural Software Reliability Estimation. in SAFECOMP '99, 18th International Conference on Computer Safety, Reliability and Security. 1999. Toulouse, France: Springer. 7. Littlewood, B. and D. Wright, Some conservative stopping rules for the operational testing of safety-critical software. IEEE Transactions on Software Engineering, 1997. 23(11): p. 673-683.

Assessment of the Benefit of Redundant Systems Luping Chen, John May, and Gordon Hughes Safety Systems Research Centre, Department of Computer Science University of Bristol, Bristol, BS8 1UB, UK {chen,jhrm,hughes}@cs.bris.ac.uk

Abstract. The evaluation of the gain in reliability of mult-iversion software is one of the key issues in the safety assessment of high integrity systems. Fault simulation has been proposed as a practical method to estimate diversity of multi-version software. This paper applies data-flow perturbation as an implementation of the fault injection technique to evaluate redundant systems under various conditions. A protection system is used as an example to illustrate the evaluation of software structural diversity, optimal selection of channelpairs and the assessment of different designs.

1

Introduction

The potential benefit of voted redundancy incorporating multi-channel software has is improved reliability for safety-critical systems [1]. The measurement and assessment of multi-version software is a longstanding topic of research. However, to date no effective and realistic metric or model exists for describing software diversity and evaluating the gain in reliability when diversity/redundancy is employed. Probabilistic models, such as those of Eckhardt and Lee [2], Littlewood and Miller [3] depend upon parameters that are very difficult to estimate. The main obstacle is the difficulty of observing realistic failure behaviors of a safety-critical system with extra-high reliability. The observations are needed to estimate the values required in the corresponding probability models. However, a fault injection has been proposed as a means of assessing software diversity using simulated faults and associated failure behaviors [4] [5]. The fault injection approach can provide a quantitative estimation of diversity as a basis for assessment of the degree of fault tolerance achieved by redundant system elements. In turn this allows investigations of the effectiveness of different redundancy design strategies. To estimate software diversity it is not enough to know only the failure probability of each single version. The fault distribution in the software structure will influence diversity greatly because the diversity is decided not only by the failure rates but also the positions of the failure regions in the input space (the input space is assumed to be shared by the versions). Specifically, it is necessary to understand the overlaps between the failure domains of the different versions caused by various faults. Effective searching methods are required to reveal the failure regions because the S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 151-162, 2002.  Springer-Verlag Berlin Heidelberg 2002

152

Luping Chen et al.

searching tasks are very intensive. An intuitive quantitative index of failure diversity for specific faults is also needed based on a comparison of the different failure domains of the versions under test. Both these requirements have been developed in [5]. The index requires two types of information: the individual version failure probability estimates and the likely fault distribution pattern of each single version [6]. This paper applies these developed techniques for assessing software diversity, to some realistic systems incorporating redundant elements. It outlines the implementation and then a diversity index table is used to record and compute the software diversity under various conditions. Furthermore, such quantitative estimations are utilized to observe the relative effectiveness of design strategies that are commonly regarded as having an influence on diversity. Finally some results from the experiments are presented and a discussion provided of further applications and enhancements of this approach. These design factors (data, algorithmic, design method etc.) are potentially important for forcing diversity between versions

2

Fault Injection and Diversity Measurement

The feasibility of the fault injection approach to assess software diversity depends on the development of the following technical elements: 1.

Methods to reveal how faults manifest themselves in the input space (the failure rates and positions); 2. Methods to measure the degree of coincident failure resulting from the faults naturally occurring in 'diverse' versions. 3. Introduction of faults into a program or the formulation of a library of 'real' faults, and categorisation of 'possible' faults so that the scale of any study can be controlled. The first two elements represent the primary objectives, in that with this ability it would be possible to establish the overlapping relation between the failures of versions. However, the third element is important to allow the study to be carried out with limited resources and to understand the fault-failure relation of software. The following sections will summarise the methods and tools that have been developed for the practical utilisation of these elements. 2.1

Diversity Quantification

The concept of "Version Correlation", as clarified by references [2, 3], has been used to explain the difficulty of diversity estimation. The positive covariance caused by "Common faults" was regarded as a significant cause of dependence between versions. Theoretically, if the failures of two systems A and B are independent, their common failure probability is given by the equation: P( AB) = P( A) * P( B) . Where P(A) and P(B) are the failure probabilities of versions A and B respectively. P( AB) is the common failure probability of both versions i.e. the probability that both fail

Assessment of the Benefit of Redundant Systems

153

together. But in practice, independence cannot be assumed and a measure of the degree of dependence is given by:

Cov ( A, B ) = P ( AB ) − P ( A) * P ( B )

(1)

To reflect the reliability gain by the two-version design, the diversity of a pair of versions A and B of software for an injected fault was generally defined as:

Div( AB) = 1 −

P( AB) Min{P( A) , P( B)}

(2)

It is obvious that 0 ≤ Div ≤ 1 , where 1 means the two versions have no coincidental failure for the fault, which is the ideal case, 0 means there is no improvement over the single version because the coincidental failure area isn't less than the smaller of the two failure areas of the individual versions. Where P(A) or P(B) is 0, Div(AB) is defined to be 1. Based on the index, we can set up a 'Div table' to record the versions diversity under various faulty conditions and act as a basis for statistical measures of redundant system behaviour. 2.2

Locating Failure Regions Overlap

Testing multi-version software is different from single version testing in that it is not simply the number of failure points in the input space which matters, but also their position [15] . Only with this extra information, can we determine coincidental failures and calculate diversity as described in 2.1. The proposed approach to assess diversity crucially depends on an ability to search a multi-dimensional input space to establish the magnitudes and locations of the regions, which stimulate the system to fail. The necessary resolution of the search methods will be related to the accuracy of coincident failure predictions that we wish to claim for the study. The developed searching methods address two distinct problems: i) ii)

Determining an initial failure point given a specific fault; Searching from this point to establish the size and boundary of any associated failure region.

In the case of software with known faults, the first form of search does not have to be used. Generally, test cases should be effective at finding failures of a program, and preferably should satisfy some requirements or criterion for testing that is repeatable, automatic, and measurable. A review of the possible approaches indicated that each has a different effectiveness for different problems and testing objectives [5]. In our automatic test tool, the neighbourhood of a known failure point is converted into State-Space Search-branches and then a branch-and-bound algorithm is used for searching contiguous areas [5]. Generally, for searching we want to find the approach that results in the lowest number of iterating steps (and hence least computing time). Branch and bound techniques rely on the idea that we can partition our choices into sets using some domain knowledge, and ignore a set when we can determine that the searched for element cannot be within it. In this way we can avoid examining most elements of most sets [7].

154

Luping Chen et al.

2.3

Fault Injection by Data-Flow Perturbation

Two fault injection techniques have been used in the experiments: Data flow perturbation (DFP) is based on locations in the data flow in software, which are easily controlled to satisfy the test adequacy criteria on covering all program branches [5,9]. Constant perturbation is based on special constants in quantitative requirement specifications and can be regarded as a special type of DFP [6]. The previous work shows that diversity, as measured by the mapping of common failure regions corresponding to injected faults, (not surprisingly) depends on the pattern of the inserted fault set. Ideally one would wish to use a 'realistic' fault set, but the definition of realism is at present deferred. Fault simulation methods have been increasingly used to assess what method might reveal software defects or what software properties might cause defects to remain hidden [8]. Their application in software testing takes two different forms: the physical modification of the program's source code to make it faulty, and the modification of some part of the program's data state during execution. Some works have shown that the main difficulties and limitations of the first method lie in the huge variety of code alterations. The idea of data-state error was suggested to improve the effectiveness of fault injection techniques, and data-flow perturbation and constant perturbation has been proposed and applied in our empirical approach for software diversity assessment [6]. Data state error can propagate because one data error can transfer to other data states by data-flow and finally may change the output of the software. The importance of Dataflow is that it is the only vehicle for data-state error propagation. Constant perturbation can simulate the faults in data-flows because the data flow is fundamentally controlled by these constants. Many results have been published to support the application of the data-state error method for fault injection. These works began from studying the probability that data-state errors would propagate to the output of the program [9][10], then whether the behaviour of each real error was simulated by the behaviour of such data-mutant errors, and finally to whether all artificially injected errors at a given location behave the same [11]. In Constant perturbation, the selected constants are distributed at locations over the data-flow chart, and are key parameters to decide the data-flow. The of this method is that we can control the test scale easily under the condition of satisfying the test adequacy criteria to cover all program branches.

3

Protection System Evaluation

3.1

Examples of Redundant Systems

Evaluation of software diversity using the fault injection approach has been illustrated as previously introduced in [6] by testing the software of a nuclear plant protection system project [12]. In the previous work, we mainly focused on developing some techniques to realize the approach, like the data perturbation, fault selection and diversity measurement etc.. This paper aims to introduce its application on assessment

Assessment of the Benefit of Redundant Systems

155

of redundant systems. The software of the protection system is also used here for an extended application of the approach. The software includes four channels programmed in different languages, all matching the same requirement specification. Two channels programmed in C and ADA have been used for our experiments. In each case, the final code resulting from the software development process was used as the "golden version". On a given test, whether "success" or "failure" occurred with the versions derived by injecting faults into the golden versions was determined by comparing the outputs of these versions against their respective golden versions. Redundant systems were built using different combinations of the two channels. Firstly, the C and Ada channels were used to constitute a channel-pair, C-Ada, as in the original system. This pair is used to test how the difference of software structure between channels influences diversity. Secondly, we used two identical copies of the C channel or the Ada channel respectively to constitute two new systems denoted as C-C and Ada-Ada pairs. In reality, no one would design system in this way because the direct copies from a same prototype means the two channels in the system will contain coincidental faults and show no diverse behaviors. In our experiments, these two systems are used to study how diversity is influenced by different fault distributions alone. The significance of this simulation is to investigate the situation that two teams may develop channels without deliberately different design strategies. The two channels can show diverse failure behaviors because the two developing teams might incorporate different faults. The randomness of development errors can be simulated by inserting manifestly different faults in the two channels. The application of the fault injection approach to analyse redundant system behaviour included three experiments. In 3.2, the experiment illustrates the measurement of software diversity when considering different influencing factors. Failure diversity occurs via the different structure designs of the channels and via different faults in the channels. Without attempting to design diverse structures, the approach of using different development methods to avoid the same fault classes has merit irrespective of the methods ability to produce structural diversity. In 3.3, the experiment is to compare diversity differences among various channel-pairs. This application provides a way to select an optimal channel-pair or to build a redundant system with the most diversity under some specified conditions. In 3.4, the effectiveness of diversity design strategies is assessed by the fault injection approach. The experiments, based on a new deliberately designed channel-pair showed that a forced diverse structural design is feasible, that the designed structures do influence software diversity, and that the design strategies can be assessed by the fault injection approach. 3.2

Diversity Measurement of Different Channel-Pairs

To assess the factors influencing software diversity, all the three channel-pairs were selected for test under various faulty conditions. In each channel (C or Ada version), four patterns of fault (denoted as NP-10_1, NP-10_2, PP-10_1 and PP-10_2) have been simulated by constant perturbation. The patterns included different amplitudes and negative or positive perturbation (perturbing a data by decreasing or increasing its value) [4]. Building a fault set involves selecting the locations (or constants) for inserted faults, and deciding how many times and how large to make the perturbation.

156

Luping Chen et al.

Some locations (constants) are very sensitive to the perturbation and some large perturbations will cause the software to fail with a very high rate. Such unrealistic faults were excluded from the fault sets for testing diversity since in practice they will be found by testing. Therefore finally, each set includes 10 sifted faults, which were selected based on their locations in versions and the size of the resulting failure rate. In total, eight fault sets for the four fault patterns were set up for the two channels. The calculation of the Div between any pair of faults in two similar sets (with the same perturbation at the same locations or for the same constants) respectively injected into any two channels can form a value in the Div matrix.

S10 S9 S8 S7 S6 S5 S4 S3

0.8-1 0.6-0.8 0.4-0.6 0.2-0.4 0-0.2

S2 S1 1

2

3

4

5

6

7

8

9

10

Fig. 3.1. Div values for C-Ada pair

S10 S9 S8 S7 S6 S5 S4 S3 S2 S1 1

2

3

4

5

6

7

8

9

10

Fig. 3.2. Div values for Ada-Ada pair

0.8-1 0.6-0.8 0.4-0.6 0.2-0.4 0-0.2

Assessment of the Benefit of Redundant Systems

157

S10 S9 S8 S7 S6

0.8-1 0.6-0.8

S5

0.4-0.6

S4

0.2-0.4

S3

0-0.2

S2 S1 1

2

3

4

5

6

7

8

9

10

Fig. 3.3. Div values for C-C pair

Figures 3.1 to 3.3 are the graphical representations of Div matrix values for C-Ada, Ada-Ada and C-C with a similar fault pattern. The results by inserting faults with other patterns gave the same result and are thus omitted for brevity. In all the three situations (Fig 3.1 to Fig 3.3), the fault numbers are arranged in the same order for both channels. For example, in Figure 3.1, No.1 on the X-axis (that represents a fault in Ada Channel) is the same as No. S1 on the Y-axis (that represents a fault in C channel). The C-Ada pair was tested mainly to explore the effects of structural diversity. In Fig. 3.1, it can be seen that the Div values form an unsymmetrical matrix and diagonal elements are not always zeroed. It means that, even for a same fault, the C and Ada channels may have different failure regions. This is a direct evidence of the design diversity of the two versions because "diversity by different fault" has been precluded. From the Div values of the C-C and Ada-Ada pairs as in Fig. 3.2 and Fig. 3.3, we can observe the effects on diversity of varying fault distributions alone. Their Div values form a symmetrical matrix and all diagonal elements are zeroed, but the offdiagonal elements are mostly bigger than zero. It shows that different faults in two channels of C-C or Ada-Ada pair can cause diverse failure behaviors, which is the main factor to consider diversity design for such channel-pair. It also shows the C-C and Ada-Ada pairs may have different diversities because the structure C and Ada channels are different. In next section, we will discuss how to use this difference to select channel-pair. Through synthesizing the test results of all three channel pairs, we can also conclude that: •

Statistically, the common failure probability from test results is higher than the value obtained from assumed failure independence (by multiplying the failure probability of single versions). This result is consistent with the general belief that the common failure probability will be underestimated if failure independence is assumed.

158

•

3.3

Luping Chen et al.

By varying the size of perturbation (same constants in each version), the diversity achieved was observed to increase as single version failure probabilities decreased. This result will be used to assess design strategies in 3.3 The Selection of Version-Pairs

Similar to the experiments in 3.2, we can compare results for all pairs to investigate which pair has the better diverse behaviour as shown in Fig 3.4. No pair shows an obvious advantage, and this is perhaps because no special design effort was put into ensuring diversity. Using our 'same faults' analysis, on average the C-Ada and C-C pairs gave higher diversity. One possible reason for this is that the Ada version used more global- variables so that there is higher branch coupling which naturally tends to resulting in more common failure regions.

Fig. 3.4. Comparison of version-pair

These specific conclusions are not suggested as being generally true because of the restricted nature of the injected faults used. However the experiments have demonstrated the feasibility of the approach to compare channel-pairs and assess design strategies given more realistic fault conditions. I.e. the same experimental procedures would be repeated for different types of injected fault set. 3.4

Assessment of System Designs on Diversity

Using development teams with different backgrounds to design multi-version software has been a traditional approach to claim/enhance diversity [1] [13]. As discussed before, two factors play essential roles in the production of diversity. One concerns about the distribution faults i.e. random uncertainties relating to human mistakes. To exploit this factor to increase diversity, the aim would be to reduce the number of faults and avoid the 'same' faults occurring in different versions. The second factor concerns using different software structures possibly resulting from the

Assessment of the Benefit of Redundant Systems

159

use of different programming languages and design/development methods. The experiments showed that the structures existing in the diverse versions used could protect the software from coincidental failures even assuming the same faults are present in the versions. In the present case, differences in the software structure are entirely 'coincidental'. In practice, to force version diversity, it is plausible that special strategies can be employed at the software design stage [14]. In this section, the experiment aims to use the fault injection approach to assess the change used by the artificial design of a new version. In particular, we want to show three points: • • •

Design structure can influence software diversity. This influence can be controlled. Design strategies can be assessed by the fault injection approach.

The ways software structure could influence failure diversity is currently poorly understood. Some researches suggest that the branch coupling in a single channel can influence failure areas [10]. This is because the more branched is coupled, the more inputs is connected by a same data-flow. Therefore, a fault contained such a data-flow can cause more inputs manifest as failure points. Our experiment investigates the use of branch coupling. To introduce a clear change, we purposely attempt to design a channel-pair with lower diversity by increasing the branch coupling. Therefore, in this experiment, we select a channel, re-design its branch coupling, and observe the failure behaviour of the new channel-pair. A piece of source code of a module named "analyse_values.c" from the C channel of the protection system [12] was selected and re-written. Part of the calculation of "combined trips" was changed to introduce more data-flow branches. The correctness of the new version was verified and validated by back-to-back test with the original version of the channel. Two version pairs were used for comparison, one is the original C-C pair, the second is the Cnew-Cnew pair. The same fault patterns in 3.3 were used for the injection test on the Cnew-Cnew pair which produced the results shown in Figure 3.5 for comparison with Figure 3.3. for the C-C pair. S 10

S9

S8

S7

S6

S5

S4

0.8-1 0.6-0.8 0.4-0.6 0.2-0.4 0-0.2

S3

S2

S1 1

2

3

4

5

6

7

8

9

10

Fig. 3.5. Div of Cnew-Cnew pair to DFP set 1

160

Luping Chen et al.

A General comparison is given in figure 3.6. We compare the bahaviours of the CC pair and the Cnew-Cnew pair. Two fault sets (F1, F2) are used on the C-C pair and a third fault set (F3) on the Cnew-Cnew pair. In all cases the 'same type' of fault are used (same location for inserting faults), but the magnitude of perturbation varied. The fault sets are chosen so that for the single version failure probability: APS (F1, CC)>APS(F3, Cnew-Cnew )>APS (F2, C-C). The fact that the Div of the Cnew-Cnew pair was lower than that of C-C pair, irrespectivitive of the APS values, shows that we have succeeded in controlling diversity using design. Here APS denotes the average failure probability of a single version.

1 0.9 0.8 0.7 0.6

fault F1 set 1 pair on C-C

0.5

F3 on Cversion Changed new-Cnew

0.4 0.3

fault set 2

pair

0.2 0.1 0 Div

AveP

APS

Fig. 3.6. Assessment of new channel-pair

4

Conclusion

This paper applies a fault injection approach to see how the measured diversity of multi-version software is affected by the code structure in version pairs and fault types. The approach uses mechanistic failure models based on the concept of input space testing/searching linked to the introduction of representative faults by the methods of data flow and constant perturbation. The techniques have been demonstrated on industrial software, representative of two 'diverse' software channels. This approach allows quantification of the degree of diversity between the two versions. The approach allows the design and programming factors that influence diverse failure behaviour to be studied in a quantitative way. From analyses of these experiments, we can derive the following initial conclusions: Under all four injected fault patterns, Div was found to be lower than the diversity estimate produced if version failure was assumed to be independent; Even when they contain the same fault, versions can possess different failure domains in the input space, so design for diversity using software structure is possible;

Assessment of the Benefit of Redundant Systems

161

One way to achieve failure diversity is by ensuring different faults in versions, as illustrated by testing the same versions (C-C, Ada-Ada) with different fault distributions; Div measurement can be used to assess the different design strategies for software diversity These experimental results show stable trends in their sensitivity to different test environments by considering different diversity strategies, different fault patterns and different fault profiles etc. Therefore in general, this approach can be used to assess the actual diversity, rather than assuming independence, and thus establish the effectiveness of the various factors (e.g. different version designs and combinations) to improve diversity. Further research will consider how to extend the experimental scale e.g. to more complex fault conditions, to lower APS etc, and use such experiments to check theoretical models and common beliefs as a means of defining effective strategies for achieving diversity.

Acknowledgements The work presented in this paper comprises aspects of a study (DISPO) performed as part of the UK Nuclear Safety Research programme, funded and controlled by the Industry Management Committee together with elements from the SSRC Generic Research Programme funded by British Energy, National Air Traffic Services, Lloyd's Register, and the Health and Safety Executive.

References 1. 2. 3. 4. 5. 6.

7.

Avizienis, A. and J. P. J. Kelly, Fault tolerance by design diversity: concepts and experiments, IEEE Computer, Aug. 1984, pp:67-80 Eckhardt, D. & Lee, L., "A theoretical basis for the analysis of multiversion software subject to coincident errors", IEEE Trans. Software Eng., Vol. SE-11, 1985 Littlewood, B. & Miller, D., "A conceptual model of multi-version software", Proc. of FTCS-17, IEEE 1987 Arlat, J. et al, Fault injection for dependability validation-A methodology and some application, IEEE Trans. Software En., Vol. 16, no. 2, Feb. 1990, pp:166182 Chen, L., Napier, J., May, J., Hughes, G.: Testing the diversity of multi version software using fault injection. Procs of Advances in Safety and Reliability, SARSS(1999) 13.1-13.10 Chen, L., May, J., Hughes, G., A Constant Perturbation Method for Evaluation of Structural Diversity in Multiversion Software, Lecture Notes in Computer Science 1943 :Computer Safety, Reliability and Security, Floor Koornneef & Meine van der Meulen (Eds.), Springer, Oct. 2000 Kumar, V. & Kanal, L.N., "A general Branch and Bound Formulation for Understanding And/Or Tree Search Procedures", Artificial Intelligence, 21, pp.179-198, 1983

162

8. 9. 10. 11. 12.

13. 14. 15.

Luping Chen et al.

Voas, J. M., McGraw, G.: Software Fault Injection: Inoculating programs against errors. Wiley Computer Publishing", 1998 Voas, J. M. Adynamic failure model for performing propagation and infection analysis on computer programs, PhD Thesis, College of William and Mary, Williamsburg, VA, USA, 1990 Murill, B.W., Error flow in computer program, PhD thesis, College of William and Mary, Williamsburg, VA, USA, 1991 Michael, C.C., Jones, R.C., On the uniformity of error propagation in software, Technical Report RSTR-96-003-4, RST Corporation,USA Quirk, W.J. and Wall, D.N., "Customer Functional Requirements for the Protection System to be used as the DARTS Example", DARTS consortium deliverable report DARTS-032-HAR-160190-G supplied under the HSE programme on Software Reliability, June 1991 Mitra, S., N.R. Saxena, and E.J. McCluskey, "A Design Diversity Metric and Reliability Analysis for Redundant Systems," Proc. 1999 Int. Test Conf., pp. 662-671, Atlantic City, NJ, Sep. 28-30, 1999 Geoghegan, S.J. & Avresky, D.R., "Method for designing and placing check sets based on control flow analysis of programs", Proceedings of the International Symposium on Software Reliability Engineering, ISSRE, pp.256-265, 1996 Bishop, P.G., The variation of software survival time for different operational input profiles (or why you can wait a long time for a big bug to fail), Proc. 23th IEEE Int. Symp. On Fault-Tolerant Computing (FTCS-23), Toulouse, France, pp.98-107, 1993

Estimating Residual Faults from Code Coverage Peter G. Bishop Adelard and Centre for Software Reliability City University, Northampton Square, London EC1V 0HB, UK [email protected], [email protected]

Abstract. Many reliability prediction techniques require an estimate for the number of residual faults. In this paper, a new theory is developed for using test coverage to estimate the number of residual faults. This theory is applied to a specific example with known faults and the results agree well with the theory. The theory is used to justify the use of linear extrapolation to estimate residual faults. It is also shown that it is important to establish the amount of unreachable code in order to make a realistic residual fault estimate.

1

Introduction

Many reliability prediction techniques require an estimate for the number of residual faults [2]. There are a number of different methods of achieving this; one approach is to use the size of the program combined with an estimate for the fault density [3]. The fault density measure might be based on generic values, or past experience, including models of the software development process [1,7]. An interesting alternative approach was suggested by Malaiya, Denton and Li [4,5] who showed that the growth in the number of faults detected was almost linearly correlated with the growth in coverage. This was illustrated by an analysis of the coverage data from an earlier experiment performed on the PREPRO program by Pasquini, Crespo and Matrella [6]. This paper develops a general theory for relating code coverage to detected faults that differs from the one developed in [4]. The results of applying this model are presented, and used to justify a simple linear extrapolation method for estimating residual faults.

2

Coverage Growth Theory

We have developed a coverage growth model that uses different modelling assumptions from those in [4]. The assumptions in our model are that: •

there is a fixed execution rate distribution for the segments of code in the program (for a given input profile)

S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 163-174, 2002.  Springer-Verlag Berlin Heidelberg 2002

164

• • •

Peter G. Bishop

faults are evenly distributed over the executable code (regardless of execution rate) there is a fixed probability of failure per execution rate, f The full theory is presented in the Appendix, but the general result is that:

M (t ) = N 0 ⋅ C ( f ⋅ t )

(1)

where the following definitions are used: N0 N(t) M(t) C(f⋅t) f

initial number of faults residual faults at time t detected faults at time t, i.e. N0 – N(t) fraction of covered code at time f⋅t probability of failure per execution of the faulty code

Note that this theory should work with any coverage measure; a fine-grained measure like MCDC coverage should have a higher value of f than a coarser measure like statement coverage. Hence a measure with a larger f value will detect the same number of faults at a lower coverage value. With this theory, the increase in detected faults M is always linear with increase in coverage C if f = 1. This is true even if the coverage growth curve against time is nonlinear (e.g. an exponential rise curve) because the fault detection curve will mirror the coverage growth curve; as a result, a linear relationship is maintained between M and C. If f < 1, a non-linear coverage growth curve will lead to a non-linear relationship between M and C. In the Appendix one particular growth model is analysed where the coverage changes as some inverse power of the number of tests T, i.e. where:

1 − C (t ) =

1 1 + kT p

(2)

Note that 1 − C(t) is equivalent to the proportion of uncovered code, U(t)this was used in the Appendix as it was mathematically more convenient. With this growth curve, the Appendix shows that the relationship between detected faults M and coverage C is:

M (1 − C ) =1− p N0 f + (1 − C )(1 − f p )

(3)

The normalised growth curves of detected faults M / N0 versus covered code C from equation (3) are shown in the figure below for different values of f p. Analysis of equation (2) shows that the slope approximates to f p at U = 1, while the final slope is 1/ f p at U = 0, so if for example f p = 0.1 then the initial slope is 0.1 and the final slope is 10. This would mean that the relationship between N and U is far from linear (i.e. the initial and final slopes differ by a factor of 100). If the value of p is small (i.e. growth in coverage against tests has a “long tail”) this reduces the effect of having a low value for f as is illustrated in the table below.

Estimating Residual Faults from Code Coverage

1 0.9 0.8 0.7 Fraction 0.6 detected 0.5 faults M/No 0.4 0.3 0.2 0.1 0

165

f p=1 f p=0.5 f p=0.2 f p=0.1

0

0.5

1

Coverage (C) Fig. 1. Normalised graph of faults detected versus coverage (inverse power coverage growth)

Table 1. Example values of f p

f 1.0 0.5 0.1

p=1 1.0 0.5 0.1

fp p=0.5 1.0 0.71 0.42

p=0.1 1.0 0.93 0.79

This suggests that non-linearity can be reduced and hence permit more accurate predictions of the number of residual faults. To evaluate the theory, the assumptions and its predictions in more detail, we applied the model to the PREPRO program.

3

Evaluation of the Theory on the PREPRO Example

PREPRO is an off-line program written in C that was developed for the European Space Agency. It computes parameters for an antenna array. It processes an input file containing a specification for the antenna. The antenna description is parsed by the program and, if the specification is valid, a set of antenna parameters computed and sent to the standard output. The program has to detect violations of the specification syntax and invalid numerical values for the antenna. This program is quite complex, containing 7555 executable lines of code. The original experiment did not provide the data necessary to evaluate our model so we obtained a copy of the program and supporting test harness from the original

166

Peter G. Bishop

experimenters [6], so that additional experiments could be performed, namely the measurement of: • • •

the growth in coverage of executable lines, C(t) the mean growth in detected faults, M(t) the probability of failure of per execution of a faulty line, f

3.1

Measurement of Growth in Line Coverage

In our proposed model, we assumed that there was an equal chance of a fault being present in any executable line of code. We therefore needed to measure the growth in line coverage against tests (as this was not measured in the original experiment). The PREPRO software was instrumented using the Solaris statement coverage tool tcov, and statement coverage was measured after different numbers of tests had been performed. The growth line coverage against time is shown below.

0.9 0.8 0.7 Fraction of 0.6 0.5 lines covered 0.4 C(t) 0.3 0.2 0.1 0 1

100

10000

1000000

Tests t Fig. 2. Coverage growth for PREPRO (line coverage)

It can be seen that around 20% (1563 out of 7555) of executable lines are uncovered even after a large number of tests. The tcov tool generates an annotated listing of the source showing the number of times each line has been executed. We analysed the entry conditions to the blocks that were not covered during the tests. The results are shown in the table below. Table 2. Analysis of uncovered code blocks

Entry condition to block Uncallable (“dangling” function) Internal error (e.g. insufficient space in internal table or string, or malloc memory allocation problem) Unused constructs, detection of invalid syntax Total

Number of blocks 1 47 231 279

Number of lines 6 248 1309 1563

Estimating Residual Faults from Code Coverage

167

The first type of uncovered code is definitely unreachable. The second is probably unreachable because the code detecting internal errors cannot be activated as this would require changes to the code to reduce table sizes, field sizes, etc. The final class of uncovered code is potentially reachable given an alternative test harness that covered the description syntax more fully and also breaks the syntax rules. The “asymptote” observed in Fig.3 of 1563 uncovered lines is assumed to represent unreachable code under the test profile generated by the harness. This stable value was achieved after 10 000 tests and was unchanged after 100 000 tests. If we just consider the 5992 lines that are likely to be reachable with the test profile we get the following relationship between uncovered code and the number of tests.

1

Fraction of Uncovered Lines (reachable) U(t)

0.1

0.01 1

10

100

1000

10000

100000

Tests t Fig. 3. Uncovered lines (excluding unreachable code) vs. tests (log-log graph)

This linear relationship between the logarithm of coverage and tests suggests that the inverse power law model of coverage growth (equation 2) can be applied to the PREPRO example. From the slope of the line it is estimated that the power law value for coverage growth in PREPRO is p =0.4. 3.2

Measurement of the Mean Growth in Detected Faults

The fault detection times will vary with the specific test values chosen. To establish the mean growth in detected faults, we measured the failure rate of each fault inserted individually into PREPRO, using a test harness where the outputs of the “bugged” version were compared against the final version. This led to an estimate for the failure probability per test, λi of each fault under the given test profile. These probabilities were used to calculate the mean number of faults detected after a given number of tests using the following equation:

168

Peter G. Bishop

M (t ) = ∑1 − exp( −λi t )

(4)

where t is the number of tests. This equation implies that several faults could potentially be detected per test—especially during the early test stages when the faults detected have failure probabilities close to unity. In practice, several successive tests might be needed to remove a set of high probability faults. This probably happened in the original PREPRO experiment where 9 different faults caused observable failures in the first 9 tests. We observed that 4 of the 33 faults documented within the code did not appear to result in any differences compared to the “oracle”. These faults were pointer-related and for the given computer, operating system and compiler, the faulty assignments might not have had an effect (e.g. if temporary variables are automatically set to zero, this may be the desired initial value, or if pointers are null, assignment does not overwrite any program data). The mean growth in detected faults, M(t), is shown below.

35 30 25 Mean faults detected M(t)

20 15 10 5 0 1

10

100 1000 Tests

10000

100000

Fig. 4. Mean growth in detected faults versus tests

3.3

Measurement of Segment Failure Probability f

In the theory it is assumed that f is a constant. To evaluate this we measured the number of failures for each of the known faults (activated individually) under extended testing. Using the statement coverage tool, tcov, we also measured the associated number of executions of the faulty line. Taking the ratio of these two

Estimating Residual Faults from Code Coverage

169

values we computed the failure probability per execution of the line f. It was found that f was not constant for all faults, the distribution is shown in the figure below.

1

0.1 Fail prob. per segment execution f

0.01

0.001

0.0001 0

5

10

15

20

25

30

Faults (sorted by f ) Fig. 5. Distribution of segment failure probabilities

There is clearly a wide distribution of values of f, so to apply the theory we should know the failure rates of individual faults (or the likely distribution of f). However we know that the range of values for f p is more restricted than f. We can also take the geometric mean of the individual f values to ensure that all values are given equal weight, i.e. fmean = (Π fi)1/Nf where Nf is the number of terms in the product. The geometric mean of the values was found to be fmean = 0.0475.

4

Comparison of the Theory with Actual Faults Detected

The estimated fault detection curve, M(t), based on known fault failure probabilities, can be compared with the growth predicted using equation (3) using the values of f and p derived earlier. The earlier analysis gave the following model parameters: f = 0.0475 p = 0.40 So the value for f p is: f p = 0.0475 0.4 = 0.295 The comparison of the actual and predicted growth curves is shown in figure 6 below. Note that the theoretical curve is normalised so that 5992 lines covered is viewed as equivalent to C=1 (as this is the asymptotic value under the test conditions).

170

Peter G. Bishop

Mean (known faults) Theory: f p = 0.295 30 25 20 Detected faults 15 M(t) 10 5 0 0

2000

4000

6000

8000

Lines of Code Covered C(t) Fig. 6. Faults detected versus coverage (theory and actual faults)

It can be seen that the theory prediction and mean number detected (based on the failure rates of actual faults) are in reasonable agreement.

5

Estimate of Residual Faults

The coverage data obtained for PREPRO are shown below: Table 3. Coverage achieved and faults detected

Total executable lines Covered after test 100 000 Unreachable Potentially reachable Faults detected (M)

7555 5992 254 1309 29

Taking account of unreachable code, the final fraction of uncovered code is: C = 5992 / (7555-254) = 0.785 If we used equation (3) and took C=0.785 and f p = 0.295, we would obtain a prediction that the fraction of faults detected is 0.505. However this would be a

Estimating Residual Faults from Code Coverage

171

misuse of the model, as the curve in figure 6 is normalised so that 5992 lines is regarded as equivalent to C=1. Using the model to extrapolate backwards from C=1, only one fault would be undetected if the coverage was 0.99 (5932 lines covered out of 5992). As this level of coverage was achieved in less than 10 000 tests and 100 000 tests were performed in total, it is reasonable to assume that a high proportion of the faults have been detected for the 5992 lines covered. The theory shows that there can be a non-linear relationship between detected faults and coverage, even if there is an equal probability of a fault in each statement. Indeed, given the measured values of f and p, good agreement is achieved using an assumption of constant fault density. Given a constant fault density and a high probability that faults are detected in the covered code, we estimate the density of faults in the covered code to be: 29/5992 =0.0048 faults/line and hence the number of faults in uncovered code is: N = 1563 ⋅ 0.0048 =7.5 or if we exclude code that is likely to be always unreachable: N = 6.3 So from the proportion of uncovered code, the estimate for residual faults lies between 6 and 8. The known number of undetected faults for PREPRO is 5, so the prediction is close to the known value.

6

Discussion

The full coverage growth theory is quite complex, but is does explain the shape of the fault-coverage growth curves observed in [4,5] and our evaluation experiment. Clearly there are difficulties in applying the full theory. We could obtain a coverage versus time curve directly (rather than using p), but there is no easy way of obtaining f. In addition we have a distribution of f values rather than a constant, which is even more difficult to determine. However our analysis indicates that the use of the full model is not normally necessary. While we have only looked at one type of coverage growth curve, the theory suggests that any slow growth curve would reveal a very high proportion of the faults in the code covered by the tests. It could therefore be argued that the number of residual faults is simply proportional to the fraction of uncovered code (given slow coverage growth). This estimate is affected by unreachable code (e.g. defensive code). To avoid overestimates of N, the code needs to be analysed to determine how many lines of code are actually reachable. In the case of PREPRO the effect is minor; only one less fault is predicted when unreachable code is excluded, so it may be simpler to conservatively assume that all lines of code are reachable.

172

Peter G. Bishop

Ideally, the fault prediction we have made should be assessed by testing the uncovered code. This would require the involvement of the original authors, which is impractical as development halted nearly a decade ago. Nevertheless, the fault density values are consistent with those observed in other software, and deriving the expected faults from a fault density estimate is a well-known approach [1,3,7]. The only difference here is that we are using coverage information to determine density for covered code, rather than treating the program as a “black box”. Clearly, there are limitations in this method (such as unreachable code), which might result in an over-estimate of N, but this would still be a useful input to reliability prediction methods [2].

7

Conclusions 1. 2. 3. 4.

A new theory has been presented that relates coverage growth to residual faults. Application of the theory to a specific example suggests that simple linear extrapolation of code coverage can be used to estimate the number of residual faults. The estimate of residual faults can be reduced if some code is unreachable, but it is conservative to assume that all executable code is reachable. The method needs to be evaluated on realistic software to establish what level of accuracy can be achieved in practice.

Acknowledgements This work was funded by the UK (Nuclear) Industrial Management Committee (IMC) Nuclear Safety Research Programme under British Energy Generation UK contract PP/114163/HN with contributions from British Nuclear Fuels plc, British Energy Ltd and British Energy Group UK Ltd. The paper reporting the research was produced under the EPSRC research interdisciplinary programme on dependability (DIRC).

References 1. 2. 3. 4.

R.E. Bloomfield, A.S.L. Guerra, "Process Modelling to Support Dependability Arguments", DSN 2002 Washington, DC, 23-26 June, 2002 W. Farr. Handbook of Software Reliability Engineering, M. R. Lyu, Editor, chapter Software Reliability Modeling Survey, pages 71--117. McGraw-Hill, New York, NY, 1996 M. Lipow, "Number of Faults per Line of Code," IEEE Trans. on Software Engineering, SE-8(4):437-439, July 1982 Y. K. Malaiya and J. Denton, “Estimating the number of residual defects”, HASE’98, 3rd IEEE Int’l High-Assurance Systems Engineering Symposium, Maryland, USA, November 13-14, 1998

Estimating Residual Faults from Code Coverage

5.

6. 7.

173

Y.K. Malaiya, J. Denton and M.N. Li., Estimating the number of defects: a simple and intuitive approach, Proceedings of the Ninth International Symposium on Software Reliability Engineering, Paderborn, Germany, November 4-7, 1998, pp. 307-315 A. Pasquini, A. N. Crespo and P. Matrella, “Sensitivity of reliability growth models to operational profile errors”, IEEE Trans. Reliability, vol. 45, no. 4, pp 531–540, Dec. 1996 K. Yasuda, “Software Quality Assurance Activities in Japan”, Japanese Perspectives in Software Engineering, 187-205, Addison-Wesley, 1989

Appendix: Coverage Growth Theory In this analysis of coverage growth theory, the following definitions will be used: N0 N(t) M(t) C(t) U(t) Q f

initial number of faults residual faults at time t detected faults at time t, i.e. N0 – N(t) fraction of covered code at time t fraction of uncovered code, i.e. 1 – C(t) execution rate of a line of executable code probability of failure given execution of fault in line of code

We assume the faults are evenly distributed over all lines of code. If we also assume there is a constant probability of failure f per execution of an erroneous line, then the mean number of faults remaining at time t is: ∞

N (t ) = N 0 ∫ pdf (Q ) ⋅ e − fQt dQ 0

(5)

Similarly the uncovered lines remaining at time t are: ∞

U (t ) = ∫ pdf (Q ) ⋅ e −Qt dQ 0

(6)

The equations are very similar, apart from the exponential term. Hence we can say by simple substitution

N (t ) = N 0 ⋅ U ( f ⋅ t )

(7)

Substituting N(t) = N0 – M(t), and U(t) = 1 – C(t) and rearranging we obtain:

M (t ) = N 0 ⋅ C ( f ⋅ t )

(8)

So when f=1, we get linear growth of faults detected, M, versus coverage, C, regardless of the distribution of execution rates, pdf(Q). For the case where f << 1, there is a time scaling between growth in C and growth in M and in this case the M versus C curve will depend on the distribution of execution rates within the specific program under the specific operational profile. To gain some insight into the relationship between M and C (or equivalently N and U),

174

Peter G. Bishop

the following section derives the relationship for one particular type of coverage growth curve. Inverse Power Law Coverage Growth Let us take an example of slow coverage growth, where the uncovered code is some inverse power of t:

U (t ) =

1 1 + kt p

(9)

where p < 1. Rearranging we obtain: 1

1−U  p t=  kU 

(10)

Substituting for t in equation (7), it can be shown that:

N U = p N0 f + U (1 − f p )

(11)

For the case where U << f:p

N U ≈ N0 f p

(12)

So the limiting value to the gradient is:

N0 dN → − p dU f as U → 0

(13)

And the slope is nearly linear as U decreases to its minimum value. If p < 1, the gradient is less sensitive to the value of f. For example, if f = 0.1, and p=0.5, then the gradient becomes 0.31.

Towards a Metrics Based Verification and Validation Maturity Model Jef Jacobs1 and Jos Trienekens2 1

Philips Semiconductors, Bld WAY-1, Prof. Holstlaan 4 5656 AA Eindhoven, The Netherlands 2 Frits Philips Institute, Eindhoven University of Technology Den Dolech 2 5600 MB Eindhoven, The Netherlands

Abstract. Many organizations and industries continuously struggle to achieve a foundation of a sound verification and validation process. Because verification and validation is only marginally addressed in software process improvement models like CMMI, a separate verification and validation improvement model has been developed. The framework of the improvement model presented in this paper is based on the “Testing Maturity Model (TMM). However considerable enhancements are provided, such as a V&V assessment procedure and a detailed metrics program to determine effectiveness and efficiency of the improvements. This paper addresses the development approach of MB-V2M2, an outline of the model, early validation experiments, current status and future outlook. The resulting model is dubbed MBV2M2 (Metrics Based Verification&Validation Maturity Model).

1

Introduction

Software verification and validation is coming of age. Nevertheless, many organizations struggle with the founding of a sound verification and validation process. One of the reasons is that existing software development maturity models, like CMMISM [10], have not adequately addressed verification and validation issues nor has the nature of a mature verification and validation (V&V) process been well defined. What is a sound V&V process in the first place? How should you organise and implement V&V process improvement? What are the consequences for the organization? In short, guidance for the process of V&V process improvement is badly needed, as well as a method to measure the maturity level of V&V, analogous to the widely adopted Capability Maturity Model Integrated (CMMI) for the software development process. In 1999 a rather unique initiative was undertaken: a number of industrial companies, consultancy & service agencies and an academic institute, all operating and residing in the Netherlands, decided to jointly develop, validate and disseminate a V&V improvement model. This model should unite the strengths and remove the weaknesses of known improvement models. S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 175-185, 2002.  Springer-Verlag Berlin Heidelberg 2002

176

Jef Jacobs and Jos Trienekens

This paper addresses the achievements so far. Chapter 2 describes the objectives of the model and the starting points for the development of the model. Chapter 3 presents the conceptual basis of the model, i.e. the framework, the elements and the structure. Chapter 4 addresses some guidelines for the application of the model in practice. Finally, chapter 5 gives the current status of the model and a future outlook.

2

Objectives and Starting Points

A consortium was formed of industrial companies1, consultancy & service organizations2 and an academic institute3. The industrial partners operate in diverse and high-demanding fields including defence and civil systems, telecommunication and satellites, consumer and professional electronics. The consultancy & service partners operate in software quality, testing, and related vocational training. The academic partner is a technology research institute affiliated with the University of Technology Eindhoven, specialising in R&D for technology-intensive companies. The partners required that the developed model be universally applicable (that is, not geared towards a specific type of business) and identified that the model should at least: • • • • •

Describe a coherent and generic set of steps and actions to improve a V&V process Provide well-defined V&V maturity levels Provide an instrument to asses V&V maturity Define measurements to determine the effectiveness of improvement actions Recommend a metrics base to select process improvements, to track and control implementation of improvement actions and to adjust and focus the process improvements.

After intensive deliberations and preliminary research scans, it was agreed to begin a joint project and using an existing test improvement model, TMM (Testing Maturity Model) as basis [1][2]. The main reasons for this choice, supported by all V&V experts of the consortium partners, were that this model already seems to fulfil quite some of the objectives that the consortium had in mind, that TMM reflects over forty years industry-wide software testing evolution and that TMM was designed to be a counterpart of the software process improvement model CMM [10]. Tentatively, the project was called MB-V2M2 (Metrics Based Verification and Validation Maturity Model), emphasising that TMM was used as starting point, and that a Metrics Base should be provided. The first step towards the definition of a framework for MB-V2M2 was an extended inventory and examination of existing improvement models and an evaluation of their fit to the model objectives [4][5]. Among the models investigated 1

Thales (formerly Hollandse Signaal Apparaten), Lucent Technologies The Netherlands and Philips Electronics The Netherlands. 2 Improve Quality Services, Quality House The Netherlands. 3 Frits Philips Institute, University of Technology – Eindhoven The Netherlands.

Towards a Metrics Based Verification and Validation Maturity Model

177

were general software improvement models like CMM and its successor CMM-I, SPICE, Bootstrap, and software testing specific models such as TMM, TPI, and TOM, including comparisons of models [9]. In addition, the literature was scanned for best V&V practices and V&Vstandards as a preparation for later process area definitions [6]. The literature scan also included approaches for development and application of metrics [7], as a preparation to the development of a metrics base for MB-V2M2. The model comparisons as well as a real-life account of application of TMM justified the choice for TMM [8] [9]. TMM (Testing Maturity Model), developed, in 1996 at the Illinois Institute of Technology [1][2] reflects the evolutionary pattern of testing process maturity growth documented over the last several decades, as outlined in Gelperin and Hetzel [3]. A definite strength of TMM is that it is founded on forty years of industrial experience with software testing. It profits from many past struggles to find a sound software testing process. Also a very strong point of TMM is its design objective: to be a counterpart of the popular software process improvement model CMM. Software process improvement programs can use TMM to complement CMM, as CMM does not adequately address software test issues. On the other hand, it is also possible to improve the test process independently, though one should be aware that maturity levels for testing and software development must remain close to each other. Weaknesses of TMM that have been identified are respectively: its rather poor description (compared to the extensive SEI’s improvement model), the relative underrepresentation of goals or activities for people management and the test organization, and the lack of explicit attention for the test environment. An issue overlooked or underrepresented by virtually all models, including TMM, is that the improvement actions of higher maturity levels cannot be performed independently from other organizational entities. To improve the software process from CMM level 4 on, alignment with e.g. marketing and sales department, manufacturing department is required. To improve the software testing process, the test organization has to be aligned with the development department, marketing and sales etc. Processes get a wider and wider scope at higher maturity levels and consequently require tuning and alignment with other departments. Based on an in-depth analysis of on the one hand the objectives and on the other hand the weak and strong points of the existing models a framework for an improvement model ofr V&V has been developed.

3

Framework, Elements and Structure of the Model

MB-V2M2 was chosen to be a hybrid model: it shares the characteristics of both a staged and a continuous model. In a continuous model every process area is addressed at every maturity level. This implies that at every maturity level all process areas are simultaneously improved. This seems to be logical: all aspects of the V&V process smoothly grow in maturity at the same time. In contrast, a staged model focuses on different process areas per maturity level, although some process areas can be addressed at multiple maturity levels. The staged nature of MB-V2M2 is reflected in its maturity levels. Different process areas are addressed at these levels. The continuous nature of MB-V2M2 is represented by fundamental factors common to

178

Jef Jacobs and Jos Trienekens

and addressed at every maturity level and process area. Hereafer the basic elements of the model are described. Maturity Levels. A maturity level is a well-defined plateau of process growth. Each maturity level stabilizes an important part of the organization’s processes. Maturity levels consist of a predefined set of process areas. The maturity levels are measured by the achievement of the goals that apply to each predefined set of process areas. The model consists of five maturity levels each layer being the basis for ongoing process improvement: Initial, Repeatable, Defined, Managed and Aligned, Optimizing. Process Areas. A process area consists of a set of of related practices in an area that, when performed collectively, satisfy a set of goals considered important for making significant improvement in that area. Fundamental Factors. A fundamental factor is common to every maturity level and process area. As process improvement progresses, the maturity of these fundamental factors grow smoothly.

!

REPEATABLE Defect Detection

! ! ! !

Organizational Alignment Quantitative V&V Management Quality Measurement

Organizational Embedding Peer Reviews Risk Based V&V Methodology Life Cycle Embedding V&V Tracking & Control

People

! ! ! ! !

DEFINED

Functional Requirements Verification & Validation

! !

Defect Prevention Quality Management Process Change Management

Technology

MANAGED & ALIGNED

Quality Measurement & Management

! ! !

Organization Process

OPTIMIZING Quality Control

V&V Policy and Goals V&V Project Planning V&V Design Methodology V&V Environment

INITIAL

Fig. 1. Framework of the MB-V2M2

Four fundamental factors are distinguished, respectively People, Technology, Process, Organization. People refer to all stakeholders directly or indirectly involved in or committed to the test process improvement. Technology refers to all techniques, methods and tools introduced or developed during the ongoing test process

Towards a Metrics Based Verification and Validation Maturity Model

179

improvement. Process refers to all presciptions, rules, guides and interactions that connect people, the organization and technology required to render the process improvements efficient and effective. Organization refers to the corporate environment in which the test process improvement takes place, and that has to cope with the growth of maturity of its people, the technology and processes used. The framework of the model with its maturity levels, process areas and fundamental factors can be graphically represented as in figure 1. Maturity Level 1: Initial. At the MB-V2M2 level 1 the main objective of V&V is to show that software products work. V&V is performed by local champions, in an ad hoc way, after coding is done and when time allows. V&V is a spiral of finding and correcting problems, without separation. There is a lack of resources, tools and properly trained staff. There are no process areas at this level. Maturity Level 2: Repeatable. Via the introduction of basic V&V practices, a basic V&V process emerges at MB-V2M2 level 2. The objective of V&V is defect detection. V&V is still (exclusively) executing code, but is now performed in a systematic and managed way. The staff is trained in V&V design techniques and methods. V&V is planned, performed and documented. V&V is conducted in a controlled V&V environment. Maturity Level 3: Defined. Further organization of V&V and embedding into the development life cycle, allows the process maturity to rise to MB-V2M2 level 3. V&V has become a real verification of requirements as laid down in a specification document according to a defined and repeatable process, documented in standards, procedures, tools and methods. V&V begins already at the requirements phase and continues throughout the life cycle. A V&V organization is in place and V&V is recognised as a profession, including a career development plan and associated training program. Maturity Level 4: Managed and Aligned. Measured and aligned V&V practices are introduced to reach MB-V2M2 level 4. V&V as now considered as quality measurement of software products, in all aspects. The conditions to operate at this level are created by aligning the way-of-working with other organizational entities. Quantitative measurements and statistical techniques and methods control the V&V process. Maturity Level 5: Optimizing. At MB-V2M2 level 5, V&V has evolved to total software product quality control, V&V is a streamlined, defined, managed and repeatable process, where costs, efficiency and effectiveness can be quantitatively measured. V&V is continuously improved and fine-tuned in all aspects. Defect collection and analysis are practised with mechanical precision to prevent defects from recurring. The interrelations between the basic elements of the MB-V2M2 are shown in Figure 2. Subsequently in Figure 3 the structure of the model is presented.

180

Jef Jacobs and Jos Trienekens

Maturity Levels Contain Achieved by

Process Areas

Goals

Organized by Show achievement

Common Features Contain Fundamental Factors

Mature by

Measured by

Generic Practices

Measured by

Metrics

Implement

Assessment

Activities

Fig. 2. Relations between the elements of the MB-V2M2

Maturity Level

Process Area 1

Process Area 2

Goals

Metrics

Process Area 3

Commitment to Perform

Generic Practices

Ability to Perform

Generic Practices

Activities Performed

Generic Practices

Directing Implementation

Generic Practices

Verifying Implementation

Generic Practices Organization

Processes

Technology

People

Fig. 3. Structure of the MB-V2M2

Towards a Metrics Based Verification and Validation Maturity Model

181

To reach the goals in the various process areas different processes have to be executed. To identify, order and cluster these processes the concepts of common feature and generic practice have been elaborated. Common Features and Generic Practices. Common Features are predefined attributes that group Generic Practices into categories. Common Features are model components that are not rated in any way, but are used only to structure the Generic Practices. Five Common Features are distinguished, structuring a total of eleven Generic Practices. These five Common Features are presented in table 1 and described in detail hereafter. Commitment to Perform • Establish an Organizational Policy The purpose of this generic practice is to define organizational expectations for the process and make these expectations visible to those in the organization that are affected. Ability to Perform • Provide Resources The purpose of this generic practice is to ensure that the resources necessary to perform the process as defined by the plan are available when they are needed. Resources include adequate funding, appropriate physical facilities, skilled people and appropriate tools. • Assign Responsibility The purpose of this generic practice is to ensure that there is accountability throughout the life of the process for performing the process and achieving the specified results. The people assigned must have the appropriate authority to perform the assigned responsibilities. • Train People The purpose of this generic practice is to ensure that the people have the necessary skills and expertise to perform or support the process. Appropriate training is required to the people who will be performing the work. • Establish a Defined Process The purpose of this generic practice is to establish and maintain a description of the process that is tailored from the organization's set of standard processes to address the needs of a specific instantiation. With a defined process, variability in how the processes are performed across the organization is reduced; and process assets, data, and learning can be effectively shared. Activities Performed • Activities The purpose of this generic practice is to describe the activities that must be performed to establish the process. Directing Implementation • Manage Configurations The purpose of this generic practice is to establish and maintain the integrity of

182

•

•

Jef Jacobs and Jos Trienekens

the designated work products of the process (or their descriptions) throughout their life cycle. Measure the Process The purpose of this generic practice is to perform direct day-to-day monitoring and control of the process and to collect information derived from planning and performing the process. Appropriate visibility into the process is maintained so that corrective action can be taken when necessary. Identify and Involve Relevant Stakeholders The purpose of this generic practice is to establish and maintain necessary involvement of stakeholders throughout execution of the process. Involvement assures that interactions required by the process are accomplished and prevents affected groups or individuals from obstructing process execution. Table 1. Overview of Common Features and Generic Practices

Common Features

Generic Practices

Commitment to Perform

Establish an organizational policy

Ability to Perform

Provide resources Assign responsibility Train people Establish a defined process

Activities Performed

Activities

Directing Implementation

Manage configurations Measure the process results Identify and involve relevant stakeholders

Verifying Implementation

Objectively evaluate adherence Review status with business management

Verifying Implementation • Objectively Evaluate Adherence The purpose of this generic practice is to provide credible assurance that the process is implemented as planned and satisfies the relevant policies, requirements, standards, and objectives. People not directly responsible for managing or performing the activities of the process typically evaluate adherence. • Review Status with Business Management The purpose of this generic practice is to provide business management with appropriate visibility into the process. Business management includes those levels of management in the organization above the immediate level of management responsible for the process.

Towards a Metrics Based Verification and Validation Maturity Model

4

183

Application of the MB-V2M2 in Practice

This section addresses some practical guidelines regarding the application of the model in day-to-day practice of business systems. Advancing through Maturity Levels. Organizations can achieve progressive improvements in their organizational maturity by first achieving stability at the project level and continuing to the most advanced-level, organization-wide continuous process improvement, using both quantitative and qualitative data to make decisions Since organizational maturity describes the range of expected results that can be achieved by an organization, it is one means of predicting the most likely outcomes from the next project the organization undertakes. For instance, at maturity level 2, the organization has been elevated from ad hoc to disciplined by establishing sound project management. As an organization achieves the generic and specific goals for the set of process areas in a maturity level, it is increasing the organizational maturity and reaping the benefits of process improvement. Skipping Maturity Levels. The staged representation identifies the maturity levels through which an organization should evolve to establish a culture of excellence. Because each maturity level forms a necessary foundation on which to build the next level, trying to skip maturity levels is usually counterproductive. At the same time, it should be recognized that process improvement efforts should focus on the needs of the organization in the context of its business environment and that process areas at higher maturity levels may address the current needs of an organization or project. For example, organizations seeking to move from maturity level 1 to maturity level 2 are frequently told to establish a process group, which is addressed by the Organizational Process Focus process area that resides at maturity level 3. While a process group is not a necessary characteristic of a maturity level 2 organization, it can be a useful part of the organization’s approach to achieving maturity level 2. This situation is sometimes characterized as “establishing a maturity level 1 engineering process group to bootstrap the maturity level 1 organization to maturity level 2.” Maturity level 1 process improvement activities may depend primarily on the insight and competence of the engineering process group staff until an infrastructure to support more disciplined and widespread improvement is in place Organizations can institute specific process improvements at any time they choose, even before they are prepared to advance to the maturity level at which the specific practice is recommended. However, organizations should understand that the stability of these improvements is at a greater risk, since the foundation for their successful institutionalization has not been completed. Processes without the proper foundation may fail at the very point they are needed most: under stress. A defined process that is characteristic of a maturity level 3 organization can be placed at great risk if maturity level 2 management practices are deficient. For example, management may make a poorly planned schedule commitment or fail to control changes to baselined requirements. Similarly, many organizations collect the

184

Jef Jacobs and Jos Trienekens

detailed data characteristic of maturity level 4, only to find the data uninterpretable because of inconsistency in processes and measurement definitions.

5

Current Status and Future Outlook

The current status of the MB-V2M2 project is that the exploration phase has been finished and that the design phase is well under its way. The latter is reflected in this paper by the descriptions of the MB-V2M2 framework, its elements and its structure. Process areas are already defined up to level 4 and the key issue of the staged/continuous aspect has been elaborated in terms of ‘fundamental factors’ that appear at each level. These fundamental factors are respectively People, Organization, Technology and Process. By making clear at each level and in each process area in what way and to what extent these fundamental factors are addressed the structure of the MB-V2M2 will be strengthened. Further this will serve as an aid to check the completeness of the MB-V2M2, both at each level and for the whole MB-V2M2. The fundamental factors are also used as basis for the development of the assessment approach in the sense that V&V units or departments will be assessed by focussing on these factors in particular. Current activities are the development of the metric base and the assessment approach and procedures. Regarding the metric base it has been decided that each process area will contain Goal Question Metric [7], [11] procedures to develop a metrics program for the evaluation and determination of the effectiveness of V&V improvement activities. With respect to the assessment approach a number of checklists have already been developed that will be validated and enriched in the experiments. Various experiments will take place this year. We mention here for example the real-life experiments in that the usage of the process area descriptions will be validated in V&V projects at the partners’organizations. Further, prototype instruments such as the assessment checklists will be applied in the V&V departments or areas of the partners’organizations.

References 1.

2.

3. 4.

Burnstein, I., Suwanassart, T., Carlson, C.: Developing a Testing Maturity Model. Part 1. Crosstalk, Journal of Defense Software Engineering, Vol. 9 (1996) 21-24 Burnstein, I., Suwanassart, T., Carlson, C.: Developing a Testing Maturity Model. Part 2. Crosstalk, Journal of Defense Software Engineering, 9, Vol. 9 (1996) 19-26 Gelperin, D., Hetzel, B.: The Growth of Software Testing, Communications of the ACM, Vol. 31, no. 6 (1988) 687-695 Ham, M.: MB-TMM. State of the art literature. Part 2, MB-TMM project report 12-7-1-FP-IM (2001)

Towards a Metrics Based Verification and Validation Maturity Model

5.

185

Ham, M.: Testing in Software Process Improvement, MB-TMM project report 12-5-1-FP-IM (2001) 6. Ham, M., Van Veenendaal, E.: MB-TMM. State of the art literature. Part 1, MBTMM project report 12-1-1-FP-IM (2001) 7. Ham, M. , Swinkels, R., Trienekens, J., Metrics and Measurement, MB-TMM project report 13-6-1-FP (2001) 8. Jacobs J., van Moll J., Stokes T.: The Process of Test Process Improvement, XOOTIC Magazine, 8, 2 (2000) 9. Swinkels, R.: A Comparison of TMM and other Test Process Improvement Models, MB-TMM Project Report 12-4-1-FP (2001) 10. Paulk, M, Curtis, B., Chrissis M., Weber, C.: Capability Maturity Model for Software, Software Engineering Institute, Carnegie Mellon University (1993) 11. Van Solingen, R., Berghout, E.: The Goal/Question/Metric Method, A practical guide for quality improvement of software development, McGraw-Hill (1999)

Analysing the Safety of a Software Development Process Stephen E. Paynter and Bob W. Born MBDA UK Ltd., Filton, Bristol, UK {stephen.paynter,bob.born}@mbda.co.uk

Abstract. The UK Defence Standard for developing safety-related software, [16], requires that a safety analysis be performed on the process used to develop safety-related software. This paper describes the experience of performing such a safety analysis, and reﬂects upon the lessons learnt. It discusses the issues involved in deﬁning the process at the appropriate level of abstraction, and it evaluates the diﬃculties and beneﬁts of performing Function Failure Analysis and Fault-Tree Analysis on a development process. It concludes that the beneﬁts of performing safety-analysis of a software development process are limited, but if such an analysis must be performed, it is best done to develop a qualitative understanding of the ways the process may fail, rather than to develop a quantitative understanding of the likelihood of the process failing.

1

Introduction

The UK Defence Standard for developing safety-related software, [16], requires that safety analysis be performed on any process used to develop safety-related or safety-critical software. Safety analysis techniques are usually applied to products that may behave dangerously, or to development processes that may be dangerous to the people operating or living near them. The requirement of [16] to apply safety analysis to a software development process is unusual, and there is little in the literature that describes how such an analysis should be performed1 . It is the aim of this paper to describe an attempt to comply with [16] in performing a safety analysis of a software development process, and to draw some conclusions by reﬂecting on the experience. The work was performed for a Naval missile project for which the most severe potential accident was catastrophic, involving an uncontrolled missile hitting the launch platform. Fortunately, a large mitigating factor, in the form of the physics of the situation, meant that few uncontrolled missile hazards would result in such an accident. The launch missile control software consequentially had a Safety Integrity Level (SIL) of 32 . Although performing a software development process safety analysis is not widely required by other standards, and not commonly performed by projects, 1 2

It is noted that the appendices of [16] provide some guidance. Adopting a SIL assignment procedure based on risk and severity in accordance with [15], and not only severity, as assumed by [16].

S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 186–197, 2002. c Springer-Verlag Berlin Heidelberg 2002

Analysing the Safety of a Software Development Process

187

there are, however, good prima facie reasons for performing such an analysis. It promises to: ensure that the process is well deﬁned; provide a degree of insight and understanding into which tasks could fail and the consequences of such failures; and form the basis for arguing that the process is adequate to produce software of a particular integrity, and that further work would not be costeﬀective. There is now a wide consensus that safety-related software needs to be supported by a software safety case, [1]. A software safety case for almost all systems will need to draw on indirect process evidence, and not only on direct testing (and analysis) evidence of the ﬁnal software. This is primarily because of the difﬁculty of performing adequate tests to provide statistically signiﬁcant evidence of suﬃcient reliability for high integrity software, [10] and [5]. The safety-case will therefore need to be based on an argument that the software development process is, in some sense, adequate to produce software which meets the required safety targets. This, in turn, suggests that a safety analysis of the development process is required, so that it is known how the process could fail. The requirement in the UK that the risks of using any system are reduced as far as is reasonable seems almost to demand that some analysis be performed of the software development process to show what has actually been achieved during development, before a further analysis can be made demonstrating that the alternatives are not cost-eﬀective in reducing the risk further. A safety analysis of a development process diﬀers little from a safety analysis of a product, and involves essentially the same tasks and the application of the same technologies and techniques. It therefore requires a precise statement of the artifact to be analysed (i.e. the process, in this case), a hazard identiﬁcation phase to determine the hazardous ways in which the artifact may fail, and an analysis of the artifact to determine which aspects of it could be involved in each hazard. This often leads to derived safety requirements for the artifact. The paper is organised as follows. Section 2 describes the task of process definition, and Section 3 describes the task of hazard identiﬁcation for a software development process, and discusses the experience of applying Function Failure Analysis (FFA) and Fault-Tree Analysis (FTA) to a software development process. Section 4 describes the derived safety requirements that arose from the safety analysis, and conclusions are drawn in Section 5.

2

Process Definition

MBDA UK has a mature and well-documented software development process, which has been supported by internal applied research and process improvement programmes over many years3. It therefore came as some surprise to discover that the existing descriptions of its software development process were not appropriate for the purposes of safety analysis. Modern process deﬁnitions are tailored 3

A sample of the publications that have arisen from this work include publications on its software development programmes, [23] and on the software design methods that it uses: [21] and [22].

188

Stephen E. Paynter and Bob W. Born

to ease the task of demonstrating compliance with the process, while providing ﬂexibility to allow individual projects the freedom to innovate and apply the design techniques and tools that their speciﬁc circumstances and design approach require. This was not felt to be at the right level of detail for a safety analysis. In particular, it was believed that a process deﬁnition worth analysing should at least provide enough detail to identify each tool used in the development of the software4. However, in contrast to this, other available process deﬁnitions that captured the iterative and technical nature of the development process – such as, for example, the steps involved in distributing a software design over a processor network – were considered to be too detailed. One intuitively knows that a process deﬁnition must reach a certain level of technical detail for it to be worth analysing. It is unclear to the authors, however, how this “intuitive knowledge” can be used to generate objective criteria by which a process deﬁnition can be judged. It was therefore found to be necessary to create a bespoke process description, which identiﬁed the main tasks (and the tools used to aid those tasks), and the products used and produced by performing those tasks. This drew upon project speciﬁc planning documents, such as the project computing plan, as well as the Company’s engineering process. A drawback from performing a safety-analysis of a bespoke process description is that it can be diﬃcult to show that this process has been followed. The existing Quality Assurance system will be designed to demonstrate compliance with the standard process. This diﬃculty must be faced, and extra compliance evidence gathered. It is common when deﬁning software development processes to treat quality and management issues separately, for example, configuration control. This is because they interact with practically every step of the process, and, for example, integrating conﬁguration control steps would greatly complicate the process description. The decision was taken to follow the same practice when deﬁning the process to be analysed, in spite of tasks like quality management, [8], and conﬁguration control being critical in ensuring that the process has been applied to the right products, and that the software installed in the ﬁnal product is the same software that has undergone the appropriate testing and review. Clearly, the quality control process should be evaluated separately for adequacy, allowing the development process to be analysed under the assumption that it will be followed and that the right products will be used at each stage. The process was deﬁned using a graphical “data-ﬂow” language loosely based upon the MASCOT design notation, [21]. MASCOT diagrams consist of concur4

At least, any tool used in the development which either contributed to the ﬁnal product or which produced information used to support the safety case. This allows tools which are only used to support the software engineer to be excluded. The decision to exclude such tools from such a process deﬁnition, however, is an important step, as it precludes the analysis results of the tool from being used in the safety case to support the argument for the adequacy of the process or product.

Analysing the Safety of a Software Development Process

189

rent processes that interact via inter-communication data areas (IDAs) on explicit paths. The processes therefore become tasks, and the IDAs, the products produced by those tasks. Such a model obviates the need to capture explicitly the iterative (spiral) nature of software development, as the tasks may be considered to “run” (i.e. be performed) whenever there is new data for them to process. The process analysed is given in Figure 1. Ovals represent the tasks (double ovals, tasks that are decomposed for further analysis); squares, the products; roundcornered squares, the inputs to the process, and arrows, the ﬂow of information from product to task, or from task to product. Reviews introduced as a result of the safety-analysis, (see Section 4), are represented by ﬁlled diamonds on the relevant arrows. The graphical process deﬁnition was supplemented with textual documentation in the analysis report, including a textual description of each task and product, and an analysis of the assumptions concerning the properties of the inputs to the process. It is argued below that, although not normally part of a process description, it is helpful to annotate each task with information about the competence of the software engineers expected to be performing it, and details of the technology. For example, it might be recorded that the “static analysis task” will be performed by experts5 using the SPARK Examiner, and will involve information ﬂow analysis and the check for the absence of run-time errors, or the “write software” task will be performed by practitioners using Spark-Ada, [2], and the structuring principle will be abstract data types (ADTs). This information can be presented in tabular form, for example, Table 1.

Table 1. Process Task Information Process Task Expertise Tools Technology 1. Design Software (T3) Practitioner MADGE MASCOT (in-house tool) 2. Write Software (T7) Practitioner Spark-Ada ADTs 3. Static Analysis (T9) Expert Spark Examiner Information Flow Analysis and Run-Time Error Checks 4. SHARD Analysis (T6) e.t.c.

5

In the UK the Institution of Electrical Engineers and the British Computer Society have worked with the Government’s Health and Safety Executive to deﬁne a set of competencies for twelve diﬀerent job functions in developing safety-related systems, and these competence deﬁnitions recognise three levels of expertise, called expert, practitioner, and supervised practitioner, [12] and [14].

190

Stephen E. Paynter and Bob W. Born

.

Test Vectors

Logic Test Vectors

I1

SRS

HSIS

I2

I3

I4

8 Develop V&V Plan

Design Software Architecture

T1

V&V Plan Generate Test Matrix & Test Specs

Software Design

SHARD Analysis

P1

SHARD Report

1

P6 Auto Generate Code

T7 Sets of Test Inputs

Application Source Code P3

P2

T10 Test Set V&V Reqs.

Software V&V Report

P16

T8

P10

T9

6

Code

7

T11

P11 Source Code

P12 T12

Instrumented P13 Source Code

Adapt and Develop RTS T13

Target RTS P14

5 Compile Link & Load Target

Test Harness P17

P18

I5

Application Source Code

Instrument P18

COTS Run Time System

Merge

Analysed

Design & Implement Test Set T14

Test Set

P8 3

2 Metric Check

Infrastructure Templates

Architecture Source Code

P9

Static Code Analysis

T5

P7

Write Software

Sets of Test Outputs

T4

MASCOT Ada-95 Mapping

Analysed S/W Design

P5

Design & Implement Infrastructure

P4

T6 T2

Design MASCOT/ Ada-95 Mapping

T3

Compile Link & Load Host T15

Target-Rack Software P19

T16

Host + + Software P20

Develop Bootloader

Apply Inputs and Execute on Target T17

T21

Release Build Instructions

Bootloader

T20

Tested Target Code

Software

Build Missile

Code in T22

Library P26 Confirm Code in Missile

Code in Missile

Version Expected P41

I1-3 Generate P2-4,16,18 Compliance P23-24 Matrix T24 Compliance Matrix P29

T23

P28

P22

Analyse Results & Coverage 4

S/W Test V&V Report

T25

P30

Confirmed Code in Missle

Certificate of Design

T19

Coverage Reports P23

Generate C of D

T18

Execution Data P21

P2

P27

Apply Inputs and Execute on Host

P31

Fig. 1. The Software Development Process

P24

Analysing the Safety of a Software Development Process

3

191

Safety Analysis

The ﬁrst step in a safety analysis is to perform a preliminary hazard analysis to identify the hazards and determine the severity of the accidents to which they could lead. The most important and obvious hazard of a software development process is that faulty software is produced which can fail causing the worst possible software controlled hazard in the ﬁnal product. A related hazard is that the process fails in such a way that no software is installed in the product. The product safety analysis should have identiﬁed the hazards of incorrect or missing software, determined the probability of their causing an accident, and the severity of the potential accident should it occur. Software development is typically considered to be a safe occupation, and hence one might not expect a software development process to pose hazards to the software developers. This, however, is not an invariable rule: hazards could, for example, be identiﬁed in the test environments, depending upon the test facilities and the properties of the system under test. However, existing Company procedures should govern such cases, and it is unlikely that an analysis of a process description at the level of detail suggested above will be able to provide any insight into these issues. Having deﬁned the process, identiﬁed the hazards, and determined the probability of their causing an accident and the severity of the accident, it is necessary to attempt some analysis of the process to determine which steps, if they failed, would contribute to which hazards. It is beneﬁcial to perform two forms of analysis: an inductive analysis, such as FFA, HAZOPS, or FMEA, and a deductive analysis, such as FTA, [9]. In an inductive analysis, one reasons from the system (i.e. the process, in this case) to the hazards, and in a deductive analysis, one reasons from the hazards towards their causes. Ideally they would always produce the same results, but there is some evidence, [3], that strenuous eﬀorts must be made to ensure consistency, perhaps indicating that the diﬀerent forms of analysis are complementary and lead the analyser to think of diﬀerent issues. 3.1

Function Failure Analysis

Function Failure Analysis (FFA) was chosen to perform an inductive analysis of the software development process as it is a simple technique to apply, and it seemed to ﬁt well with the data-ﬂow style of process description adopted. Function failure analysis consists of deﬁning a set of failure modes for the functions (or tasks in the process, in this case), and considering the consequences of each function failing in each of these ways. In each case, one determines what other tasks (the “co-eﬀectors”) must fail in order for a hazard to result, one records the conceivable hazards and the severity of the (potential) accidents, and one makes an assessment of the likelihood of the process failing in this way, along with a justiﬁcation of the assessment. A presentation of a typical function failure analysis is given in tabular form in Table 2. Three failure modes were considered for each of the tasks in the process deﬁned in Figure 1. These were: none – the task is not performed; invalid – the

192

Stephen E. Paynter and Bob W. Born

Table 2. Function Failure Analysis Task Failure Mode T1 None Invalid

Consequence CoHazard Accident Probability Justiﬁcation Eﬀectors Severity ... T2 & T3 H 1 Cata10−2 ... strophic ... T2 & T4 H 2 Critical 10−3 ...

Incorrect ... T2

None None Invalid Incorrect

T2 & T4 H 2 –

–

Critical 10−3 – e.t.c.

–

... –

task is performed in such a way that ill-formed products are produced from it; and incorrect – the task is performed so that well-formed but incorrect products are produced. Other failure modes which are sometimes considered, such as too soon, too late, on the wrong inputs and too many times, were not analysed as it is argued that they are handled by the orthogonal argument on the adequacy of the conﬁguration control process. One tedious implication of performing this kind of analysis is that one needs to consider (albeit brieﬂy) tasks failing in ways that simply could not go unnoticed, such as, for example, the failure to write any software. However, the problem with adopting FFA in exactly this form when applying it to a software development process is that most if not all process failures have the potential for introducing an arbitrary error into the software; and software tends to be of the form that an arbitrary error could have any eﬀect. Hence the hazard and accident for most process errors was that the worst software controlled hazard could occur. Performing this analysis in exactly this form therefore proves not to be very informative. An alternative approach was, therefore, adopted, where the eﬀects are traced not to ultimate failures in the product, but to failures in the evidence in the software safety-case. This provides a more nuanced understanding of the impact on the software. A typical fragment of a table for such an FFA is given in Table 3, where the numbers in the “impact” column are supposed to reference evidential items used to support a safety case (such as might be described using the goal structured notation). Two kinds of safety case evidence are diﬀerentiated: direct evidence, which is evidence which is used to directly support a claim in the safety case; and indirect evidence, which is evidence which is used to provide conﬁdence in the quality or applicability of some direct evidence. It should be noticed that the “probability” column was replaced by the less precise “risk assessment” column. This is because, as always, when dealing with software, numerical probabilities are controversial and diﬃcult to determine.

Analysing the Safety of a Software Development Process

193

Table 3. Modiﬁed Function Failure Analysis Task Failure Mode T1 None

T2

ConCoKind of safety Impact on s/w sequence Eﬀectors case evidence safety case ... T2 & T3 Direct Makes E001 inconsistent Invalid ... T2 & T4 Indirect Makes E027 incomplete Incorrect ... T2 & T4 Direct Undermines E114 e.t.c.

Risk JustiAssessment ﬁcation Incredible ... Raised

...

Normal

...

A probability ﬁgure would be determined by a number of factors— in particular, the probability that: – the task will fail in that way (per project delivery); – the task failure will be such as will potentially result in a critical fault in the software; – the co-eﬀector process tasks will fail, allowing the critical fault to enter the software, and remain undetected and removed; – the critical software fault will result in the system failure identiﬁed; and – the hazard will result in the accident. The probability that a particular task will fail, and that it will fail potentially producing a critical fault in the software, will depend upon the expertise of the engineers performing the task, [6], and the technology being used. This is why the basic process deﬁnition needs to be supplemented with this information. Ideally, each company would be able to provide statistically signiﬁcant data on their own task failure rates. However, very few (if any) will be able to do this. Most companies simply do not deliver enough software intensive products, nor does the technology they use remain stable over many products (for example, CASE tools and compilers change). There is even a dearth of evidence from across the whole industry for the advance in dependability achieved by various qualitative process steps such as the use of formal methods, static analysis, or structural testing, [11]. However, recording what technologies are being used in each task in the process, is a pre-requisite for being able to use this information, as it becomes available. However, another factor which may inﬂuence the results at least as much as the technology being used, if not more, is the complexity of the task to which the technology is being applied. An analysis of process, however, does not seem to be the proper place to take this into account, it being quintessentially a product issue. These imponderables mean that it is highly questionable that any quantitative ﬁgure for the likelihood of critical errors in the product developed from a safety analysis of a development process will be as believable as ﬁgures derived

194

Stephen E. Paynter and Bob W. Born

from more directly measured metrics, such as reliability growth modelling, [10]. However, the law of diminishing returns means that it is not normally practical for direct measurements to provide statistically signiﬁcant results for safetyrelated software. There have been, however, some attempts to combine direct measurements with qualitative evidence about the process, for example, Bayesian Belief Network approaches, [17], and test regime assessment, [5]. It remains an open question as to how these approaches would compare with ﬁgures arising from a safety analysis of the process, should a quantitative approach be persued. The solution here was to adopt a qualitative rather than quantitative assessment of the risk. In particular, risks were classiﬁed as: “incredible”; “raised”; or “normal”, where incredible meant it was diﬃcult to imagine how the process could fail in this way and the failure remain undetected; “raised” meant that some novel, untried, technology was being applied or the task was notoriously diﬃcult to ‘get right’; and “normal” meant that the task was a standard part of the software engineering process, and the engineers would be familiar with performing it. 3.2

Fault-Tree Analysis

FTA was chosen to perform a deductive analysis of the software development process as it is a simple and widely used technique for analysing the cause of hazardous events. It was found that starting from the top fault of “having loaded software containing a potentially hazardous defect”, a fault tree could be constructed from the process description given in Figure 1 in the standard way, without any adaption. The traditional lessons about constructing good fault-trees apply when they are applied to processes. In particular, one needs to be as speciﬁc as possible about the events, and at each stage of the decomposition one needs to trace each event to its immediate cause. It was also found that when introducing the checks on the faults using and-gates, these needed to be introduced as near the bottom of the tree as possible. The danger is that otherwise one has all the constructive steps of the process in one branch of the tree, and all the veriﬁcation and validation steps in another tree anded together at the top. Although it might be possible to construct a valid tree with this structure, it fails to give insight into the way the various veriﬁcation and validation tasks are suitable for capturing failures in diﬀerent parts of the development process. Although one of the reasons for performing FTA was to provide a check on the FFA, in the event, the FTA did not lead to much re-working of the FFA. It did, however, help in providing a structure to the justiﬁcations, and it provided a simple cross-check that all the relevant co-eﬀectors had been identiﬁed for each function failure mode. Using the FTA to produce a numerical value of the probability of a critical error being introduced into the software, suﬀers from all the problems discussed above. Furthermore, fault-trees are highly susceptible to producing erroneous quantitative results when there are common mode eﬀects between the diﬀerent

Analysing the Safety of a Software Development Process

195

root causes, [9]. However, the root faults of a fault-tree for a process will suﬀer from many common mode eﬀects, not least because individual engineers are likely to be involved in more than one task. In retrospect, it is probably fair to conclude that the results achieved from performing FTA did not justify the eﬀort involved.

4

Derived Safety Requirements

The above analysis did not lead to any revolutionary changes to the proposed process. Indeed, it is diﬃcult to see how this kind of process analysis could ever lead to the identiﬁcation of new kinds of process steps or technologies. The professional software engineer should already be aware of the technologies and best-practices that are available, and should have taken a considered view as to which are appropriate to develop software of the desired integrity. Similarly, it is diﬃcult to see how this kind of analysis could provide much direct guidance on what should be done at each step: for example, whether the Ada or C programming languages should be used, or the kinds of testing that should be performed, [7], [4]. The analysis did, however, lead to some derived safety requirements. These were mostly of the form of introducing speciﬁc safety reviews into the process to check that a certain product (such as a speciﬁcation or test report) had some desirable property (such as adequate treatment of safety requirements). The need for these extra reviews became clear when the FFA identiﬁed a potential failure mode that could not be discounted as incredible, and which did not have a large number of co-eﬀectors guarding the process failure from resulting in a fault in the software (or ﬂaw in the safety-case). It is noted that extra reviews do not automatically improve a process. An important step in making reviews worthwhile is ensuring the right people attend (and the wrong people do not), and in ensuring that the aim of the review is clear, focused, and known to those attending. One interesting derived safety requirement, however, applied to the product, rather than the process. This was that the product should have an interlock that prevented missile operation (ﬁring) when there was no software operating6 .

5

Conclusions

This paper has recognised that, apart from the need to comply with a safety standard such as [16], there are strong prima facie arguments for performing a safety analysis of a software development process which is used to produce safety-related software. However, there is little literature which describes how this should be done, or evaluates the worth of performing such an analysis. An attempt to apply hazard analysis in the form of FFA and FTA to a software development process has been described. Issues concerning the deﬁnition of the process have been discussed. 6

Although, this requirement had already been identiﬁed by product safety analysis.

196

Stephen E. Paynter and Bob W. Born

It has been argued that safety analysis techniques should not be used quantitatively on software processes, because the understanding of the probabilistic risks of the tasks in a software process failing are simply not known. It has also been argued that the focus of techniques such as FFA should be redirected towards the impact process failures will have on the safety case, as this gives a more nuanced understanding of the implications of a failure, than considering the impact of the failure on the software. The reason being that almost all process failures could give rise to any kind of software fault being introduced (or left undetected). The nature of software is that an arbitrary fault must be considered to lead to the worst possible software controlled hazard. The safety analysis has been shown to be instrumental in suggesting extra reviews that could be inserted into the process to check that particular products have certain desirable properties, or lack certain undesirable ones. It is impossible from an experience report such as this to pronounce upon the absolute worth of performing software development process safety analysis. One suspects that the beneﬁts depend signiﬁcantly upon the quality and maturity of the existing development process. The results here were not felt to be greatly beneﬁcial, but neither were the results of performing process safety analysis pernicious. Most companies’ software development processes only change slowly over time, so large parts of a process safety analysis should be able to be re-used from project to project. Therefore, such analysis is probably worth doing for the extra insights that it may give. It seems clear to the authors, however, that the majority of the safety analysis eﬀort should remain targeted on the product.

Acknowledgements MBDA UK Ltd. funded this research. Our ideas have beneﬁted from conversations with Dr. J.M. Armstrong and Messrs. E.F. McMahon, and C.P. Richens.

References [1] Mike Ainsworth, Katherine Eastaughﬀe, and Alan Simpson. Safety-Cases for Software Intensive Systems. In Redmill and Anderson [20], pages 1–12. 187 [2] J. Barnes. High Integrity Ada: The SPARK Approach. Addison-Wesley, 1997. 189 [3] Jack Crawford. Some Ways of Improving Our Methods of Qualitative Analysis and Why We Need Them. In Redmill and Anderson [20], pages 89–99. 191 [4] W. J. Cullyer and N. Storey. Tools and Techniques for the Testing of SafetyCritical Software. IEE Computing and Control Engineering Journal, 5(5):239–244, October 1994. 195 [5] Stewart Gardiner, editor. Testing Safety-Related Software: A Practical Handbook. Springer, 1999. 187, 194 [6] J. Griﬀyth. Human Issues in the Software Development Process - Modelling their Inﬂuence on Productivity and Integrity. In Redmill and Anderson [19], pages 105–123. 193 [7] D. C. Ince. Software Testing, chapter 19. In McDermid [13], 1991. 195

Analysing the Safety of a Software Development Process

197

[8] Denis Jackson. New Developments in Quality Management as a Pre-requisite to Safety. In Redmill and Anderson [18], pages 257–269. 188 [9] Nancy G. Leveson. Safeware: System Safety and Computers. Addison Wesley, 1995. 191, 195 [10] Bev Littlewood. Software Reliability Modelling, chapter 31. In McDermid [13], 1991. 187, 194 [11] Bev Littlewood. The Need for Evidence from Disparate Sources to Evaluate Software Safety. In Redmill and Anderson [18], pages 217–231. 193 [12] R. Malcolm, S. Clarke, S. Hatton, and R. May. Who can you Trust?: Assessing Professional Competence. In Felix Redmill and Tom Anderson, editors, Towards System Safety - Proceedings of the 7th Safety-Critical Systems Symposium, pages 239–255. Springer, 1999. 189 [13] J. A. McDermid, editor. Software Engineering Reference Book. Butterworth Heinemann, 1991. 196, 197 [14] Andrew McGettrick and Ray Ward. Towards Meaningful Uptake of Competency Descriptors. In Redmill and Anderson [20], pages 197–205. 189 [15] Ministry of Defence. Safety Management Requirements for Defence Systems, December 1996. Defence Standard 00-56. 186 [16] Ministry of Defence. The Procurement of Safety-Critical Software in Defence Equipment, August 1997. Defence Standard 00-55 - Issue 2. 186, 195 [17] Martin Neil, Bev Littlewood, and Norman Fenton. Applying Bayseian Belief Networks to System Dependability Assessment. In Redmill and Anderson [19], pages 71–94. 194 [18] Felix Redmill and Tom Anderson, editors. Directions in Safety-Critical Systems Proceedings of the 1st Safety-Critical Systems Symposium. Springer-Verlag, 1993. 197 [19] Felix Redmill and Tom Anderson, editors. Safety-Critical Systems: The Convergence of High-Tech and Human Factors - Proceedings of the 4th Safety-Critical Systems Symposium. Springer, 1996. 196, 197 [20] Felix Redmill and Tom Anderson, editors. Aspects of Safety Management - Proceedings of the 9th Safety-Critical Systems Symposium. Springer, 2001. 196, 197 [21] H. R. Simpson. The MASCOT Method. Software Engineering Journal, 1(3):103– 120, 1986. 187, 188 [22] H. R. Simpson. Layered Architecture(s): Principles and Practice in Concurrent and Distributed Systems. In Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing, 1996. 187 [23] G. Woodward. Rapier 2000 Software Development Programme. Software Engineering Journal, 11(2):82–87, 1996. 187

Software Criticality Analysis of COTS/SOUP Peter Bishop1,2, Robin Bloomfield1,2, Tim Clement1, and Sofia Guerra1 1

Adelard CSR, City University Drysdale Building, Northampton Square, London EC1V 0HB, UK {pgb,reb,tpc,aslg}@adelard.co.uk 2

Abstract. This paper describes the Software Criticality Analysis (SCA) approach that was developed to support the justification of commercial off-the-shelf software (COTS) used in a safety-related system. The primary objective of SCA is to assess the importance to safety of the software components within the COTS and to show there is segregation between software components with different safety importance. The approach taken was a combination of Hazops based on design documents and on a detailed analysis of the actual code (100kloc). Considerable effort was spent on validation and ensuring the conservative nature of the results. The results from reverse engineering from the code showed that results based only on architecture and design documents would have been misleading.

1

Introduction

This paper describes the Software Criticality Analysis (SCA) approach that was developed to support the justification of a commercial off-the-shelf software (COTS) that is used in a safety-related C&I system. The primary objective of the criticality analysis is to assess the importance to safety of the software components within the COTS and to show there is segregation between software components with different safety importance. There are both economic and safety assurance motivations for using SCA on pre-developed software products or COTS in safety-related applications. The software products are typically not implemented to comply with the relevant industry safety standards, and may therefore require additional documentation, testing, field experience studies and static analysis of the code to justify its use in a safety context. There can therefore be considerable benefits in identifying or limiting the safety criticality of software components, as this can reduce the costs of justifying the overall system. This work can be seen as an extension of the IEC 61508 concept of SIL that applies to safety functions to provide a safety indication of software component criticality. It is also an extension of the various work reported on software failure modes and effects analysis and Hazops. The distinctive features of our analysis include the application of such techniques to COTS and the validation of the assumptions underlying the safety analysis. This validation had a large impact on the S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 198-211, 2002.  Springer-Verlag Berlin Heidelberg 2002

Software Criticality Analysis of COTS/SOUP

199

results. The overall approach is very similar to the one advocated in a study we undertook for the UK HSE [2] for justifying “software of uncertain pedigree” (SOUP) and provides more detail to support this overall COTS/SOUP framework.

2

Identifying the Software Concerned

A pre-requisite for the software criticality analysis (SCA) is the identification of the main software components and their interconnections. A number of different representations could be used for this purpose: design documents and expert opinion or source code analysis. In this particular project, both methods of identification were used, as described below. 2.1

Design Documents and Expert Opinion

In the first stage of subsystem identification, the design documentation was analysed to establish the major software components such as: • • • • •

operating system kernel device drivers support services (communications, remote program loading) application level scheduler for running control applications interpreter and “function blocks” for executing control application function

The overall software architecture was re-engineered with the aid of experts used to maintaining and developing the system and from an examination of the system documentation and code. This described the major software partitions and provided a mapping from the software files to the architecture component. (Note the structure of the source files did not readily map into the architecture partitions). At a more detailed level, the design experts provided names of the top-level software components within the system and also identified which elements were to be used in this particular application. The unused software would not represent a threat provided it was really not used or activated. 2.2

Analysis of the Software Structure

There is a risk that design and maintenance documents and expert opinion may not represent the actual software implementation. To provide a rigorous traceable link between the criticality analysis and the software, the structure was reverse engineered from the code. This established the actual architecture and identified the “top” procedures in the software call trees and all the software that is influenced by these as the basis for analysis. There were difficulties in relying on the maintenance documentation or the design documentation that was not written from an assurance viewpoint. As the project continued and the code-based analysis was shown to be feasible, greater reliance was placed on using the code analysis to provide the structure of the software, with the experts and the function call names providing the semantics of the code.

200

Peter Bishop et al.

The 100k lines of C code were analysed to identify the top-level call-trees and the supporting subroutines. This analysis was implemented using the CodeSurfer [5] tool, which parses the C source code and generates a description of all the dependencies between the C elements (data and subroutines). The system was then represented as a collection of top-level functions, covering the entry points into the system and the interfaces to major sub-components of the system. These were identified by a combination of analysis of the call graph extracted from the C components of the code using CodeSurfer [5], and domain knowledge. Some of the top-level procedures could be immediately assigned the highest criticality by inspection, as they were essential to the functioning of the entire system. These would typically include low level interrupt service routines that are triggered by hardware events and key scheduling functions. The remaining top-level functions were candidates for a Hazops to determine their criticality. The analysis showed that the conceptually independent components at the design level could actually share code (such as utility functions). In such cases, the shared code must have the same classification as the most critical call tree it supports. The main limitation of such a control flow analysis is that it assumes that there are no “covert channels” (e.g. via data shared between trees) that can affect an apparently independent software call tree. A later SCA validation (described in Section 5) addressed this aspect and this resulted in some changes to the safety classification of the software components. The approach used to assess and classify the safety impact of the identified software components is described in the next section.

3

Assessing the Safety Impact of Failure - HAZOPS

One of the key techniques for assessing the impact of failure is the software equivalent of a hazard and operability study (Hazops) [4]. The Hazops was carried out broadly in accordance with the guidance in Interim Defence Standard 00-58 [3]. The analysis was initially directed towards functions that we expected to have a criticality of less than 4 (the assumed criticality of the system as a whole) in order to derive and justify that lower criticality. The Hazops process also developed recommendations such as: • • •

The need for further study of system behaviour: for example the results were not readily apparent to the Hazops team. The need for further analysis to underwrite the assumptions in the Hazops. Often these concerned the incredibility of certain failure modes. The need to consider additions to design.

After the assessment of failure impact the software components were sentenced as discussed in the following section.

Software Criticality Analysis of COTS/SOUP

4

201

Ranking Software Components

The basis of the classification scheme is that software and hardware in the system have different types of requirements. Broadly speaking these arise from the need to show correctness for the safety system functions, the correctness of the internal fault handling features and lack of impact when the fault handling features fail. These requirements will depend on the frequency of the challenge, which in turn depends on the overall system architecture and design. In general, because of the redundant architecture and the periodic manual testing, the fault handling features will have lower integrity requirements. The most important events are likely to be: • •

Failure of common protection functions (where correct operation is important). Interference failures of supporting functions (like fault handling and system maintenance) that affect the integrity of the protection functions.

In this section we describe the process of ranking the components previously identified (Section 3) according to risk assessed on likely failure rate and consequence. This ranking builds on the Hazop analysis described above. The results of the criticality analysis are summarised in a Safety Criticality Index (SCI) that defines the importance to safety. Each failure was attributed a value based on its (system-level) consequence, the estimated frequency of demand, and the degree of mitigation of the consequence. The SCI was then derived from this and a sentence encapsulates the results of this assessment. It should be noted that the same software item might be in several classes at once and in general the highest classification would take precedence, which is a conservative assumption. 4.1

Defining the SCI

The SCI is an indication of the importance to safety of a software element. It does not indicate the importance of that software to the qualification process. It may be that very strong and easily defended arguments exist to show that a piece of software meets very stringent requirements and other, more difficult, arguments are needed for lower criticalities. It is useful to distinguish different types of risk from the software: 1. 2. 3.

it defeats a safety function it spuriously initiates a safety function it defeats the handling of a safety function after its initiation

The SCI calculation evaluated these separately and for each type of risk considered: • • •

the worst case consequence on the computer-based safety function the frequency with respect to demands placed on the safety function the mitigation

The basis for the SCI calculation is a semi-quantitative calculation of the risk from the software. These figures are expressed relative to each of the types of risk of the safety functions considered and deal in approximate orders of magnitude. The

202

Peter Bishop et al.

Software Criticality Index is defined as the logarithm of the risk. The details are shown in Appendix A. In order to systematically sentence the SCI of the components, rules are needed to relate the consequence of the failure mode to the loss of the system functions (as SCI is relative to this), and to the length of time exposure of the system to the risk. For example, for a process protection application, rules are needed for the severity of: • •

Loss of a controlled process shutdown function. The consequences of spuriously initiating a shutdown. There are a number of other factors that might change the consequence:

•

Failure only affects a single channel of a redundant system (e.g. faulty diagnostic software). • A delayed shutdown rather than a total defeat of the shutdown function. There are also mitigations that would need to be taken into account. For example: • The identification of precise failure modes that defeat a shutdown means that there are often no specific mitigations. However, the failure mode that defeats the shutdown is often one of many. The mitigation is a result of taking into account the fact that another failure may occur first that does not defeat the shutdown. This could be verified by examining the impact of actual historic errors. The frequency component is also influenced by the number of demands on the component. The rules needed in sentencing are for: • • •

4.2

Moving from a continuous operation, or that of many demands, to figures for failures per demand. When a second software failure is needed to expose the fault we need a rule for the reduction in frequency index for defeating the safety function. The impact of the hardware failure rate that can lead to a demand on the fault detection and management software. These failures often only lead to loss of a single component in a redundant architecture and the SCI is normally dominated by the consequence of other common mode failures. Assigning a Criticality Index (SCI)

The results of the Hazops were reviewed and consolidated and then the definition of SCI was applied to each of the functions considered. For each top-level function, we considered features that impact the consequence, frequency or mitigation. Further analyses on the overall results were performed in order to: • • •

Derive the worst case criticality index in each risk category (e.g. spurious execution of safety function, safety function defeated) for each procedure in the Hazops, by taking the maximum over the subfunctions considered. Derive the worst case criticality index over all categories for each procedure in the Hazops. Map each procedure to the criticality associated with its top-level caller. Where there were many callers, the maximum value was taken.

Software Criticality Analysis of COTS/SOUP

•

203

Summarise the criticality assignment to procedures by showing the number of procedures of each criticality.

This analysis required only the correct assignment of SCIs to the top-level procedures. Propagation down the tree is a mechanical matter of taking for each procedure the maximum criticality of all the top-level procedures from which it can be reached, and so to which it can contribute. The coverage obtained in this way can be checked automatically and showed that the analysis had covered all the procedures in the build of the software analysed. These rules for propagation of the criticality index down the call tree are conservative and another pass of the SCI assignments was later undertaken to remove some of the conservatism during the SCA validation.

5

Validation of the SCA The criticality analysis required validation that:

1. The analysis was applied to the correct object, i.e.: • •

It applied to the actual code being used in the system, i.e. the documents used all related to the same versions as the product. This was addressed by the reengineering of the program structure from the actual code. The Hazops were adequately undertaken (with respect to procedure, people and documentation). This was addressed by the review of the results and by the check on consistency of the SCI assignment as described in Section 5.1.

2. The assumptions of behaviour of software and surrounding system were correct, i.e. that: • The judgements used about the code partitioning were correct. Verification of this involved covert channel analysis (see Section 5.2) and was addressed by subsequent static analysis of pointers [1] and by the trust placed in the reverse engineering and expert judgement. • The mitigations assumed (e.g. benefits from redundant channels, time skew) were correct. The behaviour of the software was as assumed by the experts. This could be checked by confirmatory tests or analysis. The process of assigning software criticality indices to procedures depends on analysis by domain experts and on a detailed understanding of the function of the code and the system level consequences of failure. When dealing with a large body of complex code, it is possible that this analysis will miss some aspect of the function that should affect the criticality assigned. One way to verify the self-consistency of an assignment is to examine the interconnections between procedures that should place a constraint on their relative criticalities. This can be done mechanically and helps to build confidence in the result (although clearly it is possible for an assignment to be internally consistent but not meet the external criticality requirements of the application).

204

Peter Bishop et al.

There are three aspects to such verification. The first is to decide what consistency conditions should be satisfied by an assignment of criticalities. These will be defined in Section 5.1. The second is to determine how these conditions can be evaluated mechanically. Finally, where this information cannot be extracted completely and exactly, we need to decide whether we can make a useful approximation, and whether that approximation is conservative. A restricted analysis limits the amount of consistency checking that can be done, while a conservative analysis may result in apparent inconsistencies that further investigation shows not to be problems with the assignment. Section 6 summarises the results from an application of these methods and describes how the apparent inconsistencies found were resolved. 5.1

Consistency of SCI assignments

The consistency checks we made were based on the data flows between procedures. We assigned to each variable the maximum criticality of any procedure that used its value, and checked that any procedure that affected its value had a criticality at least as high. (This makes the conservative assumption that all variables contribute to the critical function of the procedure). In these consistency checks we considered three ways in which one procedure can affect a value in another: Downwards call consistency. This condition addresses the constraint imposed by the communication of results from a called procedure to a calling one through its results or through pointers to local variables in its parameter list. We made the conservative assumption that all called procedure results can affect any calling procedure result, and hence a called procedure should be no less critical than its caller. Static data consistency. This condition addresses the constraints placed on the criticality of a procedure by the static variables it modifies. The criticality of a procedure that changes a static variable must be at least as high as the criticality of the variable. Upwards call consistency. One procedure may call another that changes static variables, either directly or through a chain of intermediate procedures. Under the conservative assumption that all variables affect the value, the criticality of every caller should thus be at least as high as the criticality of the callee. Checking the consistency conditions above required information about the code, for example the call graph and the subset of procedures returning results directly to their callers, either by an explicit return parameter or via pointers to local variables. This information was extracted mechanically using CodeSurfer. However, the code was too large to analyse in one piece, so it was divided into smaller components (projects). A reasonable amount of effort was needed to synthesise a view of the whole project in a conservative way. 5.2

Covert Channels

In classical safety analysis the idea of a segregation domain is used to define a region within which a failure can propagate. This is done so that interaction of, say, hardware

Software Criticality Analysis of COTS/SOUP

205

components of a physical plant are not overlooked when a functional or abstract view of the system is considered. A segregation domain defines an area where common cause failures are likely. In assessing safety-critical software a great deal of effort is usually placed on trying to show segregation or non-interference of software components. This often involves sophisticated and extensive analysis and a design for assurance approach that builds in such segregation. Indeed the level of granularity that we take for a software component is strongly influenced by our ability to demonstrate a segregation domain: there may be no point in producing a very refined analysis within a segregation domain. When we are dealing with software, the representation used defines certain types of interaction (e.g. dataflow) as intended by the designer. It is these representations that are used as a basis for criticality analysis, but the possibility remains that there are unintended interactions or interactions not captured by the notation (e.g. dynamic pointer allocation). Furthermore, some SOUP might not even have this documentation, so that we might be relying on expert judgement to assess the possible flows. Therefore the possibility remains of what, by analogy with the security area, we term covert channels or flows. Covert channel analysis should be addressed by the use of the guide words in the Hazops analysis and by being cautious about what is rejected as incredible. Other techniques that can possibly be used for covert channel analysis include: • • • • •

manual review tool-supported review (e.g. program slicing using a tool like CodeSurfer [5]) symbolic execution (e.g. pointer analysis using Polyspace [6]) static flow analysis formal verification

In the application discussed here this was addressed by program analysis and review.

6

Discussion

This section shows the results of the more detailed analysis and it summarises the issues raised in the development and application of the SCA approach. Some of these issues are common to the subsequent work on static analysis (see [1]). 6.1

Impact of More Detailed Analysis

The more detailed final analysis (involving the re-engineered call trees and the conservative, mechanical propagation of SCIs from the top-level functions downwards) identified many more unused procedures (about 30%). This resulted in a corresponding reduction in the qualification work required. The number of SCI 4 procedures was also reduced, but the detailed assessment has, in many cases, found at least one route by which the top-level procedures can become SCI 3, that was not visible in the initial analysis, so procedures tended to move from SCI 2 to SCI 3.

206

Peter Bishop et al.

The impact of the more detailed analysis on the proportion of procedures in each SCI category is shown below. The initial SCI was based on the software architecture partitions. SCI 0 1 2 3 4

Percentage of procedures in each SCI category Initial SCI Detailed SCI 10 40 20 5 40 0 0 35 30 20

As one might expect, partitions at level 4 are not homogeneously critical and contain a range of functions. It also shows that although the initial broad categorisations are indicative, it would not be sufficient to assure to the average of the partition: the detailed assignment is important. 6.2

Tools

There were problems of scale – CodeSurfer was unable to analyse the program in one go. We needed to split the software into separate projects then devise means for checking for interconnections between projects (via the database of information extracted from all the projects). Clearly tools to cope with more complex systems are desirable and this may mean that tools may need significant redesign, for example, to incrementally build a database of results. While our application had some particular pointer complexities it would seem that about 100kloc is the limit for this tool at present. Other program slicers [7] claim to scale but these comparisons are not properly benchmarked. 6.3

Levels of Detail and Diminishing Returns

There are clearly diminishing returns in the level of detail in the SCA. More detail might limit the amount of software to be analysed, but takes more time to justify and demonstrate segregation and there is a trade-off against having more time for the actual analysis. Progressive application may be the best way of proceeding and this is what we recommend in the following table taken from [2]. SCA Initial SCA Design SCA Code SCA

Based on Expert judgement Top level architecture High level description Software design descriptions Source code/assembler

Remarks If it exists, architecture may not reflect functionality so need to go a level lower. If it exists, description may not capture behaviour sufficiently so need to appeal to expert judgement and code review. May be too detailed to abstract behaviour. Need tools to extract control, data and information flow.

Software Criticality Analysis of COTS/SOUP

207

The optimum level of SCA will depend on the costs associated with the subsequent assurance activities. Assurance could make use of testing, static analysis, inspection and evaluation of field experience. Where these costs increase significantly with SCI, expending more SCA effort to limit downstream assurance costs can be justified. On the other hand, where the downstream cost is low, or relatively insensitive to SCI, the optimum choice may be to assure all software to the maximum SCI. There are also pragmatic issues that influence the level of detail needed in an SCA. If the SCA is being used to decide the amount of testing required, all that might be required is an estimate for the highest SCI in a testable collection of software components (some part of the pruned call tree). 6.4

Hazops

A range of expertise is needed for the safety analysis: expertise in the software, the computer equipment and application needed to assess failure consequences. The volume of work can be significant if there are many top-level functions to assess. This has an impact on resources, of course, and also on the credibility of the results as large Hazops can become difficult to sustain when there are only a few experts available. The representation used for the Hazops is a key issue. The differences between the actual code and the block diagrams and the prevalence of unused code (not just from configuration of the product but also left in from development and testing) meant that validation of the SCI required detailed analysis at the code level. There are also some issues with the Hazops process as to how much can be done by individuals and then reviewed (a desktop approach) and how much needs to be done by groups. Apart from formal outputs, the Hazops led to increased understanding of the system and it would be useful to somehow recognise this. Hazops requires knowledge of the consequence of failures. There is a need to validate some of these judgements about the behaviour of complex systems: failure impact may be amenable to some automated support (e.g. see for example [8]). 6.5

SCI Scheme

There is an issue of how to have a consistent index when some of the application is demand driven and some continuous – this is also an issue with the use of IEC61508 and SILs generally. In the safety application, there was a need for continuous control of the process during the shutdown that can take many hours. This meant that fail-stop was not an option for that part of the equipment and this can complicate the SCI and safety arguments. It meant that the use of arguments that faults would be detected is not sufficient. The SCI scheme is necessary for COTS/SOUP but the definitions and approach could be put on a more rigorous basis. The link to standards could also be elaborated and the need for pre-defined categories related to SILs discussed (see for example the proposals in [9]).

208

Peter Bishop et al.

6.6

SCA Validation

There was a problem of conservatism causing everything to float up a criticality level (as is often the case with multilevel security). We therefore need conservative but usable propagation rules. The detailed SCA reduced the amount of level 4 software at the expense of needing to show certain verifications. The pragmatic approach to reducing the propagation was to verify some properties so that, given this verification, the SCI assignments were correct. 6.7

Overall Assurance Activities

The criticality analysis provides a focus for the overall assurance activities and the safety case design. In practice, there is the issue of how many criticality indexes levels are really needed. More importantly we lack an empirical evidence (rather than expert judgement, compliance with standards) of what techniques and what rigour is needed for the different criticality indexes and the different types of software. This is an unresolved problem in all current standards.

7

Conclusions

The overall conclusions of this work are that: 1.

2.

3. 4.

5.

6.

Software criticality analysis enabled the safety assurance activities to be focused on specific software components. Not only does this bring safety benefits but also reduced the effort required for a whole range of subsequent validation activities – software reviews, testing, field experience studies as well as static analysis. A useful by-product of SCA is the identification of the software and its structure – this provides an independent assessment of the system design documents and where these are of doubtful provenance provides some re-engineering of them. The Hazops method can be adapted to assess the safety impact of software component failures. The safety criticality index (SCI) used to classify the criticality of software components was useful, but in practice the number of levels requiring different treatment in subsequent qualification is fairly small. So there is scope for simplification and closer coupling between the definition of the SCI and its subsequent use. The level of detail considered in the SCA should be a trade-off between the effort required to validate software at a given criticality and the effort required to justify the partitioning of components into different criticality levels. A more detailed SCA might reduce the amount of software classified at the highest criticality, but the effort required for classification might be more than the validation effort saved. The optimum level depends on the effort required in the subsequent validation activities. There is a need for improved tool support for analysing code structure and for assessing the impact of failures. This is especially true if the approach here is to be increased in scale to even larger systems. The use of different programming

Software Criticality Analysis of COTS/SOUP

7.

209

languages or the adherence to safe subsets and coding guidelines would increase the scale of what the tools can cope with. However, many systems of interest will be written in C, developed over many years by different groups of people. In general, some expertise in the application, design and code is needed to make the SCA tractable. However, in the cases when expertise is not available, this may be compensated by further testing, static analysis or code review. Further conclusions on the generality of the approach will result from our current work on applying the SCA approach to a different class of SOUP/COTS.

Acknowledgements This work was partly funded under the HSE Generic Nuclear Safety Research Programme under contract 40051634 and is published with the permission of the Industry Management Committee (IMC). This work was assisted by the EPSRC Dependability Interdisciplinary Research Collaboration (DIRC), grant GR/N13999. The views expressed in this report are those of the authors and do not necessarily represent the views of the members of the IMC or the Health and Safety Commission/Executive. The IMC does not accept liability for any damage or loss incurred as a result of the information contained in this paper.

References [1] PG Bishop, RE Bloomfield, Tim Clement, Sofia Guerra and Claire Jones. Static Analysis of COTS Used in Safety Application. Adelard document D198/4308/2, 2001. [2] PG Bishop, RE Bloomfield and PKD Froome. Justifying the use of software of uncertain pedigree (SOUP) in safety-related applications. Report No: CRR336 HSE Books 2001 ISBN 0 7176 2010 7, http://www.hse.gov.uk/research/crr_pdf/2001/crr01336.pdf. [3] Interim Defence Standard 00-58, Hazop studies on Systems Containing Programmable Electronics. Part 1: Requirements. Part 2: General Application Guidance. Issue 2 MoD 2000. [4] D J Burns, R M Pitblado, A Modified Hazop Methodology for Safety Critical System Assessment, in Directions in Safety-critical Systems, Felix Redmill and Tom Anderson (eds), Springer Verlag, 1993. [5] CodeSurfer user guide and technical reference. Version 1.0, Grammatech, 1999. [6] PolySpace Technologies, http://www.polyspace.com. [7] F. Tip, “A Survey of Program Slicing Techniques”, Journal of Programming Languages, Vol.3, No.3, pp.121-189, September, 1995. http://citeseer.nj.nec.com/tip95survey.html. [8] T Cichocki and J Gorski, Formal support for fault modelling and analysis, in U Voges (ed): SAFECOMP 2001, LNCS 2187, pp 202-211, Springer-Verlag, 2001. [9] Rainer Faller, Project Experience with IEC 61508 and Its Consequences, in U Voges (ed): SAFECOMP 2001, LNCS 2187, pp 212-226, Springer-Verlag, 2001.

210

Peter Bishop et al.

Appendix A: Treating the SCI Probabilistically There is a probabilistic basis to the SCI assignment formula given in the main body of the report. Let us define: riskP the annual risk of demand failures conP the consequences of the protection system failing on demand pfdP the probability of failure of the protection system per demand demand_rate trip demands per year For each software component i, let us define: risk i con i psw_err i nsw_dem i

the risk from the component i failing on demand the consequence of component i failing on demand pf_sw i the probability of failure on demand given erroneous behaviour of software component i probability of erroneous behaviour in component i given a software demand number of times software component i is activated on a demand

For a given safety function, the risk is: riskP = conP ⋅ pfdP ⋅ demand_rate For a software component of the system: riski = coni ⋅ pf_swi ⋅ psw_erri ⋅ nsw_demi ⋅ demand_rate We can express this is as a relative risk: risk_ratioi = coni ⋅ pf_swi ⋅ psw_erri ⋅ nsw_demi / (conP . pfdP) We are interested in the impact of the erroneous behaviour software component i. If we set psw_erri = 1 or differentiate with respect psw_flti, we obtain: criticalityi = (coni /conP) ⋅ pf_swi ⋅ nsw_demi / pfdP Taking logs to get an index of importance and defining: ci = log (coni / conP) mi = log (1/pf_swi) (as pf_sw<1, mi is a positive integer ≥ 0) fi = log (nsw_demi) scii = log (criticalityi) SIL = log(1/pfdP) (relates the lower bound on the pfd target) We simply get: scii = SIL + ci + fi − mi In practice the sentencing scheme described in the main paper uses a consequence index that was a combination of SIL + ci. In the equation above ci is a negative index

Software Criticality Analysis of COTS/SOUP

211

as coni may be less severe than the conP consequence that may arise in the worst case (e.g. delayed trip rather than no trip). Hence the overall consequence is related to the SIL (0..4) of the safety function, but modified by consideration of the expected result of the component failure. If the software component can contribute to different safety function failures, then a similar calculation is required for each one. The maximum criticality of the software component can then be determined.

Methods of Increasing Modelling Power for Safety Analysis, Applied to a Turbine Digital Control System Andrea Bobbio1, Ester Ciancamerla2, Giuliana Franceschinis1, Rossano Gaeta3 Michele Minichino2, and Luigi Portinale1 1

DISTA, Università del Piemonte Orientale, 15100 Alessandria, Italy 2 ENEA CR Casaccia, 00060 Roma, Italy 3 Dipartimento di Informatica, Università di Torino, 10150 Torino, Italy

Abstract. The paper describes a probabilistic approach based on methods of increasing modelling power and different analytical tractability, to analyse safety of turbine digital control system. First, a Fault-Tree (FT) has been built to model the system, assuming independent failures and binary states of its components. To include multi-states and sequentially dependent failures of the system components and to perform diagnoses, FT has been converted into a Bayesian Net. Moreover, to accommodate repair activity, FT has been converted into a Stochastic Petri Net. Due to the very large space of states of the resulting model, a coloured Petri Net model have been built to alleviate the state explosion problem. Safety measures have been computed, referring to the emergent standard IEC 61508. The applicability, the limits and the main selection criteria of the investigated methods are provided.

1

Introduction

The paper describes a probabilistic approach to analyse safety of a turbine digital control system. The system belongs to the co-generative plant ICARO, in operation at ENEA, CR Casaccia. The plant is composed by two sections: a gas turbine section, for producing electrical power, and the heat exchange section, for extracting heat from the turbine exhaust gases. The gas turbine digital control system performs both control and protection functions of the gas turbine section [1]. A failure or a deterioration in the gas turbine digital control system could result in a reduction of plant efficiency, in a reduction of plant availability and moreover, in a reduction of plant safety by a damage of the engine, safety critical because of its high capital cost. The use of a digital system, if on one side increases benefits, on the other side increases risks, due the vulnerability to random failures and design errors of such systems. For digital systems, the demand of safety is more and more urgent even in conventional application domains, like ICARO co generative plant, as proven by the increasing demand of conformity to IEC 61508 standard [2]. IEC 61508 standard does S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 212-223, 2002.  Springer-Verlag Berlin Heidelberg 2002

Methods of Increasing Modelling Power for Safety Analysis

213

not address any specific sector. A very important concept in IEC 61508 is that of Safety Integrity Level (SIL). The determination of the appropriate SIL is based on the concept of risk and IEC61508 provides a number of different methods, quantitative and qualitative, for determining it. The safety analysis of the gas turbine digital control system has been performed starting from a Fault-Tree model [3], on which independent failures and binary states of system components have been assumed. To include multi states and sequentially dependent failures of some components of the system and to perform diagnoses, FT has been converted into a Bayesian Net [4]. Then, to accommodate repair activities, FT has been converted into a Stochastic Petri Net [6]. The Petri Net model resulted unmanageable, due to a very large space of states. To manage such a complexity, coloured Petri Net, a higher level class of Petri Nets have been adopted. Safety measures have been computed referring to the SILs of IEC 61508. The paper is organised in the following sections. We start in section 2 with the concept of SIL, according to IEC 61508 standard. Section 3 deals with the description of the case study. Sections 4, 5 and 6 provide the models of the case study by using the different methods and the performed measures. In section 7 there are the conclusions.

2

IEC 61508 Standard

Process industry requires that well defined safety requirements must be achieved, as hazards may be present in process installations. IEC 61508 is based on a principle referred to with the name As Low As Reasonably Practicable (ALARP). ALARP defines the tolerable risk as that risk where additional spending on risk reduction would be in disproportion to the actually obtainable reduction of risk. The strategy proposed by IEC 61508 takes into account both random as systematic errors, and gives emphasis not only to technical requirements, but also to the management of the safety activities for the whole safety lifecycle. IEC 61508 has introduced the concept of Safety Integrity Level (SIL), attempting to homogenise the concept of safety requirements for the Safety Instrumented Systems. According to IEC 61508 the SIL is defined as “discrete level (one out of a possible four for specifying the safety integrity requirements of the safety functions to be allocated to the E/E/PE safety-related systems, where safety integrity level 4 has the highest level of safety integrity and safety integrity level 1 has the lowest”. Table 1. Safety Integrity Levels: Target Failure Measures

Safety Integrity Level 4 3 2 1

LOW DEMAND MODE OF OPERATION (Probability of failure to perform its design function on demand) >=10-5 to <10-4 >=10-4 to <10-3 >=10-3 to <10-2 >=10-2 to <10-1

CONTINUOUS/HIGH DEMAND MODE OF OPERATION (Probability of a dangerous failure per hour) >=10-9 to <10-8 >=10-8 to <10-7 >=10-7 to <10-6 >=10-6 to <10-5

214

Andrea Bobbio et al.

The target dependability measures for the 4 SILs are specified in Table 1, for systems with low demand mode of operation and with continuous (or high demand) mode of operation. The determination of the appropriate SIL for a safety-related system is a difficult task, and is largely related to the experience and judgement of the team doing the job. IEC 61508 offers suitable criteria and guidelines for assigning the appropriate SIL as a function of the level of fault-tolerance and of the coverage of the diagnostic.

3

Gas Turbine Digital Control System

Gas turbine digital control system performs both control functions and protection functions [1] of the gas turbine section of the cogenerative plant, ICARO, in operation at ENEA CR Casaccia.

Fig. 1. High level architecture of the gas turbine control system

Methods of Increasing Modelling Power for Safety Analysis

215

Control functions address the normal run operation and all plant sequencing needed from starting to stopping operations. At any time a shutdown request will cause the control system to enter in its emergency shutdown state and carry out the shutdown actions which include the de-energisation of related relays. Protection functions consist in providing the engine protection by independent overtemperature and overspeed shutdowns. The high level system architecture is shown in figure 1. A Main Controller implements the control functions processing its own signals and sharing 2 thermocouples and 1 speed probe transducer signals with a Backup Unit which implements protection functions. "Watchdog" relays are associated to each hardware circuit board for shutdown requests. The main controller and the Back up unit have separate processors and independent power supplies (operating from the same supply inlet). Figure 2 details the hardware structure of the Main Controller and the Backup Unit into elementary components which failure rates have been assumed constant (table 2).

Di - Digital input; Ai - Analog input; Cpu-32-bit microprocessor; Mem - Memory; I/O - I/O bus;

Do-Digital output; Ao-Analog input; Wd-Watchdog relay;

PS-Power Supply inlet; SM/SB -Supply circuit of main controller/ Backup Unit.

Ro - Relay output.

Fig. 2. Main Controller and Backup Unit hardware structure

Table 2. Component /Failure Rate (f/h)

Component I/OM,I/OB Th1,Th2 Speed Mem DoM AoM RoB

Failure rate (f/h) λIO=2.0 10-9 λTh=2.0 10-9 λSp=2.0 10-9 λM=5.0 10-8 λDo=2.5 10-7 λAo=2.5 10-7 λRo=2.5 10-7

Component DiM AiM,AiB PS SM SB CpuM,CpuB WdM,WdB

Failure rate (f/h) λDi=3.0 10-7 λAi=3.0 10-7 λPS=3.0 10-7 λSM=3.0 10-7 λSB=3.0 10-7 λCpu=5.0 10-7 λWd=2.5 10-7

216

Andrea Bobbio et al.

4

The Fault Tree Model

At the starting level of the safety analysis we adopt a FT model1 (figure 3).

Fig. 3. Fault Tree model for safety critical failures

The Fault Tree analysis [3] is based on the following simplifying assumptions: components (and the system) have binary behaviour (up or down) and failure events are statistically independent. Qualitative and quantitative analyses of the FT have been carried out. Qualitative analysis has been aimed at enucleating the most critical failure paths: Minimal Cut Sets (mcs). Quantitative analysis has been aimed at evaluating measures useful to characterise safety. The analysis has found 43 mcs. The most critical mcs sorted by order are shown in table 3. 1

PS and Trans Sig are repeated events, joining different FT subtree.

Methods of Increasing Modelling Power for Safety Analysis

217

The following measures2 have been performed: Unreliability versus time, table 4; Safe Mission Time (SMT), computed as the time interval in which the system unreliability is strictly lower than a pre assigned threshold. Fixing a limit for the Unreliability U=1.0* 10-3, the Safe Mission Time is SMT= 280.000 (h); Mean Time To Failure (that we consider a less significant measure with respect to SMT). The Mean Time To Failure for the Top Event is: MTTF(for the TE) = 6.02 * 106 (h); Most critical failure paths (mcs); SIL evaluation limited to table 1 requirements of IEC61508 standard. Unreliability versus time and failure frequency have been computed (table 5) for SIL evaluation according to IEC 61508. Comparing the results for the failure frequency of dangerous failures of the table (third column), and comparing t with the SIL Target Failure requirements (table 1), it is obtained SIL - 3 up to 500,000 h. -

Table 3. Most critical mcs

1 2 3 4 5 6 7 8

Minimal Cut Set PS WdB WdM CpuB CpuM WdB WdM Speed WdB WdM AiB CpuM WdB WdM DiM CpuB WdB WdM SupplM CpuB WdB WdM CpuM SupplB WdB WdM AiM CpuB WdB WdM

Table 4. Unreliability versus time and failure frequency

Time t (h) 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 500,000

2

TE Unreliability 3.2365 10-6 3.1398 10-5 1.2032 10-4 3.1088 10-4 6.4391 10-4 1.1570 10-3 1.8826 10-3 2.8473 10-3 4.0717 10-3 5.5703 10-3

Failure frequency 6.473 10-11 3.139 10-10 8,02110-9 1.544 10-9 2.575 10-9 3.856 10-9 5,37810-9 7.117 10-9 9.048 10-9 1.114 10-8

by using SHARPE, ASTRA and ITEM software tools.

218

Andrea Bobbio et al.

5

The Bayesian Network Model

According to the translation algorithm presented in [4], the Bayesian Network derived from the FT of figure 3 is reported in Figure 4. Gray ovals represent root nodes (corresponding to the basic events in the FT), while white ovals represent non-root nodes. Every node in the BN is a binary node, since the variable associated to it is a binary variable. The binary values of the variables associated to the nodes represents the presence of a failure condition (true value) or an operational condition (false value). The only probabilistic nodes of the BN are the root nodes (gray nodes). All the other nodes in the BN (white ovals) are deterministic nodes whose Conditional Probability Tables contains only 0 or 1 and are determined by the type of the gate in the FT they refer to ( AND and OR gates) [4].

Fig. 4. The Bayesian Net model

Each root node must be assigned a prior probability value, coincident with the corresponding probability of the leaf node in the FT. Since the information about the failure probability of the system components is in the form of a constant failure rate (Table 2), the probability for the true value is obtained by computing the probability of generic component C (with failure rate λC) at a specific mission time t as Pr(C=true) = 1 - e -λ t ).

Methods of Increasing Modelling Power for Safety Analysis

219

Given the prior failure probabilities of system components (i.e. basic events in the FT) computed at different mission times (from t = 1 * 10 5 h to t = 5 * 10 5 h), we can evaluate the unreliability of the TE by computing the probability of node TE in the BN of Figure 5 given a null evidence (prior probability computation). 5.1

Posterior Analysis

The novelty and the strength of the BN approach, in dependability analysis, consists in the possibility of computing posterior probabilities (i.e. diagnoses), in order to analyse the criticality of any combination of nodes including primary events, with respect to partial or total system failure3. To this end, a probabilistic computation has to be carried out, by considering the occurrence of the TE as the evidence provided to the BN. There are two main probabilistic computations that can be performed: 1. the posterior probability of each single component, carrying out a belief updating propagation [8]; 2. the joint posterior probability over the set of components, carrying out a belief revision looking for the most probable configurations of the root variables [8]. Table 5 reports the posteriors of each single component computed4 at time t = 5 *10 5 h. Table 5. Posterior Probabilities for single components

Component WdM WdB CpuB PS CpuM AiB SB RoB AiM

Posterior Component Posterior 1 SM 0.19425736 1 DiM 0.19425736 0.37063624 AoM 0.16387042 0.34525986 DoM 0.16387042 0.30848555 Mem 0.03443292 0.2333944 Speed 0.00247744 0.2333944 I/OB 0.00167474 0.19688544 I/OM 0.00139391 0.19425736 Th1 0.00100097 Th2 0.00100097

The posterior probability (criticality) of a component is a more significant measure with respect to its (prior) failure probability. Indeed the order in which components appear are different in the prior and in the posterior computations. We can notice that the two watchdogs WdM and WdB have a criticality 1, since their failures are necessary in order to have a system failure (as it could have been easily deduced from 3

4

Differently to the well known Fussel-Vesely importance measure, which indicates “the sum of unreliabilities of cut sets containing the failure event respect to the sum of all cut sets unreliability” (ITEM software), the posterior analysis indicates the probability of occurrence of any single event, typically a primary event, (or any combination of them) conditioned to the occurrence of any other event (i.e. TE). by using HUGIN and Microsoft Corp.Belief Network tools

220

Andrea Bobbio et al.

the structure of the FT as well). Moreover, the probability of a CPU failure in case of TE occurrence is about 30% for the CpuM of the Main Controller and about 37% for the CpuB of the Backup Unit. Notice that these posterior values are different, even if the failure rate of both CPUs is the same, because of the different role they play in the overall system dependability. The second kind of analysis is much more sophisticated and approaches the criticality problem over a set of components. However, it is worth noting that, differently from mcs computation, all the components (i.e. basic events) are considered in a given configuration, by providing a more precise information. In this case, the posterior joint probability of all the components, given the fact that the TE has occurred, is computed. 5.2

Multi-state Nodes and Sequentially Dependent Failures

In the present section, we discuss the use of BN which enlightens two peculiar features, not considered in FT, namely: the possibility of modeling non-binary events (like events whose behavior is more carefully considered by multi-state variables), and the inclusion of localized dependencies (where the state of a root component influences the state of other root components). A more realistic case for the power supply PS is to find it in three different conditions (states): working, degraded and failed. When PS is in state degraded it induces an anomalous behavior also in the supply equipment (SM) of the Main Controller (MC) and (SB) the Back-Up Unit (BU). The BN, that models the described situation, is reported in figure 5, where just the relevant part of the BN of Figure 4 is reconsidered. The PS node has three states denoted by W for working, deg for degraded and F for failed. The prior probabilities of the PS node in the three different states are reported in the figure. The arcs connecting node PS with both nodes SM and SB indicate a possible influence of the parent node PS on the children nodes SM and SB. This influence is quantified in the CPT’s reported in figure 5, where it is shown that a degradation in PS induces a failure in SM and SB with probability 0.9.

Fig. 5. Portion of the BN showing the influence of a PS degradation

The degradation of the power supply PS does not have a direct effect on the system dependability, but its effect originates a negative influence on the degradation of the other components of the system.

Methods of Increasing Modelling Power for Safety Analysis

6

221

Stochastic Petri Net Models

FT and Bayesian networks are intrinsically acyclic graph structures and hence cannot be invoked to model systems in which actions are possible, in consequence of a failure, to restore the system to a previous condition. To accommodate repair activities, the FT of figure 3 has been converted into a Stochastic Petri Net Model, by a conversion algorithm [7]. Some interesting points have emerged. The FT has 19 basic components and this is a relatively small number for a FT, whose quantitative analysis is based on combinatorial techniques. Instead, for a state space based analysis technique, like Petri Net, the complexity of the problem changes drastically. The state space of a system with 19 basic components consists of 2 19 states. Moreover, the translation rules to convert a FT into a PN introduce several immediate transitions that produces several vanishing states in the reachability set. Hence, the number of states to be generated (tangible plus vanishing) from the generated PN is larger than the value of 2 19 tangible states. The lesson to be learned is that, even a ”small” FT may produce an unmanageable PN. In order to produce useful results with a PN model, some simplification must be adopted. A basic idea stems from the observation that often, due to symmetries or redundancies in the system to be modeled, the model may contain several similar components. To make the model more compact, the similar components may be folded and parameterized, so that only one representative is explicitly included in the model, while, at the same time, the identity of each replica is maintained through a parameter value. To do that a special class of the Stochastic Petri nets, called Stochastic Well-formed Net (SWN) has been used. The SWN formalism was introduced [6] with the aim of alleviating the state space explosion problem that often undermines state space based models. Figure 6 shows the SWN model. The model has been obtained by observing that several basic components have the same failure rates and may be grouped into classes. The symmetry property that is exploited in this case, is that each class contains object with the same failure rate. To make the figure more clear, we have not represented the inhibitor arcs from the TE to all the transitions. The function of these inhibitor arcs can be appreciated only at the analysis level, since they allow to avoid the generation of non relevant absorbing states. The SWN model allows to add new modelling issues that could not have been included in the FT and in the Bayesian models, from others repair activities that that enforce a cyclic behaviour in the system. 6.1

Global Repair

It has been assumed that the system is repaired whenever it fails and that the repair restores the system to its initial marking with all components up and the repair rate is constant (”As good as new” repair policy). A global repair activity has been included in the model. In figure 6, the timed transition, Glob R, models the stochastic duration of the repair activity.

222

Andrea Bobbio et al.

Fig. 6. The SWN model with global repair

The timed transition Glob R has the TE as the only input place. Moreover, a number of additional immediate transitions of higher priority must be inserted in a way that, when the SWN model reaches a state where place TE becomes marked, all the timed transitions in the SWN model are disabled except transition Glob R. Table 6. Steady state unavailability and availability versus repair rate

Repair rate 0.001 0.01 0.1 1

Unavailability 0.000166638894 0.000016666389 0.000001666664 0.000000166667

Availability 0.999833361106 0.999983333611 0.999998333336 0.999999833333

When transition Glob R fires then the high priority immediate transitions restore the SWN in its initial state thus restarting the model. We have run the SWN model with global repair and with different values of the repair rate µ. The obtained steady state availability is reported in Table 6 as a function of the repair rate.

Methods of Increasing Modelling Power for Safety Analysis

7

223

Conclusions

The present paper has investigated classic and advanced methodologies for modelling and quantitatively evaluating safety measures of a gas turbine digital control system, with reference to IEC 61508 standard. First, a Fault Tree model have been built, adopting simplified basic assumptions. Then FT has been translated into a BN that is still an acyclic model, but possesses more modelling and analysis power. Furthermore, FT have been converted into a Stochastic Petri Net, a state space based model. That exponentially increased the complexity of the model (the state explosion problem). To partially alleviate this problem, coloured Petri Net have been adopted and by coloured Petri Net model system availability has been computed. The applicability, the limits and the main selection criteria of the investigated methodologies have been provided.

References 1.

2. 3. 4.

5.

6. 7.

8.

A. Bobbio , S. Bologna, E. Ciancamerla, P. Incalcaterra, C. Kropp, M. Minichino, E. Tronci - Advanced techniques for safety analysis applied to the gas turbine control system of ICARO co generative plant - X Convegno Tecnologie e Sistemi Energetici Complessi "Sergio Stecco" - Genova, Italy, June 21 - 22, 2001 IEC61508: Functional safety of electrical/electronic/programmable electronic safety related systems R. A Sahner, K. S. Trivedi, A. Puliafito – Performance and reliability analysis of computer systems – Kluwer, 1998 A. Bobbio, L. Portinale, M. Minichino, E. Ciancamerla - Improving the Analysis of Dependable Systems by Mapping Fault Trees into Bayesian NetworksReliability Engineering and System Safety Journal - vol. 71 N.3 March 2001 pages 249-260 -ISSN 0951 - 8320 S. Bologna, E. Ciancamerla, M. Minichino, A. Bobbio, G. Franceschinis, L. Portinale, and R. Gaeta - "Comparison of methodologies for the safety and dependability assessment of an industrial programmable logic controller", In European Safety Dependability Conference (ESREL2001), pages 411–418, September 2001 G. Chiola, C. Dutheillet, G. Franceschinis, and S. Haddad - "Stochastic wellformed coloured nets for symmetric modelling applications", IEEE Transactions on Computers, 42:1343–1360, 1993 A. Bobbio, G. Franceschinis, L. Portinale, and R. Gaeta - "Dependability Assessment of an Industrial Programmable Logic Controller", International Workshop on Petri Net and Performance Models (PNPM' 01), pages 29–37, September 2001 F. V. Jensen – An introduction to Bayesian Networks – UCL Press, 1996

Checking Safe Trajectories of Aircraft Using Hybrid Automata Ítalo Romani de Oliveira and Paulo Sérgio Cugnasca Escola Politécnica da Universidade de São Paulo Department of Computer Engineering and Digital Systems Av. Prof. Luciano Gualberto, Trav. 3, nº 158 CEP 05508-900 - São Paulo - SP – Brazil {italo.oliveira,paulo.cugnasca}@poli.usp.br

Abstract. The air traffic management system is under generalized upgrading process. New requirements of safety arise based on several emerging technologies. This work presents a model to automatically verify safety of actions taken by human traffic controllers in frequent situations. The model uses the formalism of hybrid automata, and consists basically on segmenting the space of routes and associating each segment to a location of the automaton. Some kind of trajectory optimisation is also possible with this model.

1

Introduction

With the global implementation of the CNS/ATM (Communication, Navigation, Surveillance / Air Traffic Management) System, that represents a generalized upgrade in the technology of systems used in air transportation, with its implementation deadline in a few years, a great demand of navigation and traffic management support tools arises. One of the main goals of the CNS/ATM is to optimise the system capacity, with respect to the arrivals and departures per minute rate, at the air transport terminals. The most critical bottleneck of this system is the number of simultaneous aircraft under the responsibility of a single ground controller. This number might be considerably increased or, at least, be compatible with a greater safety level, if the operator could use automatic previewing and verification tools. A great challenge in the construction of such tools is system modelling. This happens because the taken abstraction must be sufficiently compliant with reality, so that it must state every unsafe condition accurately. Thus, the Hybrid Systems Theory was chosen to describe and solve the problem. An appendix with a glossary is provided for best understanding this paper. 1.1

Hybrid Systems Theory

Hybrid Systems are systems that present a discrete behaviour in some aspects, and a continuous one, in some others aspects. As an example of a hybrid system, we have S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 224-235, 2002.  Springer-Verlag Berlin Heidelberg 2002

Checking Safe Trajectories of Aircraft Using Hybrid Automata

225

an aircraft that may have the flaps open or closed. When the flaps are open, its aerodynamic behaviour follows a given differential equations set; when the flaps are closed, this set is distinct. In the same sense, the commands needed to manoeuvre an aircraft on flight are different from those needed on the ground. Because of this, it is said that an aircraft possesses different modes of operation, although, inside each mode, its behaviour and its actuators follow a continuous differential behaviour. Several authors have proposed the application of Hybrid Systems on ATM modelling and similar areas. Some of the principal works are: • Conflict resolution. Synthesis and verification of automatic air traffic conflict detection and resolution. See [1], [2] and [3]. • Sea Traffic. Verification of a sea traffic management model of harbour zones. See [4]. • Strings of Vehicles: Modelling a network automatic driver system to streams of vehicles in traffic ways. See [5]. • High Level Analysis of ATM: Systemic approach to the ATM system, being considered at its object set: ground control, airport, communications network, control software, aircraft, etc. See [6] and [7]. • Path Planning: A calculation method that, given the initial and final points, determines a sub-optimal trajectory between them. See [8]. In order to model the hybrid systems, the Theory of Hybrid Automata is used. Given the system we hope to model, one finds a hybrid automaton to each object or subsystem, so that each of its nodes represents an operation mode of that object or subsystem. For each node, its validity conditions are established, as well as its flow differential equations. To the set of nodes the possible transitions between them are defined, and so are the pre and post conditions of each transition. Once the automaton is built, and the unsafe states are identified, there are algorithms that make the reachability analysis of these states, starting from given initial conditions. 1.2

Managing Air Traffic Conflict in Terminal Areas and Surroundings

This work intends to establish directives to the modelling of air traffic conflict problem in terminal areas and their surroundings, specifically in the case of sequencing approach procedures. Air traffic conflict is the situation where an aircraft invades the virtual cylinder of safe distance of another one. See Fig. 1.

Fig. 1. Two aircraft at conflict situation. The cylinder is attached to the aircraft on its centre

The dimensions of this cylinder vary according to the class of air space, and the class of flight being executed. When a conflict occurs, it is defined that the implicated

226

Ítalo Romani de Oliveira and Paulo Sérgio Cugnasca

aircraft are in an unsafe situation, because the risk of collision is no longer acceptable. This situation must be changed until a safe state is reached, where a correct distance is kept, holding a low risk of collision. Thus, at the ATM context, there are two classes of scenarios related to conflicts: a)

Conflict preview: given the actual conditions of a specific region of the air space, it is deduced that in a certain time interval a conflict will occur. b) Conflict: two or more aircraft are already in a conflict situation. And, respectively, two classes of actions to hold safety: a)

Anticipated conflict resolution: starting from a conflict preview, the controller executes the necessary actions to avoid the conflict. b) Conflict resolution: as soon the effective conflict situation is detected, the controller takes the necessary actions to escape from it. A safe traffic management requires pro-activity, which means that only the previewed scenarios and actions are considered ordinary. The agent of conflict resolution is the ground traffic controller, which is generally classified into two classes: the terminal controller (APP), and the area controller (ACC). There is not a formalized general method to preview conflict and to solve it beforehand. Actually, the method is based on some rules and heuristics, and on the controller’s expertise that has seemed sufficiently reliable until now. Therefore, with the increase of air traffic, an overload on the controller may happen and, consequently, an increase of the air traffic management risk. The use of aid tools on the conflict preview and anticipated resolution might effectively contribute to raise the operational safety level. The process of conflict detection and resolution is illustrated in Fig. 2. Input data from actual situation

Automated Conflict Detection

Conflict resolution plan

Conflict preview

Automated Conflict Detection

No conflict

Traffic Controller

Transmit Changes

Conflict preview

Fig. 2. Anticipated conflict resolution process

A first version of such a tool is drafted, that will perform the conflict preview to the following cases: 1.

2.

There are terminal areas that receive a traffic confluence from distinct area controls (for example, the São Paulo APP receives traffic from Brasília and Curitiba ACCs). As the coordination among the distinct ACCs is not perfect, the APPs controller frequently previews a conflict between an aircraft from the first ACC and an aircraft from the second, and must do its anticipated resolution in the terminal area, or close to it. There are pre-defined entrances and exits at the terminal region (see Fig. 3). To each entrance, the area controller is responsible for sequencing aircraft coming

Checking Safe Trajectories of Aircraft Using Hybrid Automata

227

from several directions, lining them along the same entrance route. At intense traffic periods, this sequencing, that is a route confluence situation, requires anticipated conflict detection and resolution. A model with the hybrid automata of the aircraft and of the routes star of approach, built from approach charts, will be presented. The procedure of composition of these automata and the verification method run on the composite automaton will also be presented.

Fig. 3. Route confluence in the São Paulo terminal area

2

The Model

The problem of conflict detection is modelled as follows. First the problem will be modelled and then the automata will be built. Both aircraft a and b in Fig. 4 are at a route confluence. The circle around each is a projection of the cylinder of safe separation. The reference points, indicated as triangles, are the reference points present on the navigation chart. A reference point is an abstract geographic point whose position is described in the chart. The aircraft must pass over each of these points at a given altitude. We assume that, given the conditions of the instant described in Fig. 4, a conflict will occur, if the speed, heading and flight level remain unchanged. Hence, the traffic controller will determine that these variables must be changed, which implies the execution of a deviation manoeuvre. Our objective is to verify if the safety conditions are held during this manoeuvre.

228

Ítalo Romani de Oliveira and Paulo Sérgio Cugnasca

2.1

Routes

Lets start analysing data and conditions of each route. Aircraft a and b follow the routes Pa and Pb, respectively, each possessing segments Sai, Sbi. These segments of route are determined by the reference points Rai, Rbi, that are equivalent to i = 1, 2. Each reference point belongs to a transition window Wi, which the aircraft must necessarily go through.

Fig. 4. Horizontal plan of routes

Fig. 4 represents the horizontal plan of routes. Therefore the transition windows are defined as a vertical rectangle, with defined width and height. The transition windows work as boundaries between locations (discrete states) of the automaton. In other words, each location of the automaton that is not the final state represents the spatial set convex(Wi ∪ Wi + 1).

Fig. 5. Transition windows and convex hull

The dashed lines of Fig. 5 indicate convex(W1 ∪ W2), i.e. the convex hull of W ∪ W2, a set that will be important to define one of the safety conditions. It will also be seen that a safe trajectory must go through every transition window at a given sequence. 1

2.2

Aircraft

The aircraft variables, that determine its behaviour, are presented at table 1. Note on Fig. 6 that the original cylinder of safety was substituted by the parallelepiped B. This change was done to preserve the linearity conditions necessary to automatic verification of automata. As the original cylinder, the horizontal

Checking Safe Trajectories of Aircraft Using Hybrid Automata

229

dimensions of the box are far bigger than the height. Other polyhedra can be used, but we preferred to use a simple one. The safe conditions of operation are described below. The violation of any of them implies falling into an unsafe state. i.

ii.

Conflict absence, i.e., given aircraft a and b, then (sb ∉ Ba(sa)) ∧ (sa ∉ Bb(sb)). This means that no aircraft invaded the safety box of another. The violation of this condition is the event that will be called get_conflict. As the aircraft goes through Wi, then s ∈ Ti, where Ti = convex(Wi ∪ Wi+1). The succession of these sets for i = 0,1,...,n results in the tube of navigation inside which the aircraft must hold along its entire trajectory. The violation of this condition is the event called escaping. Being the aircraft in Ti, then αLi ≤ α ≤ αRi. This defines a horizontal heading considered safe to the aircraft, according to the navigation chart. When this condition is violated, the misaligning event occurs. Table 1. Aircraft variables

Variable v α s = (x,y,z) ri

Variable u=(x’,y’,z’)

B

Meaning (initial value provided) Scalar speed Aircraft heading. Spatial position of the aircraft, in function of time t and speed u, from a given absolute referential. Reset1 of the heading at i-th window. It is a discrete variable, with the possible values 0 ≡ without change; 1 ≡ default heading of segment i; 2 ≡ heading of the reference point Ri+1. The default heading of segment Si is the heading of the straight line linking the points Ri and Ri+1. Meaning (initial value calculated) Aircraft speed. The initial value is x = v.sin α0, y = v.cos α0, z = z’0. The value of z’i is calculated from the difference between z0 and the next programmed flight level. Being at Wi, one may determine the next programmed flight level by the following procedure: If ri = 0 or ri = 1, then zi+1 = zi If ri = 2, then zi+1 = z(Ri+1) Safety Box of the aircraft. Given the length l, height h and width w, the aircraft will hold inside the parallelepiped B with these dimensions, whose aligning of the horizontal edge is given by the heading α. This means that B is function of the speed u and the position s. This parallelepiped will always stay vertically aligned with the Y axis.

1 A reset of a variable is an attribution of a new given value at a state transition of the system.

230

Ítalo Romani de Oliveira and Paulo Sérgio Cugnasca

w h (x’,y’,z’)

x

Z

y

B

z l Y X Fig. 6. Aircraft variables

iii. Being Wn the last transition window, then the aircraft must go through it at a time lower than tmax. The violation of this condition is the event labelled timeout. 2.3

System Automata

From the above definitions, the basic automaton of the system can be presented, as in Fig. 7. The subscripts a and b refer to aircraft a and b, respectively. Aircraft b has a similar automaton. To the verification of event get_conflict, that is synchronized between the automata of both aircraft, it is necessary to verify the composite automata. Let Ha and Hb be the automata of aircraft a and b, respectively. We build the composite automata and perform the joint verification. The unsafe states are represented by the ones labelled escaped, misaligned, conflict and out_of_time. 2.4

Deviation Manoeuvres

The deviation manoeuvres are determined by the input variables v, rai e rbi. These variables indicate where there will or not be change of heading, according to the possible options in Table 1.

Fig. 7. Hybrid automaton Ha of the aircraft a

Checking Safe Trajectories of Aircraft Using Hybrid Automata

3

231

Execution Sample

Aiming to run a verification sample, the practical model had to be simplified. The simplification was done to avoid arithmetic overflow. The tool used was HyTech version 1.04 [11]. The simplification consisted in using only two spatial dimensions, and discarding the states misaligned, escaped and out_of_time. The result is shown in Fig. 8. t←0 xa ← −4 ya ← 3

tube_ 0a 1 ≤ x!a ≤ 3

1 − 1 ≤ y! a ≤ − 4 t! = 1

( x, y) ∈ W 1

ta0 ← t

tube_1a 1 ≤ x!a ≤ 3 − 1 ≤ y! a ≤ 0 t! = 1

t←0 xb ← −1 yb ← 0

get _ conflict

conflicta

get _ conflict

( x, y ) ∈ Wa2

go_aheada

tube_ 0b x!b = 2 y! a = 0 t! = 1

( x, y) ∈W 1

get _ conflict

get _ conflict

tb0 ← t

tube_1b x!b = 2 y! a = 0 t! = 1

aircraft _ a

conflictb

( x, y) ∈ Wa2

go_aheadb

aircraft _ b

Fig. 8. Simplified hybrid automata of the aircraft

The event get_conflict needs a conflict monitor, i.e., a new automaton called monitor_a_b (see Fig. 9), which originates this event and runs parallely to those of the aircraft. This necessity arose because of the algorithm linear convex restriction over the invariants of the locations [12]. Note on Fig. 10 that the safe cylinder of the aircraft became a square, because of the linearization required and the dimensional reduction.

Fig. 9. Conflict monitor automaton

232

Ítalo Romani de Oliveira and Paulo Sérgio Cugnasca

Composing the three automata above and running them on HyTech, the result obtained indicates a conflict region. This region is not entirely visualizable, because it is 7-dimensional. Therefore, some trajectories can be observed so as to understand the solution. Fig. 10 illustrates the initial states, the transition windows, and the invariant conditions for each location. Fig. 11 shows the faster conflict trajectory possible. Fig. 12 is the trajectory obtained from the trace command of HyTech. In turn, Fig. 13 represents the time-optimal trajectory, which minimizes the time that aircraft a takes to get into the go_aheada state. The optimal trajectories where obtained by running a linear program over the HyTech results.

4

Final Remarks

In theory, the model above permits composition of innumerable routes and aircraft, and the adoption of a myriad of deviation manoeuvres. It would be highly useful to conduct a research on the complexity of the problem and practical experiments of verification of systems with a wide number of aircraft. On the other hand, in order to make this system really operational, other kinds of constraints will need to be included, like wake turbulence, aircraft type, etc.

Fig. 10. Initial conditions of the simplified sample. The grayed areas represent the safe areas of the aircraft

Fig. 11. A conflicting trajectory

Checking Safe Trajectories of Aircraft Using Hybrid Automata

233

Fig. 12. A sample safe trajectory. The total time for aircraft a was 8 units of time

ta = 0

aircraft _ a

t a = 1.33 W2 t a > 3.5

tb = 0

aircraft _ b

R1

W

Y X

1

tb = 1.33

R 2 tb = 3.5 t b = 3 .5 ⇔ xb = 6

Fig. 13. An optimal safe trajectory. The total time for aircraft a was 3.5 units of time. Note that at that time, the aircraft are imediately before the conflict boundary

References 1. 2. 3. 4. 5. 6. 7.

Bonifácio, A. L.: Verificação e Síntese de Sistemas Híbridos, MSc. Dissertation, Instituto de Computação Unicamp, 2000. Tomlin, C.; Lygeros, J.; Sastry, S.: Synthesizing Controllers for Non-linear Hybrid Systems. LNCS 1386, Springer-Verlag, 1998. Tomlin, C.; Lygeros, J.; Sastry S.: Computing Controllers for Non-linear Hybrid Systems. LNCS 1569, Springer-Verlag, 1999. GodHavn, J. M.; Lauvdal, T.; Egeland, O.: Hybrid Control in Sea Traffic Management Systems. LNCS 1066, Springer-Verlag, 1996. Lygeros, J.; Lynch, N. A: Strings of Vehicles: Modelling and Safety Conditions. LNCS 1386, Springer-Verlag, 1998. Lynch, N.: High-Level Modelling and analysis of an Air-Traffic Management System. LNCS 1569, Springer-Verlag, 1999. Lygeros, J., Pappas, G. J.; Sastry, S.: An Approach to the Verification of theTRACON Automation System. LNCS, Springer-Verlag, 1998.

234

8. 9.

10.

11. 12.

13. 14.

Ítalo Romani de Oliveira and Paulo Sérgio Cugnasca Egerstedt, M., Koo, T. J., Hoffmann, F., Sastry, S.: Path Planning and Flight Controller Scheduling for an Autonomous Helicopter. LNCS 1569, SpringerVerlag, 1999. Alur, R., Courcoubetis, C., Henzinger, T.A., Ho, P.-H.: Hybrid automata: an algorithmic approach to the specification and verification of hybrid systems. In R.L. Grossman, A. Nerode, A.P. Ravn, and H. Rischel, editors, Hybrid Systems, Lecture Notes in Computer Science 736, pages 209-229. Springer-Verlag, 1993. Alur, R., Henzinger, T.A., Ho, P.-H. Automatic symbolic verification of embedded systems. In Proceedings of the 14th Annual Real-time Systems Symposium, pages 2-11. IEEE Computer Society Press, 1993. Full version appears in IEEE Transactions on Software Engineering, 22(3): 181-201, 1996. Henzinger, T. A., Ho, P.-H., Wong-Toi, H.: HyTech : the next generation. In Proceedings of the 16th Annual Real-time Systems Symposium, pages 56-65. IEEE Computer Society Press, 1995. Henzinger, T. A., Ho, P.-H., Wong-Toi, H.: A user guide to HyTech. In E. Brinksma, W.R. Cleaveland, K.G. Larsen, T. Margaria, and B. Steen, editors, TACAS 95: Tools and Algorithms for the Construction and Analysis of Systems, Lecture Notes in Computer Science 1019, pages 41-71. Springer-Verlag, 1995. Henzinger, T. A., Nicollin, X., Sifakis, J., and Yovine, S. Symbolic model checking for real-time systems. Information and Computation, 111(2):193-244, 1994. Henzinger, T. A., Wong-Toi, H.: Linear phase-portrait approximations for nonlinear hybrid systems. In R. Alur and T.A. Henzinger, editors, Hybrid Systems III, Lecture Notes in Computer Science. Springer-Verlag, 1995.

Appendix: Glossary ACC: Air Traffic Control Centre. An area control centre established to provide air traffic control service. APP: Approach Control. A terminal control centre established to provide air traffic control service. ATM: Air Traffic Management. CNS: Communication, Navigation and Surveillance. Flight level: Band of altitude of the aircraft. The international standard assigns 100 ft for each level. So, an aircraft flying at level 150 has altitudes between 15000 and 15099 ft. Location: In a hybrid automaton, it represents a discrete state. Every transition between locations is done by edges, graphically represented as arrows in the automaton. Navigation chart: Pre-defined sequence of reference points which the aircraft must pass over. A chart determines the route that an aircraft must follow. Heading: Angle of direction of the aircraft at the horizontal plan, measured from the North. Route: Navigation tube based on a navigation chart.

Checking Safe Trajectories of Aircraft Using Hybrid Automata

235

Trajectory: Path performed by the aircraft. We assume in this model that the trajectories are the union of linear segments, whose vertices belong to the transition windows. Transition window: A transition window Wi is defined as a vertical rectangle, with defined width and height. The transition windows work as boundaries between locations. In other words, each location of the automaton that is not final represents a set like convex(Wi ∪ Wi + 1), in space. A safe trajectory must go through every transition window, at a given sequence. Wake turbulence: a trail of small horizontal tornados left behind the aircraft due its movement.

Model-Based On-Line Monitoring Using a State Sensitive Fault Propagation Model Yiannis Papadopoulos Department of Computer Science, University of Hull Hull, HU6 7RX, UK [email protected]

Abstract. As the safety analyses of critical systems typically cease or reduce in their utility after system certification useful knowledge about the behaviour of the system in conditions of failure remains unused in the operational phase of the system lifecycle. In this paper, we show that this knowledge could be usefully exploited in the context of an online hazard-directed monitoring scheme in which a suitable specification derived from design models and safety analyses forms a reference monitoring model. As a practical application of this approach, we propose a safety monitor that can operate on such models to support the on-line detection, diagnosis and control of hazardous failures in real-time. We discuss the development of the monitoring model and report on a case study that we performed on a laboratory model of an aircraft fuel system.

1

Introduction

The idea of using a model derived from safety assessment as a basis for on-line system monitoring dates back to the late seventies when STAR, an experimental monitoring system, used Cause Consequence Diagrams (CCDs) to monitor the expected propagation of disturbances in a nuclear reactor [1]. CCDs are composed of various interconnected cause and consequence trees, two models which are fundamentally close to the fault tree and event tree, i.e. the causal analyses that typically form the spine of a plant Probabilistic Risk Assessment. In STAR, such CCDs were used as templates for systematic monitoring, detection of abnormal events and diagnosis of the root causes of failures through extrapolation of anomalous event chains. In the early eighties, another system, the EPRI-DAS monitor, used a similar model, the Cause Consequence Tree [2], to perform early matching of anomalous event patterns that could signify the early stages of hazardous disturbances propagating into the system. Once more the monitoring model was a modified fault tree in which delays have been added to determine the precise chronological relationships between events. More recently, the experimental risk monitors developed for the Surry and San Onofre nuclear power plants in the USA [3] also employed a fault-propagation model S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 236-248, 2002.  Springer-Verlag Berlin Heidelberg 2002

Model-Based On-Line Monitoring Using a State Sensitive Fault Propagation Model

237

to assist real-time monitoring and prediction of plant risks. This is a master fault tree synthesised from the plant safety assessment. Recently, research has also shown that it is possible to synthesise the use of such fault propagation models with other contemporary developments in the primary detection of failures. Varde et al [4], for example, describe a hybrid system in which artificial neural networks perform the initial primary detection of anomalies and transient conditions, while a rule-based monitor then performs diagnoses using a knowledge base that has been derived from the system fault trees. The knowledge base is created as fault trees are translated into a set of production rules, each of which records the logical relationship between two successive levels of the tree. One important commonality among the above prototypes is that they all use causal models of expected relationships between failures and their effects which are typically similar to a fault tree. Experience from the use of such models, however, has shown that substantial problems can be caused in monitoring by incomplete, inaccurate, nonrepresentative and unreliable models [5]. Indeed, difficulties in the development of representative models have prevented so far the widespread employment of such models. One substantial such difficulty and often the cause of omissions and inaccuracies in the model is the lack of sufficient mechanisms for representing the effects that changes in the behaviour or structure of complex dynamic systems have in the fault propagation in those systems. Such changes, often caused by complex mode and state transitions of the system, can in practice confuse a monitoring system and contribute to false, irrelevant and misleading alarms [6]. To illustrate this problem, let us consider for example the case of a typical aircraft fuel system. In this system, there are usually a number of alternative ways of supplying the fuel to the aircraft engines. During operation, the system switches between different states in which it uses different configurations of fuel resources, pumps and pipes to maintain the necessary fuel flow. Initially, for example, the system may supply the fuel from the wing tanks and when these resources are depleted it may continue providing the fuel from a central tank. The system also incorporates complex control functions such as fuel sequencing and transfer among a number of separated tanks to maintain an optimal centre of gravity. If we attempt to analyse such a system with a technique like fault tree analysis we will soon be faced with the difficulties caused by the dynamic behaviour of the system. Indeed, as components are activated, deactivated or perform alternative fuel transfer functions in different operational states, the set of failure modes that may have adverse effects on the system, as well as those effects, change. Inevitably, the causes and propagation of failure in one state of the system are different from those in other states. Representing those state dependencies is obviously crucial in developing accurate, representative and therefore reliable monitoring models. But how can such complex state dependencies be taken into account during the fault tree analysis, and how can they be represented in the structure of fault trees? In this paper, we draw from earlier work on monitoring using fault propagation models that are similar to fault trees. Indeed, we propose a system in which a set of fault trees derived from safety assessment is used as a reference model for on-line monitoring. However, to address the problem of complex state dependencies, we complement those fault trees with a dynamic model of system behaviour that can capture the behavioural transformations that occur in complex systems as a hierarchy

238

Yiannis Papadopoulos

of state machines. This model acts as a frame that allows fault trees to be placed and then interpreted correctly in the context of the dynamic operation of a system. In section two we discuss the form of the monitoring model and its development process. In section three we discuss the algorithms required to operate on such models in order to deliver monitoring and diagnostic functions in real time. In section four, we outline a case study that we performed on a laboratory model of an aircraft fuel system and finally in section five we draw conclusions and outline further work.

2

Modelling

The general form of the proposed monitoring model is illustrated in Fig.1. The spine of the model is a hierarchy of abstract state-machines which is developed around the structural decomposition of the system. The structural part of that model records the physical or logical (in the case of software) decomposition of the system into composite and basic blocks (left hand side of Fig.1). This part of the model shows the topology of the system in terms of components and connections, and can be derived from architectural diagrams of varying complexity that may include abstract engineering schematics, piping/instrumentation diagrams and detailed data flow diagrams. The dynamic part of the model is a hierarchy of state machines that determine the behaviour of the system and its subsystems (right hand side of Fig.1). This part of the model can be developed in a popular variant of hierarchical state automata such as state-charts. The model identifies normal states of the system and its subsystems. Beyond normal behaviour, though, the model also identifies transitions to deviant or failed states each representing a loss or a deviation from the normal functions delivered by the system. A HAZOP (Hazard and Operability) style functional hazard analysis technique is used to systematically identify and record such abnormal functional states. Following the application of this technique, the lower layers of the behavioural model identify transitions of low-level subsystems to abnormal functional states, in other words states where those subsystems deviate from their expected normal behaviour. As we progressively move from the leaf nodes towards the higher layers of the behavioural model, the model shows how logical combinations or sequences of lower-level subsystem failures (transitions to abnormal states) propagate upwards and cause functional failure transitions at higher levels of the design. The model also records potential recovery measures at different levels of the design, and the conditions that verify the success, omission or failure of such measures. As Fig.1. illustrates, some of the failure transitions at the low-levels of the design represent the top events of fault trees which record the causes and propagation of failure through the architectures of the corresponding subsystems. Classical fault tree analysis techniques can be used to derive those fault trees. However, a methodology for the semi-automatic construction of such fault trees has also been presented in SafeComp’99 and is elaborated in [7]. According to that methodology (see also [8]) the fault trees which are attached to the state-machines can be semi-mechanically synthesised by traversing the structural model of the system, and by establishing how the local effects of failure (specified by analysts at component level) propagate

Model-Based On-Line Monitoring Using a State Sensitive Fault Propagation Model

239

through connections in the model and cause failure transitions in the various states of system. One difference between the proposed model and a master fault tree is that the proposed model records sequences of failures taking into account the chronological order of the sequence. The model also shows the gradual transformation of lowerlevel failures into subsystem failures and system malfunctions. Thus, the model not only situates the propagation of failure in the evolving context of the system operation, but also provides an increasingly more abstract and simplified representation of failure in the system. This type of abstraction could help to tackle the problem of state explosion in the representation of large or complex systems. Moreover, in real time, a monitoring model that embodies such a layered view of failure and recovery could assist the translation of low level failures into system malfunctions and the provision of higher level functional alarms where appropriate. The development of the proposed model relies on the application of established and widely used design and safety analysis techniques and notations, such as flow diagrams, state-charts and fault trees. Thus, there would be little value in expanding here on the application of those techniques which are generally well understood. Also, technical details about how to produce and integrate the various models into the overall specification of Fig.1 can be found in [9]. In the reminder of this section, we focus on one aspect of modelling which is particularly relevant to the monitoring problem. We discuss how the model, effectively a set of diagrams derived from the design and safety analysis can be transformed into a more precise specification that could be used for the on-line detection and control of failures in real-time. In their raw form, the designs and analyses that compose the proposed model (i.e. flow diagrams, state-charts and fault trees) would typically contain qualitative descriptions of events and conditions. However, the on-line detection of those events and conditions would clearly require more precise descriptions. One solution here would be to enhance the model with appropriate expressions that an automated system could evaluate in real-time. Since many of the events and conditions in the model represent symptoms of failures, though, it would be useful to consider first the mechanisms by which system failures are manifested on controlled processes. Structural Model Architectural diagrams of the system and its subsystems

Behavioural Model State machines of the system and its subsystems

system & its state machine

sub-systems & their state machines

sub-system architectures (basic components)

Semi-mechanically generated fault trees which locate the root causes of low-level sub-system malfunctions in the architecture of the system

Fig. 1. General form of monitoring model

240

Yiannis Papadopoulos

In a simple case a failure would cause a violation of a threshold in a single parameter of the system. In that case, the symptom of failure could be described more formally with a single constraint. The deviation “excessive flow”, for example, could be described with an expression of the type “flow>high”, where flow is a flow sensor measurement and high is the maximum allowable value of flow. In general, though, failures can cause alternative or multiple symptoms, and therefore an appropriate syntax for monitoring expressions should allow logical combinations of constraints, where each constraint could describe, for example, one of the symptoms of failure on the monitored process. One limitation of this scheme is that, in practice, constraints may fire in normal operating conditions and in the absence of failures. Indeed, in such conditions, parameter values often exhibit a probabilistic distribution. That is, there is a non-zero probability of violation of normal operating limits in the absence of process failures, which would cause a monitor to raise false alarms. Basic probability theory, however, tells us that the latter could be minimised if there was a mechanism by which we could instruct the monitor to raise an alarm only if abnormal measurements persisted over a period of time and a number of successive readings. The mechanism that we propose for filtering spurious abnormal measurements is, precisely, a monitoring expression that fires only if it remains true over a period of time. To indicate that a certain expression (represented as a logical combination of constraints) is bound by the above semantics we use the syntax T(expression, ∆t) where ∆t is the period in seconds for which expression has to remain true in order for T(expression, ∆t) to fire. ∆t is an interval which always extends from time t-∆t in the past to the present time t. The above mechanism can also be used for preventing false alarms arising from normal transient behaviour. Consider for example, that we wish to monitor a parameter in closed loop control and raise an alarm every time a discrepancy is detected between the current set-point and the actual value of the parameter. We know that a step change in the set-point of the loop is followed by a short period of time in which the control algorithm attempts to bring the value of the parameter to a new steady state. Within this period, the value of the parameter deviates from the new set-point value. To avoid false alarms arising in this situation, we could define a monitoring expression that fires only when abnormal measurements persist over a period that exceeds the time within which the control algorithm would normally correct a deviation from the current set-point. We must point out that even if we employ this type of “timed expression”, persistent noise may still cause false alarms in certain situations, when for example the value of the parameter lies close to the thresholds beyond which the expression fires. To prevent such alarms we need to consider the possible presence of noise in the interpretation of the sensor output. A simple way to achieve this is by relaxing the range of normal measurements to tolerate a reasonable level of noise. The expressions that we have discussed so far allow detection of momentary or more persistent deviations of parameters from intended values or ranges of such values. Such expressions would be sufficient for detecting anomalous symptoms in systems where the value of parameters is either stable, moves from one steady state to another, or lies within well-defined ranges of normal values. This, however, is certainly not the case in any arbitrary system. Indeed, parameters in controlled

Model-Based On-Line Monitoring Using a State Sensitive Fault Propagation Model

241

processes are often manipulated in a way that forces their value to follow a particular trend over time. To detect violations of such trends we have also introduced a set of primitives in the model that allow expressions to reference historical values and calculate parameter trends. Such primitives include, for example, a history operator P(∆t) which returns the value of parameter P in time ∆t in the past, as well as more complex differentiation and integrator operators that calculate trends over sequences of historical parameter values. The differentiation operator D(expression,∆t), for example, when applied to an expression over time ∆t, returns the average change in the value of the expression during an interval which extends from time t-∆t in the past to the present time t. Such operators were sufficient to monitor trends in the example fuel system. However, more complex statistical operators may be required for long term monitoring of other control schemes.

3

Monitoring Algorithms

Given that information from the design and safety analysis of a system has been integrated to form the specification of Fig.1, and that this specification has been annotated with monitoring expressions as explained above, then it is possible for an automated monitor to operate on this model to detect, diagnose and control hazardous failures in real-time. Fig.2 (next page) gives an outline of the general position and architecture of such a monitor. The monitor lies between the system and its human operators. It also lies between the system and its monitoring model. Parts of that model could in principle be developed in widely used modelling and analysis tools such as Statemate and Fault Tree Plus. However, for the purposes of this work, we have also developed a prototype tool that provides integrated state modelling and analysis capabilities and provides a self-contained environment for the development of the model. Once developed in this tool, the monitoring model is then exported into a model file. Between that file and the on-line monitor lies a parser that can perform syntactical analysis and regenerate (in the memory of the computer) the data structures that define the monitoring strategies for the given system. The on-line monitor itself incorporates three mechanisms that operate on those data structures to complete the various stages of the safety monitoring process. The first of those mechanisms is an event monitor. The role of this mechanism is to detect the primary symptoms of failure or successful recovery on process parameters. This is accomplished as the monitor continuously (strictly speaking, periodically) evaluates a list of events using real-time sensory data. This list contains all the events that represent transitions from the current state of the system and its subsystems. Such events are not restricted to system malfunctions and can also be assertions of operator errors or the undesired effects of such errors on the monitored process. This list of monitored events changes dynamically as the system and its subsystems move from one state to another, and as such transitions are recognised by the monitor. Historical values of monitored parameters are stored and accessed by the monitor from ring buffers the size of which is determined at initialisation time when the monitor analyses the expressions that reference historical values. A system of threevalue logic [10] that introduces a third “unknown” truth value in addition to “true”

242

Yiannis Papadopoulos

and “false” is also employed to enable evaluation of expressions in the context of incomplete information. In practice, this gives the monitor some ability to reason in the presence of (detected) sensor failures and to produce early alarms on the basis of incomplete process data. One limitation of the above scheme is that the conditions detected by the event monitor generally reflect the symptoms of failures in the system, and do not necessarily point out underlying causes of failure. If we assume that the monitor operates on a complex process, however, then we would expect that some of those symptoms would require further diagnosis before appropriate corrective action can be taken. The second mechanism of the safety monitor is precisely a diagnostic engine. In each cycle of the monitor, the engine locates the root failures of detected anomalous symptoms by selectively traversing branches of the fault trees in which the initial symptoms appear as top events. The diagnostic algorithm combines heuristic search and blind depth-first traversal strategies. As the tree is parsed from top to bottom, the engine always checks the first child of the current node.

System

Actuator Interface

Sensory Interface

Parameter Values

Parameter Values

Corrective Measures

Operator Interface Symptoms of

Event Monitor

Clear Symptoms

Failure and Recovery

TR1 TC FL3 95 0 85

Diagnostic Engine

Symptoms that Require Diagnosis

Event Processor Diagnosed Causes of Failure

Alarms, Diagnostics, Effects of Failure, Corrective Measures

11 11

FL3

FL3 0

0

TL2

TL3

0

TC

84

85 PJL

PL5

FL5

0

0

TR3

83

TR2 85

84 PR5

FR5

0

0

PJR F

22

11

22

22

VL3

PL3

VR3

VR4

VJR

1

0 FJR

FJL

0

0

Jettison Point

TL1

PL4

Port Engine

PR1

FR1

87

75

VR1

Jettison Point

84

53

Which Events to

VL2

VC2

VL1

FL1 75

PL1 PL1 77

Starboard Engine

Data Structures of the Safety Monitor Event List

State-charts

Fault Trees

0

FL3 22

VL4

FVL

VR2

Monitor in the Present State of the System and its Sub-systems

0 PR4 47

FR3

Refuelling Point

Modelling and Safety Analysis Tool

Safety Monitor Parser

Monitoring Model The proposed model as an executable specification exported from the modelling and analysis tool

Fig. 2. The position and architecture of the safety monitor

Model-Based On-Line Monitoring Using a State Sensitive Fault Propagation Model

243

If there is an expression that can be evaluated, a heuristic search is initiated among the siblings of the child node to decide which child will become the current node, i.e. in which branch the diagnosis will proceed. If there is no such expression, a blind depth first search strategy is initiated with an objective to examine all branches until expressions are found in lower levels or a root failure is diagnosed. At the end of a successful diagnosis, the algorithm returns a set of root causes (or a single cause if the tree contains only “OR” gates) that satisfies the logical structure of the fault tree. It is perhaps important to point out that the diagnosis proceeds rapidly as the value of nodes is calculated instantly from current or historical process information without the need to initiate monitoring of new trends that would postpone the traversal of the tree. In fact, this method is not only faster but also consistent with the semantics of the fault tree as a model. Indeed, as we move downwards in the tree we always move from effects to causes which of course precede their effects in chronological order. However, since the events that we visit during the traversal form a reversed causal chain, they can only be evaluated correctly on the basis of historical process data. In each cycle of the monitor, detected and diagnosed events are finally handled by an event processor, which examines the impact of those events on the state-charts of the system and its subsystems. By effectively executing those state-charts, the event processor initially infers hazardous transitions caused by those events at low subsystem level. Then, it recursively infers hazardous transitions triggered by combinations of such low-level transitions at higher levels of the system representation. It thus determines the functional effects of failure at different levels, provides high level functional alarms and guidance on corrective measures specified at various levels of abstraction in the model. The event processor also keeps track of the current state of the system and its subsystems, and determines which events should be monitored by the event monitor in each state. The processor can itself take action to restore or minimise the effects of failure, assuming of course that the monitor has control authority and corrective measures have been specified in a way that can be interpreted by an automated system. In the current implementation, simple corrective measures can be specified as assignments of values to parameters that define the state of on-off controllers or the set-point of control loops.

4

Case Study

For the purposes of this work, an experimental monitor that can operate on models that conform to the proposed specification was developed and this monitoring approach was applied on a laboratory model of an aircraft fuel system. Our analyses of this system were intentionally based on a very basic control scheme with no safety monitoring or fault tolerance functions. This gave us the opportunity to combine the results into a comprehensive model that provides an extensive specification of potential failure detection and control mechanisms that improve the safety of that system. This specification formed the monitoring model which was then used as a knowledge base for the on-line detection, diagnosis and automatic correction of failures that we injected in the system in a number of monitoring experiments. Such failures included fuel leaks, pipe blockages, valve and pump malfunctions and

244

Yiannis Papadopoulos

corruption of computer control commands that, in reality, could be caused, for example, by electromagnetic interference. In the space provided here, it is only possible to include a brief discussion of the study and highlight the main conclusions that we have drawn in the light of our experiments. For further information the reader is referred to an extensive technical report on this work [9]. The prototype fuel system is a model of the fuel storage and fuel distribution system of a twin-engine aircraft. The configuration of the system is illustrated in Fig.3 and represents a hypothetical design in which the fuel resources of the aircraft are symmetrically distributed in seven tanks that lie along the longitudinal and lateral axes of the system. A series of pumps can transfer fuel through the system and towards the engines. The direction and speed of those pumps, and hence of flows in the system, are computer controlled. The figure also shows a number of valves which can be used to block or activate paths and isolate or provide access to fuel resources. Finally, analogue sensors that measure the speed of pumps, fuel flows and fuel levels provide indications of the controlled parameters in the system. During normal operation, the central valve VC2 is closed and each engine is fed from a separate tank at a variable rate x, which is defined by the current engine thrust setting. Other fuel flows in the system are controlled in a way that fuel resources are always symmetrically distributed in the system and the centre of gravity lies always near the centre of the system. This is achieved by a control scheme in which each pump in the system is controlled in a closed loop with one or more local flow meters to achieve a certain flow which is always proportional to the overall demand x, as illustrated in Fig.3. Front Tank

Pump & Speed Sensor

P

Valve L PR4

Flow Meter

L

Level Sensor

FR3

x/7

Left Wing Tanks

F

Right Wing Tanks

PR3

(FR2-FR4)/2

(FL2-FL4)/2 L

L PL5 FL4

L

FL5

L

L PR5

4x/7 FL3

FL2

FR2

x/7

FR4

PL3

VL3

VL4

VR4

VR3 PL4

FR5

L

4x/7 Rear Tank x

PR1

VR2

x

VL2

FR1

FL1 VR1

VC2

PL1

VL1 Starboard Engine

Port Engine

Fig. 3. The fuel system

Model-Based On-Line Monitoring Using a State Sensitive Fault Propagation Model

245

The first step in the analysis of this system was to develop a structural hierarchy in which the system was first represented as an architecture that contained four subsystems (engine feeding subsystem, left & right wings, central subsystem), and then each subsystem was refined into an architectural diagram that defined basic components and flows. A state-chart for each subsystem was then developed to identify the main normal functional states of the subsystem. The engine feeding subsystem, for example, has only one such state in which it maintains supply to both engines at a variable flow rate x that always equals the current fuel demand. A HAZOP style functional failure analysis assisted the identification of deviations from that normal state, for example conditions in which the subsystem delivers “no flow”, “less flow” and “reverse flow”. Such deviations, in turn, pointed out transitions into temporary or permanently failed states which we recorded in the model of the subsystem. As part of the modelling process, deviations that formed such transitions were also augmented with appropriate expressions to enable their detection in real-time. The deviation “more flow” in the line towards the port engine, for example, was described with the expression T(FR1>1.03*x,3) which fires only if an abnormal measurement of flow in the line, which is 3% above normal demand x, persists for more than 3 sec. The causes of such deviations, such as valve & pump failures, leaks, blockages and omissions or commissions of control signals, were determined in the structure of fault trees that were semi-automatically constructed by traversing the structural model of the system and by using the approach described in [7]. Nodes of those fault trees were also augmented with monitoring expressions and the trees were then used in real-time for the diagnosis of root causes of failure. Recovery measures were also specified in the structure of the model to define appropriate responses to specific failure scenarios. If, for example, there was an interruption of flow in the line towards the starboard engine, and this was due to a blockage of valve VL2 (this could be confirmed via fault tree diagnosis), then one way to restore the flow in the line would be to open valve VC2. However, to maintain the equal distribution of fuel among the various tanks, one would have to redirect flows in the system. By solving a set of simultaneous equations derived from the new topology of the system and the relationship between volume reduction and input/output flows in each tank, it can easily be inferred that the direction of flow between the rear and central tank should be reversed and the set-points of pumps PL3 and PR3 should be changed from {x/7,x/7} to {-6x/7, 8x/7}. The three measures that form this recovery strategy can be described with the following expression that an automated monitor could interpret and execute in real-time: {VC2:=1 and PL3:=-6x/7 and PR3:=8x/7}. Such expressions, indeed, were introduced in the model and enabled the monitor to take automatic responses to deviations and root causes of failures that were detected or diagnosed in real-time. Finally, once we completed the analysis of the four subsystems, we used the results to synthesise a state-chart for the overall fuel system. In general, failures that are handled in the local scope of a subsystem do not need to be considered in a higher level chart. Wing subsystems, for example, are obviously able to maintain supply of fuel and balance between wing tanks in cases of single valve failures. Since they do not have any effects at system level, such failures therefore do not need to be considered in the fuel system chart. In generalising this discussion, we could say that

246

Yiannis Papadopoulos

only non-recoverable transitions of subsystems to degraded or failed states need to be considered at higher level, and this thankfully constrains the size of higher level charts. Indeed, in our case study, the chart of the fuel system incorporated twenty states - that is approximately as many states as that of the engine feeding subsystem. Our monitoring experiments generally confirmed the capacity of the monitor to detect the transient and permanent failures that we injected into the system. Using appropriately tuned timed expressions, the monitor was able to judge the transient or permanent nature of the disturbances caused by valve or pump failures, and thus to decide whether to trigger or filter alarms in response. Complex failure conditions like structural leaks were also successfully detected using complex expressions that fired when significant discrepancies were detected between the reduction of level in tanks and the net volume of fuel that has flown out of tanks over specified periods of time. Although injected failures did not include sensor failures, sensors proved to be in practice the most unreliable element in the system. In many occasions, therefore, we encountered unplanned transient and permanent sensor failures, which gave us the opportunity to examine, to some extent, the response of the event monitor to this class of failures. With the aid of timed expressions the monitor generally responded well to transient sensor failures by filtering transient abnormal measurements. However, in the absence of a generic validation mechanism, permanent sensor failures have often mislead the monitor into raising false alarms, and into taking unnecessary and sometimes even hazardous action. In the course of our experiments, we also had the opportunity to examine how the monitor detects and responds to multiple failures (caused by simultaneously injected faults). In general, the detection of multiple failures is a process that does not differ significantly from that of single failures. If those failures cause different symptoms on process parameters, the monitor raises a set of alarms that identify those symptoms and takes any recovery measures associated with those events. When a combination of failures is expressed with common symptoms, however, the event monitor detects the symptoms, but leaves the location of the underlying causal faults and further action to the diagnostic engine. One issue raised in our experiments is the treatment of dependent faults. Imagine a situation where two seemingly independent faults cause two different symptoms and those symptoms are accompanied by two conflicting remedial procedures. If those symptoms occur simultaneously, the monitor will apply both procedures, and this in turn will cause unpredictable and potentially hazardous effects. Assuming the independence of the two faults, analysts have designed corrective measures to treat each symptom separately. Individually, those procedures are correct, but together they define the wrong response to the dependent failure. That problem, of course, could have been avoided if the monitoring model had anticipated the simultaneous occurrence of the two symptoms and contained an appropriate third procedure for the combined event Before we close this discussion on the monitor, we would like to emphasise one property that, we believe, could make this mechanism particularly useful in environments with potentially large volumes of plant data. As we have seen, the monitor can track down the current functional state of the system and its subsystems, and thus determine the scope in which an event is applicable and should, therefore, be monitored. In practice and in the context of our experiments, this property meant a

Model-Based On-Line Monitoring Using a State Sensitive Fault Propagation Model

247

significant reduction in the workload on the monitoring system. Beyond helping to minimise the workload of the monitor, this state sensitive monitoring mechanism has also helped in avoiding misinterpretations of the process feedback that often occur in complex and evolving contexts of operation. This in turn, prevented, we believe, a number of false alarms that may otherwise have been triggered if the monitor was unable to determine whether and when seemingly abnormal events become normal events and vice versa.

5

Conclusions

In this paper, we explored the idea of using a specification drawn from design models and safety analyses as a model for on-line safety monitoring. We proposed an automated monitor that can operate on such specifications to draw inferences about the possible presence of anomalies in the controlled process, the root causes of those anomalies, the functional effects of failures on the system and corrective measures that minimise or remove those effects. Our experiments demonstrated to some extent that the monitor can deliver those potentially useful monitoring functions. At the same time, they pointed out some weaknesses and limitations of the monitor. Firstly, they highlighted the vulnerability of the monitor to sensor failures and indicated the need for a generic sensor validation mechanism. Such a mechanism could be based on well-known techniques that exploit hardware replication or analytical redundancy. Secondly, we saw that, in circumstances of (unanticipated) dependent failures that require conflicting corrective procedures, the monitor treats those failures independently with unpredictable and potentially hazardous consequences. We pointed out that the real problem in those circumstances lies in the failure of the model to predict the synchronous occurrence of those failures. This, however, highlights once more an important truth, that the quality of monitoring and the correctness of the inferences drawn by the monitor are strongly contingent to the completeness and correctness of the model. The validation of the monitoring model, therefore, is clearly an area that creates scope for further research. In summary, the monitoring concepts and algorithms proposed in this paper create opportunities for exploiting in real-time an enormous amount of knowledge about the behaviour of the system that is typically derived during design and safety analysis. However, although in this paper we demonstrated that this is possible, the limitations revealed so far and insufficient project experience mean that substantial work will be required before a conclusive evaluation of the real value and scalability of this approach is possible.

Acknowledgements This study was partly supported by the IST project SETTA (IST contract number 10043). The author would like to thank John Firth, John McDermid (University of York), Christian Scheidler, Günter Heiner (Daimler Chrysler) and Matthias Maruhn (EADS Airbus) for supporting the development of those ideas at various stages of the work.

248

Yiannis Papadopoulos

References 1.

Felkel L., Grumbach R., Saedtler E.: Treatment, Analysis and Presentation of Information about Component Faults and Plant Disturbances, Symposium on Nuclear Power Plant Control and Instrumentation, IAEA-SM-266/40, U.K., (1978) 340-347. 2. Meijer, C. H.: On-line Power Plant Alarm and Disturbance Analysis System, EPRI Technical Report NP-1379, EPRI, Palo Alto California USA (1980) 2.132.24. 3. Puglia, W. J., Atefi B.: Examination of Issues Related to the Development of Real-time Operational Safety Monitoring Tools, Reliability Engineering and System Safety, 49:189-199, Elsevier Science (1995). 4. Varde P. V., Shankar S., Verma A. K.: An Operator Support System for Research Reactor Operations and Fault Diagnosis Through a Connectionist Framework and PSA based Knowledge-based Systems, Reliability Engineering and System Safety, 60(1):53-71 (1998). 5. Davies R., Hamscher W.: Model Based Reasoning: Troubleshooting, in Hamscher et al (eds.), Readings in Model-based Diagnosis, ISBN: 1-55860-249-6 (1992) 3-28. 6. Kim, I. S., Computerised Systems for On-line Management of Failures, Reliability Engineering and System Safety, 44:279-295, Elsevier Science (1994). 7. Papadopoulos Y., McDermid J. A., Sasse R., Heiner G.: Analysis and Synthesis of the Behaviour of Complex Programmable Electronic Systems in Conditions of Failure, Reliability Engineering and System Safety, 71(3):229-247, Elsevier Science, (2001). 8. Papadopoulos Y., Maruhn M.: Model-based Automated Synthesis of Fault Trees from Simulink Models, in Proc. of DSN’2001 the Int. Conf. on Distributed Systems and Networks, Gothenburg, Sweden, ISBN 0-7695-1101-5 (2001) 7782. 9. Papadopoulos Y.: Safety Directed Monitoring Using Safety Cases, D.Phil. thesis, Technical Report No YCST-2000-08, University of York, U.K. (2000). 10. Yamashima H., Kumamoto H., Okumura S.: Plant failure diagnosis by an automated fault tree construction consistent with boundary conditions. In Proc. Of the Int. Conf. on Probabilistic Safety Assessment and Management (PSAM), Elsevier Science (1991).

On Diversity, and the Elusiveness of Independence Bev Littlewood Centre for Software Reliability, City University Northampton Square, London EC1V0HB [email protected]

1

Extended Abstract

Diversity, as a means of avoiding mistakes, is ubiquitous in human affairs. Whenever we invite someone else to check our work, we are taking advantage of the fact that they are different from us. In particular, we expect that their different outlook may allow them to see problems that we have missed. In this talk I shall look at the uses of diversity in systems dependability engineering. In contrast to diversity, redundancy has been used in engineering from time immemorial to obtain dependability. Mathematical theories of reliability involving redundancy of components go back over half a century. But redundancy and diversity are not the same thing. Redundancy, involving the use of multiple copies of similar (‘identical’) components (e.g. in parallel) can be effective in protecting against random failures of hardware. In some cases, it is reasonable to believe that failures of such components will be statistically independent: in that case very elementary mathematics can show that systems of arbitrarily high reliability can be built from components of arbitrarily low reliability. In practice, assumptions of independence need to be treated with some scepticism, but redundancy can nevertheless still bring benefits in reliability. What redundancy cannot protect against, of course, is the possibility of different components containing common failure modes – for example, design defects which will show themselves on every component of a particular type whenever certain conditions arise. Whilst this problem has been familiar to reliability and safety engineers for decades, it became particularly acute when systems dependability began to depend heavily on the correct functioning of software. Clearly, there are no software reliability benefits to be gained by the use of simple redundancy, i.e. merely exact replication of a single program. Since software, unlike hardware, does not suffer from ‘random failures’ – in the jargon its failures are ‘systematic’ – failures of identical copies will always be coincident. Design diversity, on the other hand – creating different versions using different teams and perhaps different methods - may be a good way of making software reliable, by providing some protection against the possibility of common design faults in different versions. Certainly, there is some industrial experience of design-diverse fault tolerant systems exhibiting high operational reliability (although the jury is out on the issue of whether this is the most cost-effective way of obtaining high reliability).

S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 249-251, 2002.  Springer-Verlag Berlin Heidelberg 2002

250

Bev Littlewood

What is clear, however – from both theoretical and experimental evidence - is that claims for statistical independence of diverse software versions1 are not tenable. Instead, it is likely that two (or more) versions will show positive association in their failures. This means that the simple mathematics based on independence assumptions will be incorrect – indeed it will give dangerously optimistic answers. To assess the system reliability we need to estimate not just the version reliabilities, but the level of dependence between versions as well. In recent years there has been considerable research into understanding the nature of this dependence. It has centred upon probabilistic models of variation of ‘difficulty’ across the demand (or input) space of software. The central idea is that different demands vary in their difficulty – to the human designer in providing a correct ‘solution’ that will become a part of a program, and thus eventually to the program when it executes. The earliest models assume that what is difficult for one programming team will be difficult for another. Thus we might expect that if program version A has failed on a particular demand, then this suggests the demand is a ‘difficult’ one, and so program version B becomes more likely to fail on that demand. Dependence between failures of design-diverse versions therefore arises as a result of variation of ‘difficulty’: the more variation there is, the greater the dependence. This model, and subsequent refinements of it, go some way to providing a formal understanding of the relationship between the process of building diverse program versions and their subsequent failure behaviour. In particular, they shine new light on the different meanings of that over-used word ‘independence’. They show that even though two programs have been developed ‘independently’, they will not fail independently. The apparent goal of some early work on software fault tolerance – to appeal to ‘independence’ in order to claim high system reliability using software versions of modest reliability, as had been done for hardware – turns out to be illusory. On the other hand, the models tell us that diversity is nevertheless ‘a good thing’ in certain precise and formally-expressed ways. In this talk I shall briefly describe these models, and show that they can be used to model diversity in other dependability-related contexts. For example, the formalism can be used to model diverse methods of finding faults in a single program: it provides an understanding of the trade-off between ‘effectiveness’ and ‘diversity’ when different fault-finding methods are available (as is usually the case). I shall also speculate about their applicability to the use of diversity in reliability and safety cases: e.g. ‘independent’ argument legs; e.g. ‘independent’ V&V. As will be clear from the above, much of the talk will concern the vexed question of ‘independence’. These models, for the most part, are bad news for seekers after independence. Are there novel ways in which we might seek, and make justifiable claims about, independence? Software is interesting here because of the possibility that it can be made ‘perfect’, i.e. fault-free and perfectly reliable, in certain circumstances. Such a claim for perfection is rather different from a claim for high reliability. Indeed, I might believe a claim that a program has a zero failure rate, whilst not believing a claim that another 1

Whilst the language of software will be used here, these remarks about design faults apply equally well to dependability issues arising from design defects in any complex systems – including those that are just hardware-based.

On Diversity, and the Elusiveness of Independence

251

program has a failure rate of less than 10-9 per hour. The reason for my apparently paradoxical view is that the arguments here are very different. The claim for perfection might be based upon utter simplicity, a formal specification of the engineering requirements, and a formal verification of the program. The claim for better than 10-9 per hour, on the other hand, seems to accept the presence of faults (presumably because the program’s complexity precludes claims of perfection), but nevertheless asserts that the faults will have an incredibly small effect. I shall discuss a rather speculative approach to design diversity in which independence may be believable between claims for fault-freeness and reliability.

An Approach to a New Network Security Architecture for Academic Environments MahdiReza Mohajerani and Ali Moeini University of Tehran, n. 286, Keshavarz Blvd, 14166, Tehran Iran {mahdi, moeini}@ut.ac.ir

Abstract. The rapidly growing interconnectivity of IT systems, and the convergence of their technology, renders these systems increasingly vulnerable to malicious attacks. Universities and academic institutions also face concerns about the security of computing resources and information, however, traditional security architectures are not effective for academic or research environments. This paper presents an approach to a new security architecture for the universities and academic centers. While still protecting information and computing resources behind a security perimeter, this system supports the information dissemination and allows the users to develop and test insecure softwares and protocols. We also proposed a method for auditing the security policy based on fuzzy logic intrusion detection system to check the network for possible violations.

1

Introduction

With the growth of the IT systems, computer security is rapidly becoming a critical business concern. Security in computer networks is important so as to maintain reliable operation and to protect the integrity and privacy of stored information. Network attacks cause organizations several hours or days of downtime and serious breaches in data confidentiality and integrity. Depending on the level of the attack and the type of information that has been compromised, the consequences of network attacks vary in degree from mildly annoying to completely debilitating, and the cost of recovery from attacks can range from hundreds to millions of dollars [1][2]. One of the technological tools, which is widely vulnerable against those threats, is the Internet. Because of that, security has become one of the primary concerns when an organization connects its private network to Internet to prevent destruction of data by an intruder, maintain the privacy of local information, and prevent unauthorized use of computing resources. To provide the required level of protection, an organization needs to prevent unauthorized users from accessing resources on the private network and to protect against the unauthorized export of private information [3]. Even if an organization is not connected to the Internet, it may still want to establish an internal security policy to manage user access to portions of the network and protect sensitive or secret information. Academic centers as one of the major S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 252-260, 2002.  Springer-Verlag Berlin Heidelberg 2002

An Approach to a New Network Security Architecture for Academic Environments

253

users of the Internet also need security, however, because of their special structure and requirements, the traditional solutions and policies to limit access to the Internet is not effective for them. This paper presents basic rules of a new security architecture for academic centers. The aim of this architecture is to support the information dissemination and to allow the users to develop and test insecure softwares and protocols, while protecting information and computing resources behind the security perimeters. For this, in Section 2, the general structure is presented. In Section 3, the differences between security solutions in corporate and academic environments are discussed. In Section 4, the proposed structure for the security policy and architecture for an academic center is presented. In Section 5, the results of evaluation of auditing the security policy are shown. Section 6 is the conclusions of the paper.

2

Security Architecture and Policy

The objective of network security architecture is to provide the conceptual design of the network security infrastructure, related security mechanisms, and related security policies and procedures. The security architecture links the components of the security infrastructure as one cohesive unit. The goal of this cohesive unit is to protect corporate information [4]. The security architecture should be developed by both the network design and the IT security teams. It is typically integrated into the existing enterprise network and is dependent on the IT services that are offered through the network infrastructure. The access and security requirements of each IT service should be defined before the network is divided into modules with clearly identified trust levels. Each module can be treated separately and assigned a different security model. The goal is to have layers of security so that a "successful" intruder's access is constrained to a limited part of the network. Just as the bulkhead design in a ship can contain a leak so that the entire ship does not sink, the layered security design limits the damage a security breach has on the health of the entire network. In addition, the architecture should define common security services to be implemented across the network. Usually, the primary prerequisite for implementing network security, and the driver for the security design process, is the security policy. A security policy is a formal statement, supported by a company's highest levels of management, regarding the rules by which employees who have access to any corporate resource abide. The security policy should address two main issues: the security requirements as driven by the business needs of the organization, and the implementation guidelines regarding the available technology. This policy covers senior management's directives to create a computer security program, establish its goals and assign responsibilities and also low-level technical security rules for particular systems [5]. After the key decisions have been made, the security architecture should be deployed in a phased format, addressing the most critical areas first. The most important security services are the services, which enforce the security policy (such as perimeter security) and the services, which audit the security policy (such as intrusion detection systems).

254

MahdiReza Mohajerani and Ali Moeini

2.1

The Perimeter Security

Perimeter security solutions control access to critical network applications, data, and services so that only legitimate users and information can pass through the network. This access control is handled by routers and switches with access control lists (ACLs) and by dedicated firewall appliances. A firewall is a safeguard one can use to control access between a trusted network and a less trusted on [6]. A firewall is not a single component; it is a strategy containing a system or a group of systems that enforces a security policy between the two networks (and in our case between the organization's network and the Internet). For the firewall to be effective, all traffic to and from Internet must pass through the firewall [6]. A Firewall normally includes mechanisms for protection at the network layer, transport layer and application layer. In network layer, IP packets are routed according to predefined rules. Firewalls usually can translate the internal IP addresses to valid Internet IP addresses (NAT or Network Address Translation). They can also replace all internal addresses with the firewall address (also called as Port Address Translation). In transport layer, access to TCP & UDP ports can be granted or blocked, depending on IP address of both sender and receiver. This allows access control for many TCP services, but doesn't work at all for others. In application layer, proxy servers (also called application gateways) can be used to accept requests for a particular application and either further the request to the final destination, or block the request. Ideally proxies should be transparent to the end user. Proxies are strippeddown, reliable versions of standard applications with access control and forwarding built-in. Typical proxies include HTTP (for WWW), telnet, ftp etc [7]. Firewalls can be configured in a number of different architectures, provided various levels of security at different costs of installation and operation [12]. The simplest firewall architectures is the Basic Filter Architecture (screening router), which is the cheapest (and least secure) setup involves using a router (which can filter inbound and outbound packets on each interface) to screen access to one (or more) internal servers. On the other hand there is DMZ Architecture. This architecture is an extension of the screened host architecture. The classical firewall setup is a packet filter between the outside and a "semi-secure" or De-Militarized Zone (DMZ) subnet where the proxies lie (this allows the outside only restricted access services in the DMZ). The DMZ is further separated from the internal network by another packet filter, which only allows connection to/from the proxies. Organizations should match their risk profile to type of firewall architecture selected. Usually, one of the above architectures of a composition of them is selected. 2.2

The Intrusion Detection System

An Intrusion Detection System (IDS) is a computer program that attempts to perform ID by either misuse or anomaly detection, or a combination of techniques. An IDS should preferably perform its task in real time. IDSs are usually classified as hostbased or network-based. Host-based systems base their decisions on information obtained from a single host (usually audit trails), while network-based systems obtain data by monitoring the trace of information in the network to which the hosts are connected. Notice that the definition of an IDS does not include preventing the intrusion from occurring, only detecting it and reporting the intrusion to an operator.

An Approach to a New Network Security Architecture for Academic Environments

3

255

Network Security in Academic Environments

Most corporate environments have deployed firewalls to block (or heavily restrict) access to internal data and computing resources from untrusted hosts and limit access to untrusted hosts from inside. A typical corporate firewall is a strong security perimeter around the employees who collaborate within the corporation. Academic institutions also face concerns about the security of computing resources and information. The security problems in these environments are divided into two categories: Problems with research information and problems with administrative information. Research groups often need to maintain the privacy of their works, ideas for future research, or results of research in progress. Administrative organizations need to prevent leakage of student grades, personal contact information, and faculty and staff personnel records. Moreover, the cost of security compromises is high. A research group could lose its competitive edge, and administrative organizations could face legal proceedings for unauthorized information release. In other hand, academic and research institutions are ideal environments for hackers and intruders and many of them are physically located in these places and they are highly motivated to access and modify grades and other information. There are several reports of break-ins and deletion of data from educational institutions [8]. Although the corporate and academic environments face common security problems they can't choose similar methods to solve them, because of their different structures. In a corporate environment, the natural place to draw a security perimeter is around the corporation itself. However, in an academic environment, it is very difficult to draw a perimeter surrounding all of the people whom they need to access information resources and only those people. This is mainly because of different types of information resources in these environments and also different users who want to access them. So if the firewall perimeter is chosen too big it includes untrusted people and if it is chosen too small it excludes some of the authorized people [9]. In addition, corporations can put serious limitations on the Internet connectivity in the name of security but research organizations simply cannot function under such limitations. First, trusted users need unrestricted and transparent access to Internet resources (including World-Wide-Web, FTP, Gopher, electronic mail, etc.) located outside the firewall. Researchers rely on fingertip access to on-line library catalogs and bibliographies, preprints of papers, and other network resources supporting collaborative work. Second, trusted users need the unrestricted ability to publish and disseminate information to people outside the firewall via anonymous FTP, WorldWide-Web, etc. This dissemination of research results, papers, etc. is critical to the research community. Third, the firewall must allow access to protected resources from trusted users located outside the firewall. An increasing number of users work at home or while traveling. Research collaborators may also need to enter the firewall from remote hosts [8]. Consequently, the traditional firewalls don't meet the academic environment requirements.

256

MahdiReza Mohajerani and Ali Moeini

4

The Perimeter Security Architecture

A high percentage of security efforts within organizations rely on perimeter network access controls exclusively. Perimeter security protects a network by controlling access to all entry and exit points. As mentioned before, traditional solutions can not provide the academic environments requirements because of the special needs of these institutions. One solution is to design different layers and zones in the firewall. Based on the resources and people, who want to access the resources, the firewall can have different zones. This can help us to solve one of the basic problems of the academic environments, which is the information dissemination. There are three categories of information in a university: • • •

The information that is officially disseminated by the university (such as news and events, articles and ...) The information that is gathered and used by network users. The information that is not allowed to be disseminated publicly.

Based on the above categories, three types of servers may be proposed in the university: Public servers, which are used to support information dissemination. Experimental servers, which are used for researchers and students to develop and test their own softwares and protocols. Trusted servers, which are used for administrative purposes or keeping confidential information. The other requirement of an academic environment is to let its trusted members to access the resources of the network from outside of the firewall (for example from home or in the trips). Another problem, that causes serious troubles for the university is the network viruses. These viruses are distributed through the network after users access the special sites. The proxy servers can be used to control this problem. Of course these proxy servers should be transparent. To achieve those goals, the proposed network security policy was designed based on six basic rules: i.

ii. iii. iv. v. vi.

Packets to or from the public servers are unrestricted if they are from authorized ports. The authorized port is the port that the special service is on it. Of course, each public server should be protected itself. The server-level security means to enforce stronger access controls on that level. Packets to or from the experimental servers are unrestricted. These servers can be located outside of firewall perimeter. Packets to or from the authorized ports of trusted servers are allowed only from or to the authorized clients inside the firewall. All of the outgoing packets are allowed to travel outside after port address translation. The incoming packets are allowed if they can be determined to be responses to outbound request. The packets to or from trusted users of hosts outside the firewall are allowed. All of the requests from particular applications such as http should be passed through proxy server.

An Approach to a New Network Security Architecture for Academic Environments

257

The rule i is based on our need to support information dissemination in a research environment. We have to separate the public servers from our trusted hosts and protect them in server-level and accept this fact that they may be compromised, so we should have a plan to recover them from information kept securely behind the firewall. The rule ii follows from our recognition that researchers and students sometimes need to develop and test insecure softwares and protocols on the Internet. Of course they should be alerted that their server is not secure and their information may be corrupted. The rule iii, is based on this fact that we want to protect the confidential information. These servers are our most important resources to be protected and we put them in a special secure zone. The rule iv follows from our recognition that open network access is a necessary component of a research environment. On the other hand we don't want to allow the users to setup Internet servers without permission. The address translation prevents the outside systems to access the internal resources except the ones, which are listed as public servers. Rule v grants access to protected resources to users as they work from home or while traveling, as well as to collaborators located outside the research group. Rule vi is based on the need of blocking some sites in the Internet, which contains viruses. This security policy addresses the needs of academic environments---and indeed the needs of many corporate environments. In the above policy the experimental servers are different than others. Because they involve interactions with unauthenticated hosts and users they pose considerable security risks. Since they are used without restrictions various types of services are presented on them. So these servers should be located outside of the firewall and physical separation should be created between them and other servers. These servers can be recovered periodically to make sure that they are not be hacked. The public servers are also vulnerable against the threats because these servers also involve interactions with public users and hosts. But the difference is that, only restricted services are presented on them so it can be protected easier and the internal users can be kept off from these servers. However, since we assume that the servers get corrupted, each server should be automatically restored regularly from uncompromised sources. The firewall enforces the security policy between internal network and Internet, so the firewall structure is designed based on the security policy. Regarding to our security policy our firewall structure would be a combination of structures we mentioned in section 2. We have a router for connection to Internet, which can be used as screening router. The most important part of the firewall would be the bastion host. This is where our filtering rules and the different zones of firewall are defined. Also, our proxy server is installed on it. The bastion host can be a single purpose hardware or a PC with a normal operating system such as Linux. The proposed firewall architecture is shown in Fig. 1.

MahdiReza Mohajerani and Ali Moeini

Trusted User

Scree ning router

258

Public S erver

Trusted S erver

Untrusted User

Firewall

Experimental Server

Security Perimeter Trusted Zone Connections

Fig. 1. The proposed perimeter security architecture

5

Auditing the Security Policy

After implementing the perimeter security based on the proposed security policy it is important to check the network for possible violations from the policy. The key point in our network security policy is the type of servers. The auditing system should be designed to detect the unauthorized servers in the network, so they can be blocked by the firewall. A fuzzy recognition engine is utilized for this purpose (basically developed by J.E. Dickerson, et al [14] named FIRE) in combination with a normal detector to detect the unauthorized servers. The detection system is based on autonomous agents developed at Purdue by Zamboni et al [11] using independent entities working collectively. There are three distinct independent components of the architecture called agents, transceivers and monitors. They each play different roles as part of the system, but their function can be altered only by the operating system not the other processes; thus they are autonomous. An agent monitors processes of a host and reports abnormal behavior to transceiver. It can communicate with another local agent only through a transceiver. A transceiver controls local agents and acts as the external communication tool between a host and a monitor. A transceiver can perform appropriate data processing and report to the monitors or other agents. A monitor is similar to a transceiver but it

An Approach to a New Network Security Architecture for Academic Environments

259

also controls entities in several hosts. Monitors combine higher-level reports, correlate data, and send alarms or reports to the User Interface [11][14]. Fuzzy recognition engine can be used as the correlation engine for the intrusion detection system (Fig. 2). Several agents can be developed under this system. Since choosing the best data elements to monitor in the network stream is critical to the effectiveness of the detection system and to conserve storage space, the system records only information about the services used, length of the connection, type of the connection, the source and the destination. In order to identify an unauthorized server, we should identify unusual service ports in use on the network, unusual numbers of connections from foreign hosts and unusual amounts of network traffic load to/from a host on the network. The main difference between our work and the FIRE is that we used a normal detector beside the fuzzy system [10]. While the normal detector identifies the unusual service port, the fuzzy recognition system identifies the unusual numbers of connections from foreign hosts and unusual amounts of network traffic load to/from a host on the network. The Combination of the results of the two systems detects the unauthorized server. To test this fuzzy system we gathered data for 10 days. The results show that using this system 93% of the unauthorized servers were detected [13].

Fig. 2. The Intrusion Detection System

6

Conclusions

In this paper the proposed security policy and firewall architecture for an academic center was presented. This firewall meets the needs of academic environments. The firewall is based on six rules of security policy and is largely transparent to trusted users and therefore retains the sense of ``openness'' critical in a research environment. This transparency and perceived openness actually increase security by eliminating the temptation for users to bypass our security mechanisms. In designing the firewall, this fact was identified that, network security for research institutions is a problem in its own right and that traditional corporate firewalls impose excessive restrictions. So we should categorize the university information and based on that, we designed the layers and zones of firewall. In addition, each server inside or outside the firewall

260

MahdiReza Mohajerani and Ali Moeini

should have its own server-level security. We also proposed an auditing system based on fuzzy logic recognition to detect the violations from our security policy.

Acknowledgements The Informatics Center of the University of Tehran (UTIC) has been studied about the appropriate structure of network security policy and the firewall of the university since 1997 [15] and this paper is the result of that research. We would like to thank UTIC to provide the testing environment for this project.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.

Ramachandran, J.: Designing Security Architecture Solutions, John Wiley and Sons (2002) Benjamin, R., Gladman, B.: Protecting IT Systems from Cyber Crime, The Computer Journal, Vol. 4, No. 7 (1998) Goncalves, M.: Firewall Complete, McGrawHill (1999) Arconati, N.: One Approach to Enterprise Security Architecture, SANS Institute http://www.sans.org (2002) Cisco Systems: Network Security An Executive Overview, http://www.cisco.com/warp/public/cc/so/neso/sqso/netsp_pl.htm Semeria, C.: Internet Firewalls and Security, 3Com Technical Paper (1996) Boran Consulting: The IT Security Cookbook, http://secinf.net/info/misc/boran (1999) Greenwald, M., et al.: Designing an Academic Firewall, Policy, Practice and Experience with SURF, IEEE Proceedings of 1966 Symposium of Network and Distributed Systems Security (1996) Nelson, T.: Firewall Benchmarking with firebench II, Exodus Performance Lab (1998) Molitor, A.: Measuring Firewall Performance, http://web.ranum.com/pubs/fwperf/molitor.htm (1999) Zamboni, D.: An Architecture for Intrusion Detection using Autonomous Agents, COAST Technical Report 98/05, COAST Lab, Purdue University (1998) Guttman, B., Bagwill, R.: Implementing Internet Firewall Security Policy, NIST Soecial Publication (1998) Mohajerani, M.R., Moeini, A.: Designing an Intelligent Intrusion Detection System, Internal Technical Report (in Persian), University of Tehran Informatics and Statistics Center (2001) Dickerson, J. E., Juslin, J., Koukousoula, O., Dickerson, J. A.: Fuzzy Intrusion Detection, Proc. of 20th NAFIPS International Conference (2001) Mohajerani, M. R.: To Design the Network Security Policy of the University of Tehran, Internal Technical Report (in Persian), University of Tehran Informatics and Statistics Center (2000)

A Watchdog Processor Architecture with Minimal Performance Overhead Francisco Rodr´ıguez, Jos´e Carlos Campelo, and Juan Jos´e Serrano Grupo de Sistemas Tolerantes a Fallos - Fault Tolerant Systems Group Departamento de Inform´ atica de Sistemas y Computadoras Universidad Polit´ecnica de Valencia, 46022-Valencia, Spain {prodrig,jcampelo,jserrano}@disca.upv.es http://www.disca.upv.es/gstf

Abstract. Control ﬂow monitoring using a watchdog processor is a wellknown technique to increase the dependability of a microprocessor system. Most approaches embed reference signatures for the watchdog processor into the processor instruction stream creating noticeable memory and performance overheads. A novel watchdog processor architecture using embedded signatures is presented that minimizes the memory overhead and nulliﬁes performance penalty on the main processor without sacriﬁcing error detection coverage or latency. This scheme is called Interleaved Signature Instruction Stream (ISIS) in order to reﬂect the fact that signatures and main instructions are two independent streams that co-exist in the system.

1

Introduction

In the ”Model for the Future” foreseen by Avizienis in [1] the urgent need to incorporate dependability to every day computing is clear: ”Yet, it is alarming to observe that the explosive growth of complexity, speed, and performance of single-chip processors has not been paralleled by the inclusion of more on-chip error detection and recovery features”. Eﬃcient error detection is of fundamental importance in dependable computing systems. As the vast majority of faults are transient, the use of a concurrent Error Detection Mechanism (EDM) is of utmost interest as high coverage and low detection latency characteristics are needed to recover the system from the error. And as experiments demonstrate [2, 3, 4, 5], a high percentage of nonoverwritten errors results in control ﬂow errors. Siewiorek states in [6] that ”To succeed in the commodity market, faulttolerant techniques need to be sought which will be transparent to end users”. A fault-tolerant technique can be considered transparent only if results in minimal performance overhead in silicon, memory size or processor speed. Although redundant systems can achieve the best degree of fault-tolerance, the high overheads implied limit their applicability in every day computing elements. S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 261–272, 2002. c Springer-Verlag Berlin Heidelberg 2002

262

Francisco Rodr´ıguez et al.

The work presented here provides concurrent detection of control ﬂow errors with no performance penalty and minimal memory and silicon sizes. No modiﬁcations are needed in the instruction set of the processor used as testbed and the architectural ones are so small that they can be enabled and disabled under software control to allow binary compatibility with existing software. The watchdog processor is very simple, and its design can be applied to other processors as well. The paper is structured as follows: The next section is devoted to present a set of basic deﬁnitions and it is followed by the outline of related works in the literature. Section 4 presents the system architecture where the watchdog is embedded. Section 5 discusses error detection capabilities, signature characteristics and placement, and modiﬁcations needed into the original architecture of the processor. A memory overhead comparison with similar work is performed afterwards, to ﬁnish with the conclusions.

2

Basic Definitions

The following deﬁnitions are taken from [5]: 1. A branch instruction is an instruction that can break the sequential ﬂow of execution like a procedure call, a conditional jump or a return-fromprocedure instruction. 2. A branch-in point is an instruction used as the destination of a branch instruction or the entry point of, for example, an interrupt handler. 3. A program is partitioned into branch-free intervals and branch instructions. The beginning of a branch-free interval is a branch-in instruction or the instruction following a branch. A branch-free interval is ended by a branch or a branch-in instruction. 4. A basic block is only a branch-free interval if it is ended by a branch-in. It is the branch-free interval and its following branch instruction otherwise. With the deﬁnitions above a program can be represented by a Control Flow Graph (CFG). Vertices in this graph are used to represent basic blocks and directed arcs are used to represent legal paths between blocks. Figure 1 shows some examples for simple High Level Language constructs. We call block fall-through to the situation where two basic blocks are separated with no branch-out instruction in between. Blocks are divided only because the ﬁrst branch-free interval is ended by a following branch-in instruction that starts the second block. In [7] a block that receives more than two transfers of control ﬂow it is said to be a branch fan-in block. We distinguish whether the control ﬂow transfer is due to a non-taken conditional branch (that is, both blocks are contiguous in memory) and say that a multiple fan-in block is reachable from more than one out-of-sequence vertex in the CFG. A branch instruction with more than one out-of-sequence target is represented in the CFG by two or more arcs departing from the same vertex, where

A Watchdog Processor Architecture with Minimal Performance Overhead if-then-else construct if-block

switch construct switchblock

Inst. i

Conditional branch not taken then-block Inst. i+1

Conditional branch taken else-block

263

Inst. k

case 0 block

case 1 block

Multiple fan-out block

...

case n block

Fall through block next-block Inst. k+1

next-block Multiple fan-in block

Fig. 1. CFG’s for some HLL constructs at least two of them are targeted to out-of-sequence vertices. These are said to be multiple fan-out blocks. A derived signature is a value assigned to each instruction block. The term derived means the signature is not an arbitrarily assigned value but calculated from the block’s instructions. Derived signatures are usually obtained xoring the instruction opcodes or using such opcodes to feed a Linear Feedback Shift Register (LFSR). These values are calculated at compile time and used as reference by the EDM to verify correctness of executed instructions. If signatures are interspersed or hashed with the processor instructions the method is generally known as Embedded Signature Monitoring (ESM). A watchdog processor is a hardware EDM used to detect Control Flow Errors (CFE) and/or corruption of the instructions executed by the processor, usually employing derived signatures and an ESM technique. In this case it performs signature calculations from the instruction opcodes that are actually executed by the main processor, checking these run-time values against their references. If any diﬀerence is found the error in the main processor instruction stream is detected and an Error Recovery Mechanism (ERM) is activated.

3

Related Work

Several hardware approaches using a watchdog processor and derived signatures for concurrent error detection have been proposed. The most relevant works are outlined below: Ohlsson et al. present in [5] a watchdog processor built into a RISC processor. A specialized tst instruction is inserted in the delay slot of every branch instruction, testing the signature of the preceding block. An instruction counter is also used to time-out an instruction sequence when a branch instruction is not executed in the speciﬁed range. Other watchdog supporting instructions are added to the processor instruction set to save and restore the value of the instruction counter on procedure calls. The watchdog processor used by Galla et al. in [8] to verify correct execution of a communications controller of the Time-Triggered Architecture uses a similar

264

Francisco Rodr´ıguez et al.

approach. A check instruction is inserted in appropriate places to trigger the checking process with the reference signature that is stored in the subsequent word. In the case of a branch, the branch delay slot is used to insert an adjustment value for the signature to ensure the run-time signature is the same at the check instruction independent of the path followed. An instruction counter is also used by the watchdog. The counter is loaded during the check instruction and decremented for every instruction executed; a time-out is issued if the counter reaches zero before a new check instruction is executed. Due to the nature of the communications architecture, no interrupts are processed by the controller. Thus saving the run-time signature or instruction counter is not necessary. The ERC32 is a SPARC processor augmented with parity bits and a program ﬂow control mechanism presented by Gaisler in [9]. In the ERC32, a test instruction to verify the processor control ﬂow is also inserted in the delay slot of every branch to verify the instruction bits of the preceding block. In his work, the test instruction is a slightly modiﬁed version of the original nop instruction and no other modiﬁcations to the instruction set is needed. A diﬀerent error detection mechanism is presented by Kim and Somani in [10]. The decoded signals for the pipeline control are checked in a per instruction basis and their references are retrieved from a watchdog private cache. If the run-time signature of a given instruction can’t be checked because its reference counterpart is not found in the cache, it is stored in the cache and used as reference for future executions. No signatures or program modiﬁcations are needed because reference signatures are generated at run-time, thus creating no overhead. The drawback in this approach is that the watchdog processor can’t check all instructions. An instruction can be checked if it has been previously executed and only if its reference has not been displaced from the watchdog private cache to store other signatures. Although the error is detected before the instruction is committed and no overheads are created, the error coverage is poor. More recently, hardware additions to modern processor architectures have been proposed to re-execute instructions and perform a comparison to verify no errors have been produced before instructions are committed. Some of these proposals are outlined below for the sake of completeness but they are out of the scope of this work because: i) Hardware additions, spare components and/or time redundancy are used to detect all possible errors by reexecution of all instructions. Not only errors in the instruction bits or execution ﬂow are detected but data errors as well. ii) They require either a complete redesign of the processor control unit or the addition of a complete execution unit capable to carry out the same set of instructions of the main processor, although its control unit can be simpler. These include, to name a few: – REESE (Nickel and Somani, [11]) and AR-SMT (Rotenberg, [12]). Both works take advantage of the simultaneous multi-threading architecture to execute every instruction twice. The instructions of the ﬁrst thread, along with their operands and results are stored in a queue (a delay buﬀer in

A Watchdog Processor Architecture with Minimal Performance Overhead

265

Rotenberg’s work) and re-executed. Results of both executions are compared before the instructions are committed. – The microprocessor design approach of Weaver and Austin in [13] to achieve fault tolerance is the substitution of the commitment stage of a pipeline processor with a checker processor. Instructions along with their inputs, addresses and the results obtained are passed to the checker processor where instructions are re-executed and results can be veriﬁed before they are committed. – The O3RS design of Mendelson and Suri in [14] and a modiﬁed multiscalar architecture used by Rashid et al. in [15] use spare components in a processor capable of issuing more than one instruction per cycle to re-execute instructions.

4

System Architecture

The system (see Fig. 2) is built around a soft-core of a MIPS R3000 processor clone developed in synthesizeable VHDL [16]. It is a 4-stage pipelined RISC processor running the MIPS-I and MIPS-II Instruction Set Architecture [17]. Instruction and data bus interfaces are designed as AMBA AHB bus masters providing external memory access. This processor is provided with a Memory Management Unit (MMU) inside the System Control Coprocessor (CP0 in the MIPS nomenclature) to perform virtual to physical address mapping, to isolate memory areas of diﬀerent processes and to check correct alignment of memory references. To minimize performance penalty, the instruction cache is designed with two read ports that can provide two instructions simultaneously, one for each processor. On a cache hit, no interference exists even if the other processor is waiting for a cache line reﬁll because of a cache miss.

System Architecture

AMBA AHB bus

Watchdog signature path

AHB Master

Watchdog

Retired instructions

Instruction path

AHB Arbiter I-Cache External memory External bus interface

R3000 Processor

AHB Slave Data path

AHB Master

Processor core System Control Coprocessor TLB

Fig. 2. System architecture

266

Francisco Rodr´ıguez et al.

To reduce the instruction cache complexity a single write port is provided that must be shared by both processors. When simultaneous cache misses happen, cache reﬁlls are served in a First-Come First-Served fashion. If they happen in the same clock cycle, the main processor is promoted. This arrangement takes advantage of space and time locality in the application program to augment cache hits for signatures. As we use an ESM technique and signatures are interleaved with processor instructions, when both processors produce a cache miss they request the same memory block most of the times, as both reference words in the same program area. No modiﬁcation is needed in the processor instruction set due to the fact that signature instructions are neither fetched nor executed by the main processor. This allows us to maintain binary compatibility with existing software. If access to the source code is not possible, the program can be run without modiﬁcation (and no concurrent ﬂow error detection capability will be provided). This is possible because the watchdog processor and processor’s modiﬁed architecture can be enabled and disabled under software control running with superuser privileges. If these features are disabled, our processor behaves as an oﬀ-the-shelf MIPS processor. Thus, if binary compatibility is needed for a given task, these features must be disabled by the OS every time the task resumes execution. The watchdog processor is fed with the instructions from the main processor pipeline as they are retired. When these instructions enter the watchdog the runtime signatures and address bits are calculated at the same rate of the arrived instructions. When a block ends, these values are stored in a FIFO memory to decouple the signature checking process. This FIFO allows a large set of instructions to be retired from the pipeline while the watchdog is waiting for a cache reﬁll in order to get a reference signature instruction. In a similar way, the FIFO can be emptied by the watchdog while the main processor pipeline is stalled due to a memory operation. When this memory is full, the pipeline if forced to wait for the watchdog checking process to read some data from the FIFO.

5

Interleaved Signature Instruction Stream

Block signatures are placed at the beginning of every basic block in our scheme. These reference signatures are used by the watchdog processor only and not processed in any way by the main processor. Two completely independent, interleaved instruction streams coexist in our system: the application instruction stream which is divided into blocks and executed by the main processor and the signature stream used by the watchdog processor. We have called Interleaved Signature Instruction Stream (ISIS) to our technique due to this fact. The signature word (see Fig. 3 for a ﬁeld description) provide enough information to the watchdog processor to check the following block properties: 1. Block Length. The watchdog processor checks a block’s signature when the last instruction of the block is retired from the processor pipeline. Instead

A Watchdog Processor Architecture with Minimal Performance Overhead

267

of relying on the branch instruction at the end of the block to perform the signature checking, the watchdog counts the instructions as they are retired. In this way, the watchdog can anticipate when the last instruction comes and detect a CFE if a branch occurs too early or too late. 2. Block Signature. The block instructions are compacted using a 16-bit LFSR that will be used by the watchdog to verify that the correct instructions have been retired from the processor pipeline. 3. Block Target Address. In the case of a non multiple fan-out block with a target address that can be determined at compile-time, a 3-bit signature is computed from the address diﬀerence between the branch and the outof-sequence target instruction. These parity bits are used at run-time to provide some conﬁdence in that the instruction reached after the branch is the correct one. 4. Block Origin Address. When the branch of a multiple fan-out block is executed, the watchdog can’t check all possible destinations even if they are obtainable at compile time. In our scheme, every possible destination block is provided with a 3-bit signature of the address diﬀerence between the originating branch and the start of the block, much the same as the previous Block Target Address check. Thus, instead of checking that the target instruction is the correct one, the watchdog processor checks (at the target block) that the originating branch is the correct one in this case. The signature instruction encoding has been designed in such a way that a main processor instruction can not be misinterpreted as a watchdog signature instruction. This provides an additional check when a branch instruction is executed by the main processor. This check consists in the requirement to ﬁnd a signature instruction immediately preceding the ﬁrst instruction of every block. This also helps to detect a CFE if a branch erroneously reaches a signature instruction, because the used encoding will force an illegal instruction exception to be raised. Furthermore, the block type helps the watchdog processor to check whether the execution ﬂow is correct. For example, in the case of a multiple fan-out block, the block type reﬂects the need to check the address signature at the target block. Even if an incorrect branch is taken to the initial instruction of a block, target’s signature instruction must have coded into its type that it is a block where the origin address must be checked or a CFE exception will be raised. Instructions in the MIPS processor must be placed at word boundaries; a memory alignment exception is raised if this requirement is not met. Taking advantage of this mechanism, the watchdog processor computes address diﬀerences

Block signature instruction

6 bits Type

3 bits Block Target Add

3 bits Block Origin Add

4 bits

16 bits

Length

Opcode Signature

Fig. 3. Block signature encoding

268

Francisco Rodr´ıguez et al.

Multiple fanout block

Cond. branch Cond. branch not taken

Multiple fanout block

... Shared block

...

Address protected branch to shared block Address unprotected branch to shared block

Fig. 4. Example of an address checking uncovered case

as 30-bit values. Given that the branch instruction type used most of the time by the compiler use a 16-bit oﬀset to reach the target instruction these diﬀerences obtained at run-time for Block Target Address and Block Origin Address checks are usually half empty, so every parity bit protects 5 (10 in the worst case) of such bits. To our knowledge, the Block Origin Address checking has never been proposed in the literature. The solutions oﬀered so far to manage jumps with multiple targets use justifying signatures (see [7] for an example) to patch the run-time signature and delay the check process until a common branch-in point is encountered, increasing the error detection latency. Not all jumps can be covered with address checking however. Neither the jumps with run-time computed addresses nor those jumps to a multiple fanin block that it is shared by several multiple fan-out blocks (see Fig. 4 for an example). In the later case, an address signature per origin should be used in the fan-in block in order to maintain the address checking process, which is not possible. Currently, only Block Origin Address checks from non multiple fan-out blocks can be covered for such shared blocks. 5.1

Processor Architecture Modifications

Isolating the reference signatures from the instructions fed into the processor pipeline results in a minimal performance overhead in the application program. Slight architecture modiﬁcations are needed in the main processor in order to achieve it. First of all, when a conditional branch instruction ends a basic block, a second block follows immediately. The second block’s signature sits between them, and the main processor must skip it. In order to eﬀectively jumping over the signature, the signature size is added to the Program Counter if the branch is not taken. In the same way, when a procedure call instruction ends a basic block the next one to be executed after the procedure returns immediately follows the ﬁrst one. Again, the second block’s signature must be taken into account when

A Watchdog Processor Architecture with Minimal Performance Overhead (a)

if-block

(b)

Conditional branch not taken then-block Inst. i+1

Block signature

if-block

Inst. i

Conditional branch taken

269

Jump inserted

Inst. i

Automatic PC addition

else-block Inst. k

Fall through block next-block Inst. k+1

then-block

Inst. i+2

next-block

else-block

Fall through substituted by an explicit jump instruction

Fig. 5. An if-then-else example (a). After block signatures and jump insertion (b)

calculating the procedure return address. And again, this is achieved by an automatic addition of the signature size to the PC. Additions to the PC mentioned above can be automatically generated at runtime because the control unit decodes a branch or procedure call instruction at the end of the block. The instruction is a clear indication that the block end will arrive soon. As the processor has a pipelined architecture, the next instruction is executed in all cases (this is known as the branch delay slot), so the control unit has a clock cycle to prepare for the addition. Despite the fact that the instruction in the delay slot is placed after the branch, it logically belongs to the same block, as it is executed even if the branch is taken. However, in the case of a block fall-through the control unit has no clue to determine when the ﬁrst block ends, so the signature can not be automatically jumped over. In this case, the compiler explicitly adds an unconditional jump to skip it. This is the only case where a processor instruction must be added in order to isolate main processor from the signature stream. Figure 5a shows an example of an if-then-else construct with a fall-through block that needs such an addition (shown in Fig. 5b).

6

Overhead Analysis

Although we have not enough experimental data yet to assess the memory and performance overhead of our system, a qualitative analysis for the memory overhead based on related work is possible. A purely software approach to concurrent error detection was evaluated by Wildner in [18]. This control ﬂow EDM is called Compiler-Assisted Self Checking of Structural Integrity (CASC) and it is based on address hashing and signature justifying to protect the return address of every procedure. At the procedure entry, the return address of the procedure is extracted from the link register into a general-purpose register to be operated on. The ﬁrst operation is the inversion of the LSB bit of the return address to provide a misalignment exception in the

270

Francisco Rodr´ıguez et al.

case of a CFE. An add instruction at each basic block is inserted to justify the procedure signature and, at the exit point, the ﬁnal justifying and reinversion of the LSB bit is calculated and the result is transferred to the link register before returning from the procedure. In the case of a CFE, the return address obtained is likely to cause a misalignment exception thus catching the error. The experiments carried out on a RISC SPARC processor resulted in a memory codesize overhead for the SPECint92 benchmarks varying from 0% to 28% (18,76% on average) depending on the run-time library used and the benchmark itself. The hardware watchdog of Ohlsson et al. presented in [5] use a tst instruction per basic block, taking advantage of the branch delay slot of a pipelined RISC processor called TRIP. One of the detection mechanisms used by the watchdog is an instruction counter to issue a time-out exception if a branch instruction is not executed during the speciﬁed interval. When a procedure is called two instructions are inserted to save the block instruction counter and another instruction is inserted at the procedure end to restore it. Their watchdog code size overhead is evaluated to be between 13% and 25%. The later value comes from the heap sort algorithm showing a mean basic block of 4.8 instructions. ISIS inserts a single word per basic block, without special treatment for procedure entry and exit blocks, so CASC or TRIP overhead can be taken as an upper bound of ISIS memory overhead. Hennessey and Patterson in [19] state that the average length of a basic block for a RISC processor sits between 7 and 8 instructions. The reasoning to evaluate memory overhead as 1/L being L the basic block length is used by Ohlsson and Rim´en in [20] to evaluate the memory overhead of their Implicit Signature Checking (ISC) method. The same value (7-8 instructions per block) is used by Shirvani and McCluskey in [21] to perform this same analysis on several software signature checking techniques. Applying this evaluation method to ISIS results in a mean of about 12% 15% memory overhead. An additional word must be accounted to eliminate fallthrough blocks. The overhead of these insertions has to be methodically studied, but initial experiments show a negligible impact on overall memory overhead.

7

Conclusion

We have presented a novel technique to embed signatures into the execution ﬂow of a RISC processor that provides a set of error checking procedures to assess that the ﬂow of executed instructions is correct. These checking procedures include a block length count, the signature of instruction opcodes using a LFSR, and address checking when a branch is executed. All these checkings are performed in a per block basis, in order to reduce the error detection latency of our hardware Error Detection Mechanism. One of those address checking procedures has not been published before. It is the Block Origin Address checking used when a branch has multiple valid targets and consists of delaying the branch checking until the target instruction is reached and verifying that the branch comes from the correct origin vertex

A Watchdog Processor Architecture with Minimal Performance Overhead

271

in the CFG. This technique solves the address checking problem that arises if a branch has multiple valid destinations, for example, the table-based jumps used when the OS dispatches a service request. Not all software cases can be covered with address checking however. When a CFG vertex is targeted from two or more multiple fan-out vertices the Block Origin Address check becomes ineﬀective. We have called Interleaved Signature Instruction Stream (ISIS) to our signature embedding technique to reﬂect the important fact that signature instructions processed by the watchdog processor and main processor instructions are two completely independent streams. ISIS has been implemented into a RISC processor and the modiﬁcations demanded by signature embedding to the original architecture have been discussed. These modiﬁcations are very simple and can be enabled and disabled by software with superuser privileges to maintain binary compatibility with existing software. No speciﬁc features of the processor has been used, so the port of ISIS to a diﬀerent processor architecture is quite straightforward. Memory performance overhead has been studied by comparison with other methods and analysis show a memory overhead between 12% and 15% although we haven’t performed a methodical study yet. As a negligible amount of instructions are added to the original program, the performance is expected to remain basically unaltered.

Acknowledgements This work is supported by the Spanish Government Comisi´ on Interministerial de Ciencia y Tecnolog´ia under project CICYT TAP99-0443-C05-02.

References [1] Avizienis, A.: Building Dependable Systems: How to Keep Up with Complexity. Proc. of the 25th Fault Tolerant Computing Symposium (FTCS-25), 4-14, Pasadena, California, 1995. 261 [2] Gunneﬂo, U., Karlsson, J., Torin, J.: Evaluation of Error Detection Schemes Using Fault Injection by Heavy-ion Radiation. Proc. of the 19th Fault Tolerant Computing Symposium (FTCS-19), 340-347, Chicago, Illinois, 1989. 261 [3] Czeck, E. W., Siewieorek, D. P.: Eﬀects of Transient Gate-Level Faults on Program Behavior. Proc. of the 20th Fault Tolerant Computing Symposium (FTCS-20), 236-243, NewCastle Upon Tyne, U. K., 1990. 261 [4] Gaisler, J.: Evaluation of a 32-bit Microprocessor with Built-in Concurrent Error Detection. Proc. of the 27th Fault Tolerant Computing Symposium (FTCS-25), 42-46, Seattle, Washington, 1997. 261 [5] Ohlsson, J., Rim´en, M., Gunneﬂo, U.: A Study of the Eﬀects of Transient Fault Injection into a 32-bit RISC with Built-in Watchdog. Proc. of the 22th Fault Tolerant Computing Symposium (FTCS-22), 316-325, Boston, Massachusetts, 1992. 261, 262, 263, 270

272

Francisco Rodr´ıguez et al.

[6] Siewiorek, D. P.: Niche Sucesses to Ubiquitous Invisibility: Fault-Tolerant Computing Past, Present, and Future. Proc. of the 25th Fault Tolerant Computing Symposium (FTCS-25), 26-33, Pasadena, California, 1995. 261 [7] Oh, N., Shirvani, P. P., McCluskey, E. J.: Control Flow Checking by Software Signatures. IEEE Transactions on Reliability - Special Section on Fault Tolerant VLSI Systems, March, 2001. 262, 268 [8] Galla, T. M., Sprachmann, M., Steininger, A., Temple, C.: Control Flow Monitoring for a Time-Triggered Communication Controller. Proceedings of the 10th European Workshop on Dependable Computing (EWDC-10), 43-48, Vienna, Austria, 1999. 263 [9] Gaisler, J.: Concurrent Error-Detection and Modular Fault-Tolerance in an 32-bit Processing Core for Embedded Space Flight Applications. Proc. of the 27th Fault Tolerant Computing Symposium (FTCS-24), 128-130, Austin, Texas, 1994. 264 [10] Kim, S., Somani, A. K.: On-Line Integrity Monitoring of Microprocessor Control Logic. Proc. Intl. Conference on Computer Design: VLSI in Computers and Processors (ICCD-01), 314-319, Austin, Texas, 2001. 264 [11] Nickel, J. B., Somani, A. K.: REESE: A Method of Soft Error Detection in Microprocessors. Proc. of the 2001 Intl. Conference on Dependable Systems and Networks (DSN-2001), 401-410, Goteborg, Sweden, 2001. 264 [12] Rotenberg, E.: AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors. Proc. of the 29th Fault Tolerant Computing Symposium (FTCS29), 84-91, Madison, Wisconsin, 1999. 264 [13] Weaver, C., Austin, T.: A Fault Tolerant Approach to Microprocessor Design. Proc. of the 2001 Intl. Conference on Dependable Systems and Networks (DSN2001), 411-420, Goteborg, Sweden, 2001. 265 [14] Mendelson, A., Suri, N.: Designing High-Performance & Reliable Superscalar Architectures. The Out of Order Reliable Superscalar (O3RS) Approach. Proc. of the 2000 Intl. Conference on Dependable Systems and Networks (DSN-2000), 473-481, New York, USA, 2000. 265 [15] Rashid, F., Saluja, K. K., Ramanathan, P.: Fault Tolerance Through Re-execution in Multiscalar Architecture. Proc. of the 2000 Intl. Conference on Dependable Systems and Networks (DSN-2000), 482-491, New York, USA, 2000. 265 [16] IEEE Std. 1076-1993: VHDL Language Reference Manual. The Institute of Electrical and Electronics Engineers Inc., New York, 1995. 265 [17] MIPS32 Architecture for Programmers, volume I: Introduction to the MIPS32 Architecture. MIPS Technologies, 2001. 265 [18] Wildner, U.: Experimental Evaluation of Assigned Signature Checking With Return Address Hashing on Diﬀerent Platforms. Proc. of the 6th Intl. Working Conference on Dependable Computing for Critical Applications, 1-16, Germany, 1997. 269 [19] Hennessy, J. L., Patterson, D. A.: Computer Architecture. A Quantitative Approach, 2nd edition, Morgan-Kauﬀmann Pub., Inc., 1996. 270 [20] Ohlsson, J., Rim´en, M.: Implicit Signature Checking. Proc. of the 25th Fault Tolerant Computing Symposium (FTCS-25), 218-227, Pasadena, California, 1995. 270 [21] Shirvani, P. P., McCluskey, E. J.: Fault-Tolerant Systems in a Space Environment: The CRC ARGOS Project. Center for Reliable Computing, Technical Report CRC-98-2, Standford, California, 1998. 270

Model-Checking Based on Fluid Petri Nets for the Temperature Control System of the ICARO Co-generative Plant Marco Gribaudo1 , A. Horv´ ath2 , A. Bobbio3 , Enrico Tronci4 , 5 Ester Ciancamerla , and Michele Minichino5 1

3

Dip. di Informatica, Universit` a di Torino, 10149 Torino, Italy [email protected] 2 Dept. of Telecommunications, Univ. of Technology and Economics Budapest, Hungary [email protected] Dip. di Informatica, Universit` a del Piemonte Orientale, 15100 Alessandria, Italy [email protected] 4 Dip. di Informatica, Universit` a di Roma ”La Sapienza”, 00198 Roma, Italy [email protected] 5 ENEA, CR Casaccia, 00060 Roma, Italy {ciancamerla,minichino}@casaccia.enea.it

Abstract. The modeling and analysis of hybrid systems is a recent and challenging research area which is actually dominated by two main lines: a functional analysis based on the description of the system in terms of discrete state (hybrid) automata (whose goal is to ascertain for conformity and reachability properties), and a stochastic analysis (whose aim is to provide performance and dependability measures). This paper investigates a unifying view between formal methods and stochastic methods by proposing an analysis methodology of hybrid systems based on Fluid Petri Nets (FPN). It is shown that the same FPN model can be fed to a functional analyser for model checking as well as to a stochastic analyser for performance evaluation. We illustrate our approach and show its usefulness by applying it to a “real world” hybrid system: the temperature control system of a co-generative plant.

1

Introduction

This paper investigates an approach to model checking starting from a ﬂuid Petri net (FPN) model, for formally verifying the functional and safety properties of hybrid systems. This paper shows that FPN [1, 11, 9] can constitute a suitable formalism for modeling hybrid systems, like the system under study, where a discrete state controller operates according to the variation of suitable continuous quantities (temperature, heat consumption). The parameters of the models are usually aﬀected by uncertainty. A common and simple way to account for parameter uncertainty is to assign to them a range of variation (between a minimum and a maximum value), without any speciﬁcation on the actual value S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 273–283, 2002. c Springer-Verlag Berlin Heidelberg 2002

274

Marco Gribaudo et al.

assumed by the parameter in a speciﬁc realization (non-determinism). Hybrid automata [3] and discretized model checking tools [6] operate along this line. If a weight can be assigned to the parameter uncertainty through a probability distribution, we resolve the non-determinism by deﬁning a stochastic model: the FPN formalism [11, 8] has been proposed to include stochastic speciﬁcations. However, the paper intends to show that a FPN model for an hybrid system can be utilized as an input model both for functional analysis as well as for stochastic analysis. In particular, the paper shows that the FPN model can be translated in terms of a hybrid automaton [2, 15] or a discrete model checker [5]. FPN’s are an extension of Petri nets able to model systems with the coexistence of discrete and continuous variables [1, 11, 9]. The main characteristics of FPN is that the primitives (places, transitions and arcs) are partitioned in two groups: discrete primitives that handle discrete tokens (as in standard Petri nets) and continuous (or ﬂuid) primitives that handle continuous quantities (referred to as ﬂuid). Hence, in the single formalism, both discrete and continuous variables can be accommodated and their mutual interaction represented. Even if Petri nets and model checking rely on very diﬀerent conceptual and methodological bases (one coming from the world of performance analysis and the other form the world of formal methods), nevertheless the paper attempts to gain cross fertilizations from the two areas. The main goal of the research work presented in this paper is to investigate on the possibility of deﬁning a methodology which allows to refer to a common FPN model to be used both for formal speciﬁcation and veriﬁcation with model checking tools and for performance analysis. We describe our approach and show its usefulness by using a meaningful “real world” application. Namely, we assume as a case study the control system of the temperature of the primary and secondary circuit of the heat exchange section of the ICARO co-generative plant [4] in operation at centre of ENEA CR Casaccia. The plant, under study, is composed by two sections: the gas turbine section for producing electrical power and the heat exchange section for extracting heat from the turbine exhaust gases. The paper is organized as follow. Section 2 describes our case study. Section 3 introduces the main elements of the FPN formalism, provides the FPN model of the case study, and its conversion into an hybrid automaton. Section 4 shows how the same FPN model can be translated into a discrete models checker (NuSMV [14]) and provides some of our experimental results. Section 5 gives the conclusions.

2

Temperature Control System

The ICARO co-generative plant is composed by two sections: the electrical power generation and the heat extraction from the turbine exhaust gases. The exhaust gases can be conveyed to a re-heating chamber to heat the water of a primary circuit and then, through a heat exchanger, to heat the water of a secondary circuit that, actually, is the heating circuit of the ENEA Research Center. If the

Model-Checking Based on Fluid Petri Nets

275

thermal energy required by the end user is higher than the thermal energy of the exhaust gases, fresh methane gas can be ﬁred in the re-heating chamber where the combustion occurs. The ﬂow of the fresh methane gas is regulated by the control system through the position of a valve. The block diagram of the temperature control of the primary and secondary circuits is depicted in Figure 1. The control of the thermal energy used to heat the primary circuit is performed by regulating both the ﬂow rate of the exhaust gases through the diverter D and the ﬂow rate of the fresh methane gas through the valve V. T1 is the temperature of the primary circuit, T2 is the temperature of the secondary circuit, and u is the thermal request by the end user. The controller has two distinct regimes (two discrete states) represented by the position 1 or 2 of the switch W in Figure 1. Position 1 is the normal operational condition, position 2 is the safety condition. In position 1, the control is based on a proportional-integrative measure (performed by block PI1 ) of the error of temperature T2 with respect to a (constant) set point temperature Ts . Conversely, in position 2, the control is based on a proportional-integrative measure (performed by block PI2 ) of the error of temperature T1 with respect to a (constant) set point temperature Ts . Normally, the switch W is in position 1 and the control is performed on T2 to maintain constant the temperature to the end user. Switching from position 1 to position 2 occurs for safety reasons, when the value of T2 is higher than a critical value deﬁned as the set point Ts augmented by an hysteresis value Th and the control is locked to the temperature of the primary circuit T1 , until T1 becomes lower than the set point Ts . The exit of the proportional-integrative block (either PI1 or PI2 , depending on the position of the switch W) is the variable y which represents the request of thermal energy. When y is lower than a split point value Y s the control just acts on the diverter D (ﬂow of the exhaust gases), when the diverter is completely

Fresh gas control

T1 -

Ts

V

PI

+

2 W

SETPOINT

YS

PI -

T2

1

T1 Primary circuit

Secondary circuit u

y

1 Ts +

1

1

D

YS 1 Diverter control

γ(T1-T2)

T2

Fig. 1. Temperature control of the primary and secondary circuits of the ICARO plant

276

Marco Gribaudo et al.

open, and the request for thermal energy y is greater than Y s, the control also acts on the ﬂow rate of the fresh methane gas by opening the valve V. The heating request is computed by the function f (y) represented in Figure 2. Since the temperature T2 is checked out when W is in position 1, and the temperature T1 is checked out in state 2, the function f (y) depends on y2 when W = 1 and on y1 when W = 2. The function f (y) is deﬁned as the sum of two non-deterministic components, namely: g1 (y) which represents the state of the valve V, and g2 (y) which represents the state of the diverter D. The nondeterminism is introduced by the parameters αmin , αmax that give the minimal and maximal heat induced by the fresh methane gas, and βmin , βmax that deﬁne the minimal and maximal heat induced by the exhaust gases. Finally, the heat exchange between the primary and the secondary circuit is approximated by the linear function γ(T1 − T2 ), proportional (through a constant γ) to the temperature diﬀerence.

3

Fluid Petri Nets

Fluid Petri Nets (FPN) are an extension of standard Petri Nets [13], where, beyond the normal places that contain a discrete number of tokens, new places are added that contain a continuous quantity (ﬂuid). Hence, this extension is suitable to be considered for modeling and analyzing hybrid systems. Two main formalisms have been developed in the area of FPN: the Continuous or Hybrid Petri net (HPN) formalism [1], and the Fluid Stochastic Petri net (FSPN) formalism [11, 9]. A complete presentation of FPN is beyond the scope of the present paper and an extensive discussion of FPN in performance analysis can be found in [8]. Discrete places are drawn according to the standard notation and contain a discrete amount of tokens that are moved along discrete arcs. Fluid places are drawn by two concentric circles and contain a real variable (the ﬂuid level). The ﬂuid ﬂows along ﬂuid arcs (drawn by a double line to suggest a pipe) according to an instantaneous ﬂow rate. The discrete part of the FPN regulates the ﬂow of the ﬂuid through the continuous part, and the enabling conditions of a transition depend only on the discrete part.

g1(y) g2(y) + f (y) g1(y) + g 2(y) max max min min

1

1

g1(y)=

g2(y)= YS

1

YS

Fig. 2. The heating request function f (x)

1

Model-Checking Based on Fluid Petri Nets

3.1

277

A FPN Description of the System

The FPN modeling the case study of Figure 1 is represented in Figure 3. The FPN contains two discrete places: P 1 which is marked when the switch W is in state 1, and P 2 which is marked when the switch W is in state 2. Fluid place Primary (whose marking is denoted by T1 , and has a lower bound at Ts ) represents the temperature of the primary circuit, and ﬂuid place Secondary (whose marking is denoted by T2 and has an upper bound at Ts + Th) represents the temperature of the secondary. The ﬂuid arcs labeled with γ(T1 −T2 ) represent the heat exchange between the primary and the secondary circuit. The system jumps from state 1 to state 2 due to the ﬁring of immediate transition Sw12. This transition has associated a guard T2 > Ts + Th that makes the transition ﬁre (inducing a change of state) as soon as the temperature T2 exceeds the setpoint Ts augmented by an histeresys value Th . The change from state 2 to state 1 is modeled by the immediate transition Sw21, whose ﬁring is controlled by the guard T1 < Ts that makes the transition ﬁre when the temperature T1 goes below the setpoint Ts . In order to simplify the ﬁgure, we have connected the ﬂuid arcs directly to the immediate transitions. The meaning of this unusual feature is that ﬂuid ﬂows across the arcs as long as the immediate transitions are enabled regardless of the value of the guards. The ﬂuid arc in output from place secondary, represents the end user demand. The label on this arc is [u1 , u2 ], indicating the possible range of variation of the user demand. Fluid place CT R2, whose marking is denoted by y1 , models the exit of the proportional-integrator PI1 . This is achieved by connecting to place CT R1 an input ﬂuid arc, characterized by a variable ﬂow rate equal to T2 , and by an output ﬂuid arc with a constant ﬂuid rate equal to the setpoint Ts . In a similar way, the exit of the proportional-integrator PI2 is modeled by ﬂuid place CT R2 (whose marking is denoted by y2 ). The ﬂuid arcs that connect transition Sw12 and Sw21 to ﬂuid place primary represent the heating up of the primary circuit.

Ts

T2

y1

Ts

CTR2 P1

Sw12

f(y2)

T2>Ts+Th

P2

Sw21

T1

y2 CTR1

T1

γ(T1-T2)

Primary

γ(T1-T2)

T2

[u1,u2]

Secondary

f(y1)

T1
Fig. 3. FPN model of the temperature Controller

278

3.2

Marco Gribaudo et al.

From FPN to Hybrid Automata

An hybrid automaton [3]is a ﬁnite state machine whose nodes (called control modes) contain real valued variables with a deﬁnition of their ﬁrst derivatives and possible bounds on their values. The edges represent discrete events and are labeled with guarded assignments on the real variables. Given a hybrid automaton and a legal formula on its variables, the model checking problem asks to compute a region that satisﬁes the predicate, or to ﬁnd at least one counterexamples that contradicts the predicate. In order to use a FPN model in a model checking environment, the FPN formalism sould be converted into a hybrid automaton. A general conversion algorithm could be envisaged using the technique proposed in [15], and some of the ideas presented in [2]. The application of the general algorithm to the case study FPN of Figure 3 provides the hybrid automaton [3] of Figure 4. The hybrid automaton has the following set of real variables T1 , T2 , y1 and y2 (corresponding to the ﬂuid variables of the FPN) and two control modes P 1 and P 2 (corresponding to the two discrete markings of the FPN). Each continuous variable has a derivative equal to the ﬂow rate of the corresponding ﬂuid place in that state. Transitions from control mode P 1 to P 2 and from P 2 to P 1 are labeled with the guards of the immediate transitions that cause the state change. State P 1 has also associated the bound (invariant condition) T2 ≤ Ts + Th and P 2 the bound T1 ≥ Ts to reﬂect the same bounds posed on those ﬂuid places. The model of Figure 4 could be analyzed by means of appropriate tools for hybrid automata [10].

4

Analysis of the FPN Model via NuSMV

Discrete model checking is based on a ﬁnite state machine model in which the variables and their derivatives are discretized and in which the time increases with a predeﬁned time step. The parameters and their derivatives can be assigned uncertainty ranges (e.g. a min and a max value) with non-deterministic logic. The predicates to be checked are speciﬁed using a Computational Tree Logic (CTL) or a Real Time CTL (RTCTL) [6, 7].

P1 T1=T2=Ts y1=y2=0

T2=Ts+Th

P2

T1’=f(y 2)-γ(T1-T2)

T1’=f(y 1)-γ(T1-T2)

T2’= γ(T1-T2)-[u1,u2]

T2’= γ(T1-T2)-[u1,u2]

y1’=T s-T1

y1’=T s-T1

y2’=T s-T2 T2<=Ts+Th

y2’=T s-T2 T1=Ts

T1>=Ts

Fig. 4. Hybrid Automata obtained from the FPN of Figure 3

Model-Checking Based on Fluid Petri Nets

279

In order to show the generality of our approach and to give an insight on the class of models that can be automatically derived from the FPN description, we sketch, in brief, how the FPN can be converted into a discrete model to be checked using discrete model checking techniques and we present some typical analyses and results that can be obtained from the converted model. For the purpose of the present paper, we have chosen the language NuSMV, for which an analysis tool is available [14]. 4.1

Converting a FPN into a NuSMV Model

In the present section, we describe the main steps that are required to convert a FPN model into the NuSMV language, and we provide an excerpt of the NuSMV speciﬁcations for the case study at hand, in the Appendix. A more detailed description of the conversion algorithm and the complete NuSMV speciﬁcations are given in [12]. The conversion algorithm requires the following steps: 1. Deﬁnition of the variables. The discrete part of the FPN model is the marking and it is directly translated into a discrete variable in NuSMV. All the continuous variables (ﬂuid levels of the FPN) and their rates of variation must be, instead, suitably discretized. These facts are described in NuSMV under the keyword VAR. 2. The second step requires the FPN constants that are used in the ﬂuid rate functions or in the enabling conditions are deﬁned. Moreover, the ranges of variation (min and max values) for the continuous variables and for their rates must be set. All these quantities (constants and bounds) must be suitably discretized and rescaled according to the discretization intervals chosen in step 1), above. The constants and the bounds are listed under the keyword DEFINE. 3. In order to analyze the behavior of the control system versus time, a time step (in arbitrary units) is assumed and the dynamic evolution of the system at the integer multiples of the time step must be described. The evolution of the model is stated under the keyword TRANS and must be described marking by marking. 4. Finally, the fourth step consists in deﬁning an initial state from which the dynamic evolution of the model starts. The initial state of the model is described under the keyword INIT. We now particularize the above general points to the present case study (refer to the Appendix and to [12] for the obtained speciﬁcation in the NuSMV language). The discrete part of the FPN model is reﬂected in the variable marking, whose value is either 1 or 2. Furthermore, all the continuous variables (ﬂuid levels of the FPN) and their range of variation must be discretized. Let x be a ﬂuid variable in FPN whose ﬂuid place is lower bounded by Bl and upper bounded by Bu . We deﬁne a discretization step δ such that the continuous range of variation of x is discretized in n steps with n = (Bu − Bl )/δ (· denotes the closest larger integer of its argument). With this assumption, the possible discretized

280

Marco Gribaudo et al.

values of the level x are deﬁned in NuSMV as x: 0..n where x = i means that the corresponding value is Bl + iδ. In the FPN of Figure 3, four ﬂuid variables are deﬁned: y1 , y2 and T1 , T2 . In the NuSMV description, the variables representing the ﬂuid levels y1 , y2 , T1 and T2 are denoted by y1, y2, T1, and T2. The ﬂuid levels y1 and y2 , of ﬂuid places CT R1 and CT R2, respectively, are normalized in the range [0, 1], and discretized with a step interval 1/30. The normalization constant for y1 and y2 is denoted by dy and represents how fast the system reacts to the temperature diﬀerence with respect to the setpoint. The ﬂuid levels T1 and T2 of ﬂuid places P rimary and Secondary, respectively, are bounded between T = 138 and Tu = 145 and the discretization step chosen for these variables is 0.1. The list of constants and bounds deﬁned in step 2 above, includes: – y1max, y2max,T1max, T2max rescaled bounds on the continuous variables; – alphamin, alphamax,and betamin, betamax: non-deterministic range of the heat induced by the methane gas and by the exhaust gas, respectively; – sp, hys, ys: setpoint temperature Ts , hysteresis value Th and split point Ys ; – dy: system reaction speed to the output of the proportional integrators P I; – gamma rate of heat exchange between the primary and secondary circuit; – u1, u2: non-deterministic range in the heat consumption of the end user. The second part under the keyword DEFINE deﬁnes the rates at which the continuous variables change in each discrete state, and the bound on the rates. For marking=1, these are the following (similar deﬁnition hold for marking=2): – m1 y1 gives the (deterministic) ﬂuid rate of place CT R1 in state 1; – m1 y2 gives the (deterministic) ﬂuid rate of place CT R2 in state 1; – m1 T 1 min and m1 T 1 max give the minimal and maximal ﬂow rate of ﬂuid place P rimary in state 1; – m1 T 2 min and m1 T 2 max give the minimal and maximal ﬂow rate of ﬂuid place Secondary in state 1; Since in the present model we have two markings (states), the evolution description is restricted to four expressions: – possible changes of the variables inside marking=1 (marking=2) ; – jump from marking=1 to marking=2 (from marking=2 to marking=1). Finally, the initial state of the model is described under the keyword INIT. 4.2

NuSMV Results

NuSMV is a model checking tool and contains also a simulation engine to explore the dynamics of the system. To increase the readability of the results, we report the variables in their true units (and not in the rescaled units used by NuSMV). Figure 5 and 6 depict the evolution of the temperatures (T1 and T2) and of y1 and y2, respectively, for the same simulation trace, starting from the initial state [y1=0, y2=0, T1=138, T2=138].

Model-Checking Based on Fluid Petri Nets

144

1

t1 t2

143

y1 y2

0.8

142 y1 and y2

Temperature

281

141 140

0.4

0.2

139 138

0.6

0

20

40

60

80

100

120

140

160

0

0

20

40

60

Time

80

100

120

140

160

Time

Fig. 5. Change of temperature given by a simulation trace

Fig. 6. Change of y1 and y2 given by a simulation trace

The design speciﬁcation for the temperatures of the system (given as invariant condition or bounds) are: (139 ≤ T1 ≤ 144 and 139 ≤ T1 ≤ 141). If the invariant does not hold, i.e. the temperatures exceed the bounds, NuSMV produces a counterexample as in Table 1. The table shows a case with [gamma=2, dy=10, Ts=140]. Both temperatures start initially from 141 and decrease because of the heat consumption of the user. As T2 (T1) reaches Ts, y1 (y2) starts to increase. However, the reaction is not fast enough to avoid the undesirable condition and the secondary temperature T2 crosses the lower bound. Modifying the design parameters may avoid to incur in this situation (f.i. setting [gamma=2, dy=1/10, Ts=139.8], i.e. speeding up the reaction of the system, and reducing the setpoint temperature). Using RTCTL (Real-Time Computational Tree Logic [7]) expression, one can check the trajectory on which the system proceeds. For example, starting from the lowest possible temperatures (T1=T2=138) the formula AF (AG (T1>=139 & T1<=144 & T2>=139 & T2<=141)) is true if the system gets back to stable state for sure and remains there forever. Setting [gamma=2, dy=1/10, sp=18], the formula evaluates to true. The same formula, with the same settings evaluates to true as well, if the system starts from the upper bound of the temperatures. Knowing the timing behavior of the system, one can use NuSMV to compute the minimal or maximal time needed to get a given set of states from an initial condition. For example, the command COMPUTE MIN[y1=0 & y2=0 & T1=145 & T2=145, AG (T1>=139 & T1<=144 &

Table 1. Counterexample Step 1 2 3 4 5 6 7 8 9 10 State 1 1 1 1 1 1 1 1 1 1 T1 141 141 140.9 140.7 140.6 140.4 140.3 140.1 140 139.8 T2 141 140.7 140.5 140.4 140.2 140.1 139.9 139.8 139.6 139.5 y1 0 0 0 0 0 0 0 1/30 2/30 3/30 y2 0 0 0 0 0 0 0 0 0 1/30

11 1 139.7 139.3 4/30 2/30

12 13 14 1 1 1 139.5 139.4 139.2 139.2 139 138.9 5/30 6/30 8/30 3/30 4/30 5/30

282

Marco Gribaudo et al.

T2>=139 & T2<=141)] COMPUTE MAX[y1=0 & y2=0 & T1=145 & T2=145, AG (T1>=139 & T1<=144 & T2>=139 & T2<=141)] gives the length of the minimal and maximal paths that lead from the initial condition [y1=0, y2=0, T1=145, T2=145] of high temperatures (out of the required range) to temperatures inside the required range in such a way that the system does not leave this range in the future. The above command with parameters [gamma=2, dy=1/10, sp=18] results in min−path = 21 and M ax− path = 64.

5

Scalability and Complexity

The scalability and the complexity of the method mainly depends on the hybrid automata solution component. The process of translating a FPN to Hybrid System is exponential in the dimension of the FPN: it requires the creation of its reachability graph which is done through a depth-ﬁrst visit of its state space. This step is clearly exponential in the dimension of the model (see for example [9]). After the model has been translated, the complexity of the analysis depends on the complexity of the algorithms used by the NuSMV package. The scalability of the technique is thus limited by two diﬀerent aspects: the exponential complexity of the translation process, and the solution complexity of the Hybrid Automata analysis technique. At the present time, these constraints limit the applicability of the proposed technique only to very small (in term of FPN description elements) models.

6

Conclusion

Using a real world hybrid system as a case study we presented an approach to integrate FPNs and model checking via hybrid automata and NuSMV. Such integration turns out to be conceptually useful and eﬀective in practice. In fact it allowed us to comfortably model and verify the temperature control system in the co-generative plant ICARO at ENEA (CR).

Acknowledgment A. Bobbio and M. Gribaudo were partially supported by the Italian Ministry of Education under Grant ISIDE.

Model-Checking Based on Fluid Petri Nets

283

References [1] H. Alla and R. David. Continuous and hybrid Petri nets. Journal of Systems Circuits and Computers, 8(1):159–188, Feb 1998. 273, 274, 276 [2] M. Allam. Sur l’analyse quantitative des r´eseaux de Petri hybrides: une approche base´ee sur lea automates hybrides. Technical report, Phd Thesis, Institut National Polytechnique de Grenoble (in French), 1998. 274, 278 [3] R. Alur, T. A. Henzinger, and P. H. Ho. Automatic symbolic veriﬁcation of embedded systems. IEEE Transaction Software Engineering, 22:181–201, 1996. 274, 278 [4] A. Bobbio, S. Bologna, E. Ciancamerla, P. Incalcaterra, C. Kropp, M. Minichino, and E. Tronci. Advanced techniques for safety analysis applied to the gas turbine control system of ICARO co-generative plant. In X Convegno Tecnologie e Sistemi Energetici Complessi, pages 339–350, 2001. 274 [5] A. Bobbio and A. Horv´ ath. Petri nets with discrete phase timing: A bridge between stochastic and functional analysis. In Second International Workshop on Models for Time-Critical Systems (MTCS 2001), pages 22–38, 2001. 274 [6] E. M. Clarke, E. A. Emerson, and A. P. Sistla. Automatic veriﬁcation of ﬁnite state concurrent systems using temporal logic speciﬁcations: A practical approach. ACM Transactions on Programming Languages and Systems, 8(2):244–263, 1986. 274, 278 [7] E. A. Emerson, A. K. Mok, A. P. Sistla, and J. Srinivasan. Quantitative Temporal Reasoning. Journal of Real Time Systems, 4:331–352, 1992. 278, 281 [8] M. Gribaudo. Hybrid formalism for performance evaluation: Theory and applications. Technical report, Phd Thesis, Dipartimento di Informatica, Universit` a di Torino, 2001. 274, 276 [9] M. Gribaudo, M. Sereno, A. Horv´ ath, and A. Bobbio. Fluid stochastic Petri nets augmented with ﬂush-out arcs: Modelling and analysis. Discrete Event Dynamic Systems, 11 (1/2):97–117, January 2001. 273, 274, 276, 282 [10] T. A. Henzinger, P. H. Ho, and H. Wong-Toi. A user guide to HyTech. In Proceedings 1st Workshop Tools and Algorithms for the Construction and Analysis of Systems - TACAS, pages 41–71. Springer Verlag, LNCS Vol 1019 - http:// www.eecs.berkeley.edu/ tah/HyTech, 1995. 278 [11] G. Horton, V. Kulkarni, D. Nicol, and K. Trivedi. Fluid stochastic Petri nets: Theory, application and solution techniques. European Journal of Operational Research, 105(1):184–201, 1998. 273, 274, 276 [12] A. Horv´ ath, M. Gribaudo, and A. Bobbio. From FPN to NuSMV: The temperature control system of the ICARO cogenerative plant. Technical report, Universit` a del Piemonte Orientale, Feb 2002, http://www.di.unipmn.it. 279 [13] T. Murata. Petri nets: properties, analysis and applications. Proceedings of the IEEE, 77:541–580, 1989. 276 [14] NuSMV. http://nusmv.irst.itc.it/index.html. 274, 279 [15] B. Tuﬃn, D. S. Chen, and K. Trivedi. Comparison of hybrid systems and ﬂuid stochastic Petri nets. Discrete Event Dynamic Systems, 11 (1/2):77–95, January 2001. 274, 278

Assertion Checking Environment (ACE) for Formal Verification of C Programs Babita Sharma1 , S. D. Dhodapkar1 , and S. Ramesh2 1

2

Reactor Control Division, Bhabha Atomic Research Centre Mumbai 400085, India fax(091-22-5505151) [email protected] [email protected] Centre for Formal Design and Veriﬁcation of Software, IIT Bombay Mumbai 400076, India [email protected]

Abstract. In this paper we describe an Assertion Checking Environment(ACE) for compositional veriﬁcation of programs which are written in an industrially sponsored safe sub-set of C programming language called MISRA C [1]. The theory is based on Hoare logic [2] and the C programs are veriﬁed using static assertion checking technique. First the functional speciﬁcations of the program, captured in the form of preand post-conditions for each C function, are derived from the speciﬁcations. These pre- and post-conditions are then introduced as assertions(annotations or formal comments) in the program code. The assertions are then proved formally using ACE and theorem proving tool called Stanford Temporal Prover(STeP) [3]. ACE has been developed by us and consists of a translator c2spl, a GUI and some utility programs. The technique and tools developed are targeted towards veriﬁcation of embedded software.

1

Introduction

Safety-critical systems need to be highly reliable and there is obligation to demonstrate that the integrity level of software is commensurate with the safety class of the software. One of the main constituent attribute of software integrity is the freedom from defects. It is known that techniques such as inspection, peer review and dynamic testing which are used extensively for veriﬁcation of software are very eﬀective in detecting a number of bugs, but they can not guarantee absence of bugs [4]. Therefore techniques based on formal methods are becoming more and more relevant due to their ability to provide mathematical proofs of desired properties of software and are being increasingly recommended for veriﬁcation of high-integrity or safety-critical software. Formal veriﬁcation techniques employ mathematical methods and logical reasoning to determine

Corresponding author

S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 284–295, 2002. c Springer-Verlag Berlin Heidelberg 2002

Assertion Checking Environment for Formal Veriﬁcation of C Programs

285

whether software satisﬁes its speciﬁcations or not and hence are able to provide more dependable evidence of freedom from defects. This paper describes the Assertion Checking Environment(ACE) developed by us and interfaced with the theorem prover STeP [3]. ACE is used for statically checking (i.e. without executing program unit) assertions about a program unit. The assertions themselves are derived from the speciﬁcations of the program unit. While dynamic assertion checking has been around for a long time and has been well supported by many dynamic analysis tools, it requires execution of program and development of test drivers and stubs. This contributes enormously to the veriﬁcation eﬀort. Also dynamic assertion checking is more akin to testing and suﬀers from the same limitations. On the contrary static method checks assertion completely with respect to the total program unit. In context of safety-critical software, verifying that software units satisfy their speciﬁcations was our target application of ACE. Another application of assertion checking is safety verification. In safety veriﬁcation of software, the top level safety properties are analyzed during design/implementation to arrive at safety properties or Safety Veriﬁcation Conditions(SVCs) at the software unit level which are then veriﬁed. Such work on proving Safety Veriﬁcation Conditions(SVCs) on safe sub-set of Ada has been reported in [5]. In that work the actual proofs of SVCs were obtained using SPARK Simpliﬁer and Proof Checker [6]. The work reported here was motivated by our requirement of static assertion checking tool for C language, which continues to be a popular language of embedded system designers. The paper is organized as follows. Section 2 describes the rationale behind design of ACE. Section 3 gives details on veriﬁcation using ACE i.e. the inputs to ACE and compositional veriﬁcation of MISRA C compliant programs. This is followed by description of c2spl implementation in section 4 and an example in section 5. Section 6 describes our experience with initial use of ACE. Section 7 summarizes the work, compares it with similar work and discusses possible future work.

2

ACE - Design Rationale

Formal veriﬁcation, in general, involves checking that model of system/software satisﬁes its specifications. The model which describes the system/software formally is usually speciﬁed in specialized and abstract high level modeling languages. The formal model could be ﬁnite state or inﬁnite state and this would govern the choice of veriﬁcation engine. The overall formal veriﬁcation scheme is shown in ﬁg. 1. The design of ACE is based on consideration of three main issues which arise in context of veriﬁcation of C programs against their speciﬁcations. – In order to apply formal veriﬁcation, the veriﬁer has to construct formal model of a C program in the input language of the veriﬁcation tool used. This process is time-consuming and error-prone and has been one of the impediments in the use of formal veriﬁcation.

286

Babita Sharma et al. System/Software Requirements

System/Software Description

Formalization

Model Generation

Formal Specifications

Abstract Model

Verificaton Engine

Fig. 1. Formal Veriﬁcation Technique

– The formal modeling of full featured C language in general poses several problems because of pointer arithmetic, aliasing, side eﬀects etc. Most programming guidelines used in the design of safety-critical or high integrity software recommend restrictions on the use of language features which are not secure and/or may be diﬃcult to verify. This approach can actually help overcome the problem of generating formal models of C programs. Hence a well deﬁned sub-set of C language was required. – To reduce eﬀort in discharging of proofs, a powerful veriﬁcation engine is required. Since the formal model of a C program is in general an inﬁnite state model, a theorem proving tool would be required. These issues have been addressed in ACE as described below: ACE provides a tool for automatic translation of C program units to SPL programs. SPL is input language of theorem prover STeP. The C language programs to be checked by ACE are assumed to be compliant to MISRA C standard. The MISRA subset of C [1] is an industrially sponsored subset of C recommended for programming safety-critical systems. It prohibits or restricts the use of unspecified, undefined, implementation-defined and locale-specific features of the C language. It also omits all features of C, such as recursion and dynamic memory allocation, which are proscribed by most programming guidelines for safety-critical software. ACE has been interfaced with a powerful theorem proving tool called STeP, thus making proving properties easy. Stanford Temporal Prover(STeP) [3] is a GUI based theorem prover developed at Stanford University. It supports the formal veriﬁcation of reactive systems. The inputs to STeP are the system model given in the Simple Programming Language(SPL) and the speciﬁcation(SPEC) of the system containing the axioms(properties that are known to be true for the system) and properties to be proved about the system. The axioms and properties are expressed by temporal logic formulae. The syntax of SPL is like an imperative language. It has basic data-types, basic control-ﬂow constructs and

Assertion Checking Environment for Formal Veriﬁcation of C Programs

287

allows deﬁnition of new data-types. It also has features to model concurrency using parallel composition. This feature is not used by ACE as it is meant for veriﬁcation of sequential program code/units. STeP provides a collection of simpliﬁcation and decision procedures to automatically check the validity of a large class of ﬁrst-order and temporal formulas. It provides veriﬁcation rules that reduce temporal properties to ﬁrst-order veriﬁcation conditions. The heuristics and the commonly used steps to prove properties can be coded in the form of a tactic which when invoked can automatically discharge the property. By analyzing the SPL code STeP can also generate local invariants. These local invariants greatly simplify the deduction of properties in general and in particular properties of programs having loops.

3

Verification Using ACE

In ACE the classical Hoare [2] triple(pre-condition, program, post-condition) expresses the functional behaviour of a program unit. The Hoare triple written as {pre}p{post} states that the execution of program p beginning in a state satisfying pre will eventually result in a state satisfying post. The assertion pre is called pre-condition and assertion post is called post-condition of the program p. While carrying out veriﬁcation pre and post are introduced as annotations in p and post becomes the proof obligation to be discharged. c2spl is used to produce the formal SPL version of p, which is used in carrying out the proof. In real programs program unit p can call other program units. This is handled using compositional verification explained later in section 3.2. 3.1

Inputs to ACE

The present ACE implementation can be used for the unit-level veriﬁcation of sequential C programs. The vertical arrows in ﬁg.2 show the steps that are followed. The functions enclosed in the dashed box are implemented in ACE and are available to user through GUI. The unit-level speciﬁcations are obtained from Software Detailed Design (SDD) document and are converted to formal speciﬁcations in the form of preand post-conditions of C functions. ACE allows user to interactively insert these pre- and post-conditions in function code as annotations(formal comments).User can also insert assertions at intermediate points in the C program. The utility program gen cg can then be used to build the call-graph. This is used to sequence the veriﬁcation of functions as explained in section 3.2. The main component of ACE, the c2spl, can be invoked interactively to translate any C function into its semantically equivalent SPL program. The c2spl also processes annotations during translation to generate a SPEC ﬁle containing properties to be proved. The SPL code and SPEC ﬁle are then used as input to STeP for carrying out proofs.

288

Babita Sharma et al. Source Code (MISRA C)

Software Specifications at Unit Level

Annotate Source Code G User

U

Generate Call-Graph

ACE

(gen_cg)

I Translate the Target Funtion to SPL (c2spl)

SPL Code

SPEC file

Prove Properties (STeP)

Fig. 2. Process of source code veriﬁcation

3.2

Compositional Verification

The veriﬁcation process is begun from the leaf nodes of the call-graph, as these functions do not invoke any other functions. For example consider part of a program call-graph shown in ﬁg. 3. Let preB, preD and preE denote the preconditions of functions B, D and E respectively and postB, postD and postE denote the post-conditions of the functions. The proofs for lower-level functions D and E are discharged ﬁrst and their pre- and post-conditions are recorded in the property-recorder of ACE. User may need to prove intermediate assertions to prove the post-conditions. For proving the correctness of a non-leaf function such as B, the functions called from this non-leaf function are treated as black-boxes and only their speciﬁcations stored in the property-recorder are used. In the example shown in ﬁg. 3 postD and postE are used by ACE to generate the postfunc type of annotations in B at the location of function-calls to D and E respectively. These annotations are translated to axioms in the SPEC ﬁle by c2spl. Since postD and postE were proved true assuming preD and preE to be valid, these pre-conditions become additional proof-obligations to be discharged when proving the parent-function B. Hence preD and preE are used to generate prefunc type of annotations in B as shown in ﬁg. 3. The prefunc and postfunc type annotations corresponding to called functions are generated automatically by ACE by replacing formal parameters with actual parameters.

Assertion Checking Environment for Formal Veriﬁcation of C Programs

int B(....) { /*pre preB end*/

B

Proof

/*prefunc preD end*/ x=D(...) ; /*postfunc postD end*/ /*prefunc preE end*/ y=E(...); /*postfunc postE end*/ Direction

D

E

int D(....) { /*pre preD end*/ .....

}

.....

/*assert condition end*/

......

/*post postD end*/

/*post postB end*/

int E(....) { /*pre preE end*/

/*assert condition end*/

}

289

......

}

/*post postE end*/

Fig. 3. Part of Call-graph illustrating Compositional Veriﬁcation

4

c2spl Translator Details

The translator c2spl has two main functions viz. – translate the input program compliant to MISRA C to semantically equivalent program in SPL. – translate the diﬀerent types of annotations as speciﬁcations of the translated program. The translation process is elaborated below. 4.1

MISRA C to SPL Translation

A parse-tree walker handles the syntax-directed translation of C statements to SPL. The c2spl assumes that input program is MISRA compliant. However since the parser ctool [7] used by c2spl handles full ANSI C, a MISRA compliance checking tool needs to be used prior to attempting use of c2spl for translation. Translation of Control-Flow Constructs. SPL provides many control-ﬂow constructs similar to those provided by the C language such as if-then-else, while-do and for loop. The switch statement of C, for which there is no corresponding construct in SPL, is modeled using the if-then-else-if... construct and the do-while looping construct of C is translated to the repeat-until construct of SPL. The switch statement is modeled using the if-then-else construct of SPL. It may be noted that MISRA C prohibits use of fall through in case of switch statement.

290

Babita Sharma et al.

Table 1. Mapping of Basic Data-types C Data-type SPL Data-type Additional Axiom unsigned char [0..255] signed char [-128..127] [signed] [short|long] int int unsigned [short|long] int int ✷(var ≥ 0) float, double, long double rat

Table 2. Translation of Bit-ﬁelds C Syntax struct [struct tag] {

SPL Translation type [ struct tag | str <struct Counter> ] = { unsigned int ident list1 : 1 ident list1 : bool unsigned int ident list2 : constant ident list2 : [0..2constant − 1] .. .. } ident list; } local ident list : [ struct tag | str <struct Counter> ] struct struct tag ident list; local ident list : struct tag

Translation of Data-Types. SPL has data-types int, rat(rational) and bool. The mapping of the basic data-types from C to SPL is shown in table 1. For data types in C which are of type unsigned, during translation an axiom stating that the value of the variable is always positive is generated by c2spl. This is shown in the last column of table 1 where ✷ is a temporal operator read as ’always’. A variable in SPL can have mode in, out or local. Variables of mode in cannot be assigned any value, out variables cannot be read and local variables can be both read and written. Variables qualiﬁed with const keyword in C are declared to have mode in in the SPL code. A dummy variable RET is introduced by c2spl to model the return statement of C. This variable is declared as an out variable. All other variables are declared as local variables. Arrays, structures, unions, enumerations, typedefs are also translated by c2spl into semantically equivalent constructs. Table 2 shows the translation of bit-ﬁelds. An extra variable str <struct Counter> is introduced if the structure tag is missing in the C declaration. c2spl generates additional axioms to handle union of bit-ﬁelds. The translation of pointers in the absence of aliasing is also handled. 4.2

Formal Annotations and Their Translation

ACE provides syntax for inserting seven types of formal annotations. Annotations are enclosed in /* and */ and hence are treated as comments by C compilers. The annotations specifying the functional requirements are written in the

Assertion Checking Environment for Formal Veriﬁcation of C Programs

291

Table 3. Translation of Formal Annotations Annotation /*pre formula1 end*/ /*post formula2 end*/ /*assert formula3 end*/

SPL code pre1:skip post2:skip assert3:skip

Specification AXIOM a1 : ✷(pre1 → formula1) PROPERTY p2 : ✷(post2 → formula2) PROPERTY p3 : ✷(assert3 → formula3)

Table 4. Translation of Function Speciﬁcations Annotation SPL code Specification /*prefunc formula4 end*/ pref1:skip PROPERTY p1 : ✷(pref1 → formula4) function-call /*postfunc formula5 end*/ postf2:skip AXIOM a2 : ✷(postf2 → formula5)

syntax shown in column 1 of table 3. Their translations, as performed by c2spl, are also shown in the table. Here formula1 denotes a pre-condition, formula2 a post-condition and formula3 an intermediate assertion. The symbols pre1, post2 and assert3 are labels denoting control locations, generated by c2spl, to establish the correspondence between formulae and the locations at which they hold. The formulae are written in linear temporal logic in the syntax of STeP. The eﬀect of calls to library functions can be expressed using prefunc and postfunc annotations. These annotations are inserted before and after the function-call as shown in table 4. The translation done by c2spl is shown along side. ACE provides var type of annotation to support declaration of auxillary variables. It also provides always type of annotation to specify a global invariant which is known to be true. The always annotation is translated to an axiom by c2spl and is mainly used to handle union of bit-ﬁelds.

5

Example

The C functions given below read data(InputX and InputY) from 12-bit registers and convert them to values(final data) suitable for display. The speciﬁcation ﬁle generated for the function get inputsX follows the C code. Due to lack of space the SPL code generated by c2spl is not given. The function read from reg is a library function whose speciﬁcation is inserted in the body of function get inputsXY as postfunc type assertion, while for functions change to v and convert to d the proofs are carried out separately and their speciﬁcations get automatically inserted in the SPEC ﬁle as shown below. C Code #include #define RCD3_X 1 #define RCD3_Y 2

292

Babita Sharma et al.

#define RCD2 2 #define RCD3 3 typedef unsigned short WORD; typedef unsigned char BYTE; struct RCD3_data { double X, Y; }; extern int read_from_reg(int regnum, WORD *D); /****** Function prototype declarations *******/ void get_inputsXY(struct RCD3_data *final_data) { WORD InputX, InputY; BYTE input_src = RCD3; double tempX, tempY; int ret1, ret2; ret1 = read_from_reg( 1, &InputX ); /*postfunc ( InputX >= 0 /\ InputX <= 4095 ) end*/ ret2 = read_from_reg( 2, &InputY ); /*postfunc ( InputY >= 0 /\ InputY <= 4095 ) end*/ change_to_v(InputX, input_src, &tempX ); /*assert !(tempX < 0 \/ tempX > 5) end*/ change_to_v(InputY, input_src, &tempY); final_data->X= tempX; final_data->Y= tempY; convert_to_d(1, tempX, final_data); /*assert (#X final_data >= -180) /\ (#X final_data <= 180) end*/ if((final_data->X > 80) || (final_data->X < -80)){ final_data->X = 0; } /*post (#Y final_data = ( (5.0/4096.0) * InputY )) /\ ( !(#Y final_data < 0 \/ #Y final_data > 5)) /\ (!((#X final_data > 80) \/ (#X final_data < -80))) end*/ } /******************************************************/ void change_to_v(WORD D_input, BYTE input_src, double *ptr) { /*pre (input_src = RCD2) \/ (input_src = RCD3) end*/ switch( input_src ){ case RCD2 : *ptr = ( (5/2048) * D_input - 5.0 ); break; case RCD3 : *ptr = ( (5/4096) * D_input ); break; default : break; } /*post ((input_src = RCD3) /\ (ptr = ((5/4096) * D_input))) \/ ((input_src=RCD2) /\ (ptr = ((5/2048) * D_input - 5.0))) end*/ } /******************************************************/ void convert_to_d(WORD src, double input, struct RCD3_data *deg) { /*pre (src = RCD3_X) \/ (src = RCD3_Y) end*/

Assertion Checking Environment for Formal Veriﬁcation of C Programs

293

switch (src) { case RCD3_Y : deg->Y = ((180 / 5.0) * input - 90.0); break ; case RCD3_X : deg->X = ((360.0 / 5.0) * input) - 180.0; break; default : break; } /*post (src = RCD3_X /\ #X deg = (360.0 / 5.0) * input - 180.0) \/ (src = RCD3_Y /\ #Y deg = (180 / 5.0) * input - 90.0) end*/ } SPEC File for Function get inputsXY: SPEC AXIOM AXIOM

a1 : postf1 ==> a2 : postf2 ==>

( InputX >= 0 /\ InputX <= 4095 ) ( InputY >= 0 /\ InputY <= 4095 )

PROPERTY p3 : prefunc3 ==> (input_src = 2) \/ (input_src = 3) AXIOM a4 : postf4 ==> ((input_src = 3) /\ (tempX = ((5.0/4096.0) * InputX))) \/ ((input_src = 2) /\ (tempX = ((5.0/2048.0) * InputX - 5.0))) PROPERTY p5 : assert5 ==> !(tempX < 0

\/ tempX > 5)

PROPERTY p6 : prefunc6 ==> (input_src = 2 ) \/ (input_src = 3) AXIOM a7 : postf7 ==> ((input_src = 3) /\ (tempY = ((5.0/4096.0) * InputY))) \/ ((input_src = 2) /\ (tempY = ((5.0/2048.0) * InputY - 5.0))) PROPERTY p8 : prefunc8 ==> (1 = 1) \/ (1 = 2) AXIOM a9 : postf9 ==> ( (1 = 1) /\ (#X final_data = ((360.0 / 5.0) * tempX) - 180.0)) \/ ((1 = 2) /\ (#Y final_data = (180 / 5.0) * tempX - 90.0)) PROPERTY p10 : assert10 ==> (#X final_data >= -180) /\ (#X final_data <= 180) PROPERTY p11: post11 ==> (#Y final_data = ((5.0/4096.0) * InputY)) /\ ( !(#Y final_data < 0 \/ #Y final_data > 5) ) /\ (!((#X final_data > 80) \/ (#X final_data < -80))) In the SPEC ﬁle given above axioms a1 and a2 correspond to the postconditions of the library function read from reg. Identiﬁers postf1 and postf2 denote control-locations in the SPL code. Property p3 and axiom a4 are the preand post-conditions of function change to v respectively. Properties p5, p10 and p11 are the proof-obligations supplied by the user. The post-conditions of functions convert to d and change to v were discharged automaticaly using the tactic

294

Babita Sharma et al.

repeat(B-INV;else[Check-Valid,skip;WLPC]) and by invoking the corresponding pre-condition. The post-condition and assertions of the function get inputsXY were proved by repeatedly applying B-INV and WLPC rules and invoking the postfuncs at the place of function-calls.

6

Initial Operational Experience

ACE in its current form has been used in the veriﬁcation of many real programs from safety-critical embedded systems performing control, process interlock and data-acquisition and display functions. The process interlock software analysed by ACE was generated using a tool. The tool took logic diagrams, composed of function blocks and super-blocks as input to generate C code that becomes part of the runtime code. It was required to verify that the generated code implemented the logic speciﬁed in the diagram input to the code-generation tool. In this case the post-conditions were obtained from the logic diagrams and the emitted C code was annotated. The post-conditions were then proved, thus validating the translation of diagrams into C code. The software had approximately 6000 lines of code and 54 C functions. Around 500 properties were proved automatically using tactics. In yet another system where code was manually developed, the formal speciﬁcations were arrived at from the design documents and with discussions with the system designers. Around 110 properties were derived for the software made up of 4000 lines of code and roughly 40 C functions. For proving properties of some functions human-assistance in the form of selecting the appropriate invariants and axioms was required. The typical properties veriﬁed were of following nature: – – – –

7

Range checks on variables Arithmetic computations Properties specifying software controlled actions Intermediate asserts on values of variables

Conclusion and Future Work

Initial experience with ACE has shown that we could verify embedded sytem software, developed to comply with sringent quality standards, with relative ease and within reasonable time. The technique of compositional veriﬁcation helps in proving higher level properties by splitting the task of veriﬁcation into small, manageable program units. The properties of small program units, where size and complexity has been controlled, can generally be obtained and proved cheaply. The SPARK Examiner [6] which also supports static assertion checking, is a commercially available tool, for programs written in SPARK-Ada, a subset of Ada. It uses a proprietary Simpliﬁer and Proof-checker, while we have interfaced ACE with STeP, a powerful theorem proving tool. The use of proof tactics and

Assertion Checking Environment for Formal Veriﬁcation of C Programs

295

other features of STeP such as built-in theories, decision procedures, simpliﬁcation rules and local invariance generation, makes the veriﬁcation of most of embedded system programs easy, which is a great advantage. There are tools such as Bandera [8], Java Pathﬁnder [9] and Automation Extractor(AX) [10] that take source code of the system as input and generate a ﬁnite-state model of the system. The system properties are then veriﬁed by applying model-checking techniques. The work presented in this paper is targeted towards proving functional correctness of sequential program code and adopts the theorem-proving approach to formal veriﬁcation. In our future work we propose to use property-guided slicing of C programs prior to translation to SPL. This can further reduce the model size and hence the eﬀort and time required in deducing properties using the theorem prover.

Acknowledgement The authors wish to acknowledge the BRNS, DAE for supporting this work. Thanks are also due to A.K.Bhattacharjee for comments and help with the manuscript.

References [1] Guidelines for the Use of the C Language in Vehicle Based Software. The Motor Industry Software Reliability Association, 1998. 284, 286 [2] C. A. R. Hoare: An Axiomatic Basis for Computer Programming. Communications of the ACM, 12:576-580, 1969. 284, 287 [3] Nikolaj Bjorner et al.: The Stanford Temporal Prover User’s Manual. Stanford University, 1998. 284, 285, 286 [4] E. W. Dijkstra: A Discipline of Programming. Prentice-Hall, 1976. 284 [5] Ken Wong, Jeﬀ Joyce: Reﬁnement of Saftey-Related Hazards into Veriﬁable Code Assertions. SAFECOMP’98, Heidelberg, Germany, Oct. 5-7, 1998. 285 [6] John Barnes: High Integrity Ada - The SPARK Approach. Addison Wesley, 1997. 285, 294 [7] Shawn Flisakowski: Parser and Abstract Syntax Tree Builder for the C Programming Language. ftp site at ftp.cs.wisc.edu:/coral/tmp/spf/ctree 14.tar.gz 289 [8] J.Corbett, M.Dwyer, et.al.: Bandera : Extracting Finite State Models from Java Source Code. Proc. ICSE 2000, Limerick, Ireland. 295 [9] G. Brat, K. Havelund, S. Park and W. Visser: Java PathFinder - A Second Generation of a Java Model Checker. Workshop on Advances in Veriﬁcation, July 2000. 295 [10] G. J. Holzmann: Logic Veriﬁcation of ANSI-C Code with SPIN. Bell Laboratories, Lucent Technologies. 295

Safety Analysis of the Height Control System for the Elbtunnel Frank Ortmeier1 , Gerhard Schellhorn1 , Andreas Thums1 , Wolfgang Reif1 , Bernhard Hering2 , and Helmut Trappschuh2 1

Lehrstuhl f¨ ur Softwaretechnik und Programmiersprachen, Universit¨ at Augsburg 86135 Augsburg, Germany {ortmeier,schellhorn,thums,reif}@informatik.uni-augsburg.de 2 Siemens – I&S ITS IEC OS 81359 M¨ unchen, Germany {[email protected],helmut.trappschuh@abgw}.siemens.de

Abstract. Currently a new tunnel tube crossing the river Elbe is built in Hamburg. Therefore a new height control system is required. A computer examines the signals from light barriers and overhead sensors to detect vehicles, which try to drive into a tube with insuﬃcient height. If necessary, it raises an alarm that blocks the road. This paper describes the application of two safety analysis techniques on this embedded system: model checking has been used to prove functional correctness with respect to a formal model. Fault tree analysis has validated the model and considered technical defects. Their combination has uncovered a safety ﬂaw, led to a precise requirement speciﬁcation for the software, and showed various ways to improve system safety.

1

Introduction

This paper presents the safety analysis of the height control for the Elbtunnel. It is a joint project of the University of Augsburg with Siemens, department ‘Industrial Solutions and Services’ in Munich, which are sub-contractors in the Elbtunnel project, responsible for the traﬃc engineering. The Elbtunnel is located in Hamburg and goes beneath the river Elbe. Currently this tunnel has three tubes, where vehicles with a maximum height of 4 meters may drive through. A new, fourth tube will be going into operation in the year 2003. It is a larger tube and can be used by overhigh vehicles. A height control should prevent these overhigh vehicles from driving into the smaller tubes. It avoids collisions, by triggering an emergency stop and locking the tunnel entrance. Because the system consists of software and hardware components, we combine two orthogonal methods, model checking and fault tree analysis from the domain of software development resp. engineering. Model checking is used to prove safety properties like ‘no collision’. Fault tree analysis (FTA) examines sensor failure and reliability. S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 296–308, 2002. c Springer-Verlag Berlin Heidelberg 2002

Safety Analysis of the Height Control System for the Elbtunnel

297

We will brieﬂy describe the layout of the tunnel, the location of the sensors, and its functionality in Sect. 2. The formalization of the system and model checking of safety properties are presented in Sect. 3 and Sect. 4. The fault tree analysis in Sect. 5 completes the safety analysis. Some weaknesses of the system have been discovered which led to the proposals for improvements given in Sect. 6. Finally, Sect. 7 concludes the paper.

2

The Elbtunnel Project

The Elbtunnel project is very complex. Besides building the tunnel tube, it contains traﬃc engineering aspects like dynamic route control, locking of tunnel tubes, etc. We will consider only a small part of the whole project, the height control. Currently a height control exists for the ‘old’ three tubes. Light barriers are scanning the lanes for vehicles which are higher than 4 meters and trigger an emergency stop. The existing height control has to be enhanced, such that it allows overhigh vehicles to drive through the new, higher tube, but not through the old ones. In the following, we will distinguish between high vehicles (HVs), which may drive through all tubes and overhigh vehicles (OHVs), which can only drive through the new, fourth tube. Figure 1 sketches the layout of the tunnel. The fourth tube may be cruised from north to south and the east-tube from south to north. We focus our analysis on the northern entrance, because OHVs may only drive from north to south. The driving direction on each of the four lanes of the mid- and west-tube can be switched, depending on the traﬃc situation. Flexible barriers, signals and road ﬁres guide drivers to the tubes, which are open in their direction. The system uses two diﬀerent types of sensors. Light barriers (LB) are scanning all lanes of one direction to detect, if an OHV passes. For technical reasons they cannot be installed in such a way, that they only supervise one lane. Therefore overhead detectors (OD) are necessary to detect, on which lane a HV OHV LB

LBpre

LBpost

ODright

ODleft

ODfinal

4. tube

west-tube

mid-tube

east-tube

high vehicle overhigh vehicle light barrier, detecting OHVs OD overhead detector, detecting HVs and OHVs (undistinguishable) west-, existing tubes; HVs mid-, may drive trough, but east-tube not OHVs 4. tube new, higher tube; OHVs may drive trough

Fig. 1. Layout of the northern tunnel entrance

298

Frank Ortmeier et al.

HV passes. The ODs can distinguish vehicles (e.g. cars) from high vehicles (e.g. buses, trucks), but not HVs from OHVs (but light barriers can!). If the height control detects an OHV heading towards a diﬀerent than the fourth tube, then an emergency stop is signaled, locking the tunnel entrance. The idea of the height control is, that the detection starts, if an OHV drives through the light barrier LBpre . To prevent unnecessary alarms through faulty triggering of LBpre , after expiration of a timer (30 minutes) the detection will be switched oﬀ. Road traﬃc regulations require, that after LBpre both HVs and OHVs have to drive on the right lane through tunnel 4. If nevertheless an OHV drives on the left lane towards the west-tube, detected trough the combination of LBpost and ODleft , an emergency stop is triggered. If the OHV drives on the right lane through LBpost , it is still possible for the driver to switch to the left lanes and drive to the west- or mid-tube. To detect this situation, the height control uses the ODfinal detector. To minimize undesired alarms (remember, that normal HVs may also trigger the ODs), a second timer will switch oﬀ detection at ODfinal after 30 minutes. For safe operation it is necessary, that after the location of ODfinal it is impossible to switch lanes. Infrequently, more than one OHV drives on the route. Therefore the height control keeps track of several but at the most three OHVs.

3

Formal Specification

In this section we deﬁne a formal speciﬁcation of the Elbtunnel using timed automata over a ﬁnite set of states. An automaton is shown as a directed graph, where states are nodes. A transition is visualized as an arrow marked with a condition c and a time interval [t1,t2] as shown in Fig. 2. A transition from s1 to s2 may be taken indeterministically at any time between t1 and t2, if the c/[t1,t2] s2 condition c holds from now until this time. The time s1 interval is often just [t,t] (deterministic case) and we abbreviate this to t. If t = 1, a transition will hapFig. 2. Transition pen in the next step (provided the condition holds) and we have the behavior of an ordinary, untimed automaton. The always true condition is indicated as ‘−’. The automata deﬁned below are graphic representations of the textual speciﬁcation that we used in the model checker RAVEN [7], [8], [6]. We also tried the model checker SMV [4], which supports untimed automata only, but is more eﬃcient. Timed automata are translated to untimed ones using intermediate states which say “the system has been in state s1 for n steps”, see [6]. The speciﬁcation consists of two parts. The ﬁrst speciﬁes the control system that is realized in software and hardware, the second describes the environment, i.e. which possible routes HVs and OHVs can follow. Our aim is to prove (Sect. 4), that the control system correctly reacts to the environment: for example we will prove that if an OHV tries to go into mid- or west-tube (behavior of the environment) then the control system will go into a state that signals an emergency stop (reaction of the control system).

Safety Analysis of the Height Control System for the Elbtunnel

299

Both parts of the speciﬁcation consist of several automata: the control system consists of two automata COpre and COpost , which will be implemented in software. The ﬁrst counts OHVs between LBpre and LBpost , the second checks whether there are any after LBpost . Each uses a timer (TIpre and TIpost ), modeled as an instance of the same automaton with diﬀerent input and output signals. The environment consists of three identical automata OHV1 , OHV2 , OHV3 for OHVs. This is suﬃcient, since it is assumed, that at most three OHVs may pass simultaneously through the tubes. Finally, three automata HVleft , HVright , HVfinal model HVs that trigger the sensors ODleft , ODright , and ODfinal . They are instances of a generic automaton describing HVs. Altogether the system consists of 7 automata running in parallel: SYS = COpre COpost TIpre TIpost OHV1 OHV2 OHV3 HVleft HVright HVfinal The following sections will describe each automaton in detail. Specification of a Timer The generic speciﬁcation of a timer is shown in Fig. 3. Initially the timer is in state ‘oﬀ’, marked with the ingoing arrow. When it receives a ‘start’ signal, it starts ‘running’ for time ‘runtime’. During this time, another ‘start’ signal is interpreted as reset. After time ‘runtime’ the timer signals ‘alarm’ start/1 and ﬁnally turns oﬀ again. Two timers, which start/1 have a runtime of 30 minutes, are used in start/1 the control system. The ﬁrst (TIpre ) is started −/runtime −/1 when an OHV passing through LBpre , i.e. running alarm off ‘start’ is instantiated with LBpre . After 30 minutes it signals TIpre .alarm to COpre . SimiFig. 3. Timer TI larly, the second timer is triggered by an OHV passes through LBpost . It signals TIpost .alarm to COpost . Specification of Control for LBpre The control automaton COpre (shown in Fig. 4) controls the number of OHVs between the two light barriers LBpre and LBpost . Starting with a count of 0, evTI pre .alarm/1 ery OHV passing LBpre increments the counter, every OHV passing LBpost decreLBpre /1 LBpre /1 LB /1 pre ments it. If COpre receives TIpre .alarm, 0 1 2 3 i.e. if for 30 minutes no OHV has passed LBpost /1 LBpost /1 LBpost /1 through LBpost , the counter is reset. Actually the automaton shown in Fig. 4 is not Fig. 4. Control COpre for LBpre completely given, some edges which correspond to simultaneous events are left out for better readability.

300

Frank Ortmeier et al.

Specification of Control for LBpost Figure 5 shows the automaton COpost which controls OHVs in the area after LBpost . It has fewer states than COpre , since it just needs to know whether there is at least one OHV in the critical section between LBpost and the entries of the tubes, but not how many. If at least one OHV is in the critical section then the automaton is in state ‘active’, otherwise in ‘free’. To signal an emergency stop, the automaton goes to state ‘stop’. To avoid false alarms, COpost interprets the interruption of LBpost as a misdetection, if COpre has not deTI post .alarm/1 LBok /1 tected the OHV before (i.e. when LBok /1 COpre = 0). There are two reasons free active for an emergency stop: either ODfinal LBstop /1 signals that an OHV (or a HV) tries LB stopOR OD final /1 stop to enter the mid- or west-tube, while an OHV is in the critical section (i.e. we are in state ‘active’). Or an OHV Fig. 5. Control COpost for LBpost tries to drive through LBpost , but has not obeyed the signs to drive on the right lane. This rule must be reﬁned with two exceptions. If tube 4 is not available, an emergency stop must be caused even if the OHV is on the right lane. The other way round, no exception must be signalled, if only tube 4 is available, even if the OHV does not drive on the right lane. This means that the conditions LBstop and LBok for going to the ‘stop’ resp. ‘active’ state must be deﬁned as: LBok LBstop

:= LBpost and COpre =0 and (only tube 4 open or (ODright and not ODleft )) := LBpost and COpre =0 and (tube 4 closed or not (ODright and not ODleft ))

Specification of Overhigh Vehicles The speciﬁcation given in Fig. 6 shows the possible behavior of one OHV that drives through the Elbtunnel. It is the core of the environment speciﬁcation. We can implement the control system, such that it behaves exactly as the speciﬁcation prescribes but we have no such inﬂuence on reality. Whether the real environment behaves according to our model, must be validated carefully (e.g. with fault tree analysis, see Sect. 5). The model describes the possible movement LBpost and ODright −/1 −/[1,30] of an OHV through −/1 −/1 the light barriers and between LBpre absent LBpost and ODleft the tunnel: Initially the −/[1,30] OHV is ‘absent’. It re−/1 MW −/1 −/1 ODfinal −/[1,30] −/1 mains there for an un−/1 critical tube 4 known period of time (idling transition of −/[1,30] ‘absent’ state). At any Fig. 6. Overhigh vehicle OHV time it may reach

Safety Analysis of the Height Control System for the Elbtunnel

301

LBpre . In the next step the OHV will be ‘between’ the two light barriers and will remain there for up to 30 minutes. Then it will drive through LBpost . Either it will do so on the right lane, which will trigger ODright or on the left lane and cause signal ODleft . After passing LBpost it will reach the ‘critical’ section where the lanes split into the lane to tube 4 and the lane to mid- and west-tube. The OHV may stay in this critical section for up to 30 minutes. Then it will either drive correctly to the entry of ‘tube 4’ and ﬁnally return to state ‘absent’ or it will pass through ODfinal and reach the entry of the mid- or west-tube (state ‘MW’). In this case the control system must be designed such that an emergency stop has been signalled and we assume that the OHV will be towed away then (return to state ‘absent’). Our complete speciﬁcation runs three automata OHV1 , OHV2 , OHV3 in parallel. The signal LBpre , that is sent to COpre , is the disjunction of the OHVi being in state LBpre . Signals LBpost , ODleft etc. are computed analogously. Specification of High Vehicles High vehicles which are not overhigh are relevant for the formal speciﬁcation only since they can trigger the ODs. Therefore it is not necessary to specify which possible routes −/1 −/1 they drive or how many there potentially may −/1 not OD OD be. Instead we only deﬁne three automata −/1 ODleft , ODright and ODfinal , which say, that at any time the corresponding ‘OD’ signal may be Fig. 7. High vehicle HV triggered. Fig. 7 gives a generic automaton that randomly switches between the two states.

4

Proving Safety Properties

The formal speciﬁcation of the Elbtunnel allows to state safety properties and to prove them rigorously as theorems over the speciﬁcation. Since the model has only ﬁnitely many states, we can use model checkers, which are able to prove theorems automatically. We used two model checkers, SMV and RAVEN. Both have the ability to generate counter examples, i.e. if a theorem we try to prove turns out to be wrong, they return a run of the system which is a counter example to the theorem. To keep the number of states in the model small, we have also avoided an exact deﬁnition of the duration of one step of the automaton in real time. Of course the value of 30 (minutes), we used for the maximal time that an OHV is between the two LBs, is much more than 30 times longer than the time the OHV needs to cross an LB. For the real experiments we have even used a value of 5 or 6 to keep the runtimes of the proofs short (with a value of 30 proofs take several hours, while proofs using 5 go through in seconds). The exact value used in the proofs does not inﬂuence the results, as long as the maximum times and the runtime of the timers agree.

302

Frank Ortmeier et al.

Despite these inadequacies the model can serve to analyze all safety properties we found relevant. The most important is of course, that whenever an OHV is driving to the entry of the mid- or west-tube, then an emergency stop will be raised. This is formalized by the following formula in temporal logic: AG((OHV1 = M W ∨ OHV2 = M W ∨ OHV3 = M W ) → COpost = stop) The formula is read as: For every possible run of the system it is always the case (temporal operator AG), that if one of OHV1 , OHV2 , OHV3 is in state ‘MW’, then COpost is in state ‘stop’. The ﬁrst veriﬁcation results have been negative due to several speciﬁcation errors. There have been basically two classes: simple typing errors, and errors due to simultaneous signals. E.g. our initial model included only transitions for one signal in the control automaton shown in Fig. 4 and left out the case that one OHV passes LBpre while another passes LBpost at the same time. Additional transitions are necessary when two signals occur simultaneously (but they are left out in Fig. 4 for better readability). Both types of errors have easily been corrected, since each failed proof attempt resulted in a trace, that clearly showed what went wrong. After we corrected the errors, we ﬁnally found, that we still could not prove the theorem. It does not hold as can be demonstrated by the following run: 1. Initially, all OHVs are ‘absent’, COpre is in state ‘0’, and COpost signals ‘free’. 2. Then two OHVs drive through the light barrier LBpre at the same time. 3. LBpre cannot detect this situation, so COpre counts only one OHV and switches into state ‘1’. 4. The ﬁrst of the two OHVs drives through the second light barrier fast, resetting the state of COpre to ‘0’. COpost switches into the ‘active’ state, and starts timer TIpost . 5. The other OHV takes some time to reach LBpost . COpost assumes the signal from LBpost to be a misdetection, since COpre is in state ‘0’. Therefore it does not restart timer TIpost . 6. The second OHV now needs longer than the remaining time TIpost to reach ODfinal . TIpost signals ‘alarm’ to COpost which switches to state ‘free’ again. 7. The OHV triggering ODfinal while COpost is in state ‘free’ is assumed to be a misdetection and does not cause an emergency stop. 8. The OHV can now enter the mid- or west-tube without having caused an emergency stop. This run shows that our system is inherently unsafe, since two OHVs that pass simultaneously through LBpre are recognized as one. Whether the ﬂaw is relevant in practice, depends on the probability of its occurrence, which is currently being analyzed. To get a provable safety property, we now have two possibilities: either we can modify the model, possible modiﬁcations for which the safety property can be proved are discussed in Sect. 6. Or we can weaken the safety property to exclude this critical scenario. Then we can prove that the critical scenario described above is the only reason that may cause safety to be violated: Theorem 1. safety System SYS has the following property: If two OHVs never pass simultaneously through LBpre , then any OHV trying to enter the middle or western tube will cause an emergency stop.

Safety Analysis of the Height Control System for the Elbtunnel

303

In practice this means, that if the height control is implemented following the description of Sect. 3, then it will be safe except for the safety ﬂaw described above. It is interesting to note, that we detected the ﬂaw only in the RAVEN speciﬁcation. We did not ﬁnd the problem with SMV, since we had made an error in the translation of timed automata to untimed automata. This error resulted in a speciﬁcation in which all OHVs had to stay for the full 30 minutes between LBpre and LBpost . This prevented the problem from showing up: two OHVs driving simultaneously through LBpre also had to drive through LBpost at the same time. The incident shows that using a speciﬁcation language which does not explicitly support time is a source for additional speciﬁcation errors. Safety is not the only important property of height control in the Elbtunnel (although the most critical). Another desirable property is the absence of unnecessary alarms. Here we could prove: Theorem 2. emergency stops If tube 4 is open and not the only open tube, then an emergency stop can only be caused by a) an OHV driving on the left lane through LBpost or b) an OHV driving through ODfinal or c) an OHV driving on the right lane through LBpost while a high vehicle is driving on the left lane or d) a high vehicle at ODfinal while Timer TIpost is running. The ﬁrst two causes are correctly detected alarms, the other two are false alarms inherent in the technical realization of the system (and of course already known). Formal veriﬁcation proves the absence of other causes for false alarms. Finally we also analyzed, how the availability of tubes inﬂuences the situation. This led to the following theorem: Theorem 3. tube 4 availability If tube 4 is not available, then any OHV trying to drive through will cause an emergency stop. If only tube 4 is available, then an emergency stop will never occur. Summarizing, the formal analysis has led to a precise requirement speciﬁcation for the automata that should be used in the height control. A safety ﬂaw has been discovered and proven to be the only risk for safety. False alarms due to other reasons than the known ones mentioned in Theorem 2 have been ruled out. The results of formal analysis show that the control system does not have inherent logic faults. The formal proofs of this section do not give an absolute safety guarantee, but only relative to the aspects considered in the formal model. For example we did not consider technical defects, which are covered by the fault tree analysis presented in the next section.

5

Fault Tree Analysis

Another approach to increase the overall system safety is fault tree analysis (FTA)[10]. FTA is a technique for analyzing the possible, basic causes (primary

304

Frank Ortmeier et al.

failures) for a given hazard (top event). The top event event is always the root of the fault tree and the and gate primary failures are its leaves. All inner nodes or gate of the tree are called intermediate events (see primary failure Fig. 8). Starting with the top event the tree is generated by determining the immediate causes Fig. 8. FT Symbols that lead to the top event. They are connected to their consequence through a gate. The gate indicates if all (and-gate) or any (or-gate) of the causes are necessary to make the consequence happen. This procedure has to be applied recursively to all causes until the desired level of granularity is reached (this means all causes are primary failures that won’t be investigated further). We analyzed two diﬀerent hazards for the Elbtunnel height control - the collision of an OHV with the tunnel entrance and the tripping of a false alarm. We will use the hazard collision to illustrate FTA (see Fig. 9). The immediate causes of the top event - collision of an OHV with tunnel entrance - are that either the driver ignores the stop signals OR (this means the causes are connected through an OR-gate) that the signals are not turned on. The ﬁrst cause is a primary failure. We can do nothing about it, but to disbar the driver from his license. The second cause is an intermediate event. It’s immediate causes are a) that the signal lights are broken or b) the signals were not actiCollision vated. Again the ﬁrst one is a primary failure and the second is an intermediate event, which has to Signal not on OHV ignores be investigated further. The minsignal imal set of primary failures, which Signal out Signal not activated are necessary to make the hazof order ... ard happen for sure, is called minimal cutset. Cutsets which conFig. 9. Fault tree for hazard collision sist of only one element are called single point failures. This means the system is very susceptible to this primary failure. We will not present the whole tree here, but only discuss the most interesting results. One of these is the fact, that the original control system had a safety gap. This gap may be seen in one of the branches of the fault tree (see Fig. 10). The direct causes for the (intermediate) event OHV not detected at LB OHV not detected at LBpre , are malfunctioning of LBpre or synchronous passing of two OHVs through the light barrier LBpre . 2OHVs at LB LB out of The malfunction is a primary failure. But simultaneously order the second cause represents a safety gap in system design. In contrast to all other priFig. 10. Safety gap mary failures in the fault tree this event is pre

pre

pre

Safety Analysis of the Height Control System for the Elbtunnel

305

a legal scenario. Although all components of the system are working according to their speciﬁcation, the hazard collision may still occur. This must not happen in a safe control system. The FTA of the control system examines the systems sensibility to component failures. There are no AND-gates in the collision fault tree. This means that there is no redundancy in the system - making it eﬀective on the one hand but susceptible to failure of each component on the other hand. The false alarm fault tree is diﬀerent. This tree has several AND-gates. Especially, misdetection of pre-control light barrier LBpre appears in almost all minimal cut-sets (at least in all those scenarios where no OHV is involved at all). This means that the system is resistant to single point failures (i.e. only one primary failure occurs) with regard to the triggering of false alarms. Most failure modes in this fault tree are misdetections. These failures are by far more probable than a not detected OHV, e.g. a light barrier can easily interpret the interruption by a passing bird as an OHV but it is very improbable that it still detects the light beam while an OHV is passing through. Fault tree analysis yields some other interesting results. We could show that Risk of all the measures taken are complemencollision false alarm tary in their eﬀects on the two discussed • Timer TI1 runtime + main hazards collision and false alarm, • Timer TI2 runtime + as shown in Fig. 11. E.g. the pre-control • Pre-Control LB + • OD + LBpre decreases the risk of false alarms, but - less obvious - it increases the risk Fig. 11. Complementary eﬀects for collisions as well. However this is a qualitative proposition. Of course it decreases the ﬁrst probability much more than it increases the second one. These results correlate with the intention of the system to decrease the high number of false alarms signiﬁcantly, while still keeping the height control safe in terms of collision detection (the actual height control triggers about 700 false alarms per year). pre

left/right

6

Improvements

The safety analysis led to some suggestions for improvements and changes in the control logic as well as in the physical placement of the sensors. These changes were handed back as analysis feedback to our partners. Their cost-value beneﬁt is being discussed and it is very likely that they will be implemented. In the ﬁrst part of this section we describe two possible solutions to close the safety leak. Then we will explain some improvements for the overall performance and quality of the system, which were discovered through the safety analysis. Measures to Close the Safety Gap The ﬁrst suggestion is the better one in terms of failure probability, but the second one can be implemented without any additional costs.

306

Frank Ortmeier et al.

Installing additional ODs at LBpre . The ﬁrst possibility of closing the safety gap of simultaneous passing of 2 OHVs at LBpre is to install additional ODs at the pre-control signal bridge. With ODs above each lane one can keep track of the actual number of HVs that are simultaneously passing the pre-control checkpoint. Every time the light barrier is triggered, the counter COpre (see Fig. 4) will not only be increased by one but by the number of HVs detected by the (new) ODs. With this information it may be assured that the counter is always at least equal to the number of OHVs in the area between pre- and post-control. The counter can still be incorrect, e.g. simultaneous passing of an OHV and a HV through the pre-control LBpre will increase it by two (instead of one), but it can only be higher than the actual number of OHVs. This closes the safety gap. It increases the probability of false alarms only insigniﬁcantly. Never stop TIpre . An alternative to the solution described above is to keep TIpre always active for its maximum running time and restart it with every new OHV entering the controlled area. If this solution is chosen, the counter COpre will be redundant; it will be replaced by an automaton similar to COpost . The advantage of this measure is of course that no additional sensors must be installed. Only a change in the control logic is required. On the other hand it has a higher increase in false alarm probability than the option with ODs (as the alarm triggering detectors are kept active longer). Changes to Improve Overall System Quality We will now give useful advice for changes, which increase the overall performance and quality of the system. Unfortunately quantitative information on failure probabilites and hazard cost were not available to us. But the presented changes will improve the overall quality for all realistic failure rates. Both measures described here are aimed at reducing the number of false alarms. Additional light barrier at entrance of tube 4. The FTA showed that one important factor for the hazard false alarm is the total activation time of ODfinal . This is because each HV passing one of the ODfinal sensors immediately leads to a false alarm, if the sensor is activated. As described above these detectors are active while TIpost is running. An additional light barrier at the entrance of tube 4 can be used to detect OHVs that are leaving the endangered area between post control and tunnel entrance. This can be used to stop TIpost and keep it only running while there are OHVs left in the last section. It will be necessary to use a counter to keep track of the number of OHVs in the post sector. Another advantage is that the timeout for TIpost may be chosen much more conservatively without signiﬁcantly increasing the risk of false alarms, but decreasing the risk of collisions a lot. It is important to make the risk of misdetections of this additional light barrier as low as possible as misdetections could immediately lead to collisions. This can be done by installing a pair of light barriers instead of a single one and connecting them with an AND-connector. Distinguished alarms at post control. To further decrease the risk of false alarm, one may only trigger an alarm at post control if ODleft detects one HV. If neither ODleft nor ODright detect a HV (or OHV), the ODfinal will be activated without

Safety Analysis of the Height Control System for the Elbtunnel

307

triggering an alarm. This means the system assumes that the post control light barrier had a misdetection, if both ODs can’t detect a high vehicle. But it still activates the ODfinal just in case (if either ODleft or ODright are defect). This measure increases the risk of collisions almost unnoticeable. But the reaction time between alarm signals and the potential collision will decrease.

7

Conclusion

The safety analysis of the height control for the Elbtunnel has shown the beneﬁt of combining formal veriﬁcation and fault tree analysis. Both, formal model checking and safety analysis are important to examine a complex system. Formal veriﬁcation gives a precise requirements speciﬁcation for the control system and discovers logical ﬂaws in the control software. In our case study it has revealed the safety problem of two OHVs passing LBpre at the same time. On the other hand, building an adequate model for the system and its environment is not a simple process. To minimize speciﬁcation errors an orthogonal analysis technique like FTA is needed. FTA also addresses the issue of component failures. This analysis gives a lot of useful hints for improving the system. The revised system can be formally model checked again leading to a synergy eﬀect. The presented case study, which required an eﬀort of 3 person months, combines both techniques by exchanging results. A short overview, how to combine FTA and formal methods, can be found in [5] and a detailed paper is in preparation. For a tighter integration a FTA semantics was developed [9] used to formally verify the completeness of fault trees. Proof support [2] is integrated into the interactive speciﬁcation and veriﬁcation system KIV [1] using statecharts [3] as formal modeling language. In conclusion, we ﬁnd that the combination of formal methods and FTA is a suitable analysis technique for embedded systems. Our analysis has made the system safer (by detecting and closing the safety gap), led to design improvements and increased overall system quality.

References [1] M. Balser, W. Reif, G. Schellhorn, K. Stenzel, and A. Thums. Formal system development with KIV. In T. Maibaum, editor, Fundamental Approaches to Software Engineering, number 1783 in LNCS. Springer, 2000. 307 [2] M. Balser and A. Thums. Interactive veriﬁcation of statecharts. In Integration of Software Specification Techniques (INT’02), 2002. 307 [3] D. Harel. Statecharts: A visual formalism for complex systems. Science of Computer Programming, 8(3), 1987. 307 [4] K. L. McMillan. Symbolic Model Checking. Kluwer Academic Publishers, 1990. 298 [5] W. Reif, G. Schellhorn, and A. Thums. Safety analysis of a radio-based crossing control system using formal methods. In 9th IFAC Symposium on Control in Transportation Systems 2000, 2000. 307

308

Frank Ortmeier et al.

[6] J. Ruf. RAVEN: Real-time analyzing and veriﬁcation environment. Technical Report WSI 2000-3, University of T¨ ubingen, Wilhelm-Schickard-Institute, January 2000. 298 [7] J¨ urgen Ruf and Thomas Kropf. Symbolic Model Checking for a Discrete Clocked Temporal Logic with Intervals. In E. Cerny and D. K. Probst, editors, Conference on Correct Hardware Design and Verification Methods (CHARME), pages 146– 166, Montreal, 1997. IFIP WG 10.5, Chapman and Hall. 298 [8] J¨ urgen Ruf and Thomas Kropf. Modeling and Checking Networks of Communicating Real-Time Systems. In Correct Hardware Design and Verification Methods (CHARME 99), pages 265–279. IFIP WG 10.5, Springer, September 1999. 298 [9] G. Schellhorn, A. Thums, and W. Reif. Formal fault tree semantics. In The Sixth World Conference on Integrated Design & Process Technology, 2002. (to appear). 307 [10] W. E. Vesely, F. F. Goldberg, N. H. Roberts, and D. F. Haasl. Fault Tree Handbook. Washington, D. C., 1981. NUREG-0492. 303

Dependability and Configurability: Partners or Competitors in Pervasive Computing? Titos Saridakis NOKIA Research Center PO Box 407, FIN-00045, Finland [email protected]

Abstract. To foster commercial strength pervasive services, dependability and conﬁgurability concerns must be integrated tightly with the oﬀered functionality in order for the pervasive services to gain enduser’s trust while keeping their presence transparent to him. This paper presents the way dependability and conﬁgurability correlate with pervasive services and analyzes their common denominators and their competing forces. The common denominators are used to derive a set of design guidelines that promote the integration of dependability and conﬁgurability aspects. The competing forces are used for revealing a number of challenges that software designers must face in pervasive computing.

1

Introduction

The explosive evolution of wireless communications in the past decade, combined with the equally impressive improvements in the hand-held device technology (mobile phones, personal digital assistants or PDAs, palmtop computers, etc) has opened the way for the development of ubiquitous applications and pervasive services. In [12] where pervasive computing is seen as the incremental evolution of distributed systems and mobile computing (see Fig. 1), four research thrusts are identiﬁed which distinguish pervasive computing from its predecessors: smart spaces (i.e. context awareness), invisibility (i.e minimal user distraction), localized scalability (i.e. scalability within the environment where a pervasive service is used), and masking of uneven conditioning (i.e. graceful service quality degradation when space “smartness” decreases). All these four characteristics of pervasive computing imply directly or indirectly some form of conﬁgurability. A pervasive service should be able to adapt to diﬀerent contexts and to diﬀerent user proﬁles. It should also be reconﬁgurable in order to adjust to changes in its execution environment (e.g. increase of network traﬃc and decrease of space “smartness”). Hence, the quality attribute of conﬁgurability is inherently found in all pervasive services. On the other hand diﬀerent dependability concerns (i.e. availability, reliability, safety and security according to [8]) are tightly related to the four characteristics of pervasive computing. Availability of a service is of crucial importance when dealing with the scalability of that service. In a similar manner, reliability goes hand in hand S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 309–320, 2002. c Springer-Verlag Berlin Heidelberg 2002

310

Titos Saridakis Remote Comm. Fault Tolerance High Availability Distributed Security

Distributed systems

Mobile computing

Pervasive computing

Mobile Networking Mobile Info Access Adaptive Applications Energy Aware Systems Location Sensitivity

Smart Spaces Invisibility Localized Scalability Uneven Conditioning

Fig. 1. The incremental evolution from distributed system and mobile computing to pervasive computing

with minimal user distraction. And last, but most certainly not least, diﬀerent aspects of the quality attribute of security are related to all four characteristics of pervasive computing (user privacy, authentication and authorization that are invisible for the user and accountability in spaces with low “smartness” are only few examples). The dependability and conﬁgurability concerns in the context of pervasive computing are not entirely new; rather, most of them are rooted back to distributed systems and mobile computing. However, the intensive needs of pervasive computing accentuate and transform these concerns in a way that makes indispensable their integration with the functionality provided by pervasive services. The main objective of this paper is to analyze the roles played by dependability and conﬁgurability in pervasive computing. The analysis of these roles reveals the common points of conﬁgurability, availability, reliability, safety and security which support the integration of these ﬁve quality attributes in the development of pervasive services. On the other hand, the analysis uncovers a number of discrimination factors among the aforementioned quality attributes which resist their integration. The analysis also identiﬁes critical areas in the design and development of pervasive services that need to be carefully dealt with in order to result in successful products. The remainder of this paper is structured as follows: the next section describes the system model that is used in this paper, and is followed by a quick revision of the dependability and conﬁgurability quality attributes in section 3.

Dependability and Conﬁgurability

311

Section 4 presents the common denominators of the aforementioned quality attributes and discusses how they ﬁt with the needs of pervasive services. Section 5 on the antipode, presents the competing forces that appear when trying to combine the same ﬁve quality attributes in the context of pervasive services. A discussion on the issues related to the integration of dependability and conﬁgurability concerns in pervasive computing takes place in section 6. The paper concludes in section 7 with a summary of the presented challenges in the development of pervasive services and a mention on the open issues.

2

System Model

The purpose of our system model is to provide the abstractions that describe the entities which participate in a pervasive computing environment. A number of scenarios have been described in the literature (e.g see [4, 12]) giving a coarse-grained view on pervasive computing environments. In brief, these scenarios describe users which use personal, oﬃce and commercial information while moving from one physical location to another. The way they use this information depends on the devices they are carrying (personal and wearable computers) and the devices oﬀered by the physical location in which the information access takes place (screens, projectors, printers, etc). Hence, these scenarios can be abstractly described in terms of devices, information, use of information, and physical locations where the information is used. Our system model provides four abstractions for capturing the above terms describing a pervasive computing environment. The asset abstraction captures all kinds of information that a user might want to use, which can be personal or oﬃce data, advertisements, ﬂight-schedule updates, etc. The service abstraction captures the manipulation of assets when a user uses the corresponding information. The capability abstraction captures the devices that a user may use to manipulate and perceive a set of assets, which usually corresponds to the devices where information is manipulated and the devices that present to the user the outcome of that manipulation. Finally, the context abstraction captures the physical location in which assets are manipulated by services using the available capabilities. However, the context abstraction goes beyond the physical location by covering also other factors that can parametrize the interaction of assets, services and capabilities. These factors include time, the occurrence of pre-speciﬁed events, the emotional condition of the user, etc. Besides the above abstractions, our system model also deﬁnes four relations that capture the interactions among these abstractions. These relations are informally described below. In a given context, a service uses a capability to access an asset when the software that implements a service can manipulate the information corresponding to the given asset while executing on the device that corresponds to the given capability. When a service accesses an asset in a given context, it may employ some other service in order to accomplish its designated manipulation of the asset, in which case we say that the former service depends-on the latter. Finally, the contains relation is deﬁned for contexts and

312

Titos Saridakis

Capabilities

Assets

11 00 000 00 111 11 000 111 00 11 00 11 00 11 11 00

11 00 00 11

data

data data

data data

data

data data

uses

Contexts

accesses

data data

data

111 11 000 00 000 11 111 00 00 11 00 11

data data

data data

data data

data

data

data data

11 00 00 11 000 111 000 111

data data data

data data data

depends-on

Services

Fig. 2. Entities and relations deﬁned by the presented system model

all four abstractions deﬁned above, i.e. a context contains the services, capabilities, assets and other contexts that are available in it. Hence, a context is a container of services, capabilities, assets and other contexts, and constrains them and their interactions according to the policies1 associated with the factors that parametrize the context. The accesses, uses and depends-on relations may cross context boundaries as long as one of the contexts is (directly or indirectly) contained in the other or there exists a wider context which (directly or indirectly) contains both contexts. Fig. 2 illustrates graphically the entities and the relations deﬁned by the presented system model. Based on this model, a pervasive service is nothing more than the instantiation of the service abstraction presented above, i.e. a case where, in a given context, the software implementing the functionality captured by a service runs on a device represented by a given capability and operates on a given set of assets. Most of the properties that characterize a pervasive service are inherited from distributed systems and mobile computing (see Fig.1), including fault tolerance, high availability, distributed security and adaptability. Pervasive computing adds the invisibility property and strengthens the requirements regarding availability, adaptability (graceful degradation of service provision), scalability, and security 1

The word “policy” has no particular meaning in this paper, other than describing the constraints that a given context imposes on the contained assets, services and capabilities. In practice, such policies can be represented as context-speciﬁc assets which participate in every interaction among assets, services and capabilities contained by the given context.

Dependability and Conﬁgurability

313

(user privacy and trust) properties. All these properties are directly related to the dependability and conﬁgurability quality attributes of a system which are brieﬂy summarized in the next section.

3

Dependability and Configurability

Dependable systems have been a domain of research for the past four decades. In general, dependability is a property that describes the trustworthiness of a system and it consists of the following four quality attributes [8]: availability, which describes the continuous accessibility of a system’s functionality; reliability, which describes the uninterrupted delivery of a system’s functionality; safety, which describes the preservation of a correct system state; and security, which describes the prevention of unauthorized operations on a system. Conﬁgurability is another research domain that has a long history, especially in the domain of distributed systems. The quality attribute of conﬁgurability describes the capability of a system to change the conﬁguration of its constituents as a response to an external event (e.g. user input), a reaction to an internal event (e.g. failure detection), as part of its speciﬁed behavior (i.e. programmed reconﬁguration). In the system model presented in the previous section, all these ﬁve quality attributes (availability, reliability, safety, security, and conﬁgurability) can be used to qualify assets, services and capabilities but not contexts. In the area of distributed systems dependability has been very closely associated to fault tolerance techniques (e.g. see [9] and [3]). In fact, fault tolerance refers to techniques for dealing with the occurrence of failure in the operation of a system and covers mostly safety and reliability issues. Availability is partially addressed by fault tolerance techniques, and is more related to load balancing issues. Security is even less related to fault tolerance techniques2 and deals mostly with access control, intrusion detection and communication privacy. On the other hand, a wide variety of fault tolerance techniques are closely related to conﬁgurability (e.g. after the occurrence of a failure is detected, the system reconﬁgures itself to isolate the failed part). Hence, in distributed system the relation between reliability and safety on one hand and conﬁgurability on the other is quite strong. The relation between availability and conﬁgurability is less direct and it mainly relates to the establishment of a conﬁguration where the part of the system that needs to access some remote functionality gets in contact with the part of the system that provides that functionality. Finally, security appears not to relate to the other four quality attributes addressed in this paper. Mobile computing extends the problems that the designers of distributed system are facing with concerns regarding the poor resources of the capabilities, the high ﬂuctuations in connectivity characteristics, and an unpredictable factor of hazardousness [11]. This brings new dimensions to dependability and 2

Fault tolerance techniques which deal with byzantine failures consider security attacks as a special kind of failure where the failed part of the system exhibits unpredictable behavior.

314

Titos Saridakis

conﬁgurability, which are further accentuated in pervasive computing. Failures, load balancing and security attacks are not the only concerns that inﬂuence the availability, reliability, safety and security of services, assets and capabilities. Contexts composition (i.e. containment under the same wider context), proximity issues in short range wireless networks, the absence of centralized point of information (e.g. which may play the role of a security server) and the limited physical resources of the capabilities are only few of the factors that make dependability an indispensable constituent quality in pervasive computing. In addition, the very nature of pervasive computing contains the need for dynamic reconﬁguration and adaptation of services, assets and capabilities of a given context to a wide range (possibly not deﬁned a priori) of services, assets and capabilities in other contexts with which the former may get in contact.

4

Common Denominators

The ﬁrst common denominator across all ﬁve quality attributes presented in the previous section is the detection of conditions which call for some reaction that will prevent any behavior of the system that lies outside its speciﬁcations. Such conditions can be diﬀerent for each quality attribute, e.g. failure occurrence for reliability, exceeded load for availability, unauthorized access for security, and user action that leads to reconﬁguration. Nevertheless, in every case the system reaction and the triggering of the mechanisms that ensure the “correct” behavior of the system rely on the detection of the conditions which call for some system reaction. The detection of such conditions can be diﬀerentiated depending on whether it is set on assets, services or capabilities as well as with respect to the quality attribute with which it is associated. Still, mechanisms that guarantee availability, reliability, safety, security and conﬁgurability are all based on the detection of some type of condition. Closely related to the detection of conditions that can lead to speciﬁcation violations is the “design for adaptation” characteristic that can be found in all ﬁve quality attributes studied in this paper. A direct consequence of detecting a condition that leads to speciﬁcation violation is to take appropriate actions for preventing the violation. The action which expresses this design for adaptation can be diﬀerent depending on the case (e.g. stop unauthorized accesses, connect to an unoccupied resource, retrieve the assets or service results from a non-faulty replica, etc). This common characteristic of dependability and conﬁgurability aligns perfectly with the property of self-tuning and the invisibility thrust which are fundamental elements of pervasive computing [12]. Another common denominator for the quality attributes of availability and reliability is redundancy which appears in the form of replication of assets, services or devices and is used to mask resource overload or failure (these are the common cases in distributed systems) as well as inaccessibility, ad hoc changes in the composition of contexts, etc. A diﬀerent form of redundancy can be found in mechanisms that ensure security of assets and more speciﬁcally their integrity. This redundancy concerns some extra information which is represented as CRC

Dependability and Conﬁgurability

315

codes, digital signatures and certiﬁcation tokens. For reasons that will be revealed in the following section it is worth to mention that this kind of redundancy has also been used in fault tolerance techniques (CRC codes are used to detect corrupted data).

5

Competing Forces

On the antipode of the common denominators of the ﬁve quality attributes discussed in this paper, a number of competing forces strive to complicate the development of pervasive services. An interesting remark is that these competing forces stem directly or indirectly from the common denominators identiﬁed above. The ﬁrst common characteristic of availability, reliability, safety, security and conﬁgurability identiﬁed in the previous section is the detection of conditions that trigger the mechanisms which guarantee the provision of the ﬁve aforementioned quality attributes. The same characteristic yields an important force that complicates the development of pervasive services: the interpretation of a phenomenon as a certain type of condition. For example, the interception of a corrupted message by a security detection mechanism will be interpreted as a security attack to the service or asset integrity and, depending on the security policy, it may result in prohibiting any communication with the sender of the message. However, if the message was corrupted due to some transient communication failure, the most eﬃcient reaction of the receiver would be to request from the sender the re-transmission of the message. But without any means to reveal the true nature of the message corruption, the interpretation of this phenomenon will be decided by the detection mechanism that will ﬁrst intercept it. The problem of the wrong detection mechanism intercepting a condition (or a detection mechanism intercepting the wrong condition) is not speciﬁc to pervasive computing. The same situation can appear in distributed systems and mobile computing, only in these cases the impact is smaller. Based on the above example, in distributed systems a central security authority can be employed to resolve the misunderstanding created by the misinterpretation of the corrupted message as a security attack. In pervasive computing however such central authorities do not exist, and even if they do in some exceptional case they are not accessible at any given moment mainly due to the lack of a global connectivity network. This situation is closer to mobile computing with the exception of the minimal user distraction constraint which is a characteristic of pervasive computing. The implication of this latter fact is that in pervasive computing the misunderstanding must be resolve without the intervention of a central authority (or the user) and it should result in the minimum possible inconvenience for him. The second common denominator, the “design for adaptation”, is probably the factor that produces the most ﬁerce obstacles in integrating all ﬁve quality attributes discussed in this paper with the design of pervasive services. The “design for adaptation” reﬂects the diﬀerent strategies which are employed to deal

316

Titos Saridakis

with the occurrence of conditions that aﬀect the availability, reliability, safety, security or conﬁgurability of a pervasive service. And alone the combination of security and reliability is known to be notoriously diﬃcult (e.g. see [7]). The same statement is true also between the security and any of the three other quality attributes above. Increasing security checks results in an increase in the time to access an asset or to use a capability which in turn decreases the availability of the latter. Similarly, the availability of a service is inversely proportional to the number of checkpoints that are put in place by the mechanism which guarantees the reliability of a service or an asset. The problem of composing the design decisions regarding diﬀerent quality attributes is not met for the ﬁrst time in pervasive computing. The development of diﬀerent views on a system’s architecture in order to better comprehend the implications of the requirements regarding diﬀerent quality attributes has already been suggested and tried out (e.g. see [5]). However, in pervasive computing the alternative design choices for resolving the requirements regarding availability, reliability, safety, security and conﬁgurability are not as many as in distributed systems and in mobile computing. For example, replication in space (e.g. state machine approach or active replication) will rarely be aﬀordable as a reliability solution due to the limited resources of the capabilities. In the same manner, centralized security is not an option in pervasive computing and security solutions must be fundamentally distributed (e.g. distributed trust [6]). The same argument applies to conﬁguration mechanisms based on a central reconﬁguration manager (e.g. see [1]). The bottom line is that the composability of diﬀerent architectural views was already a diﬃcult issue in distributed systems where a variety of design alternatives exist for diﬀerent quality attributes. In pervasive computing this diﬃculty is ampliﬁed by the lack of alternative design solutions which is the result of the limited resources disposed by hand-held devices. This brings us to the third force that opposes the integration of availability, reliability and security: redundancy. Each of those three quality attributes uses some form of redundancy but, as imposed by the diﬀerences in their nature, these redundancy forms are diﬀerent and in many cases contradictory. For example, replicating an asset to guarantee its availability runs contrary to traditional security practices for protecting secrets. Similarly, encrypting and certifying an asset, which adds some redundant information, increases the time needed to access that asset which fatally decreases its availability. When redundancy for reliability purposes enters the above picture, the result is a mess of conﬂicting redundancy types and strategies on how to employ them to serve all the three quality attributes in question. In addition to this, the choices of redundancy types that can be employed in pervasive computing are few. Limited resources in terms of memory, computing power and energy power do not permit solutions based on cryptographic techniques on the leading edge of distributed system technology (high consumption of computing and energy power), nor do they permit greedy replication schemes for availability and active replication for reliability (high memory consumption). For the same reasons, fat

Dependability and Conﬁgurability

317

proxy solutions used in mobile computing for disconnected mode operation are also excluded.

6

Discussion

The issues of balancing the competing forces in the integration of dependability and conﬁgurability concerns is not speciﬁc to pervasive computing. Distributed systems and mobile computing face the same problem only on a diﬀerent scale (e.g. limited resource devices, a number of services, assets and capabilities in a context that may vary with time, etc) and under assumptions which are not valid in pervasive computing (e.g. central authority, presence of capabilities, services and assets in the surrounding environment of a service, etc). The traditional approach in distributed systems, which also applies in mobile computing, is to sacriﬁce one of the system aspects related to dependability or conﬁgurability in order to guarantee the others, or to compromise the guarantees provided by many of these aspects in order to provide a little bit of everything. However, both the above alternatives are just not good enough for pervasive computing. Pervasive services that neglect security issues (e.g. integrity or privacy of services, assets and capabilities) will have signiﬁcantly limited acceptance by end-users, service providers or device manufacturers regardless the degree of availability, reliability and conﬁgurability they oﬀer. Similar cases hold when any of the constituent attributes of dependability or conﬁgurability is neglected. For example, neglecting availability issues in favor of reliability, i.e. assuring that the user of a service will receive the expected result but provide no guarantees about when the service will be accessible, results in very weak user trust on the services. On the other hand, compromising the guarantees regarding some of the aforementioned quality attributes does not necessarily increase the trust of the service users and providers. Such approaches can easily result in degraded systems which are neither secure, nor reliable or conﬁgurable. In order to ensure the integration of dependability and conﬁgurability concerns in the development of pervasive services, the designer must resolve the conﬂicts arising by the competing forces presented in Secton 5. This integration is a very challenging task which will push the system modeling and software architecture domains to their limits and probably foster their evolution. Still, for each of the three identiﬁed competing forces there is a simple guideline that can be used as a starting point when dealing with the integration of dependability and conﬁgurability in the design of pervasive services. Regarding the detection of conditions that activate the mechanisms responsible for the dependability and conﬁgurability guarantees in a system, the designer must dissociate the detection of an event from its interpretation as a condition of a certain type since it is strongly probable that the same event in diﬀerent contexts will have to be interpreted as a diﬀerent condition. In distributed systems an event has a pre-assigned meaning as a condition of some type (e.g. if a timeout expires in communication over a network it is taken to mean that the remote server is down or overloaded). This is a consequence of the assump-

318

Titos Saridakis

tion that the network composition is more or less ﬁxed and nodes do not enter and exit the network all the time. Hence, the appearance and removal of nodes happen under the supervision of some system administrator (human or software component). So, delays in responses to network communication are assumed to be events that signify node overload or failure. Mobile computing deals with the events associated to network communication in a more ﬂexible way since network disconnections are part of the system speciﬁcation. However, this is not the case with security issues where the usual approach is to trust either a central authority which is occasionally accessible over the network or the information kept in some security proﬁle on the mobile terminal. This results in rigid security policies which fail to adapt to the variety of circumstances that may arise in diﬀerent contexts in pervasive computing. Following the dissociation of event detection from event interpretation as a condition of a speciﬁc type, the second force identiﬁed in § 5 (i.e. design for adaptation) must be adjusted. In a similar way that the same event may signify diﬀerent conditions in diﬀerent contexts, the adaptation policy for the same condition must be parametrized by the context in which it applies. The ﬁrst impact of this ﬂexibility of adaptation policies is that the quality attributes of availability, reliability, safety, security and conﬁgurability must have more than one mechanisms which guarantee them. For example, privacy may depend on an RSA encryption mechanism in a context where communication takes place over Ethernet, but content to the ciphering provided by the modulation and compression performed by a CDMA-based radio communication. Similarly, reliability may be based on an active replication fault tolerance mechanism in contexts with rich capabilities, but content to a primary-backup mechanism where capabilities are scarce and communication delays high. Design for adaptation in pervasive computing must take into consideration context-dependent parameters that may or may not be known at the design phase. Hence, adaptability in pervasive computing is not only adjusting the system behavior according to the conditions to which event occurrences are translated. It is also adjusting the adaptation policies to the speciﬁc characteristics of the context in which a given pervasive service operates. Since all the adaptation policies that may apply in diﬀerent contexts cannot be known in advance, the adaptation policies must be adaptive themselves. The design issues related to the use of redundancy in pervasive services are directly related to the adaptability of the adaptation policies. For example, using replication of services and/or assets to achieve availability and reliability must not be a choice ﬁxed at design time; rather it should be possible to select the most appropriate form of redundancy for a given context. Security related redundancy must also be adjustable to context characteristics. Encryption and digital signatures might not be necessary for guaranteeing the integrity of an asset when the given asset is accessed in an attack-proof context. Redundancy schemes with conﬂicting interests for the quality attributes of availability, reliability and security must be prioritized on a per context basis. This will allow the graceful degradation of the qualities guaranteed by a pervasive service to

Dependability and Conﬁgurability

319

adapt to the characteristics of diﬀerent contexts. For example, while assuring maximum security guarantees, lower reliability and availability guarantees can be provided for accessing an asset in a given context (e.g. uncertiﬁed context). In a diﬀerent context where physical security means allow to relax the security guarantees provided by the system, the reliability and availability guarantees for the access of the same asset can be maximized.

7

Summary

The tight relation of pervasive computing with the quality attributes of dependability and conﬁgurability suggests that it is unavoidable to deliver successful pervasive services without integrating dependability and conﬁgurability considerations in their design. Pervasive computing does not introduce the integration of functional system properties and quality attributes as a new design concern. This is a concern already addressed before (e.g. see [2]). What is new in pervasive computing is that this integration is no longer a desirable action in order to increase the quality of a system; rather it becomes an absolute necessity in order to provide successful pervasive services. This means that pervasive computing characteristics (smart-spaces, invisibility, localized scalability and graceful degradation) must be harmonically interweaved with the condition detection, the adaptability and the redundancy characteristics that constitute a doubleedged knife for the dependability and conﬁgurability quality attributes. Although there is no widely applicable, systematic method to support the system designer in the aforementioned challenging integration task, there are three simple guidelines which can serve as a starting point for attempting the harmonic integration of dependability and conﬁgurability in pervasive services. First, the dissociation of the detection of events in a context from their interpretation as conditions of a certain type. The interpretation must happen separately and in a context-speciﬁc way which will allow the same event to signify diﬀerent conditions (and hence trigger diﬀerent mechanisms) in diﬀerent contexts. Second, the design provisions which enable the adaptation of adaptation policies to the characteristics speciﬁc to each context where the policies are applied. Finally, the inclusion of redundancy schemes as part of the adaptation policies, which implies that diﬀerent redundancy schemes must exist for assets and services and the selection of the one to be applied in a given context will depend on the characteristics of the context in question. These guidelines are very much in line with the self-tuning property of pervasive computing [12]. In fact, the ﬁrst guideline is an enabler for self-tuning while the second and the third elaborate on how to put self-tuning consideration in the design of dependable and conﬁgurable pervasive services. We anticipated substantial support from the system modeling and software architecture activities for the integration of dependability and conﬁgurability in the design of pervasive services Finally, the quality attribute of timeliness must be considered in conjunction with dependability and conﬁgurability. This is another big design challenge

320

Titos Saridakis

since on one hand real-time embedded devices form a signiﬁcant number of capabilities in pervasive computing, and on the other the integration of timeliness, dependability and adaptability is shown to be far from trivial (e.g. see [10]).

References [1] C. Bidan, V. Issarny, T. Saridakis, and A. Zarras. A Dynamic Reconﬁguration Service for CORBA. In Proceedings of the 4th International Conference on Configurable Distributed Systems, pages 35–42, 1998. 316 [2] J. Bosch and P. Molin. Software Architecture Design: Evaluation and Transformation. In Proceedings of the Conference on Engineering of Computer-Based Systems, pages 4–10, 1999. 319 [3] C. Flaviu. Understanding Fault-Tolerant Distributed Systems. Communications of the ACM, 34(2):56–78, February 1991. 313 [4] R. Grimm, T. Anderson, B. Bershad, and D. Wetherall. A system architecture for pervasive computing. In Proceedings of the 9th ACM SIGOPS European Workshop, pages 177–182, September 2000. 311 [5] V. Issarny, T. Saridakis, and A. Zarras. Multi-View Description of Software Architectures. In Proceedings of the 3rd International Workshop on Software Architecture, pages 81–84, 1998. 316 [6] L. Kagal, T. Finin, and A. Joshi. Trust-Based Security in Pervasice Computing Environments. IEEE Computer, 34(12):154–157, December 2001. 316 [7] K. Kwiat. Can Reliability and Security be joined Reliable and Securely. In Proceedings of the IEEE Symposium on Reliable Distributed Systems, pages 72– 73, 2001. 316 [8] J. C. Laprie, editor. Dependability: Basic Concepts and Terminology, volume 5 of Dependable Computing and Fault-Tolerant Systems. Springer-Verlag, 1992. 309, 313 [9] V. P. Nelson. Fault-Tolerant Computing: Fundamental Concepts. IEEE Computer, 23(7):19–25, july 1990. 313 [10] P. Richardson, L. Sieh, and A. M. Elkateeb. Fault-Tolerant Adaptive Scheduling for Embedded Real-Time Systems. IEEE Micro, 21(5):41–51, September-October 2001. 320 [11] M. Satyanarayanan. Fundamental Challenges in Mobile Computing. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing, pages 1–7, 1996. 313 [12] M. Satyanarayanan. Pervasive Computing: Vision and Challenges. IEEE Personal Communications, 8(4):10–17, August 2001. 309, 311, 314, 319

Architectural Considerations in the Certification of Modular Systems Iain Bate and Tim Kelly Department of Computer Science University of York, York, YO10 5DD, UK {iain.bate,tim.kelly}@cs.york.ac.uk

Abstract. The adoption of Integrated Modular Avionics (IMA) in the aerospace industry offers potential benefits of improved flexibility in function allocation, reduced development costs and improved maintainability. However, it requires a new certification approach. The traditional approach to certification is to prepare monolithic safety cases as bespoke developments for a specific system in a fixed configuration. However, this nullifies the benefits of flexibility and reduced rework claimed of IMA-based systems and will necessitate the development of new safety cases for all possible (current and future) configurations of the architecture. This paper discusses a modular approach to safety case construction, whereby the safety case is partitioned into separable arguments of safety corresponding with the components of the system architecture. Such an approach relies upon properties of the IMA system architecture (such as segregation and location independence) having been established. The paper describes how such properties can be assessed to show that they are met and trade-off performed during architecture definition reusing information and techniques from the safety argument process.

1

Introduction

Integrated Modular Avionics (IMA) offers potential benefits of improved flexibility in function allocation, reduced development costs and improved maintainability. However, it poses significant problems in certification. The traditional approach to certification relies heavily upon a system being statically defined as a complete entity and the corresponding (bespoke) system safety case being constructed. However, a principal motivation behind IMA is that there is through-life (and potentially runtime) flexibility in the system configuration. An IMA system can support many possible mappings of the functionality required to the underlying computing platform. In constructing a safety case for IMA an attempt could be made to enumerate and justify all possible configurations within the architecture. However, this approach is unfeasibly expensive for all but a small number of processing units and functions. Another approach is to establish the safety case for a specific configuration within the S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 321-333, 2002.  Springer-Verlag Berlin Heidelberg 2002

322

Iain Bate and Tim Kelly

architecture. However, this nullifies the benefit of flexibility in using an IMA solution and will necessitate the development of completely new safety cases for future modifications or additions to the architecture. A more promising approach is to attempt to establish a modular, compositional, approach to constructing safety arguments that has a correspondence with the structure of the underlying system architecture. However, to create such arguments requires a system architecture that has been designed with explicit consideration of enabling properties such as independence (e.g. including both non-interference and location ‘transparency’), increased flexibility in functional integration, and low coupling between components. An additional problem is that these properties are nonorthogonal and trade-offs must be made when defining the architecture.

2

Safety Case Modules

Defining a safety case ‘module’ involves defining the objectives, evidence, argument and context associated with one aspect of the safety case. Assuming a top-down progression of objectives-argument-evidence, safety cases can be partitioned into modules both horizontally and vertically: Vertical (Hierarchical) Partitioning - The claims of one safety argument can be thought of as objectives for another. For example, the claims regarding software safety made within a system safety case can serve as the objectives of the software safety case. Horizontal Partitioning - One argument can provide the assumed context of another. For example, the argument that “All system hazards have been identified” can be the assumed context of an argument that “All identified system hazards have been sufficiently mitigated”. In defining a safety case module it is essential to identify the ways in which the safety case module depends upon the arguments, evidence or assumed context of other modules. A safety case module, should therefore be defined by the following interface: 1. 2. 3. 4.

Objectives addressed by the module Evidence presented within the module Context defined within the module Arguments requiring support from other modules

Inter-module dependencies: 5. Reliance on objectives addressed elsewhere 6. Reliance on evidence presented elsewhere 7. Reliance on context defined elsewhere The principal need for having such well-defined interfaces for each safety case module arises from being able to ensure that modules are being used consistently and correctly in their target application context (i.e. when composed with other modules).

Architectural Considerations in the Certification of Modular Systems

2.1

323

Safety Case Module Composition

Safety case modules can be usefully composed if their objectives and arguments complement each other – i.e. one or more of the objectives supported by a module match one or more of the arguments requiring support in the other. For example, the software safety argument is usefully composed with the system safety argument if the software argument supports one or more of objectives set by the system argument. At the same time, an important side-condition is that the collective evidence and assumed context of one module is consistent with that presented in the other. For example, an operational usage context assumed within the software safety argument must be consistent with that put forward within the system level argument. The definition of safety case module interfaces and satisfaction of conditions across interfaces upon composition is analogous to the long established rely-guarantee approach to specifying the behaviour of software modules. Jones in [1] talks of ‘rely’ conditions that express the assumptions that can be made about the interrelations (interference) between operations and ‘guarantee’ conditions that constrain the endeffect assuming that the ‘rely’ conditions are satisfied. For a safety case module, the rely conditions can be thought of as items 4 to 7 (at the start of section 0) of the interface whilst item 1 (objectives addressed) defines the guarantee conditions. Items 2 (evidence presented) and 3 (context defined) must continue to hold (i.e. not be contradicted by inconsistent evidence or context) during composition of modules. The defined context of one module may also conflict with the evidence presented in another. There may also simply be a problem of consistency between the system models defined within multiple modules. For example, assuming a conventional system safety argument / software safety argument decomposition (as defined by U.K. Defence Standards 00-56 [2] and 00-55 [3]) consistency must be assured between the state machine model of the software (which, in addition to modelling the internal state changes of the software will almost inevitably model the external – system – triggers to state changes) and the system level view of the external stimuli. As with checking the consistency of safety analyses, the problem of checking the consistency of multiple, diversely represented, models is also a significant challenge in its own right. 2.2

The Challenge of Compositionality

It is widely recognised (e.g. by Perrow [4] and Leveson [5]) that relatively low risks are posed by independent component failures in safety-critical systems. However, it is not expected that in a safety case architecture where modules are defined to correspond with a modular system structure that a complete, comprehensive and defensible argument can be achieved by merely composing the arguments of safety for individual system modules. Safety is a whole system, rather than a ‘sum of parts’, property. Combination of effects and emergent behaviour must be additionally addressed within the overall safety case architecture (i.e. within their own modules of the safety case). Modularity in reasoning should not be confused with modularity (and assumed independence) in system behaviour.

324

Iain Bate and Tim Kelly

2.3

Safety Case Module ‘Contracts’

Where a successful match (composition) can be made of two or more modules, a contract should be recorded of the agreed relationship between the modules. This contract aids in assessing whether the relationship continues to hold and the (combined) argument continues to be sustained if at a later stage one of the argument modules is modified or a replacement module substituted. This is a commonplace approach in component based software engineering where contracts are drawn up of the services a software component requires of, and provides to, its peer components, e.g. as in Meyer’s Eiffel contracts [6]. In software component contracts, if a component continues to fulfil its side of the contract with its peer components (regardless of internal component implementation detail or change) the overall system functionality is expected to be maintained. Similarly, contracts between safety case modules allow the overall argument to be sustained whilst the internal details of module arguments (including use of evidence) are changed or entirely substituted for alternative arguments provided that the guarantees of the module contract continue to be upheld. 2.4

Safety Case Architecture

We define safety case architecture as the high level organisation of the safety case into modules of argument and the interdependencies that exist between them. In deciding upon the partitioning of the safety case, many of the same principles apply as for system architecture definition, for example: High Cohesion/Low Coupling – each safety case module should address a logically cohesive set of objectives and (to improve maintainability) should minimise the amount of cross-referencing to, and dependency on, other modules. Supporting Work Division & Contractual Boundaries – module boundaries should be defined to correspond with the division of labour and organisational / contractual boundaries such that interfaces and responsibilities are clearly identified and documented. Isolating Change – arguments that are expected to change (e.g. when making anticipated additions to system functionality) should ideally be located in modules separate from those modules where change to the argument is less likely (e.g. safety arguments concerning operating system integrity). The principal aim in attempting to adopt a modular safety case architecture for IMA-based systems is for the modular structure of the safety case to correspond as far as is possible with the modular partitioning of the hardware and software of the actual system. 2.5

Reasoning about Interactions and Independence

One of the main impediments to reasoning separately about individual applications running on an IMA based architecture is the degree to which applications interact or interfere with one another. The European railways safety standard CENELEC ENV

Architectural Considerations in the Certification of Modular Systems

325

50129 [7] makes an interesting distinction between those interactions between system components that are intentional (e.g. component X is meant to communicate with component Y) and those that are unintentional (e.g. the impact of electromagnetic interference generated by one component on another). A further observation made in ENV 50129 is that there are a class of interactions that are unintentional but created through intentional connections. An example of this form of interaction is the influence of a failed processing node that is ‘babbling’ and interfering with another node through the intentional connection of a shared databus. Ideally ‘once-for-all’ arguments are established by appeal to the properties of the IMA infrastructure to address unintentional interactions. For example, an argument of “non-interference through shared scheduler” could be established by appeal to the priority-based scheduling scheme offered by the scheduler. It is not possible to provide “once-for-all” arguments for the intentional interactions between components – as these can only be determined for a given configuration of components. However, it is desirable to separate those arguments addressing the logical intent of the interaction from those addressing the integrity of the medium of interaction. The following section describes how properties of the system architecture, such as those discussed above, can be explicitly considered as part of the architecture definition activity.

3

Evaluating Required Qualities during System Architecture Definition

In defining system architecture it is important to consider the following activities: 1. 2. 3.

4.

derivation of choices – identifies where different design solutions are available for satisfying a goal. manage sensitivities – identifies dependencies between components such that consideration of whether and how to relax them can be made. A benefit of relaxing dependencies could be a reduced impact to change. evaluation of options – allows questions to be derived whose answers can be used for identifying solutions that do/do not meet the system properties, judging how well the properties are met and indicating where refinements of the design might add benefit. influence on the design – identifies constraints on how components should be designed to support the meeting of the system’s overall objectives.

A technique (the Architecture Trade-Off Analysis Method – ATAM [8]) for evaluating architectures for their support of architectural qualities, and trade-offs in achieving those qualities, has been developed by the Software Engineering Institute. Our proposed approach is intended for use within the nine-step process of ATAM. The differences between our strategy and other existing approaches, e.g. ATAM, include the following.

326

Iain Bate and Tim Kelly

1. the techniques used in our approach are already accepted and widely used (e.g. nuclear propulsion system and missile system safety arguments) [2], and as such processes exist for ensuring the correctness and consistency of the results obtained. 2. the techniques offers: (a) strong traceability and a rigorous method for deriving the attributes and questions with which designs are analysed; (b) the ability to capture design rationale and assumptions which is essential if component reuse is to be achieved. 3. information generated from their original intended use can be reused, rather than repeating the effort. 4. the method is equally intended as a design technique to assist in the evaluation of the architectural design and implementation strategy as it is for evaluating a design at a particular fixed stages of the process. 3.1

Analysing Different Design Solutions and Performing Trade-Offs

Figure 1 provides a diagrammatic overview of the proposed method. Stage (1) of the trade-off analysis method is producing a model of the system to be assessed. This model should be decomposed to a uniform level of abstraction. Currently our work uses UML [9] for this purpose, however it could be applied to any modelling approach that clearly identifies components and their couplings. Arguments are then produced (stage (2)) for each coupling to a corresponding (but lower so that impact of later choices can be made) abstraction level than the system model. (An overview of Goal Structuring Notation symbols is shown in Figure 2, further details of the notation can be found in [10]) The arguments are derived from the top-level properties of the particular system being developed. The properties often of interest are lifecycle cost, dependability, and maintainability. Clearly these properties can be broken down further, e.g. dependability may be decomposed to reliability, safety, timing (as described in [11]). Safety may further involve providing guarantees of independence between functionality. In practice, the arguments should be generic or based on patterns where possible. Stage (3) then uses the information in the argument to derive options and evaluate particular solutions. Part of this activity uses representative scenarios to evaluate the solutions. Based on the findings of stage (3), the design is modified to fix problems that are identified – this may require stages (1)-(3) to be repeated to show the revised design is appropriate. When this is complete and all necessary design choices have been made, the process returns to stage (1) where the system is then decomposed to the next level of abstraction using guidance from the goal structure. Components reused from another context could be incorporated as part of the decomposition. Only proceeding when design choices and problem fixing are complete is preferred to allowing tradeoffs across components at different stages of decomposition because the abstractions and assumptions are consistent.

M

By

Stage 2 – Arguing about key properties

327

e fin Re sign De

Stage 1 – Modelling the system e ov pr gn Im esi D

ak e M Ch D O ulti oic esi pt pl e gn im e- s isa Cr tio iter n ia

Architectural Considerations in the Certification of Modular Systems

Stage 3(b) - Extracting Stage 3(a) – Elicitation and evaluation of choices questions from the arguments SCENARIOS

Stage 3(c) – Evaluating whether claims are satisfied

Fig. 1. Overview of the Method

A

Goal

Solution

Context

Assumption SolvedBy InContextOf

Choice

Fig. 2. Goal Structuring Notation (GSN) Symbols

Sensor

Actuator

Calculations

-value -health +read_data() +send_data()

-sensor_data -actuator_data -health +read_data() +send_data() +transform_data()

-value -health +read_data() +send_data()

Health Monitoring -system_health +read_data() +calculate_health() +perform_health() +update_maintainenance_state()

Fig. 3. Class Diagram for the Control Loop

3.2

Example – Simple Control System

The example being considered is a continuous control loop that has health monitoring to check for whether the loop is complying with the defined correct behaviour (i.e. accuracy, responsiveness and stability) and then takes appropriate actions if it does not. At the highest level of abstraction the control loop (the architectural model of which is shown in Figure 3) consists of three elements; a sensor, an actuator and a calculation stage. It should be noted that at this level, the design is abstract of whether the implementation is achieved via hardware or software. The requirements (key safety properties to be maintained are signified by (S), functional properties by (F) and non-functional properties by (NF), and explanations, where needed, in italics) to be met are: 1. the sensors have input limits (S) (F); 2. the actuators have input and output limits (S) (F);

328

Iain Bate and Tim Kelly

3. the overall process must allow the system to meet the desired control properties, i.e. responsiveness (dependent on errors caused by latency (NF)), stability (dependent on errors due to jitter (NF) and gain at particular frequency responses (F)) [6] (S); 4. where possible the system should allow components that are beginning to fail to be detected at an early stage by comparison with data from other sources (e.g. additional sensors) (NF). Early recognition would allow appropriate actions to be taken including the planning of maintenance activities. In practice as the system development progresses, the component design in Figure 3 would be refined to show more detail. For reasons of space only the calculation-health monitor coupling is considered. Stage 2 is concerned with producing arguments to support the meeting of objectives. The first one considered here is an objective obtained from decomposing an argument for dependability (the argument is not shown here due to space reasons) that the system’s components are able to tolerate timing errors (goal Timing). From an available argument pattern, the argument in Figure 4 was produced that reasons “Mechanisms in place to tolerate key errors in timing behaviour” where the context of the argument is health monitor component. Figure 4 shows how the argument is split into two parts. Firstly, evidence has to be obtained using appropriate verification techniques that the requirements are met in the implementation, e.g. when and in what order functionality should be performed. Secondly, the health monitor checks for unexpected behaviour. There are two ways in which unexpected behaviour can be detected (a choice is depicted by a black diamond in the arguments) – just one of the techniques could be used or a combination of the two ways. The first way is for the health-monitor component to rely entirely on the results of the internal health monitoring of the calculation component to indicate the current state of the calculations. The second way is for the health-monitor component to monitor the operation of the calculation component by observing the inputs and outputs to the calculation component. In the arguments, the leaf goals (generally at the bottom) have a diamond below them that indicates the development of that part of the argument is not yet complete. The evidence to be provided to support these goals should be quantitative in nature where possible, e.g. results of timing analysis to show timing requirements are met. Next an objective obtained from decomposing an argument for maintainability (again not shown here due to space reasons) that the system’s components are tolerant to changes is examined. The resultant argument in Figure 5 depicts how it is reasoned the “Component is robust to changes” in the context of the health-monitor component. There are two separate parts to this; making the integrity of the calculations less dependent on when they are performed, and making the integrity of the calculations less dependent on the values received (i.e. error-tolerant). For the first of these, we could either execute the software faster so that jitter is less of an issue, or we could use a robust algorithm that is less susceptible to the timing properties of the input data (i.e. more tolerant to jitter or the failure of values to arrive).

Architectural Considerations in the Certification of Modular Systems

C0010 Mechanism = Healthmonitoring component

Timing Mechanisms in place to tolerate key errors in timing behaviour

329

A0004 Appropriate steps taken when system changes

A

G0020 Sufficient information about the bounds of expected timing operation is obtained

C0009 Appropriate = correct, consistent and completeness

G0015 Timing requirements are specified appropriately

G0016 System implemented in a predictable way

C0010 Expected temporal behaviour concerns when and the order in which functionality is performed

G0017 Verification techniques available to prove the requirements are met

G0021 Operation is monitored and unexpected behaviour handled

G0022 Health monitor relies on health information provided to it

G0023 Health monitor performs checks based on provided information

Fig. 4. Timing Argument

C0012 Component = health monitoring

A0002 The integrity is related to frequency, latency and jitter

G0002 Component is robust to changes

G0011 Make operations integrity less susceptible to time variations

G0012 Make operations integrity less dependent on value

A

C0007 Plant = system under control

G0013 Perform functionality faster than the plant's fastest frequency

G0014 Make calculations integrity less dependent on input data's timing properties

C0008 Robust algorithms e.g. H-infinity

Fig. 5. Minimising Change Argument

The next stage (stage 3(a)) in the approach is the elicitation and evaluation of choices. This stage extracts the choices, and considers their relative pros and cons. The results are presented in Table 1. From Table 1 it can be seen that some of the choices that need to be made about individual components are affected by choices made by other components within the system. For instance, Goal G0014 is a design option of having a more complicated algorithm that is more resilient changes to and variations in the system’s timing properties. However Goal G0014 is in opposition to Goal G0023 since it would make the health-monitoring component more complex. Stage 3(b) then extracts questions from the argument that can then be used to evaluate whether particular solutions (stage 3(c)) meets the claims from the arguments generated earlier in the process. Table 2 presents some of the results of extracting questions from the arguments for claim G0011 and its assumption A0002 from Figure 5. The table includes an evaluation of a solution based on a PID (Proportional Integration Differentiation) loop.

330

Iain Bate and Tim Kelly Table 1. Choices Extracted from the Arguments Content

Choice Goal G0022 Health monitor relies on health information provided to it Goal G0023- Health monitor performs checks based on provided information

Pros Simplicity since health monitor doesn’t need to access and interpret another component’s Goal G0021 state. Operation is monitored and Omission failures unexpected easily detected and behaviour integrity of handled calculations maintained assuming data provided is correct. Goal G0013 – Perform Simple algorithms can functionality faster be used. than the plant’s fastest These algorithms take Goal G0011 less execution time. Make operations frequency. integrity less Goal G0014 - Make Period and deadline susceptible to time calculations’ integrity constraints relaxed. variations less dependent on Effects of failures may input data’s timing be reduced. properties.

Cons Can a failing/failed component be trusted to interpret error-free data. Health monitor is more complex and prone to change due to dependence on the component. Period and deadline constraints are tighter. Effects of failures are more significant. More complicated algorithms have to be used. Algorithms may take more execution time.

Table 2 shows how questions for a particular coupling have different importance associated (e.g. Essential versus Value Added). These relate to properties that must be upheld or those whose handling in a different manner may add benefit (e.g. reduced susceptibility to change). The responses are only partially for the solution considered due to the lack of other design information. As the design evolves the level of detail contained in the table would increase and the table would then be populated with evidence from verification activities, e.g. timing analysis. With the principles that we have established for organising the safety case structure “in-the-large”, and the complementary approach we have described for reasoning about the required properties of the system architecture, we believe it is possible to create a flexible, modular, certification argument for IMA. This is discussed in the following section. Table 2. Evaluation Based on Argument

Question Goal G0011 - Can the integrity of the operations be justified? Assumption A0002 - Can the dependency between the operation’s integrity and the timing properties be relaxed?

Importance Response Design Mod. Essential More design Dependent on information response to needed questions Value Only by changing Results of other Added control algorithm trade-off used analysis needed

Architectural Considerations in the Certification of Modular Systems

4

331

Example Safety Case Architecture for a Modular System

The principles of defining system and safety case architecture discussed in this paper are embodied in the safety case architecture shown in Figure 6. (The UML package notation is used to represent safety case modules.) The role of each of the modules of the safety case architecture shown in Figure 6 is as follows: • • • • •

• • •

•

•

•

ApplnAArg - Specific argument for the safety of Application A (one required for each application within the configuration) CompilationArg - Argument of the correctness of the compilation process. Ideally established once-for-all. HardwareArg - Argument for the correct execution of software on target hardware. Ideally an abstract argument established once-for-all leading to support from specific modules for particular hardware choices. ResourcingArg - Overall argument concerning the sufficiency of access to, and integrity of, resources (including time, memory, and communications) ApplnInteractionArg - Argument addressing the interactions between applications, split into two legs: one concerning intentional interactions, the second concerning unintentional interactions (leading to the NonInterfArg Module) InteractionIntArg - Argument addressing the integrity of mechanism used for intentional interaction between applications. Supporting module for ApplnInteractionArg. Ideally defined once-for-all. NonInterfArg - Argument addressing unintentional interactions (e.g. corruption of shared memory) between applications. Supporting module for ApplnteractionArg. Ideally defined once-for-all PlatFaultMgtArg - Argument concerning the platform fault management strategy (e.g. addressing the general mechanisms of detecting value and timing faults, locking out faulty resources). Ideally established once-for-all. (NB Platform fault management can be augmented by additional management at the application level). ModeChangeArg - Argument concerning the ability of the platform to dynamically reconfigure applications (e.g. move application from one processing unit to another) either due to a mode change or as requested as part of the platform fault management strategy. This argument will address state preservation and recovery. SpecificConfigArg - Module arguing the safety of the specific configuration of applications running on the platform. Module supported by once-for-all argument concerning the safety of configuration rules and specific modules addressing application safety. TopLevelArg - The top level (once-for-all) argument of the safety of the platform (in any of its possible configurations) that defines the top level safety case architecture (use of other modules as defined above).

332

• •

Iain Bate and Tim Kelly

ConfigurationRulesArg - Module arguing the safety of a defined set of rules governing the possible combinations and configurations of applications on the platform. Ideally defined once-for-all. TransientArg - Module arguing the safety of the platform during transient phases (e.g. start-up and shut-down).

An important distinction is drawn above between those arguments that ideally can be established as ‘once-for-all’ arguments that hold regardless of the specific applications placed on the architecture (and should therefore be unaffected by application change) and those that are configuration dependent. In the same way as there is an infrastructure to the IMA system itself the safety case modules that are established once for all possible application configurations form the infrastructure of this particular safety case architecture. These modules (e.g. NonInterfArg) establish core safety claims such as non-interference between applications by appeal to properties of the underlying system infrastructures. These properties can then be relied upon by the application level arguments. TopLevelArg Top Level System Argument for the platform + configured applications

SpecificConfigArg Safety argument for the specific configuration of the system

ApplnAArg

ApplnBArg

Specific safety arguments concerning the functionality of Application A

Specific safety arguments concerning the functionality of Application B

ConfigRulesArg ApplnInteractionArg Argument for the safety of interactions between applications

Safety argument based upon an allowable set of configurations

Hardware Arg

CompilationArg (As Example) Arguments of the integrity of the compilation path

NonInterfArg

InteractionIntArg

Arguments of the absence of non-intentional interference between applications

Arguments concerning the integrity of intentional mechanisms for application interaction

Arguments of the correct execution of software on target hardware

PlatformArg Arguments concerning the integrity of the general purpose platform

PlatFaultMgtArg Argument concerning the platform fault management strategy

ResourcingArg Arguments concerning the sufficiency of access to, and integrity of, resources

TransientArg Arguments of the safety of the platform during transient phases

Fig. 6. Safety Case Architecture of Modularised IMA Safety Argument

5

Conclusions

In order to reap the potential benefits of modular construction of safety critical and safety related systems a modular approach to safety case construction and acceptance is also required. This paper has addressed a method to support architectural design and implementation strategy trade-off analysis, one of the key parts of component-based development. Specifically, the method presented provides guidance when decomposing systems so that the system’s objectives are met and deciding what functionality the components should fulfil in-order to achieve the remaining objectives.

Architectural Considerations in the Certification of Modular Systems

333

References 1.

Jones, C. Specification and design (parallel) programs. in IFIP Information Processing 83. 1983: Elsevier. 2. MoD, 00-56 Safety Management Requirements for Defence Systems. 1996, Ministry of Defence. 3. MoD, 00-55 Requirements of Safety Related Software in Defence Equipment. 1997, Ministry of Defence. 4. Perrow, C., Normal Accidents: living with high-risk technologies. 1984: Basic Books. 5. Leveson, N.G., Safeware: System Safety and Computers. 1995: Addison-Wesley. 6. Meyer, B., Applying Design by Contract. IEEE Computer, 1992. 25(10): p. 4052. 7. CENELEC, Safety-related electronic systems for signalling, European Committee for Electrotechnical Standardisation: Brussels. 8. Kazman, R., M. Klein, and P. Clements, Evaluating Software Architectures Methods and Case Studies. 2001: Addison-Wesley. 9. Douglass, B., Real-Time UML. 1998: Addison Wesley. 10. Kelly, T.P., Arguing Safety - A Systematic Approach to Safety Case Management. 1998, Department of Computer Science, University of York. 11. Laprie, J.-C. Dependable Computing and Fault Tolerance: Concepts and Terminology. 1985. 15th International Symposium on Fault Tolerant Computing (FTCS-15).

A Problem-Oriented Approach to Common Criteria Certification Thomas Rottke1 , Denis Hatebur1 , Maritta Heisel2 , and Monika Heiner3 1

3

¨ TUViT GmbH, System- und Softwarequalit¨ at Am Technologiepark 1, 45032 Essen, Germany {t.rottke,d.hatebur}@tuvit.de 2 Institut f¨ ur Praktische Informatik und Medieninformatik Technische Universit¨ at Ilmenau 98693 Ilmenau, Germany [email protected] Brandenburgische Technische Universit¨ at Cottbus, Institut f¨ ur Informatik 03013 Cottbus, Germany [email protected]

Abstract. There is an increasing demand to certify the security of systems according to the Common Criteria (CC). The CC distinguish several evaluation assurance levels (EALs), level EAL7 being the highest and requiring the application of formal techniques. We present a method for requirements engineering and (semi-formal and formal) modeling of systems to be certified according to the higher evaluation assurance levels of the CC. The method is problem oriented, i.e. it is driven by the environment in which the system will operate and by a mission statement. We illustrate our approach by an industrial case study, namely an electronic purse card (EPC) to be implemented on a Java Smart Card. As a novelty, we treat the mutual asymmetric authentication of the card and the terminal into which the card is inserted.

1

Introduction

In daily life, security-critical systems play a more and more important role. For example, smart cards are used for an increasing number of purposes, and e-commerce and other security-critical internet activities become increasingly common. As a consequence, there is a growing demand to certify the security of systems. The common criteria (CC) [1] are an international standard that is used to assess the security of IT products and systems. The CC distinguish several evaluation assurance levels (EALs), level EAL7 being the highest and requiring the application of formal techniques even in the high-level design. Whereas the CC state conditions to be met by secure systems, they do not assist in constructing the systems in such a way that the criteria are met. In this paper, we present a method for requirements engineering and (semi-formal or formal) modeling of systems to be certiﬁed according to the higher evaluation S. Anderson et al. (Eds.): SAFECOMP 2002, LNCS 2434, pp. 334–346, 2002. c Springer-Verlag Berlin Heidelberg 2002

A Problem-Oriented Approach to Common Criteria Certification

335

¨ assurance levels of the CC. This method is used by TUViT Essen1 in supporting product evaluations. The distinguishing feature of our method is its problem orientation. First, problem orientation means that the starting point of the system modeling is an explicit mission statement, which is expressed in terms of the application domain. This approach is well established in systems engineering [3], but in contrast to other requirements engineering approaches, especially in software engineering [9]. Such a mission statement consists of the following parts: 1. 2. 3. 4.

external actors and processes objective/mission of the system system services quality of services

From our experience, the mission statement provides the main criteria for assessing, prioritizing and interpreting requirements. Second, problem orientation means that we do not only model the system to be constructed, but also its environment, as proposed by Jackson [6]. This approach has several advantages: – Without modeling the environment, only trivial security properties can be assessed. For example, an intruder who belongs to the environment must be taken into account to demonstrate that a system is secure against certain attacks. – With the model of the environment, a test system can be constructed at the same time as the system itself. – The problem oriented approach results in a strong correspondence between the reality and the model, which greatly enhances validation and veriﬁcation activities as they are required for certiﬁcation. In the CC, the environment of the system is contained only indirectly via subjects, which must be part of the security policy model. Another diﬀerence to the CC is that our method not only takes into account the new system to be constructed, but also performs an analysis of the current system in its environment. Figure 1 shows the most important documents that have to be constructed and evaluated for certiﬁcation: – The objective of the security target (ST) is to specify the desired security properties of the system in question by means of security requirements and assurance measures. – The security policy model (SPM) shows the interaction between the system and its environment. This model provides a correspondence between the functional requirements and the functional speciﬁcation enforced by the security policy. 1

¨ TUViT is an independent organization that performs IT safety and security evaluation.

336

Thomas Rottke et al.

ST (Security Target)

RCR

Environment

FSP Functional Specification

TOE-Description Security Security Objectives Objectives

SPM

Security Policy Model

High-Level -Level-Design -

Functional Functional Requirements

LLD

Summary Summary Specification

IMP IMP

formal

RCR

HLD HLD

RCR

Low-Level-Design Low-Level-

Implementation

semi-formal

informal

Fig. 1. CC documents for development – The functional speciﬁcation (FSP), high-level-design (HLD), low-level-design (LLD) and implementation (IMP) are development documents that are subject to evaluation. – In addition to the development documents, representation correspondences (RCR) documentation is required to ensure that appropriate reﬁnement steps have been performed in the development documents. In the following, we describe our problem oriented approach in Section 2, and then illustrate it in Section 3 by an industrial case study, namely an EPC to be implemented on a Java Smart Card. As a novelty, we treat the mutual asymmetric authentication of the card and the terminal into which the card is inserted. In this case study, we use the notations SDL [2], message sequence charts (MSCs) [5], and colored Petri nets [7]. Finally, we sum up the merits of our method and point out directions for future work.

2

The Method

Our method gives guidance how to develop the documents required for a CC certiﬁcation in a systematic way. Because of the systematic development and the use of semi-formal and formal notations, the developed documents can readily be evaluated for conformance with the CC. To express our method, we use the agenda concept [4]. An agenda is a list of steps or phases to be performed when carrying out some task in the context of software engineering. The result of the task will be a document expressed in some language. Agendas contain informal descriptions of the steps, which may depend on each other. Agendas are not only a means to guide software development activities. They also support quality assurance because the steps may have validation conditions associated with them. These validation conditions state necessary semantic conditions that the developed artifact must fulﬁll in order to serve its purpose properly. Table 1 gives an overview of the method. Note that the method does not terminate with Phase 4. There are two more phases that are beyond the scope

A Problem-Oriented Approach to Common Criteria Certification

337

Table 1. Agenda for problem oriented requirements engineering and system modeling Phase 1. Problem oriented requirements capture 2. Analysis of current system

Content list of requirements

Format CC Documents informal —

description of current informal system status or semiformal 3. Problem description of desired informal oriented require- system status, mis- or semiments analysis sion statement formal

—

Validation reviews

reviews

ST: environment, TOE description,security objectives

each statement of phase 1 must be incorporated; internal consistency must be guaranteed. 4. Problem ori- context diagram, sys- possibly ST: functional see sub-agenda, ented modeling tem interface descrip- formal requirements, Table 2. tions, system environsummary specment description ification; FSP, SPM

of this paper. In Phase 5, the model constructed in Phase 4 is validated. Finally, the model is reﬁned by constructing a high-level design, a low-level design and an implementation (Phase 6). In this paper, however, we concentrate on the systematic development of the requirements and speciﬁcation documents. Validation and reﬁnement issues will be treated in separate papers. Setting up the documents required by the CC need not necessarily proceed in the order prescribed by the CC outline. Our process proceeds in a slightly diﬀerent order. The “CC Documents” column in Table 1 shows in which phases which CC documents are developed. The purpose of Phase 1 is to collect the requirements for the system. These requirements are expressed in the terminology of the system environment or the application domain, respectively. Requirements capture is performed by conducting interviews and studying documents. The results of Phase 1 are validated by reviewing the minutes of the interviews together with the interview partner and by reviewing the used documents with their authors. In Phase 2, the current state of aﬀairs must be described, analyzed and assessed. External actors and entities must be identiﬁed; functionalities must be described and decomposed. The result of this phase are domain-speciﬁc rules, as well as descriptions of the strengths and weaknesses of the current system. As in Phase 1, the validation of the produced results is done by reviews. The results of Phases 1 and 2 are not covered by the CC. However, they are needed to provide a ﬁrm basis for the preparation of the CC documents.

338

Thomas Rottke et al.

Table 2. Agenda for problem oriented system modeling Phase 4.1. Context modeling 4.2. Define constraints and system properties

Content

Format

structure of system em- context bedded in its environment diagram TOE security require- instanments, security require- tiated ments of environment text from CC part 2 catalogue 4.3. In- data formats, system be- data dicterface havior at external inter- tionary, definition faces MSCs

CC Docu- Validation ments SPM part 1 must be compatible with Phase 3 ST: func- see Phase 3 tional requirements, summary specification FSP, part 1 each service contained in the mission statement must be modeled SPM, part 2 must be compatible with Phases 3 and 4.1

4.4. Model- external components and CEFSM ing of system their behavior, environenvironment mental constrains and assumptions 4.5 Model- service specifications informal FSP, part 2 see Phase 4.3 ing of system text and services CEFSM

The goal of the requirements analysis, i.e. Phase 3, is to qualitatively describe which purpose the new system serves in its environment, and which services it must provide. Strict requirements and constraints for the new system are set up. As in Phase 2 for the existing system, external actors and entities are identiﬁed for the desired system. The requirements captured in Phase 1 can now be made more concrete. Thus, the mission statement is set up. The validation condition associated with Phase 3 requires that all requirements captured in Phase 1 be taken into account in the mission statement. In contrast to Phase 2, which makes descriptive statements, Phase 3 makes prescriptive statements. The purpose of Phase 4 is to deﬁne the system and its environment. The system entities and their attributes are deﬁned, as well as the processes and procedures they are involved in. Case distinctions imposed by domain rules are identiﬁed with respect to the entities and processes. This phase consists of ﬁve sub-phases, which are also represented as an agenda, see Table 2. In Phase 4.1, the boundary between the system and its environment is deﬁned. In Phase 4.2, the security requirements for the target of evaluation (TOE) and for the system environment are instantiated from the CC by deﬁning constraints or properties.

A Problem-Oriented Approach to Common Criteria Certification

339

In Phase 4.3, the interface of the system is speciﬁed in detail. MSCs are used to represent traces of system services. In Phases 4.4 and 4.5, communicating extended ﬁnite state machines (CEFSMs) are set up for the environment as well as for each system service identiﬁed in the mission statement. For each service, functional as well as non-functional properties must be deﬁned.

3

Case Study: Electronic Purse Card (EPC)

We now illustrate the method presented in Section 2 by an industrial case study2 . EPCs are introduced to replace Eurocheque (EC) cards for the payment of goods purchased in retail stores. For reasons of space, we can present only parts of the documents produced when performing our method. As a novelty, we deﬁne an EPC system that uses mutual asymmetric authentication. This security protocol guarantees mandatory authenticity of two communication partners. In contrast to the symmetric authentication, the asymmetric authentication procedures have the advantage that they do not need a common secret between the partners. In case of asymmetric authentication, each communication partner has its own key pair which is generated independently from other key pairs within the terminals and the cards, respectively. A personalization authority initiates the key generation within the components and the signing of the public key by a certification authority to ensure the correctness of the key generation procedure. By using asymmetric authentication, the e-cash procedure becomes open to other terminal manufacturers and card emitters, as long as their components are personalized by the personalization authority. In the following, we sketch each of the development phases introduced in Section 2. Phase 1: problem oriented requirements capture. Requirements for the EPC system include: – Payment must be simple for all participants. – Payment is anonymous, but at the same time authentic; non-repudiation is guaranteed. – Stolen EPCs cannot be used. – EPCs and terminals can be neither intercepted nor forged. Phase 2: analysis of current system. In this phase, it is described how payment with EC cards proceeds. Figure 2 shows the diﬀerent stages. Examples of domain-speciﬁc rules are that a personal identiﬁcation number (PIN) has four digits, and that it must be counted how often a wrong PIN has been entered. Some weaknesses of the EC card system are that payments are not anonymous, that the access to the customer’s account is protected only by the PIN, 2

¨ Similar systems have been evaluated by TUViT, Essen.

340

Thomas Rottke et al.

printing receipt

close connection to transaction system

remove EC card from terminal

posting transaction

PIN verification

establish connection to transaction system

commit amount

input PIN

input amount in terminal

insert EC card in card reader

buy with EC card

Fig. 2. Payment with Eurocheque card buy with EPC

printing receipt

remove EPC from terminal

pay with EPC

transfer amount

commit amount

input PIN

authentication of card and terminal

input amount in terminal

insert EPC into terminal

PIN verification

account money on EPC

personalize EPC

Fig. 3. EPC system

and that the connection between the terminal and EC card is insecure. Hence, customer proﬁles can be constructed, the customer can be damaged by revelation of the PIN, and the system is not protected against man-in-the-middle-attacks. Phase 3: problem oriented requirements analysis. EPCs function diﬀerently from EC cards. Before the customer can pay with the EPC, the card must be loaded. For this purpose, it must be inserted into a bank terminal, and the PIN and the desired amount must be entered. Purchasing goods with the EPC proceeds similarly as with the EC card, but the amount is debited from the card instead of from the customer’s account. Moreover, the bank and cash terminals and the EPC must be personalized by a personalization authority. This means that a pair of keys (a public and a private one) is generated for each component, where the public key is certiﬁed by a certiﬁcation authority. Figure 3 shows the desired services of the EPC system. The EPC system, too, has potential weaknesses. For example, a man-inthe-middle attack or spying out the PIN may be possible. An analysis of such weaknesses and the corresponding attack scenarios lead to the following security goals: – Debiting the customer account is done in a secure environment (bank terminal). – An EPC is useless if its secrets are unknown. – Neither the cards nor the terminals can be copied or forged.

A Problem-Oriented Approach to Common Criteria Certification

341

– The connections between the terminals and the EPC are encrypted, so that intercepting those connections is useless. – Transactions take place only between authenticated components. The following assumptions concerning the environment must be made: – The personalization of the card is secure. – The bank terminals are installed in a protected area and cannot be intercepted. Now, the requirements set up in Phase 1 can be made more concrete. Simplicity of payment means that a payment is performed just by debiting the EPC. The customers only need to type their PIN and to conﬁrm the transaction. The store personnel only needs to specify the amount and to hand the printed receipt to the customer. Anonymity is guaranteed, because the only documentation of the payment is the customer receipt. Because of the authentication mechanism, authenticity and non-repudiation are guaranteed. Stolen cards cannot be used, because the PIN is known only to the card holder, and the card secret will not be revealed by an authenticated terminal. Interception is made useless by encryption, and copying cards is prevented by preventing direct access to the physical card storage. Inh this paper, we have only given an informal sketch of Phase 3. In real-life projects, this phase is performed much more thoroughly. Phase 4.1: context modeling. We present two diﬀerent documents that show the system and its embedding in its environment. Figure 4 shows the security policy model for the EPC system in SDL notation. It shows the EPC in its environment consisting of an intruder, a terminal, a personalization and a certiﬁcation authority (CA). The personalization and CA components are not the main concern of our discourse and are therefore drawn with dotted lines. The terminal is used for e-cash transactions. It is personalized, which means that it has a key pair, and its public key is signed by certiﬁcation authority. The intruder models a man-inthe-middle attack, i.e. the intruder intercepts the communication between card and terminal and can therefore attack both the card and the terminal. The EPC is the target of evaluation (TOE). The card application includes functionality for mutual asymmetric authentication, PIN check, credit and debit transactions. It is assumed that the card is personalized. The components in the SPM are interconnected by channels, shown in the diagram by inscripted arcs. The external channels chU ser and chCard represent the interactions between T erminal and U ser and between T erminal and CardReader. The internal channels connect T erminal and Intruder or Intruder and EP C, respectively. If a system is to be certiﬁed according to EAL7, we need a completely formal model, which we express as a colored Petri net (CPN). Phase 4.2: define constraints and system properties. As an example for the CC part 2 catalogue we take the component FCS COP.1.1 (Cryptographic operation from class cryptographic support):

342

Thomas Rottke et al.

system SPM

sReturn

sPersonalizeTerminalReturn, sPersonalizeCardReturn

chCard sInsertCard, sRemoveCard sReturn, sSignDocumentReturn

chUser

CA

Terminal

sIdentify, sCredit, sDebit, sSignDocument, sPersonalizeTerminal, sGenerateTerminalKeys

sATR, sTransferKeyReturn, sSignHashReturn, sMutualAuthReturn, sGetRandomReturn

chPersonalizeTerminal

chTerminalToIntruder

sTransferKey, sIdentify, sMutualAuth, sGetRandom sSignHash, sCredit, sDebit

chAttack

sFakeTerminal, sFakeApplet

sGetGlobalCAKey, sSignKey

chPersonalize

sReturn, sGenerateTerminalKeysReturn

Intruder

chCA

sPersonalizeTerminal, sPersonalizeCard

Personalization sGetGlobaCAKeyReturn, sSignKeyReturn

sATR, sTransferKeyReturn, sSignHashReturn, sMutualAuthReturn, sGetRandomReturn

sReturn, sGenerateCardKeysReturn

sTransferKey, sIdentify, sMutualAuth, sGetRandom sSignHash, sCredit, sDebit

chEPCToIntruder chPersonalizeCard EPC

sPersonalizeCard, sGenerateCardKeys

Fig. 4. SDL security policy model FCS COP.1.1 The TSF3 shall perform [assignment: list of cryptographic operations] in accordance with a speciﬁed cryptographic algorithm [assignment: cryptographic algorithm] and cryptographic key sizes [assignment: cryptographic key sizes] that meet the following: [assignment: list of standards]. For our EPC system, this component is instantiated as follows: FCS COP.1.1 The TSF shall perform the mutual authentication procedure in accordance with a speciﬁed cryptographic algorithm RSA and cryptographic key sizes of 1024 bit that meet the following: IEEE 1363. In addition, it is necessary to follow all dependencies between the components. In this case, the FCS COP.1.1 component requires to include the cryptographic key generation component. Phase 4.3: interface definition. As an example, we consider the asymmetric authentication protocol. It is speciﬁed by means of a message sequence chart, which in fact is the common speciﬁcation technique for technical protocols. 3

TOE security function

A Problem-Oriented Approach to Common Criteria Certification

Terminal

Intruder

EPC

vPrivateKeyTerm, vPublicKeyTerm, vSigPublicKeyTerm, vPublicKeyGlobal

sTransferKey ( vPublicKeyTerm, vSigPublicKeyTerm)

sTransferKeyReturn ( vPublicKeyCard, vSigPublicKeyCard)

343

vPrivateKeyCard, vPublicKeyCard, vSigPublicKeyCard, vPublicKeyGlobal sTransferKey ( vPublicKeyTerm, vSigPublicKeyTerm) checkSig vPublicKeyGlobal (vSigPublicKeyTerm, vPublicKeyTerm) sTransferKeyReturn ( vPublicKeyCard, vSigPublicKeyCard)

checkSig vPublicKeyGlobal (vSigPublicKeyCard, vPublicKeyCard) sGetRandom (void)

sGetRandom (void) vRndCard = genRandom()

sGetRandom Return ( vRndCard)

sGetRandom Return ( vRndCard)

vRndTerm = genRandom(), vSKRndTerm = genRandom(), vSigRndCard = makeSig vPrivateKeyTerm ( vRndCard) vEncrRndTerm = encr vPublicKeyCard (vRndTerm), vEncrSKRndTerm = encrvPublicKeyCard (vSKRndTerm) sMutualAuth (vSigRndCard, vEncrRndTerm,vEncrSKRndTerm )

sMutualAuth (vSigRndCard, vEncrRndTerm,vEncrSKRndTerm ) vSKRndCard = genRandom(), vSigRndTerm = makeSig vPrivateKeyCard ( decrvPrivateKeyCard (vEncrRndTerm )), vEncrSKRndCard = encrvPublicKeyTerm (vSKRndCard)), checkSigvPublicKeyCard(vSigRndCard, vRndCard), vSessionKey=calcSK( vSKRndCard, decrvPrivateKeyCard(vEncrSKRndTerm))

sMutualAuthReturn (vSigRndTerm, vEncrSKRndCard )

sMutualAuthReturn (vSigRndTerm, vEncrSKRndCard )

checkSig vPublicKeyCard (vSigRndTerm, vRndTerm), vSessionKey=calcSK( decrvPrivateKeyTerm (vEncrSKRndCard), vSKRndTerm)

Fig. 5. MSC of authentication protocol

Figure 5 shows a successful asymmetric mutual authentication between a terminal and an EPC, intercepted by an intruder. The sequence of action can be divided into three phases: – public key exchange and check of public key signature using the pulic key of the CA – random number request by terminal – authentication and session key generation Each phase starts with a command called by the terminal followed by an EPC response. Phase 4.4: modeling of system environment. For reasons of space, we cannot present the CEFSMs representing the environment.

344

Thomas Rottke et al.

process CardMutualAuthentication

PersonalizedAndInserted

*

sTransferKey

CheckSig

Err

PersonalizedAndInserted

OK WaitForGetRandom

sTransferKeyReturn

sGetRandom

GenerateRandom

*

WaitForMutualAuth

WaitForGetRandom

PersonalizedAndInserted

sGetRandomReturn

sTransferKeyError

PersonalizedAndInserted

WaitForMutualAuth

sMutualAuth

Generate SKRandomCard Calculate SignatureOfRandomTerm Encrypt SKRandomCard CalculateSessionKey

Check SignatureOfRandomCard

*

PersonalizedAndInserted

Err

OK sMutualAuthReturn

TermAuthenticated

sMutualAuthError

PersonalizedAndInserted

Fig. 6. SDL deﬁnition of authentication

Phase 4.5: modeling of system services. We consider the authentication service, where we present an SDL and a CPN version. The EP C part of the asymmetric authentication protocol is modeled as an SDL state machine, see Figure 6. The three phases of the MSC are reﬂected by the three parts of the state machine. The state machine starts in the state P ersonalizedAndInserted. 1. The phase “Public Key Exchange and Check of Public Key Signature” leads to the state W aitF orGetRandom or remains in the state P ersonalized− AndInserted. 2. The phase “Random Number Request by Terminal” leads to the state W ait− F orM utualAuth or back to the state P ersonalizedAndInserted. 3. The phase “Authentication and Session Key generation” leads to the state T erm− Authenicated or back to the state P ersonalizedAndInserted. The state machine is now modeled formally in CPN, see Figure 7. It is just a translation of the SDL protocol machine into CPN. Therefore the CPN model also contains the same three phases, states and transitions. In CPN, the states of the state machine are modeled by places for simple tokens. The channels are

A Problem-Oriented Approach to Common Criteria Certification

ActualState

tslIntr2CC

tslIntr2CC

sTransferKey

tState

345

tslCC2Intr

FP

tslCC2Intr

Personalized AndInserted

sTransferKeyReturn

CheckSigOK CheckSig = OK

sTransferKey

sTransferKeyError

CheckSigErr CheckSig = Err

sStar

OtherSignal tState

sStar<>sTransferKey

WaitFor GetRandom sGetRandom

sGetRandomReturn (GenerateRandom)

GenerateRnd tState

sStar

OtherSignal tState

sMutualAuth

sMutualAuth

WaitFor MutualAuth

FP

Personalized AndInserted

sStar<>sGetRandom sMutualAuthReturn (Generate and Encrypt SKRandom, CalculateSig of Random)

Authenticated CheckSig of Random of Card = OK

sMutualAuthError

NotAuth

CheckSig of Random of Card = Err

sStar

OtherSignal

calculate SessionKey

sStar<>sMutualAuth

FP

tState

Terminal Authenticated

tState

Personalized AndInserted

Fig. 7. CPN model of authentication

also modeled by places, but using more complex tokens (colors). Arc inscriptions are used to model the functionality of the transitions. Conditions are modeled with the CPN guard mechanism. In the subsequent phases of the method, these documents will be further validated and reﬁned. Because of the formal nature of the documents, the required security properties can be demonstrated in a routine way [8].

4

Conclusions

We have presented a method to model systems in such way that their certiﬁcation according to the higher levels of the CC is well prepared. This method shows that problem analysis can be performed in an analytic and systematic way, even though problem analysis is often regarded as an unstructured task that needs – above all – creative techniques. In contrast, we are convinced that problem

346

Thomas Rottke et al.

analysis requires sound engineering techniques to achieve non-trivial and highquality results. Using our method, formal documents as they are required by the CC, can be developed in an appropriate way. The problem orientation of the method ensures a high degree of correspondence between the system model and the reality. This is due to the facts that the modeling process is oriented on the system mission and that the requirements are analyzed using terms of the application domain. Such a correspondence is crucial. If it is not given, inadequate models may be set up. Such inadequate models may have serious consequences: relevant properties may be impossible to prove. Instead, irrelevant properties may be proven, which would lead to an unjustiﬁed trust in the system. Our method is systematic and thus repeatable, and gives guidance how to model security properties. The risk of omissions is reduced, because the agenda leads the attention of the system engineers to the relevant points. Because of our method, we are now able to suggest some improvements to the CC. Until now, the CC required security models only for access control policies and information ﬂow policies, because only these belonged to the state of the art. By modeling the system environment, we have succeeded in setting up a formal model also for authentication. To the best of our knowledge, we are the ﬁrst to propose such a systematic, problem oriented approach to CC certiﬁcation. In the future, we will work on validation and reﬁnement in general, and on a complete validation of authentication SPM in particular.

References [1] Common criteria. See http://www.commoncriteria.org/. 334 [2] F. Belina and D. Hogrefe. The CCITT Specification and Description Language SDL. Computer Networks and ISDN Systems, 16(4):311–341, March 1989. 336 [3] B. Blanchard and W. Fabrycky. Systems Engeneering and Analysis. Prentice Hall, 1980. 335 [4] M. Heisel. Agendas – a concept to guide software development activites. In R. N. Horspool, editor, Proc. Systems Implementation 2000, pages 19–32. Chapman & Hall London, 1998. 336 [5] ITU-TS. ITU-TS Recommendation Z.120anb: Formal Semantics of Message Sequence Charts. Technical report, ITU-TS, Geneva, 1998. 336 [6] M. Jackson. Problem Frames. Analyzing and structuring software development problems. Addison-Wesley, 2001. 335 [7] K. Jensen. Colored Petri nets. Lecture Notes Comp. Sci.: Advances in petri nets, 254:248–299, 1986. 336 [8] K. Jensen. Colored Petri nets, Vol. II. Springer, 1995. 345 [9] G. Kolonya and I. Sommerville. Requirements Engineering. Wiley, 1997. 335

Author Index Androutsopoulos, Kelly . . . . . . . . . 82 Bate, Iain . . . . . . . . . . . . . . . . . . . . . 321 Benerecetti, Massimo . . . . . . . . . . 126 Bishop, Peter G. . . . . . . . . . . 163, 198 Bloomﬁeld , Robin . . . . . . . . . . . . . 198 Bobbio, Andrea . . . . . . . . . . . 212, 273 Bologna, Sandro . . . . . . . . . . . . . . . . . 1 Born, Bob W. . . . . . . . . . . . . . . . . . 186 Bredereke, Jan . . . . . . . . . . . . . . . . . . 19 Campelo, Jos´e Carlos . . . . . . . . . . 261 Chen, Luping . . . . . . . . . . . . . . . . . . 151 Ciancamerla, Ester . . . . . . . .212, 273 Clark, David . . . . . . . . . . . . . . . . . . . . 82 Clement, Tim . . . . . . . . . . . . . . . . . .198 Cugnasca, Paulo S´ergio . . . . . . . . 224 Dafelmair, Ferdinand J. . . . . . . . . . 61 Dhodapkar, S. D. . . . . . . . . . . . . . . 284 Dimitrakos, Theo . . . . . . . . . . . . . . . 94 Droste, Thomas . . . . . . . . . . . . . . . . .53 Franceschinis, Giuliana . . . . . . . . 212 Fredriksen, Rune . . . . . . . . . . . . . . . .94 Gaeta, Rossano . . . . . . . . . . . . . . . . 212 Gran, Bjørn Axel . . . . . . . . . . . . . . . 94 Gribaudo, Marco . . . . . . . . . . . . . . 273 Guerra, Soﬁa . . . . . . . . . . . . . . . . . . 198 Hartswood, Mark . . . . . . . . . . . . . . . 32 Hatebur, Denis . . . . . . . . . . . . . . . . 334 Heidtmann, Klaus . . . . . . . . . . . . . . 70 Heiner, Monika . . . . . . . . . . . . . . . . 334 Heisel, Maritta . . . . . . . . . . . . . . . . 334 Hering, Bernhard . . . . . . . . . . . . . . 296 Hollnagel, Erik . . . . . . . . . . . . . . . . 1, 4 Horv´ ath, A. . . . . . . . . . . . . . . . . . . . 273 Hughes, Gordon . . . . . . . . . . . . . . . 151 Jacobs, Jef . . . . . . . . . . . . . . . . . . . . 175 Kelly, Tim . . . . . . . . . . . . . . . . . . . . . 321 Kim, Tai-Yun . . . . . . . . . . . . . . . . . . . 44 Knight, John C. . . . . . . . . . . . . . . . 106 Kristiansen, Monica . . . . . . . . . . . . .94

Lankenau, Axel . . . . . . . . . . . . . . . . . 19 Lano, Kevin . . . . . . . . . . . . . . . . . . . . 82 Littlewood, Bev . . . . . . . . . . . . . . . 249 May, John . . . . . . . . . . . . . . . . . . . . . 151 Minichino, Michele . . . . . . . . 212, 273 Moeini, Ali . . . . . . . . . . . . . . . . . . . . 252 Mohajerani, MahdiReza . . . . . . . 252 Oliveira, ´Italo Romani de . . . . . . 224 Opperud, Tom Arthur . . . . . . . . . . 94 Ortmeier, Frank . . . . . . . . . . . . . . . 296 Panti, Maurizio . . . . . . . . . . . . . . . . 126 Papadopoulos, Yiannis . . . . . . . . . 236 Paynter, Stephen E. . . . . . . . . . . . 186 Popov, Peter . . . . . . . . . . . . . . . . . . 139 Portinale, Luigi . . . . . . . . . . . . . . . . 212 Procter, Rob . . . . . . . . . . . . . . . . . . . . 32 Ramesh, S. . . . . . . . . . . . . . . . . . . . . 284 Reif, Wolfgang . . . . . . . . . . . . . . . . . 296 Rhee, Yoon-Jung . . . . . . . . . . . . . . . 44 Rodr´ıguez, Francisco . . . . . . . . . . . 261 Rottke, Thomas . . . . . . . . . . . . . . . 334 Rounceﬁeld, Mark . . . . . . . . . . . . . . 32 Saridakis, Titos . . . . . . . . . . . . . . . . 309 Schellhorn, Gerhard . . . . . . . . . . . 296 Serrano, Juan Jos´e . . . . . . . . . . . . .261 Servida, Andrea . . . . . . . . . . . . . . . . 10 Sharma, Babita . . . . . . . . . . . . . . . . 284 Slack, Roger . . . . . . . . . . . . . . . . . . . . 32 Spalazzi, Luca . . . . . . . . . . . . . . . . . 126 Stølen, Ketil . . . . . . . . . . . . . . . . . . . . 94 Tacconi, Simone . . . . . . . . . . . . . . . 126 Thums, Andreas . . . . . . . . . . . . . . . 296 Trappschuh, Helmut . . . . . . . . . . . 296 Trienekens, Jos . . . . . . . . . . . . . . . . 175 Tronci, Enrico . . . . . . . . . . . . . . . . . 273 Voß, Alexander . . . . . . . . . . . . . . . . . 32 Williams, Robin . . . . . . . . . . . . . . . . 32 Zhang, Wenhui . . . . . . . . . . . . . . . . 113

Computer Safety, Reliability, and Security: 23rd International Conference, SAFECOMP 2004, Potsdam, Germany, September 21-24,2004, Proceedings

Computer Safety, Reliability, and Security : 24th International Conference, SAFECOMP 2005, Fredrikstad, Norway, September 28-30, 2005, Proceedings

Computer Safety, Reliability, and Security: 22nd International Conference, SAFECOMP 2003, Edinburgh, UK, September 23-26, 2003, Proceedings

Computer Safety, Reliability and Security: 20th International Conference, SAFECOMP 2001, Budapest, Hungary, September 26-28, 2001 Proceedings

Computer Safety, Reliability, and Security: 25th International Conference, SAFECOMP 2006, Gdansk, Poland, September 27-29, 2006, Proceedings

Computer Safety, Reliability, and Security, 27 conf., SAFECOMP 2008

Computer Safety, Reliability, and Security, 26 conf., SAFECOMP 2007

Computer Safety, Reliability, and Security: 19th International Conference, SAFECOMP 2000, Rotterdam, The Netherlands, October 24-27, 2000 Proceedings

Object-Oriented Information Systems: 8th International Conference, OOIS 2002, Montpellier, France, September 2-5, 2002, Proceedings

Electronic Government: First International Conference, EGOV 2002, Aix-en-Provence, France, September 2-5, 2002. Proceedings

Security in Communication Networks: Third International Conference, SCN 2002, Amalfi, Italy, September 11-13, 2002, Revised Papers

Infrastructure Security: International Conference, InfraSec 2002 Bristol, UK, October 1-3, 2002 Proceedings

Computer Safety, Reliability, and Security: 29th International Conference, SAFECOMP 2010, Vienna, Austria, September 14-17, 2010, Proceedings (Lecture ... Programming and Software Engineering)

Computer Safety, Reliability, and Security: 28th International Conference, SAFECOMP 2009, Hamburg, Germany, September 15-18, 2009. Proceedings ... Programming and Software Engineering)

Computer Safety, Reliability, and Security: 29th International Conference, SAFECOMP 2010, Vienna, Austria, September 14-17, 2010, Proceedings (Lecture ... Programming and Software Engineering)

Artificial Immune Systems: Third International Conference, ICARIS 2004, Catania, Sicily, Italy, September 13-16, 2004, Proceedings

Research and Advanced Technology for Digital Libraries: 6th European Conference, ECDL 2002, Rome, Italy, September 16-18, 2002, Proceedings

Artificial immune systems: third international conference, ICARIS 2004, Catania, Sicily, Italy, September 13-16, 2004: proceedings

Scientific American (September, 2002)

Computer Safety, Reliability, and Security (Lecture Notes in Computer Science)

Coordination Models and Languages: 5th International Conference, COORDINATION 2002, YORK, UK, April 8-11, 2002 Proceedings

Computer Security - ESORICS 2002, 7 conf

Symmetry and perturbation theory: proceedings of the international conference SPT 2002, Cala Gonone, Sardinia, Italy, 19-26 May 2002

Symmetry and perturbation theory : proceedings of the international conference SPT 2002, Cala Gonone, Sardinia, Italy, 19-26 May 2002

Music and Artificial Intelligence: Second International Conference, ICMAI 2002, Edinburgh, Scotland, UK, September 12-14, 2002, Proceedings

Data Warehousing and Knowledge Discovery: 4th International Conference, DaWaK 2002, Aix-en-Provence, France, September 4-6, 2002. Proceedings

Artificial Intelligence and Cognitive Science: 13th Irish International Conference, AICS 2002, Limerick, Ireland, September 12-13, 2002. Proceedings

Database and Expert Systems Applications: 13th International Conference, DEXA 2002, Aix-en-Provence, France, September 2-6, 2002. Proceedings

Engineering and Deployment of Cooperative Information Systems: First International Conference, EDCIS 2002, Beijing, China, September 17-20, 2002. Proceedings

E-Commerce and Web Technologies: Third International Conference, EC-Web 2002, Aix-en-Provence, France, September 2-6, 2002, Proceedings

Computer Safety, Reliability and Security: 21st International Conference, SAFECOMP 2002, Catania, Italy, September 10-13, 2002. Proceedings

Computer Safety, Reliability, and Security: 23rd International Conference, SAFECOMP 2004, Potsdam, Germany, September 21-24,2004, Proceedings

Computer Safety, Reliability, and Security : 24th International Conference, SAFECOMP 2005, Fredrikstad, Norway, September 28-30, 2005, Proceedings

Computer Safety, Reliability, and Security: 22nd International Conference, SAFECOMP 2003, Edinburgh, UK, September 23-26, 2003, Proceedings

Computer Safety, Reliability and Security: 20th International Conference, SAFECOMP 2001, Budapest, Hungary, September 26-28, 2001 Proceedings

Computer Safety, Reliability, and Security: 25th International Conference, SAFECOMP 2006, Gdansk, Poland, September 27-29, 2006, Proceedings

Computer Safety, Reliability, and Security, 27 conf., SAFECOMP 2008

Computer Safety, Reliability, and Security, 26 conf., SAFECOMP 2007

Computer Safety, Reliability, and Security: 19th International Conference, SAFECOMP 2000, Rotterdam, The Netherlands, October 24-27, 2000 Proceedings

Object-Oriented Information Systems: 8th International Conference, OOIS 2002, Montpellier, France, September 2-5, 2002, Proceedings

Electronic Government: First International Conference, EGOV 2002, Aix-en-Provence, France, September 2-5, 2002. Proceedings

Security in Communication Networks: Third International Conference, SCN 2002, Amalfi, Italy, September 11-13, 2002, Revised Papers

Infrastructure Security: International Conference, InfraSec 2002 Bristol, UK, October 1-3, 2002 Proceedings

Computer Safety, Reliability, and Security: 29th International Conference, SAFECOMP 2010, Vienna, Austria, September 14-17, 2010, Proceedings (Lecture ... Programming and Software Engineering)

Computer Safety, Reliability, and Security: 28th International Conference, SAFECOMP 2009, Hamburg, Germany, September 15-18, 2009. Proceedings ... Programming and Software Engineering)

Computer Safety, Reliability, and Security: 29th International Conference, SAFECOMP 2010, Vienna, Austria, September 14-17, 2010, Proceedings (Lecture ... Programming and Software Engineering)

Artificial Immune Systems: Third International Conference, ICARIS 2004, Catania, Sicily, Italy, September 13-16, 2004, Proceedings

Research and Advanced Technology for Digital Libraries: 6th European Conference, ECDL 2002, Rome, Italy, September 16-18, 2002, Proceedings

Artificial immune systems: third international conference, ICARIS 2004, Catania, Sicily, Italy, September 13-16, 2004: proceedings

Scientific American (September, 2002)

Computer Safety, Reliability, and Security (Lecture Notes in Computer Science)

Coordination Models and Languages: 5th International Conference, COORDINATION 2002, YORK, UK, April 8-11, 2002 Proceedings

Computer Security - ESORICS 2002, 7 conf

Symmetry and perturbation theory: proceedings of the international conference SPT 2002, Cala Gonone, Sardinia, Italy, 19-26 May 2002

Symmetry and perturbation theory : proceedings of the international conference SPT 2002, Cala Gonone, Sardinia, Italy, 19-26 May 2002

Music and Artificial Intelligence: Second International Conference, ICMAI 2002, Edinburgh, Scotland, UK, September 12-14, 2002, Proceedings

Data Warehousing and Knowledge Discovery: 4th International Conference, DaWaK 2002, Aix-en-Provence, France, September 4-6, 2002. Proceedings

Artificial Intelligence and Cognitive Science: 13th Irish International Conference, AICS 2002, Limerick, Ireland, September 12-13, 2002. Proceedings

Database and Expert Systems Applications: 13th International Conference, DEXA 2002, Aix-en-Provence, France, September 2-6, 2002. Proceedings

Engineering and Deployment of Cooperative Information Systems: First International Conference, EDCIS 2002, Beijing, China, September 17-20, 2002. Proceedings

E-Commerce and Web Technologies: Third International Conference, EC-Web 2002, Aix-en-Provence, France, September 2-6, 2002, Proceedings

Recommend Documents