Embedded Systems Handbook

I N D U S T R I A L I N F O R M AT I O N T E C H N O L O G Y S E R I E S Series Editor RICHARD ZURAWSKI Published Books...

Author: Richard Zurawski

212 downloads 2283 Views 12MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

I N D U S T R I A L I N F O R M AT I O N T E C H N O L O G Y S E R I E S Series Editor

RICHARD ZURAWSKI Published Books Industrial Communication Technology Handbook Edited by Richard Zurawski

Embedded Systems Handbook Edited by Richard Zurawski

Forthcoming Books Electronic Design Automation for Integrated Circuits Handbook Luciano Lavagno, Grant Martin, and Lou Scheffer

© 2006 by Taylor & Francis Group, LLC

EMBEDDED SYSTEMS

H

A

N

D

B

O

O

Edited by

RICHARD ZURAWSKI

Boca Raton London New York

A CRC title, part of the Taylor & Francis imprint, a member of the Taylor & Francis Group, the academic division of T&F Informa plc.

© 2006 by Taylor & Francis Group, LLC

K

Published in 2006 by CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2006 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number-10: 0-8493-2824-1 (Hardcover) International Standard Book Number-13: 978-0-8493-2824-4 (Hardcover) Library of Congress Card Number 2005040574 This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC) 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Embedded systems handbook / edited by Richard Zurawski. p. cm. Includes bibliographical references and index. ISBN 0-8493-2824-1 (alk. paper) 1. Embedded computer systems--Handbooks, manuals, etc. I. Zurawski, Richard. TK7895.E42E64 2005 004.16--dc22

2005040574

Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com Taylor & Francis Group is the Academic Division of T&F Informa plc.

© 2006 by Taylor & Francis Group, LLC

and the CRC Press Web site at http://www.crcpress.com

To my wife, Celine

© 2006 by Taylor & Francis Group, LLC

International Advisory Board Alberto Sangiovanni-Vincentelli, University of California, Berkeley, U.S. (Chair) Giovanni De Micheli, Stanford University, U.S. Stephen A. Edwards, Columbia University, U.S. Aarti Gupta, NEC Laboratories, Princeton, U.S. Rajesh Gupta, University of California, San Diego, U.S. Axel Jantsch, Royal Institute of Technology, Sweden Wido Kruijtzer, Philips Research, The Netherlands Luciano Lavagno, Cadence Berkeley Laboratories, Berkeley, U.S., and Politecnico di Torino, Italy Robert de Simone, INRIA, France Grant Martin, Tensilica, U.S. Pierre G. Paulin, ST Microelectronics, Canada Antal Rajnák, Volcano AG, Switzerland Françoise Simonot-Lion, LORIA, France Thomas Weigert, Motorola, U.S. Reinhard Wilhelm, University of Saarland, Germany Lothar Thiele, Swiss Federal Institute of Technology, Switzerland

© 2006 by Taylor & Francis Group, LLC

Preface Introduction The purpose of the Embedded Systems Handbook is to provide a reference useful to a broad range of professionals and researchers from industry and academia involved in the evolution of concepts and technologies, as well as development and use of embedded systems and related technologies. The book provides a comprehensive overview of the field of embedded systems and applications. The emphasis is on advanced material to cover recent significant research results and technology evolution and developments. It is primarily aimed at experienced professionals from industry and academia, but will also be useful to novices with some university background in embedded systems and related areas. Some of the topics presented in the book have received limited coverage in other publications either owing to the fast evolution of the technologies involved, or material confidentiality, or limited circulation in the case of industry-driven developments. The book covers extensively the design and validation of real-time embedded systems, design and verification languages, operating systems and scheduling, timing and performance analysis, power aware computing, security in embedded systems, the design of application-specific instruction-set processors (ASIPs), system-on-chip (SoC) and network-on-chip (NoC), testing of core-based ICs, network embedded systems and sensor networks, and embedded applications to include in-car embedded electronic systems, intelligent sensors, and embedded web servers for industrial automation. The book contains 46 contributions, written by leading experts from industry and academia directly involved in the creation and evolution of the ideas and technologies treated in the book. Many of the contributions are from industry and industrial research establishments at the forefront of the developments shaping the field of embedded systems: Cadence Systems and Cadence Berkeley Labs (USA), CoWare (USA), Microsoft (USA), Motorola (USA), NEC Laboratories (USA), Philips Research (The Netherlands), ST Microelectronics (Canada), Tensilica (USA), Volcano (Switzerland), etc. The contributions from academia and governmental research organizations are represented by some of the most renowned institutions such as Columbia University, Duke University, Georgia Institute of Technology, Princeton University, Stanford University, University of California at Berkeley/Riverside/ San Diego/Santa Barbara, University of Texas at Austin/Dallas, Virginia Tech, Washington University — from the United States; Delft University of Technology (Netherlands), IMAG (France), INRIA/IRISA (France), LORIA-INPL (France), Malardalen University (Sweden), Politecnico di Torino (Italy), Royal Institute of Technology — KTH (Sweden), Swiss Federal Institute of Technology — ETHZ (Switzerland), Technical University of Berlin (Germany), Twente University (The Netherlands), Universidad Politecnica de Madrid (Spain), University of Bologna (Italy), University of Nice Sophia Antipolis (France), University of Oslo (Norway), University of Pavia (Italy), University of Saarbrucken (Germany), University of Toronto (Canada), and many others. The material presented is in the form of tutorials, surveys, and technology overviews. The contributions are grouped into sections for cohesive and comprehensive presentation of the treated areas. The reports on recent technology developments, deployments, and trends frequently cover material released to the profession for the first time. The book can be used as a reference (or prescribed text) for university (post)graduate courses: Section I (Embedded Systems) provides “core” material on embedded systems. Selected illustrations of actual applications are presented in Section VI (Embedded Applications). Sections II and III (System-on-Chip Design, and Testing of Embedded Core-Based Integrated Circuits) offer material on recent advances in system-on-chip design and testing of core-based ICs. Sections IV and V (Networked Embedded Systems, and Sensor Networks) are suitable for a course on sensor networks.

© 2006 by Taylor & Francis Group, LLC

x

Preface

The handbook is designed to cover a wide range of topics that comprise the field of embedded systems and applications. The material covered in this volume will be of interest to a wide spectrum of professionals and researchers from industry and academia, as well as graduate students, from the fields of electrical and computer engineering, computer science and software engineering, as well as mechatronic engineering. It is an indispensable companion for those who seek to learn more about embedded systems and applications, and those who want to stay up to date with recent technical developments in the field. It is also a comprehensive reference for university or professional development courses on embedded systems.

Organization Embedded systems is a vast field encompassing numerous disciplines. Not every topic, however important, can be covered in a book of reasonable volume without superficial treatment. Choices need to be made with respect to the topics covered, balance between research material and reports on novel industrial developments and technologies, balance between so-called “core” topics and new trends, and other aspects. The “time-to-market” is another important factor in making those decisions, along with the availability of qualified authors to cover the topics. One of the main objectives of any handbook is to give a well-structured and cohesive description of fundamentals of the area under treatment. It is hoped that the section Embedded Systems has achieved this objective. Every effort was made to make sure that each contribution in this section contains introductory material to assist beginners with the navigation through more advanced issues. This section does not strive to replicate or replace university level material, but, rather, tries to address more advanced issues, and recent research and technology developments. To make this book timely and relevant to a broad range of professionals and researchers, the book includes material reflecting state-of-the-art trends to cover topics such as design of ASIPs, SoC communication architectures including NoC, design of heterogeneous SoC, as well as testing of core-based integrated circuits. This material reports on new approaches, methods, technologies, and actual systems. The contributions come from the industry driving those developments, industry-affiliated research institutions, and academic establishments participating in major research initiatives. Application domains have had a considerable impact on the evolution of embedded systems, in terms of required methodologies and supporting tools, and resulting technologies. A good example is the accelerated evolution of the SoC design to meet demands for computing power posed by DSP, network and multimedia processors. SoCs are slowly making inroads into the area of industrial automation to implement complex field-area intelligent devices which integrate the intelligent sensor/actuator functionality by providing on-chip signal conversion, data and signal processing, and communication functions. There is a growing tendency to network field-area intelligent devices around industrial communication networks. Similar trends appear in the automotive electronic systems where the Electronic Control Units (ECUs) are networked by means of safety-critical communication protocols such as FlexRay, for instance, for the purpose of controlling vehicle functions such as electronic engine control, anti-locking break system, active suspension, etc. The design of this kind of networked embedded system (this also includes hard real-time industrial control systems) is a challenge in itself due to the distributed nature of processing elements, sharing a common communication medium and safety-critical requirements. With the automotive industry increasingly keen on adopting mechatronic solutions, it was felt that exploring, in detail, the design of in-vehicle electronic embedded systems would be of interest to the readers of this book. The applications part of the book also touches the area of industrial automation (networked control systems) where the issues are similar. In this case, the focus is on the design of web servers embedded in the intelligent field-area devices, and the security issues arising from internetworking. Sensor networks are another example of networked embedded systems, although, the “embedding” factor is not so evident as in other applications; particularly for wireless and self-organizing networks where the nodes may be embedded in the ecosystem, battlefield, or a chemical plant, for instance. The area of

© 2006 by Taylor & Francis Group, LLC

Preface

xi

wireless sensor networks has now evolved into a relative maturity. Owing to novelty, and growing importance, it has been included in the book to give a comprehensive overview of the area, and present new research results which are likely to have a tangible impact on further developments and technology. The specifics of the design automation of integrated circuits have been deliberately omitted in this book to keep the volume at a reasonable size and in view of the publication of another handbook which covers these aspect in a comprehensive way: The Electronic Design Automation for Integrated Circuits Handbook, CRC Press, FL, 2005, Editors: Luciano Lavagno, Grant Martin, and Lou Scheffer. The aim of the Organization section is to provide highlights of the contents of the individual chapters to assist readers with identifying material of interest, and to put topics discussed in a broader context. Where appropriate, a brief explanation of the topic under treatment is provided, particularly for chapters describing novel trends, and with novices in mind. The book is organized into six sections: Embedded Systems, System-on-Chip Design, Testing of Embedded Core-Based Integrated Circuits, Networked Embedded Systems, Sensor Networks, and Embedded Applications.

I Embedded Systems This section provides a broad introduction to embedded systems. The presented material offers a combination of fundamental and advanced topics, as well as novel results and approaches, to cover the area fairly comprehensively. The presented topics include issues in real-time and embedded systems, design and validation, design and verification languages, operating systems, timing and performance analysis, power aware computing, and security.

Real-Time and Embedded Systems This subsection provides a context for the material covered in the book. It gives an overview of real-time and embedded systems and their networking to include issues, methods, trends, applications, etc. The focus of the chapter Embedded Systems: Toward Networking of Embedded Systems is on networking of embedded systems. It briefly discusses the rationale for the emergence of these kinds of systems, their benefits, types of systems, diversity of application domains and requirements arising from that, as well as security issues. Subsequently, the chapter discusses the design methods for networked embedded systems, which fall into the general category of system-level design. The methods overviewed focus on two separate aspects, namely the network architecture design and the system-on-chip design. The design issues and practices are illustrated by examples from the automotive application domain. After that, the chapter introduces selected application domains for networked embedded systems, namely: industrial and building automation control, and automotive control applications. The focus of the discussion is on the networking aspects. The chapter gives an overview of the networks used in industrial applications, including the industrial Ethernet and its standardization process; building automation control; and networks for automotive control and other applications from the automotive domain — but the emphasis is on networks for safety critical solutions. Finally, general aspects of wireless sensor/actuator networks are presented, and illustrated by an actual industrial implementation of the concept. At the end of the chapter, a few paragraphs are dedicated to the security issues for networked embedded systems. An authoritative introduction to real-time systems is provided in Real-Time in Embedded Systems. The chapter covers extensively the areas of design and analysis, with some examples of analysis, as well as tools; operating systems (an in-depth discussion of real-time embedded operating systems is presented in the chapter Real-Time Embedded Operating Systems Standards and Perspectives); scheduling (the chapter Real-Time Embedded Operating Systems: The Scheduling and Resource Management Aspects presents an authoritative description and analysis of real-time scheduling); communications to include descriptions of selected fieldbus technologies and Ethernet for real-time communications; and component based design, as well as testing and debugging. This is essential reading for anyone interested in the area of real-time systems.

© 2006 by Taylor & Francis Group, LLC

xii

Preface

Design and Validation of Embedded Systems The subsection Design and Validation of Embedded Systems contains material presenting design methodology for embedded systems and supporting tools, as well as selected software and hardware implementation aspects. Models of Computation (MoC) — which are essentially abstract representations of computing systems — are used throughout to facilitate design and validation stages of systems development and approaches to validation as well as available methods and tools. The verification methods, together with an overview of verification languages, are presented in subsection Design and Verification Languages. In addition, the subsection presents novel research material including a framework used to introduce different models of computation particularly suited to the design of heterogeneous multiprocessor SoC, and a mathematical model of embedded systems based on the theory of agents and interactions. A comprehensive introduction to the design methodology for embedded systems is presented in the chapter Design of Embedded Systems. It gives an overview of the design issues and stages. Then, the chapter presents, in quite some detail, the functional design, function/architecture and hardware/software codesign, and hardware/software coverification and hardware simulation. Subsequently, the chapter discusses selected software and hardware implementation issues. While discussing different design stages and approaches, the chapter also introduces and evaluates supporting tools. An excellent introduction to the topic of models of computation, particularly for embedded systems, is presented in the chapter Models of Embedded Computation. The chapter introduces the origin of MoC, and the evolution from models of sequential and parallel computation to attempts to model heterogeneous architectures. In the process, the chapter discusses, in relative detail, selected nonfunctional properties such as power consumption, component interaction in heterogeneous systems, and time. It also presents a new framework used to introduce four different models of computation, and shows how different time abstractions can serve different purposes and needs. The framework is subsequently used to study the coexistence of different computational models; specifically the interfaces between two different MoCs and the refinement of one MoC into another. This part of the chapter is particularly relevant to the material on the design of heterogeneous multiprocessor SoC presented in the section System-on-Chip Design. A comprehensive survey of selected models of computation is presented in the chapter Modeling Formalisms for Embedded System Design. The surveyed formalisms include Finite State Machines (FSM), Finite State Machines with Datapath (FSMD), Moore machine, Mealy machine, Codesign Finite State Machines (CFSM), Program State Machines (PSM), Specification and Description Language (SDL), Message Sequence Charts (MSC), Statecharts, Petri nets, synchronous/reactive models, discrete event system, Dataflow Models, etc. The presentation of individual models is augmented by numerous examples. The chapter System Validation briefly discusses approaches to requirements capture, analysis and validation, and surveys available methods and tools to include: descriptive formal methods such as VDM, Z, B, RAISE (Rigorous Approach to Industrial Software Engineering), CASL (Common Algebraic Specification Language), SCR (Software Cost Reduction), and EVES; deductive verifiers: HOL, Isabelle, PVS, Larch, Nqthm, and Nuprl; state exploration tools: SMV (Symbolic Model Verifier), Spin, COSPAN (COordination SPecification Analysis), MEIJE, CADP, and Murphi. It also presents a mathematical model of embedded systems based on the theory of agents and interactions. To underline a novelty of this formalism, classical theories of concurrency are surveyed to include process algebras, temporal logic, timed automata, (Gurevich’s) ASM (Abstract State Machine), and rewriting logic. As an illustration, the chapter presents a specification of a simple scheduler.

Design and Veriﬁcation Languages This section gives a comprehensive overview of languages used to specify, model, verify, and program embedded systems. Some of those languages embody different models of computation discussed in the previous section. A brief overview of Architecture Description Languages (ADL) is presented in

© 2006 by Taylor & Francis Group, LLC

Preface

xiii

Embedded Applications (Automotive Networks); the use of this class of languages, in the context of describing in-car embedded electronic systems, is illustrated through the EAST-ADL language. An authoritative introduction to a broad range of languages used in embedded systems is presented in the chapter Languages for Embedded Systems. The chapter surveys some of the most representative and widely used languages. Software languages: assembly languages for complex instruction set computers (CISC), reduced instruction set computers (RISC), digital signal processors (DSPs) and very-long instruction word processors (VLIWs), and for small (4- and 8-bit) microcontrollers; the C and C++ Languages; Java; and real-time operating systems. Hardware languages: Verilog and VHDL. Dataflow languages: Kahn Process Networks and Synchronous Dataflow (SDF). Hybrid languages: Esterel, SDL, and SystemC. Each group of languages is characterized for their specific application domains and illustrated with ample code examples. An in-depth introduction to synchronous languages is presented in The Synchronous Hypothesis and Synchronous Languages. Before introducing the synchronous languages, the chapter discusses the concept of synchronous hypothesis: the basic notion, mathematical models, and implementation issues. Subsequently, it overviews the structural languages used for modeling and programming synchronous applications. Imperative languages, Esterel and SyncCharts, provide constructs to deal with controldominated programs. Declarative languages, Lustre and Signal, are particularly suited for applications based on intensive data computation and dataflow organization. Future trends are also covered. The chapter Introduction to UML and the Modeling of Embedded Systems gives an overview of the use of UML (Unified Modeling Language) for modeling embedded systems. The chapter presents a brief overview of UML and discusses UML features suited to represent the characteristics of embedded systems. The UML constructs, the language use, and other issues are introduced through an example of an automatic teller machine. The chapter also briefly discusses a standardized UML profile (a specification language instantiated from the UML language family) suitable for modeling of embedded systems. A comprehensive survey and overview of verification languages is presented in the chapter Verification Languages. It describes languages for verification of hardware, software, and embedded systems. The focus is on the support that a verification language provides for dynamic verification based on simulation, as well as static verification based on formal techniques. Before discussing the languages, the chapter provides some background on verification methods. This part introduces basics of simulation-based verification, formal verification, and assertion-based verification. It also discusses selected logics that form the basis of languages described in the chapter: propositional logic, first-order predicate logic, temporal logics, and regular and ω-regular languages. The hardware verification languages (HVLs) covered include: e, OpenVera, Sugar/PSL, and ForSpec. The languages for software verification overviewed include programming languages: C/C++, and Java; and modeling languages: UML, SDL, and Alloy. Languages for SoCs and embedded systems verification include system-level modeling languages: SystemC, SpecC, and SystemVerilog. The chapter also surveys domain-specific verification efforts, such as those based on Esterel and hybrid systems.

Operating Systems and Quasi-Static Scheduling This subsection offers a comprehensive introduction to real-time and embedded operating systems to cover fundamentals and selected advanced issues. To complement this material with new developments, it gives an overview of the operating system interfaces specified by the POSIX 1003.1 international standard and related to real-time programming and introduces a class of operating systems based on virtual machines. The subsection also includes research material on quasi-static scheduling. The chapter Real-Time Embedded Operating Systems: Standards and Perspectives provides a comprehensive introduction to the main features of real-time embedded operating systems. It overviews some of the main design and architectural issues of operating systems: system architectures, process and thread model, processor scheduling, interprocess synchronization and communication, and network support. The chapter presents a comprehensive overview of the operating system interfaces specified by

© 2006 by Taylor & Francis Group, LLC

xiv

Preface

the POSIX 1003.1 international standard and related real-time programming. It also gives a short description of selected open-source real-time operating systems to include eCos, µClinux, RT-Linux and RTAI, and RTEMS. The chapter also presents a fairly comprehensive introduction to a class of operating systems based on virtual machines. Task scheduling algorithms and resource management policies, put in the context of real-time systems, are the main focus of the chapter Real-Time Embedded Operating Systems: The Scheduling and Resource Management Aspects. The chapter discusses in detail periodic task handling to include Timeline Scheduling (TS), Rate-Monotonic (RM) scheduling, Earliest Deadline First (EDF) algorithm, and approaches to handle tasks with deadlines less than periods scheme; and aperiodic task handling. Protocols for accessing shared resources discussed include Priority Inheritance Protocol (PIP) and Priority Ceiling Protocol (PCP). Novel approaches, which provide efficient support for real-time multimedia systems, for handling transient overloads and execution overruns in soft real-time systems working in dynamic environments are also mentioned in the chapter. The chapter Quasi-Static Scheduling of Concurrent Specifications presents methods aimed at efficient synthesis of uniprocessor software with an aim to improve speed of the scheduled design. The proposed approach starts from a specification represented in terms of concurrent communicating processes, derives an intermediate representation based on Petri nets or Boolean Dataflow Graphs, and finally attempts to obtain a sequential schedule to be implemented on a processor. The potential benefits result from replacement of explicit communication among processes by data assignment and reduced amount of context switches due to a reduction of the number of processes.

Timing and Performance Analysis Many embedded systems, particularly hard real-time systems, impose strict restrictions on the execution time of tasks which are required to be completed within certain time bounds. For this class of systems, schedulability analysis requires the upper bounds for the execution times of all tasks to be known in order to verify whether the system meets its timing requirements. The chapter Determining Bounds on Execution Times presents architecture of the aiT timing-analysis tool and an approach to timing analysis implemented in the tool. In the process, the chapter discusses cache-behavior prediction, pipeline analysis, path analysis using integer linear programming, and other issues. The use of this approach is put in the context of upper bounds determination. In addition, the chapter gives a brief overview of other approaches to timing analysis. The validation of nonfunctional requirements of selected implementation aspects such as deadlines, throughputs, buffer space, power consumption, etc., comes under performance analysis. The chapter Performance Analysis of Distributed Embedded Systems discusses issues behind performance analysis and its role in the design process. It also surveys a few selected approaches to performance analysis for distributed embedded systems to include simulation-based methods, holistic scheduling analysis, and compositional methods. Subsequently, the chapter introduces the performance network approach, as stated by authors, influenced by the worst-case analysis of communication networks. The presented approach allows one to obtain upper and lower bounds on quantities such as end-to-end delay and buffer space; it also covers all possible corner cases independent of their probability.

Power Aware Computing Embedded nodes, or devices, are frequently battery powered. The growing power dissipation, with the increase in density of integrated circuits and clock frequency, has a direct impact on the cost of packaging and cooling, as well as reliability and lifetime. These and other factors make the design for low power consumption a high priority for embedded systems. The chapter Power Aware Embedded Computing presents a survey of design techniques and methodologies aimed at reducing static and dynamic power dissipation. The chapter discusses energy and power modeling to include instruction

© 2006 by Taylor & Francis Group, LLC

Preface

xv

level and function level power models, micro-architectural power models, memory and bus models, and battery models. Subsequently, the chapter discusses system/application level optimizations which explore different task implementations exhibiting different power/energy versus quality-of-service characteristics. Energy efficient processing subsystems: voltage and frequency scaling, dynamic resource scaling, and processor core selection, are also overviewed in the chapter. Finally, the chapter discusses energy efficient memory subsystems: cache hierarchy tuning, novel horizontal and vertical cache partitioning schemes, dynamic scaling of memory elements, software controlled memories, scratch-pad memories, improving access patterns to on-chip memory, special purpose memory subsystems for media streaming, and code compression, and interconnect optimizations.

Security in Embedded Systems There is a growing trend for networking of embedded systems. Representative examples of such systems can be found in automotive, train, and industrial automation domains. Many of those systems are required to be connected to other networks to include LAN, WAN, and the Internet. For instance, there is a growing demand for remote access to process data at the factory floor. This, however, exposes systems to potential security attacks, which may compromise their integrity and cause damage. The limited resources of embedded systems pose considerable challenge for the implementation of effective security policies which, in general, are resource demanding. An excellent introduction to the security issues in embedded systems is presented in the chapter Design Issues in Secure Embedded Systems. The chapter outlines security requirements in computing systems, classifies abilities of attackers, and discusses security implementation levels. Security constraints in the embedded systems designs discussed include energy considerations, processing power limitations, flexibility and availability requirements, and cost of implementation. Subsequently, the chapter presents the main issues in the design of secure embedded systems. It also covers, in detail, attacks and countermeasures of cryptographic algorithm implementations in embedded systems.

II System-on-Chip Design Multi-Processor Systems-on-Chip (MPSoC), which combine the advantages of parallel processing with the high integration levels of SoCs, emerged as a viable solution to meet the demand for computational power required by applications such as network and media processors. The design of MPSoCs typically involves integration of heterogeneous hardware and software IP components. However, the support for reuse of hardware and software IP components is limited, thus potentially making the design process labor-intensive, error-prone, and expensive. Selected component-based design methodologies for the integration of heterogeneous hardware and software IP components are presented in this section together with other issues such as design of ASIPs, communication architectures to include NoC, and platform based design, to mention some. Those topics are presented in eight chapters introducing the SoC concept and design issues; design of ASIPs; SoC communication architectures; principles and guidelines for the NoC design; platform-based design principles; converter synthesis for incompatible protocols; a component-based design automation approach for multiprocessor SoC platforms; an interface-centric approach to the design and programming of embedded multiprocessors; and an STMicroelectronics developed exploration multiprocessor SoC platform. A comprehensive introduction to the SoC concept, in general, and design issues is provided in the chapter System-on-Chip and Network-on-Chip Design. The chapter discusses basics of SoC; IP cores and virtual components; introduces the concept of architectural platforms and surveys selected industry offerings; and provides a comprehensive overview of the SoC design process. A retargetable framework for ASIP design is presented in A Novel Methodology for the Design of Application-Specific Instruction-Set Processors. The framework, which is based on machine descriptions in the LISA language, allows for automatic generation of software development tools including HLL C-compiler, assembler, linker, simulator, and graphical debugger frontend. In addition, synthesizable

© 2006 by Taylor & Francis Group, LLC

xvi

Preface

hardware description language code can be derived for architecture implementation. The chapter also gives an overview of various machine description languages in the context of their suitability for the design of ASIP; discusses the ASIPs design flow, and the LISA language. On-chip communication architectures are presented in the chapter State-of-the-Art SoC Communication Architectures. The chapter offers an in-depth description and analysis of three most relevant, from industrial and research viewpoints, architectures to include ARM developed AMBA (Advanced MicroController Bus Architecture) and new interconnect schemes, namely Multi-Layer AHB and AMBA AXI; IBM developed CoreConect; and STMicroelectronics developed STBus. In addition, the chapter surveys other architectures such as Wishbone, Sonics SiliconBackplane Micronetwork, Peripheral Interconnect Bus (PI-Bus), Avalon, and CoreFrame. The chapter also offers analysis of selected architectures and extends the discussion of on-chip interconnects to NoC. Basic principles and guidelines for the NoC design are introduced in Network-on-Chip Design for Gigascale Systems-on-Chip. It discusses a rationale for the design paradigm shift of SoC communication architectures from shared busses to NoCs; and briefly surveys related work. Subsequently, the chapter presents details of NoC building blocks to include switch, network interface, and switch-to-switch links. In discussing the design guidelines, the chapter uses a case study of a real NoC architecture (Xpipes) which employs some of the most advanced concepts in NoC design. It also discusses the issue of heterogeneous NoC design, and the effects of mapping the communication requirements of an application onto a domain-specific NoC. An authoritative discussion of the platform-based design (PBD) concept is provided in the chapter Platform-Based Design for Embedded Systems. The chapter introduces PBD principles and outlines the interplay between micro-architecture platforms and Application Program Interface (API), or programmer model, which is a unique abstract representation of the architecture platform via the software layer. The chapter also introduces three applications of PBD: network platforms for communication protocol design, fault-tolerant platforms for the design of safety-critical applications, and analog platforms for mixed-signal integrated circuit design. An approach to synthesis of interface converters for incompatible protocols in a componentbased design automation is presented in Interface Specification and Converter Synthesis. The chapter surveys several approaches for synthesizing converters illustrated by simple examples. It also introduces more advanced frameworks based on abstract algebraic solutions that guarantee converter correctness. The chapter Hardware/Software Interface Design for SoC presents a component-based design automation approach for MPSoC platforms. It briefly surveys basic concepts of MPSoC design and discusses some related platform and component-based approaches. It provides a comprehensive overview of hardware/software IP integration issues to include bus-based and core-based approaches, integrating software IP, communication synthesis (the concept is presented in detail in Interface Specification and Converter Synthesis), and IP derivation. The focal point of the chapter is a new component-based design methodology and the design environment for the integration of heterogeneous hardware and software IP components. The presented methodology, which adopts the automatic communication synthesis approach and uses a high-level API, generates both hardware and software wrappers, as well as a dedicated operating system for programmable components. The IP integration capabilities of the approach and accompanying software tools are illustrated by redesigning a part of a VDSL modem. The chapter Design and Programming of Embedded Multiprocessors: An Interface-Centric Approach presents a design methodology for implementing media processing applications as MPSoCs centered around the Task Transaction Level (TTL) interface. The TTL interface can be used to build executable specifications; it also provides a platform interface for implementing applications as communicating hardware and software tasks on a platform infrastructure. The chapter introduces the TTL interface in the context of the requirements, and discusses mapping technology which supports structured design and programming of embedded multiprocessor systems. The chapter also presents two case studies of implementations of TTL interface on different architectures: a multi-DSP

© 2006 by Taylor & Francis Group, LLC

Preface

xvii

architecture, using an MP3 decoder application to evaluate this implementation; and a smart-imaging multiprocessor. The STMicroelectronics developed StepNPTM flexible MPSoC platform and its key architectural components are described in A MultiProcessor SoC Platform and Tools for Communications Applications. The platform was developed with an aim to explore tool and architectural issues in a range of high-speed communications applications, particularly packet processing applications used in network infrastructure SoCs. Subsequently, the chapter reviews the MultiFlex modeling and analysis tools developed to support the StepNP platform. The MultiFlex environment supports two parallel programming models: a distributed system object component (DSOC) message passing model and a symmetrical multiprocessing (SMP) model using shared memory. It maps these models onto the StepNP MPSoC platform. The use of the platform and supporting environment are illustrated by two examples mapping IPv4 packet forwarding and traffic management applications onto the StepNP platform. Detailed results are presented and discussed for a range of architectural parameters.

III Testing of Embedded Core-Based Integrated Circuits The ever-increasing circuit densities and operating frequencies, as well as the use of the SoC designs, have resulted in enormous test data volume for today’s embedded core-based integrated circuits. According to the Semiconductor Industry Association, in the International Technology Roadmap for Semiconductors (ITRS), 2001 Edition, the density of ICs can reach 2 billion transistors per square cm, and 16 billion transistors per chip are likely by 2014. Based on that, according to some estimates (A. Khoche and J. Rivoir, “I/O bandwidth bottleneck for test: is it real?” Test Resource Partitioning Workshop, 2002), the test data volume for ICs in 2014 is likely to increase 150 times in reference to 1999. Some other problems include the growing disparity between performance of the design and the automatic test equipment which makes at-speed testing, particularly of high-speed circuits, a challenge and results in increasing yield loss; high cost of manually developed functional tests; and growing cost of high-speed and high-pincount testers. This section contains two chapters introducing new techniques addressing some of the issues indicated above. The chapter Modular Testing and Built-In Self-Test of Embedded Cores in System-on-Chip Integrated Circuits presents a survey of techniques that have been proposed in the literature for reducing test time and test data volume. The techniques surveyed rely on modular testing of embedded cores and built-in self test (BIST). The material on modular testing of embedded cores in a system-on-a-chip describes wrapper design and optimization, test access mechanism (TAM) design and optimization, test scheduling, integrated TAM optimization and test scheduling, and modular testing of mixed-signal SOCs. In addition, the chapter reviews a recent deterministic BIST approach in which a reconfigurable interconnection network (RIN) is placed between the outputs of the linear-feedback shift register (LFSR) and the inputs of the scan chains in circuit under test. The RIN, which consists only of multiplexer switches, replaces the phase shifter that is typically used in pseudo-random BIST to reduce correlation between the test data bits that are fed into the scan chains. The proposed approach does not require any circuit redesign and it has minimal impact on circuit performance. Hardware-based self-testing techniques (BIST) have limitations due to performance, area, and design time overhead, as well as problems caused by the application of nonfunctional patterns (which may result in higher power consumption during testing, over-testing, yield loss problems, etc.). The embedded softwarebased self-testing technique has a potential to alleviate the problems caused by using external testers, as well as structural BIST problems. The embedded software-based self-testing utilizes on-chip programmable resources (such as embedded microprocessors and DSPs) for on-chip test generation, test delivery, signal acquisition, response analysis, and even diagnosis. The chapter Embedded Software-Based Self-Testing for SoC Design discusses processor self-test methods targeting stuck-at faults and delay faults; presents a brief description of a processor self-diagnosis method; presents methods for self-testing of buses and global

© 2006 by Taylor & Francis Group, LLC

xviii

Preface

interconnects as well as other nonprogrammable IP cores on SoC; describes instruction-level design-fortestability (Df T) methods based on insertion of test instructions to increase the fault coverage and reduce the test application time and test program size; and outlines DSP-based self-test for analog/mixed-signal components.

IV Networked Embedded Systems Networked embedded systems (NES) are essentially spatially distributed embedded nodes (implemented on a board, or a single chip in future) interconnected by means of wireline or/and wireless communication infrastructure and protocols, interacting with the environment (via sensor/actuator elements) and each other, and, possibly, a master node performing some control and coordination functions to coordinate computing and communication in order to achieve certain goal(s). An example of a network embedded system may be an in-vehicle embedded network comprising a collection of ECUs networked by means of safety-critical communication protocols, such as FlexRay or TTP/C, for the purpose of controlling vehicle functions, such as electronic engine control, anti-locking brake system, active suspension, etc. (for details of automotive applications see the last section in the book). An excellent introduction to NES is presented in the chapter Design Issues in Networked Embedded Systems. This chapter outlines some of the most representative characteristics of NES, and surveys potential applications. It also explains design issues for large-scale distributed NES such as environment interaction, life expectancy of nodes, communication protocol, reconfigurability, security, energy constraints, operating systems, etc. Design methodologies and tools are discussed as well. The topic of middleware for NES is addressed in Middleware Design and Implementation for Networked Embedded Systems. This chapter discusses the role of middleware in NES and the challenges in design and implementation, such as remote communication, location independence, reuse of the existing infrastructure, providing real-time assurances, providing a robust DOC middleware, reducing middleware footprint, and support for simulation environments. The focal points of the chapter are the sections describing the design and implementation of nORB (a small footprint real-time object request broker tailored to specific embedded sensor/actuator applications), and the rationale behind the adopted approach, namely to address the NES design and implementation challenges.

V Sensor Networks The distributed (wireless) sensor networks are a relatively new and exciting proposition for collecting sensory data in a variety of environments. The design of this kind of network poses a particular challenge due to limited computational power and memory size, bandwidth restrictions, power consumption restriction if battery powered, communication requirements, and unattended mode of operation in case of inaccessible and/or hostile environments, to mention some. It provides a fairly comprehensive discussion of the design issues related to, in particular, self-organizing wireless networks. It introduces fundamental concepts behind sensor networks, discusses architectures, energy-efficient Medium Access Control (MAC), time synchronization, distributed localization, routing, distributed signal processing, security, and it surveys selected software solutions. A general introduction to the area of wireless sensor networks is provided in Introduction to Wireless Sensor Networks. A comprehensive overview of the topic is provided in Issues and Solutions in Wireless Sensor Networks, which introduces fundamental concepts, selected application areas, design challenges, and other relevant issues. The chapter Architectures for Wireless Sensor Networks provides an excellent introduction to various aspects of the architecture of wireless sensor networks. It includes the description of a sensor node architecture and its elements: sensor platform, processing unit, communication interface, and power source. In addition, it presents a mathematical model of power consumption by a node, to account for energy consumption by radio, processor, and sensor elements. The chapter also discusses architectures

© 2006 by Taylor & Francis Group, LLC

Preface

xix

of wireless sensor networks developed on the protocol stack approach and EYES project approach. In the context of the EYES project approach, which consists of only two key system abstraction layers, namely the sensor and networking layer and the distributed services layer, the chapter discusses distributed services that are required to support applications for wireless sensor networks and approaches adopted by various projects. Energy efficiency is one of the main issues in developing MAC protocols for wirelesss sensor networks. This is largely due to unattended operation and battery-based power supply, and a need for collaboration as a result of limited capabilities of individual nodes. Energy-Efficient Medium Access Control offers a comprehensive overview of the issues involved in the design of MAC protocols. It contains a discussion of MAC requirements for wireless sensor networks such as hardware characteristics of the node, communication patterns, and others. It surveys 20 medium access protocols specially designed for sensor networks and optimized for energy efficiency. It also discusses qualitative merits of different organizations; contention-based, slotted, and TDMA-based protocols. In addition, the chapter provides a simulationbased comparison of the performance and energy efficiency of four MAC protocols: Low Power Listening, S-MAC, T-MAC, and L-MAC. The knowledge of time at a sensor node may be essential for the correct operation of the system. Time Division Multiple Access (TDMA) scheme (adopted in TTP/C and FlexRay protocols, for instance — see section on automotive applications) requires the nodes to be synchronized. The time synchronization issues in sensor networks are discussed in Overview of Time Synchronization Issues in Sensor Networks. The chapter introduces basics of time synchronization for sensor networks. It also describes design challenges and requirements in developing time synchronization protocols such as the need to be robust, energy aware, able to operate correctly in absence of time servers (server-less), be light-weight, and to offer a tunable service. The chapter also overviews factors influencing time synchronization such as temperature, phase noise, frequency noise, asymmetric delays, and clock glitches. Subsequently, different types of timing techniques are discussed: Network Time Protocol (NTP), Timing-sync Protocol for Sensor Networks (TPSN), Reference-Broadcast Synchronization (RBS), and Time-Diffusion Synchronization Protocol (TDP). The knowledge of the location of nodes is essential for the base station to process information from sensors, and to arrive at valid and meaningful results. The localization issues in ad hoc wireless sensor networks are discussed in Distributed Localization Algorithms. The focus of this presentation is on three distributed localization algorithms for large-scale ad hoc sensor networks which meet the basic requirements for self-organization, robustness, and energy efficiency: ad hoc positioning by Niculescu and Nath, N-hop multilateration by Savvides et al., and robust positioning by Savarese et al. The selected algorithms are evaluated by simulation. In order to forward information from a sensor node to the base station or another node for processing, the node requires routing information. The chapter Routing in Sensor Networks provides a comprehensive survey of routing protocols used in sensor networks. The presentation is divided into flat routing protocols: Sequential Assignment Routing (SAR), direct diffusion, minimum cost forwarding approach, Integer Linear Program (ILP) based routing approach, Sensor Protocols for Information via Negotiation (SPIN), geographic routing protocols, parametric probabilistic routing protocol, and Min-MinMax; and clusterbased routing protocols: Low Energy Adaptive Clustering Hierarchy (LEACH), Threshold sensitive Energy Efficient sensor Network protocol (TEEN), and two-level clustering algorithm. Due to their limited resources, sensor nodes frequently provide incomplete information on the objects of their observation. Thus the complete information has to be reconstructed from data obtained from many nodes, frequently providing redundant data. The distributed data fusion is one of the major challenges in sensor networks. The chapter Distributed Signal Processing in Sensor Networks introduces a novel mathematical model for distributed information fusion, which focuses on solving a benchmark signal processing problem (spectrum estimation) using sensor networks. With deployment of sensor networks in areas such as battlefield or factory floor, security becomes of paramount importance, and a challenge. The existing solutions are impractical due to limited capabilities (processing power, available memory, and available energy) of sensor nodes. The chapter

© 2006 by Taylor & Francis Group, LLC

xx

Preface

Sensor Network Security gives an introduction to selected specific security challenges in wireless sensor networks: denial of service and routing security, energy efficient confidentiality and integrity, authenticated broadcast, alternative approaches to key management, and secure data aggregation. Subsequently, it discusses in detail some of the proposed approaches and solutions: SNEP and µTESLA protocols for confidentiality and integrity of data, the LEAP protocol, and probabilistic key management for key management, to mention some. The chapter Software Development for Large-Scale Wireless Sensor Networks presents basic concepts related to software development for wireless sensor networks, as well as selected software solutions. The solutions include: TinyOS, a component-based operating system, and related software packages; MATÉ, a byte-code interpreter; and TinyDB, a query processing system for extracting information from a network of TinyOS sensor nodes. SensorWare, a software framework for wireless sensor networks, provides querying, dissemination, and fusion of sensor data, as well as coordination of actuators. MiLAN (Middleware Linking Applications and Networks), a middleware concept, aims to exploit information redundancy provided by sensor nodes. EnviroTrack, a TinyOS-based application, provides a convenient way to program sensor network applications that track activities in their physical environment. SeNeTs, a middleware architecture for wireless sensor networks, is designed to support the pre-deployment phase. The chapter also discusses software solutions for simulation, emulation, and test of large-scale sensor networks: TinyOS SIMulator (TOSSIM), a simulator based on the TinyOS framework; EmStar, a software environment for developing and deploying applications for sensor networks consisting of 32-bit embedded Microserver platforms; and SeNeTs, a test and validation environment.

VI Embedded Applications The last section in the book, Embedded Applications, focuses on selected applications of embedded systems. It covers automotive field, industrial automation, and intelligent sensors. The aim of this section is to introduce examples of the actual embedded applications in fast-evolving areas which, for various reasons, have not received proper coverage in other publications, particularly in the automotive area.

Automotive Networks The automotive industry is aggressively adopting mechatronic solutions to replace or duplicate existing mechanical/hydraulic systems. The embedded electronic systems together with dedicated communication networks and protocols play pivotal roles in this transition. This subsection contains three chapters that offer a comprehensive overview of the area by presenting topics, such as networks and protocols, operating systems and other middleware, scheduling, safety and fault tolerance, and actual development tools, used by the automotive industry. This section begins with a contribution entitled Design and Validation Process of In-Vehicle Embedded Electronic Systems that provides a comprehensive introduction to the use of embedded systems in automobiles, their design and validation methods, and tools. The chapter identifies and describes a number of specific application domains for in-vehicle embedded systems, such as power train, chassis, body, and telematics and HMI. It then outlines some of the main standards used in the automotive industry to ensure interoperability between components developed by different vendors; this includes networks and protocols, as well as operating systems. The surveyed networks and protocols include (for details of networks and protocols see The Industrial Communication Technology Handbook, CRC Press, 2005, Richard Zurawski, editor) Controller Area Network (CAN), Vehicle Area Network (VAN), J1850, TTP/C (Time-Triggered Protocol), FlexRay, Local Interconnect Network (LIN), Media Oriented System Transport (MOST), and IDB-1394. This material is followed by a brief introduction of OSEK/VDX (Offene Systeme und deren schnittstellen für die Elektronik im Kraft-fahrzeug), a multitasking operating system that has become a standard for automotive applications in Europe. The chapter introduces a new language, EAST-ADL, which offers support for an unambiguous description of in-vehicle embedded electronic

© 2006 by Taylor & Francis Group, LLC

Preface

xxi

systems at each level of their development. The discussion of the design and validation process and related issues is facilitated by a comprehensive case study drawn from actual PSA Peugeot-Citroën application. This case study is essential reading for those interested in the development of this kind of embedded system. The planned adoption of X-by-wire technologies in automotive applications pushed the automotive industry into the realm of safety critical systems. There is a substantial body of literature on safety critical issues and fault tolerance, particularly when applied to components and systems. Less has been published on safety-relevant communication services and fault-tolerant communication systems as mandated in X-by-wire technologies in automotive applications. This is largely due to the novelty of fast-evolving concepts and solutions, which is pursued mostly by industrial consortia. Those two topics are presented in detail in Fault-Tolerant Services for Safe In-Car Embedded Systems. The material on safety-relevant communication services discusses some of the main services and functionalities that the communication system should provide to facilitate the design of fault-tolerant automotive applications. This includes services supporting reliable communication, such as robustness against electromagnetic interference (EMI), time-triggered transmission, global time, atomic broadcast, and avoiding“babbling-idiots.” Also discussed are higher-level services that provide fault-tolerant mechanisms that belong conceptually to layers above MAC in the OSI reference model, namely group membership service, management of nodes’ redundancy, support for functioning mode, etc. The chapter also discusses fault tolerant communication protocols to include TTP/C, FlexRay, and variants of CAN (TTCAN, RedCAN, and CANcentrate). The Volcano concept for design and implementation of in-vehicle networks using the standardized CAN and LIN communication protocols is presented in the chapter Volcano — Enabling Correctness by Design. This chapter provides an in-depth description of the Volcano approach and a suite of software tools, developed by Volcano Communications Technologies AG, which supports requirements capture, modelbased design, automatic code generation, and system-level validation capabilities. This is an example of an actual development environment widely used by the automotive industry.

Industrial Automation The current trend for flexible and distributed control and automation has accelerated the migration of intelligence and control functions to the field devices; particularly sensors and actuators. The increased processing capabilities of those devices were instrumental in the emergence of a trend for networking of field devices around industrial data networks, thus making access to any device from any place in the plant, or even globally, technically feasible. The benefits are numerous, including increased flexibility, improved system performance, and ease of system installation, upgrade, and maintenance. Embedded web servers are increasingly used in industrial automation to provide Human–Machine Interface (HMI), which allows for web-based configuration, control and monitoring of devices and industrial processes. An introduction to the design of embedded web servers is presented in the chapter Embedded Web Servers in Distributed Control Systems. The focus of this chapter is on Field Device Web Servers (FDWS). The chapter provides a comprehensive overview of the context in which the embedded web servers are usually implemented, as well as the structure of an FDWS application with the presentation of its component packages and the mutual relationship between the content of the packages and the architecture of a typical embedded site. All this is discussed in the context of an actual FDWS implementation and application deployed at one of the Alstom (France) sites. Remote access to field devices may lead to many security challenges. The embedded web servers are typically run on processors with limited memory and processing power. These restrictions necessitate a deployment of lightweight security mechanisms. Vendor tailored versions of standard security protocol suites such as Secure Sockets Layer (SSL) and IP Security Protocol (IPSec) may still not be suitable due to excessive demand for resources. In applications restricted to the Hypertext Transfer Protocol (HTTP), Digest Access Authentication (DAA), which is a security extension to HTTP, offers an alternative and viable solution. Those issues are discussed in the chapter HTTP Digest Authentication for Embedded Web

© 2006 by Taylor & Francis Group, LLC

xxii

Preface

Servers. This chapter overviews mechanisms and services, as well as potential applications of HTTP Digest Authentication. It also surveys selected embedded web server implementations for their support for DAA. This includes Apache 2.0.42, Allegro RomPager 4.05, and GoAhead 2.1.2.

Intelligent Sensors The advances in the design of embedded systems, availability of tools, and falling fabrication costs allowed for cost-effective migration of the intelligence and control functions to the field devices, particularly sensors and actuators. Intelligent sensors combine computing, communication, and sensing functions. The trend for increased functional complexity of those devices necessitates the use of formal descriptive techniques and supporting tools throughout the design and implementation process. The chapter Intelligent Sensors: Analysis and Design tackles some of those issues. It reviews some of the main characteristics of the generic intelligent sensor formal model; subsequently, it discusses an implementation of the model using the CAP language, which was developed specifically for the design of intelligent sensors. A brief introduction to the language is also provided. The whole development process is illustrated by using an example of a simple distance measuring system comprising an ultrasonic transmitter and two receivers.

Locating Topics To assist readers with locating material, a complete table of contents is presented at the front of the book. Each chapter begins with its own table of contents. Two indexes are provided at the end of the book: the index of authors contributing to the book, together with the titles of their contributions, and a detailed subject index.

Richard Zurawski

© 2006 by Taylor & Francis Group, LLC

Acknowledgments My gratitude goes to Luciano Lavagno, Grant Martin, and Alberto Sangiovanni-Vincentelli who have provided advice and support while preparing this book. This book would never have had a chance to take off without their assistance. Andreas Willig helped with identifying some authors for the section on Sensor Networks. Also, I would like to thank the members of the International Advisory Board for their help with the organization of the book and selection of authors. I have received tremendous cooperation from all contributing authors. I would like to thank all of them for that. I would like to express gratitude to my publisher Nora Konopka, and other Taylor and Francis staff involved in the book production, particularly Jessica Vakili, Elizabeth Spangenberger, and Gail Renard. My love goes to my wife who tolerated the countless hours I spent on preparing this book.

© 2006 by Taylor & Francis Group, LLC

About the Editor Dr. Richard Zurawski is president of ISA Group, San Francisco and Santa Clara, CA, involved in providing solutions to Fortune 1000 companies. Prior to that, he held various executive positions with San Francisco Bay area based companies. Dr. Zurawski is a cofounder of the Institute for Societal Automation, Santa Clara, a research and consulting organization. Dr. Zurawski has close to thirty years of academic and industrial experience, including a regular professorial appointment at the Institute of Industrial Sciences, University of Tokyo, and a full-time R&D advisor position with Kawasaki Electric Corp., Tokyo. He provided consulting services to Kawasaki Electric, Ricoh, and Toshiba Corporations, Japan, and participated in 1990s in a number of Japanese Intelligent Manufacturing Systems programs. Dr. Zurawski has served as editor at large for IEEE Transactions on Industrial Informatics, and associate editor for IEEE Transactions on Industrial Electronics; he also served as associate editor for Real-Time Systems: The International Journal of Time-Critical Computing Systems, Kluwer Academic Publishers. He was a guest editor of four special sections in IEEE Transactions on Industrial Electronics and a guest editor of a special issue of the Proceedings of the IEEE dedicated to industrial communication systems. In 1998, he was invited by IEEE Spectrum to contribute material on Java technology to “Technology 1999: Analysis and Forecast Issues.” Dr. Zurawski is series editor for The Industrial Information Technology Series, Taylor and Francis Group, Boca Raton, FL. Dr. Zurawski has served as a vice president of the Institute of Electrical and Electronics Engineers (IEEE) Industrial Electronics Society (IES), and was on the steering committee of the ASME/IEEE Journal of Microelectromechanical Systems. In 1996, he received the Anthony J. Hornfeck Service Award from the IEEE Industrial Electronics Society. Dr. Zurawski has served as a general, program, and track chair for a number of IEEE conferences and workshops, and has published extensively on various aspects of formal methods in the design of real-time, embedded, and industrial systems, MEMS, parallel and distributed programming and systems, as well as control and robotics. He is the editor of The Industrial Information Technology Handbook (2004), and The Industrial Communication Technology Handbook (2005), both published by Taylor and Francis Group. Dr. Richard Zurawski received his M.Sc. in informatics and automation, University of Mining and Metallurgy, Krakow, Poland, and his Ph.D. in computer science, La Trobe University, Melbourne, Australia.

© 2006 by Taylor & Francis Group, LLC

Contributors

Parham Aarabi

Davide Bertozzi

Krishnendu Chakrabarty

Department of Electrical and Computer Engineering University of Toronto Ontario, Canada

Dipartimento Elettronica Informatica Sistemistica University of Bologna Bologna, Italy

Department of Electrical and Computer Engineering Duke University Durham, North Carolina

José L. Ayala

Jan Blumenthal

S. Chatterjea

Dpto. Ingenieria Electronica E.T.S.I. Telecomunicacion Ciudad Universitaria s/n Madrid, Spain

Institute of Applied Microelectronics and Computer Science Dept. of Electrical Engineering and Information Technology University of Rostock Rostock, Germany

Faculty of Electrical Engineering, Mathematics, and Computer Science University of Twente Enschede The Netherlands

João Paulo Barros Universidade Nova de Lisboa Faculdade de Ciências e Tecnologia Dep. Eng. Electrotécnica Caparica, Portugal

Ali Alphan Bayazit Princeton University Princeton, New Jersey

Gunnar Braun CoWare Inc. Aachen, Germany

Kwang-Ting (Tim) Cheng Department of Electrical and Computer Engineering University of California Santa Barbara, California

Anikó Costa Giorgio C. Buttazzo Dip. di Informatica e Sistemistica University of Pavia Pavia, Italy

Universidade Nova de Lisboa, Faculdade de Ciências e Tecnologia Dep. Eng. Electrotécnica Caparica, Portugal

Luca P. Carloni

Mario Crevatin

Advanced System Technology STMicroelectronics Ontario, Canada

EECS Department University of California at Berkeley Berkeley, California

Corporate Research ABB Switzerland Ltd Baden-Dattwil, Switzerland

Ivan Cibrario Bertolotti

Wander O. Cesário

IEIIT — National Research Council Turin, Italy

SLS Group TIMA Laboratory Grenoble, France

Luca Benini Dipartimento Elettronica Informatica Sistemistica University of Bologna Bologna, Italy

Essaid Bensoudane

Fernando De Bernardinis

© 2006 by Taylor & Francis Group, LLC

EECS Department University of California at Berkeley Berkeley, California

xxviii

Contributors

Erwin de Kock

Frank Golatowski

Hans Hansson

Philips Research Eindhoven, The Netherlands

Institute of Applied Microelectronics and Computer Science Dept. of Electrical Engineering and Information Technology University of Rostock Rostock, Germany

Department of Computer Science and Engineering Mälardalen University Västerås, Sweden

Giovanni De Micheli Gates Computer Science Stanford University Stanford, California

Robert de Simone INRIA Sophia-Antipolis, France

Eric Dekneuvel

Luís Gomes Universidade Nova de Lisboa Faculdade de Ciências e Tecnologia Dep. Eng. Electrotécnica Caparica, Portugal

University of Nice Sophia Antipolis Biot, France

Aarti Gupta

S. Dulman

Rajesh Gupta

Faculty of Electrical Engineering, Mathematics, and Computer Science University of Twente Enschede The Netherlands

Department of Computer Science and Engineering University of California at San Diego San Diego, California

Stephen A. Edwards

Tallwood Venture Capital Palo Alto, California

Department of Computer Science Columbia University New York, New York

Gerben Essink Philips Research Eindhoven, The Netherlands

A. G. Fragopoulos Department of Electrical and Computer Engineering University of Patras Patras, Greece

Shashidhar Gandham The Department of Computer Science The University of Texas at Dallas Richardson, Texas

Christopher Gill Department of Computer Science and Engineering Washington University St. Louis, Missouri

© 2006 by Taylor & Francis Group, LLC

NEC Laboratories America Princeton, New Jersey

Sumit Gupta

Marc Haase Institute of Applied Microelectronics and Computer Science Dept. of Electrical Engineering and Information Technology University of Rostock Rostock, Germany

P. Havinga Faculty of Electrical Engineering, Mathematics, and Computer Science University of Twente Enschede The Netherlands

Øystein Haugen Department of Informatics University of Oslo Oslo, Norway

Tomas Henriksson Philips Research Eindhoven, The Netherlands

Andreas Hoffmann CoWare Inc. Aachen, Germany

T. Hoffmeijer Faculty of Electrical Engineering, Mathematics, and Computer Science University of Twente Enschede The Netherlands

J. Hurink

Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology Delft, The Netherlands

Faculty of Electrical Engineering, Mathematics, and Computer Science University of Twente Enschede The Netherlands

Matthias Handy

Margarida F. Jacome

Institute of Applied Microelectronics and Computer Science Dept. of Electrical Engineering and Information Technology University of Rostock Rostock, Germany

Department of Electrical and Computer Engineering University of Texas at Austin Austin, Texas

Gertjan Halkes

Omid S. Jahromi Bioscrypt Inc. Markham, Ontario, Canada

Contributors

xxix

Axel Jantsch

Damien Lyonnard

Roberto Passerone

Department for Microelectronics and Information Technology Royal Institute of Technology Kista, Sweden

Advanced System Technology STMicroelectronics Ontario, Canada

Cadence Design Systems, Inc. Berkeley Cadence Labs Berkeley, California

A. A. Jerraya SLS Group TIMA Laboratory Grenoble, France

J. V. Kapitonova

Yogesh Mahajan Princeton University Princeton, New Jersey

Grant Martin Tensilica Inc. Santa Clara, California

Glushkov Institute of Cybernetics National Academy of Science of Ukraine Kiev, Ukraine

Birger Møller-Pedersen

Alex Kondratyev

Ravi Musunuri

Cadence Berkeley Labs Berkeley, California

Wido Kruijtzer Philips Research Eindhoven, The Netherlands

Koen Langendoen Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology Delft, The Netherlands

Michel Langevin Advanced System Technology STMicroelectronics Ontario, Canada

Luciano Lavagno Cadence Berkeley Laboratories Berkeley, California; and Dipartimento di Elettronica Politecnico di Torino, Italy

A. A. Letichevsky Glushkov Institute of Cybernetics National Academy of Science of Ukraine Kiev, Ukraine

Marisa López-Vallejo Dpto. Ingenieria Electronica E.T.S.I. Telecomunicacion Ciudad Universitaria s/n Madrid, Spain

© 2006 by Taylor & Francis Group, LLC

Department of Informatics University of Oslo Oslo, Norway

Hiren D. Patel Electrical and Computer Engineering Virginia Tech Blacksburg, Virginia

Maulin D. Patel The Department of Computer Science The University of Texas at Dallas Richardson, Texas

Pierre G. Paulin

The Department of Computer Science The University of Texas at Dallas Richardson, Texas

Advanced System Technology STMicroelectronics Ontario, Canada

Nicolas Navet

Advanced System Technology STMicroelectronics Ontario, Canada

Institut National Polytechnique de Lorraine Nancy, France

Gabriela Nicolescu Ecole Polytechnique de Montreal Montreal, Quebec Canada

Achim Nohl CoWare Inc. Aachen, Germany

Mikael Nolin Department of Computer Science and Engineering Mälardalen University Västerås, Sweden

Thomas Nolte Department of Computer Science and Engineering Mälardalen University Västerås, Sweden

Claudio Passerone Dipartimento di Elettronica Politecnico di Torino Turin, Italy

Chuck Pilkington

Claudio Pinello EECS Department University of California at Berkeley Berkeley, California

Dumitru Potop-Butucaru IRISA Rennes, France

Antal Rajnák Advanced Engineering Labs Volcano Communications Technologies AG Tagerwilen, Switzerland

Anand Ramachandran Department of Electrical and Computer Engineering University of Texas at Austin Austin, Texas

Niels Reijers Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology Delft, The Netherlands

xxx

Alberto L. Sangiovanni-Vincentelli EECS Department University of California at Berkeley Berkeley, California

Contributors

Weilian Su

Thomas P. von Hoff

Broadband and Wireless Networking Laboratory School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia

ABB Switzerland Ltd Corporate Research Baden-Dattwil, Switzerland

Udit Saxena Microsoft Corporation Seattle, Washington

Guenter Schaefer Institute of Telecommunication Systems Technische Universität Berlin Berlin, Germany

D. N. Serpanos Department of Electrical and Computer Engineering University of Patras Patras, Greece

Venkita Subramonian Department of Computer Science and Engineering Washington University St. Louis, Missouri

Jacek Szymanski ALSTOM Transport Centre Meudon La Fort Meudon La Fort, France

Jean-Pierre Talpin IRISA Rennes, France

Marco Sgroi

Lothar Thiele

EECS Department University of California at Berkeley Berkeley, California

Department Information Technology and Electrical Engineering Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology Zurich, Switzerland

Sandeep K. Shukla Electrical and Computer Engineering Virginia Tech Blacksburg, Virginia

Pieter van der Wolf

Françoise Simonot-Lion

Philips Research Eindhoven, The Netherlands

Institut National Polytechnique de Lorraine Nancy, France

V. A. Volkov

YeQiong Song Université Henri Poincaré Nancy, France

© 2006 by Taylor & Francis Group, LLC

Glushkov Institute of Cybernetics National Academy of Science of Ukraine Kiev, Ukraine

A. G. Voyiatzis Department of Electrical and Computer Engineering University of Patras Patras, Greece

Flávio R. Wagner UFRGS — Instituto de Informática Porto Alegre, Brazil

Ernesto Wandeler Department Information Technology and Electrical Engineering Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology Zurich, Switzerland

Yosinori Watanabe Cadence Berkeley Labs Berkeley, California

Thomas Weigert Global Software Group Motorola Schaumburg, Illinois

Reinhard Wilhelm University of Saarland Saarbruecken, Germany

Richard Zurawski ISA Group San Francisco, California

Contents

SECTION I

Embedded Systems

Real-Time and Embedded Systems 1 Embedded Systems: Toward Networking of Embedded Systems Luciano Lavagno and Richard Zurawski . . . . . . . . . . . . .

1-1

2 Real-Time in Embedded Systems Hans Hansson, Mikael Nolin, and Thomas Nolte . . . . . . . . . . . . . . . . . . . . . . . .

2-1

Design and Validation of Embedded Systems 3 Design of Embedded Systems Luciano Lavagno and Claudio Passerone . . . . . . . . . . . . . . . . . . . . . .

3-1

4 Models of Embedded Computation

4-1

Axel Jantsch . . . . . . . . .

5 Modeling Formalisms for Embedded System Design Luís Gomes, João Paulo Barros, and Anikó Costa . . . . . . . . . . . . . . . . .

5-1

6 System Validation J.V. Kapitonova, A.A. Letichevsky, V.A. Volkov, and Thomas Weigert . . . . . . . . . . . . . . . . . . . . .

6-1

Design and Veriﬁcation Languages . . . . . .

7-1

8 The Synchronous Hypothesis and Synchronous Languages Dumitru Potop-Butucaru, Robert de Simone, and Jean-Pierre Talpin .

8-1

9 Introduction to UML and the Modeling of Embedded Systems Øystein Haugen, Birger Møller-Pedersen, and Thomas Weigert

. . .

9-1

10 Veriﬁcation Languages Aarti Gupta, Ali Alphan Bayazit, and Yogesh Mahajan . . . . . . . . . . . . . . . . . . . . . . .

10-1

7 Languages for Embedded Systems

Stephen A. Edwards

Operating Systems and Quasi-Static Scheduling 11 Real-Time Embedded Operating Systems: Standards and Perspectives Ivan Cibrario Bertolotti . . . . . . . . . . . . . . . . . . . .

© 2006 by Taylor & Francis Group, LLC

11-1

xxxii

Contents

12 Real-Time Operating Systems: The Scheduling and Resource Management Aspects Giorgio C. Buttazzo . . . . . . . . . . .

12-1

13 Quasi-Static Scheduling of Concurrent Speciﬁcations Alex Kondratyev, Luciano Lavagno, Claudio Passerone, and Yosinori Watanabe . . . . . . . . . . . . . . . . . . . . . .

13-1

Timing and Performance Analysis . . .

14-1

15 Performance Analysis of Distributed Embedded Systems Lothar Thiele and Ernesto Wandeler . . . . . . . . . . . . . . .

15-1

14 Determining Bounds on Execution Times

Reinhard Wilhelm

Power Aware Computing 16 Power Aware Embedded Computing Margarida F. Jacome and Anand Ramachandran . . . . . . . . . . . . . . . . . . . . .

16-1

Security in Embedded Systems 17 Design Issues in Secure Embedded Systems A.G. Voyiatzis, A.G. Fragopoulos, and D.N. Serpanos . . . . . . . . . . . . . .

SECTION II

17-1

System-on-Chip Design . . .

18-1

19 A Novel Methodology for the Design of Application-Speciﬁc Instruction-Set Processors Andreas Hoffmann, Achim Nohl, and Gunnar Braun . . . . . . . . . . . . . . . . . . . . . . . .

19-1

20 State-of-the-Art SoC Communication Architectures José L. Ayala, Marisa López-Vallejo, Davide Bertozzi, and Luca Benini . . . . . .

20-1

21 Network-on-Chip Design for Gigascale Systems-on-Chip Davide Bertozzi, Luca Benini, and Giovanni De Micheli . . . . . .

21-1

22 Platform-Based Design for Embedded Systems Luca P. Carloni, Fernando De Bernardinis, Claudio Pinello, Alberto L. Sangiovanni-Vincentelli, and Marco Sgroi . . . . . . .

22-1

23 Interface Speciﬁcation and Converter Synthesis

Roberto Passerone .

23-1

24 Hardware/Software Interface Design for SoC Wander O. Cesário, Flávio R. Wagner, and A.A. Jerraya . . . . . . . . . . . . . . .

24-1

25 Design and Programming of Embedded Multiprocessors: An Interface-Centric Approach Pieter van der Wolf, Erwin de Kock, Tomas Henriksson, Wido Kruijtzer, and Gerben Essink . . . . . .

25-1

18 System-on-Chip and Network-on-Chip Design

© 2006 by Taylor & Francis Group, LLC

Grant Martin

Contents

xxxiii

26 A Multiprocessor SoC Platform and Tools for Communications Applications Pierre G. Paulin, Chuck Pilkington, Michel Langevin, Essaid Bensoudane, Damien Lyonnard, and Gabriela Nicolescu . . .

26-1

SECTION III Circuits

Testing of Embedded Core-Based Integrated

27 Modular Testing and Built-In Self-Test of Embedded Cores in System-on-Chip Integrated Circuits Krishnendu Chakrabarty . . .

27-1

28 Embedded Software-Based Self-Testing for SoC Design Kwang-Ting (Tim) Cheng . . . . . . . . . . . . . . . . . . .

28-1

SECTION IV

Networked Embedded Systems

29 Design Issues for Networked Embedded Systems Sumit Gupta, Hiren D. Patel, Sandeep K. Shukla, and Rajesh Gupta . . . . . . .

29-1

30 Middleware Design and Implementation for Networked Embedded Systems Venkita Subramonian and Christopher Gill . . . . . . .

30-1

SECTION V

Sensor Networks

31 Introduction to Wireless Sensor Networks S. Dulman, S. Chatterjea, and P. Havinga . . . . . . . . . . . . . . . . . . . . . . . .

31-1

32 Issues and Solutions in Wireless Sensor Networks Ravi Musunuri, Shashidhar Gandham, and Maulin D. Patel . . . . . . . . . . .

32-1

33 Architectures for Wireless Sensor Networks S. Dulman, S. Chatterjea, T. Hoffmeijer, P. Havinga, and J. Hurink . . . . . .

33-1

34 Energy-Efﬁcient Medium Access Control Koen Langendoen and Gertjan Halkes . . . . . . . . . . . . . . . . . . . . . . . .

34-1

35 Overview of Time Synchronization Issues in Sensor Networks Weilian Su . . . . . . . . . . . . . . . . . . . . . . . . . .

35-1

36 Distributed Localization Algorithms Koen Langendoen and Niels Reijers . . . . . . . . . . . . . . . . . . . . . . . . .

36-1

37 Routing in Sensor Networks Shashidhar Gandham, Ravi Musunuri, and Udit Saxena . . . . . . . . . . . . . . . . . . . . . . .

37-1

38 Distributed Signal Processing in Sensor Networks Omid S. Jahromi and Parham Aarabi . . . . . . . . . . . . . . . . . . . . . .

38-1

© 2006 by Taylor & Francis Group, LLC

xxxiv

Contents

Guenter Schaefer . . . . . . . . . . .

39-1

40 Software Development for Large-Scale Wireless Sensor Networks Jan Blumenthal, Frank Golatowski, Marc Haase, and Matthias Handy . . . . . . . . . . . . . . . . . . . . . . .

40-1

39 Sensor Network Security

SECTION VI

Embedded Applications

Automotive Networks 41 Design and Validation Process of In-Vehicle Embedded Electronic Systems Françoise Simonot-Lion and YeQiong Song . . . . . . .

41-1

42 Fault-Tolerant Services for Safe In-Car Embedded Systems Nicolas Navet and Françoise Simonot-Lion . . . . . . . . . . . .

42-1

43 Volcano — Enabling Correctness by Design

Antal Rajnák . . . . .

43-1

44 Embedded Web Servers in Distributed Control Systems Jacek Szymanski . . . . . . . . . . . . . . . . . . . . . . .

44-1

45 HTTP Digest Authentication for Embedded Web Servers Mario Crevatin and Thomas P. von Hoff . . . . . . . . . . . . .

45-1

Industrial Automation

Intelligent Sensors 46 Intelligent Sensors: Analysis and Design

© 2006 by Taylor & Francis Group, LLC

Eric Dekneuvel

. . . . .

46-1

I Embedded Systems

© 2006 by Taylor & Francis Group, LLC

Real-Time and Embedded Systems 1

Embedded Systems: Toward Networking of Embedded Systems Luciano Lavagno and Richard Zurawski

2

Real-Time in Embedded Systems Hans Hansson, Mikael Nolin, and Thomas Nolte

© 2006 by Taylor & Francis Group, LLC

1 Embedded Systems: Toward Networking of Embedded Systems 1.1 1.2

Luciano Lavagno Cadence Berkeley Laboratories and Politecnico di Torino

Richard Zurawski ISA Group

1.3

Networking of Embedded Systems . . . . . . . . . . . . . . . . . . . . . Design Methods for Networked Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Networks Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-1 1-3 1-5

Networked Embedded Systems in Industrial Automation • Networked Embedded Systems in Building Automation • Automotive Networked Embedded Systems • Sensor Networks

1.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14

1.1 Networking of Embedded Systems The last two decades have witnessed a remarkable evolution of embedded systems from being assembled from discrete components on printed circuit boards, although, they still are, to systems being assembled from Intellectual Property (IP) components “dropped” onto silicon of the system on a chip. Systems on a chip offer a potential for embedding complex functionalities, and to meet demanding performance requirements of applications such as DSPs, network, and multimedia processors. Another phase in this evolution, already in progress, is the emergence of distributed embedded systems; frequently termed as networked embedded systems, where the word “networked” signifies the importance of the networking infrastructure and communication protocol. A networked embedded system is a collection of spatially and functionally distributed embedded nodes interconnected by means of wireline or wireless communication infrastructure and protocols, interacting with the environment (via a sensor/actuator elements) and each other, and, possibly, a master node performing some control and coordination functions, to coordinate computing and communication in order to achieve certain goal(s). The networked embedded systems appear in a variety of application domains such as, automotive, train, aircraft, office building, and industrial — primarily for monitoring and control, environment monitoring, and, in future, control, as well. There have been various reasons for the emergence of networked embedded systems, influenced largely by their application domains. The benefit of using distributed systems and an evolutionary need to replace point-to-point wiring connections in these systems by a single bus are some of the most important ones.

1-1

© 2006 by Taylor & Francis Group, LLC

1-2

Embedded Systems Handbook

The advances in design of embedded systems, tools availability, and falling fabrication costs of semiconductor devices and systems, have allowed for infusion of intelligence into field devices such as sensors and actuators. The controllers used with these devices provide typically on-chip signal conversion, data processing, and communication functions. The increased functionality, processing, and communication capabilities of controllers have been largely instrumental in the emergence of a widespread trend for networking of field devices around specialized networks, frequently referred to as field area networks. The field area networks, or fieldbuses [1] (fieldbus is, in general, a digital, two-way, multi-drop communication link) as commonly referred to, are, in general, networks connecting field devices such as sensors and actuators with field controllers (for instance, Programmable Logic Controllers [PLCs] in industrial automation, or Electronic Control Units [ECUs] in automotive applications), as well as man–machine interfaces, for instance, dashboard displays in cars. In general, the benefits of using those specialized networks are numerous, including increased flexibility attained through combination of embedded hardware and software, improved system performance, and ease of system installation, upgrade, and maintenance. Specifically, in automotive and aircraft applications, for instance, they allow for a replacement of mechanical, hydraulic, and pneumatic systems by mechatronic systems, where mechanical or hydraulic components are typically confined to the end-effectors; just to mention their two different application areas. Unlike Local Area Networks (LANs), due to the nature of communication requirements imposed by applications, field area networks, by contrast, tend to have low data rates, small size of data packets, and typically require real-time capabilities which mandate determinism of data transfer. However, data rates above 10 Mbit/sec, typical of LANs, have already become a commonplace in field area networks. The specialized networks tend to support various communication media such as twisted pair cables, fiber optic channels, power line communication, radio frequency channels, infrared connections, etc. Based on the physical media employed by the networks, they can be, in general, divided into three main groups, namely: wireline-based networks using media such as twisted pair cables, fiber optic channels (in hazardous environments like chemical and petrochemical plants), and power lines (in building automation); wireless networks supporting radio frequency channels, and infrared connections; and hybrid networks composed of wireline and wireless networks. Although the use of wireline-based field area networks is dominant, the wireless technology offers a range of incentives in a number of application areas. In industrial automation, for instance, wireless device (sensor/actuator) networks can provide a support for mobile operation required in case of mobile robots, monitoring, and control of equipment in hazardous and difficult to access environments, etc. In a wireless sensor/actuator network, stations may interact with each other on a peer-to-peer basis, and with a base station. The base station may have its transceiver attached to a cable of a (wireline) field area network, giving rise to a hybrid wireless–wireline system [2]. A separate category is the wireless sensor networks, mainly envisaged to be used for monitoring purposes, which is discussed in detail in the book. The variety of application domains impose different functional and nonfunctional requirements onto the operation of networked embedded systems. Most of them are required to operate in a reactive way; for instance, systems used for control purposes. With that comes the requirement for real-time operation, in which systems are required to respond within a predefined period of time, mandated by the dynamics of the process under control. A response, in general, may be periodic to control a specific physical quantity by regulating dedicated end-effector(s), or aperiodic arising from unscheduled events such as out-of-bounds state of a physical parameter or any other kind of abnormal conditions, or sporadic with no period but with known minimum time between consecutive occurrences. Broadly speaking, systems which can tolerate a delay in response are called soft real-time systems; in contrast, hard real-time systems require deterministic responses to avoid changes in the system dynamics which potentially may have negative impact on the process under control, and as a result may lead to economic losses or cause injury to human operators. Representative examples of systems imposing hard real-time requirement on their operation are fly-by-wire in aircraft control, and steer-by-wire in automotive applications, to mention a few. The need to guarantee a deterministic response mandates using appropriate scheduling schemes, which are frequently implemented in application domain specific real-time operating systems or custom designed

© 2006 by Taylor & Francis Group, LLC

Toward Networking of Embedded Systems

1-3

“bare-bone” real-time executives. Most of those issues (real-time scheduling and real-time operating systems) are discussed in this book in a number of chapters. The networked embedded systems used in safety-critical applications such as fly-by-wire and steer-bywire require a high level of dependability to ensured that a system failure does not lead to a state in which human life, property, or environment are endangered. The dependability issue is critical for technology deployment; various solutions are discussed in this chapter in the context of automotive applications. One of the main bottlenecks in the development of safety-critical systems is the software development process. This issue is briefly discussed in this chapter in the context of the automotive application domain. As opposed to applications mandating hard real-time operation, such as the majority of industrial automation controls or safety-critical automotive control applications, building automation control systems, for instance, seldom have a need for hard real-time communication; the timing requirements are much more relaxed. The building automation systems tend to have a hierarchical network structure and typically implement all seven layers of the ISO/OSI reference model [3]. In the case of field area networks employed in industrial automation, for instance, there is little need for the routing functionality and end-to-end control. Therefore, typically, only the layers 1 (physical layer), 2 (data link layer, including implicitly the medium access control layer), and 7 (application layer, which also covers user layer) are used in those networks. This diversity of requirements imposed by different application domains (soft/hard real-time, safety critical, network topology, etc.) necessitated different solutions, and using different protocols based on different operation principles. This has resulted in plethora of networks developed for different application domains. Some of those networks will be overviewed in one of the subsequent sections. With the growing trend for networking of embedded system and their internetworking with LAN, Wide Area Network (WAN), and the Internet (for instance, there is a growing demand for remote access to process data at the factory floor), many of those systems may become exposed to potential security attacks, which may compromise their integrity and cause damage as a result. The limited resources of embedded nodes pose considerable challenge for the implementation of effective security policies which, in general, are resource demanding. These restrictions necessitate a deployment of lightweight security mechanisms. Vendor tailored versions of standard security protocol suites, such as Secure Sockets Layer (SSL) and IP Security Protocol (IPSec), may still not be suitable due to excessive demand for resources. Potential security solutions for this kind of systems depend heavily on the specific device or system protected, application domain, and extent of internetworking and its architecture. (The details of potential security measures are presented in this book in two separate chapters.)

1.2 Design Methods for Networked Embedded Systems Design methods for networked embedded systems fall into the general category of system-level design. They include two separate aspects, which will be discussed briefly. A first aspect is the network architecture design, in which communication protocols, interfaces, drivers, and computation nodes are selected and assembled. A second aspect is the system-on-chip design, in which the best hardware/software partition is selected, and an existing platform is customized, or a new chip is created for the implementation of a computation or a communication node. Both aspects share several similarities, but so far have generally been solved using ad hoc methodologies and tools, since the attempt to create a unified electronic system-level design methodology have so far failed. When one considers the complete networked system, including several digital and analog parts, many more trade-offs can be played at the global level. However, it also means that the interaction between the digital portion of the design activity and the rest is much more complicated, especially in terms of tools, formats, and standards with which one must interoperate and interface. In the case of network architecture design, tools such as OpNet and NS are used to identify communication bottlenecks, investigate the effect of parameters such as channel bit error rate, and analyze the choice of coding, medium access, and error correction mechanisms on the overall system performance.

© 2006 by Taylor & Francis Group, LLC

1-4

Embedded Systems Handbook

For wireless networks, tools such as Matlab and Simulink are also used, in order to analyze the impact of detailed channel models, thanks to their ability to model both digital and analog components, as well as physical elements, at a high level of abstraction. In all cases, the analysis is essentially functional, that is, it takes into account only in a very limited manner effects such as power consumption, computation time, and cost. This is the main limitation that will need to be addressed in the future, if one wants to model and design in an optimal manner low power networked embedded systems, such as those that are envisioned for wireless sensor network applications. At the system-on-chip architecture level, the first decision to be made is whether to use a platform instance or design an Application-Specific Integrated Circuit (ASIC) from scratch. The first option builds on the availability of large libraries of IP, both in the form of processors, memories, and peripherals, from major silicon vendors. These IP libraries are guaranteed to work together, and hence constitute what is termed as a platform. A platform is a set of components, together with usage rules that ensure their correct and seamless interoperation. They are used to speed up time-to-market, by ensuring rapid implementation of complex architectures. Processors (and the software executing on them) provide flexibility to adapt to different applications and customizations (e.g., localization and adherence to regional standards), while hardware IPs provide efficient implementation of commonly used functions. Configurable processors can be adapted to the requirements of specific applications and via instruction extensions, offer considerable performance and power advantages over fixed instruction set architectures. Thus, a platform is a single abstract model that hides the details of a set of different possible implementations as clusters of lower level components. The platform, for example, a family of microprocessors, peripherals, and bus protocols, allows developers of application designs to operate without detailed knowledge of the implementation (e.g., the pipelining of the processor or the internal implementation of the UART). At the same time, it allows platform implementors to share design and fabrication costs among a broad range of potential users, broader than if each design was a one-of-a-kind type. Design methods that exploit the notion of platform generally start from a functional specification, which is then mapped onto an architecture (a platform instance) in order to derive performance information and explore the design space. Full exploitation of the notion of platform results in better reuse, by decoupling independent aspects that would otherwise tie, for example, a given functional specification to low level implementation details. The guiding principle of separation of concerns distinguishes between: 1. Computation and communication. This separation is important because refinement of computation is generally done by hand, or by compilation and scheduling, while communication makes use of patterns. 2. Application and platform implementation, because they are often defined and designed independently by different groups or companies. 3. Behavior and performance, which should be kept separate because performance information can represent either nonfunctional requirements (e.g., maximum response time of an embedded controller), or the result of an implementation choice (e.g., the worst-case execution time of a task). Nonfunctional constraint verification can be performed traditionally, by simulation and prototyping, or with static formal checks, such as schedulability analysis. Tool support for system-on-chip architectural design is, so far, mostly limited to simulation and interface generation. The first category includes tools such as NC-SystemC from Cadence, ConvergenSC from CoWare, and SystemStudio from Synopsys. Simulators at the system-on-chip level provide abstractions for the main architectural components (processors, memories, busses, and hardware blocks) and permit quick instantiation of complete platform instances from template skeletons. Interface synthesis can take various forms, from the automated instantiation of templates offered by N2C from CoWare, to the automated consistent file generation for software and hardware offered by Beach Solutions. A key aspect of design problems in this space is compatibility with respect to specifications, at the interface level (bus and networking standards), instruction-set architecture level, and Application Procedural Interface (API) level. Assertion-based verification techniques can be used to ease the problem of verifying compliance with a digital protocol standard (e.g., for a bus).

© 2006 by Taylor & Francis Group, LLC

Toward Networking of Embedded Systems

1-5

Let us consider an example of a design flow in the automotive domain, which can be considered as a paradigm of any networked embedded system. Automotive electronic design starts, usually 5 to 10 years before the actual introduction of a product, when a car manufacturer defines the specifications for its future line of vehicles. It is now an accepted practice to use the notion of platform also in this domain, so that the electronic portion (as well as the mechanical one, which is outside the scope of this discussion) is modularized and componentized, enabling sharing across different models. An ECU generally includes a microcontroller (8, 16, and 32 bits), memory (SRAM, DRAM, and Flash), some ASIC or FPGA for interfacing, one or more in-vehicle network interfaces (e.g., CAN [Controller Area Network] or FlexRay), and several sensor and actuator interfaces (analog/digital and digital/analog converters, pulse-width modulators, power transistors, display drivers, and so on). The system-level design activity is performed by a relatively small team of architects, who know the domain well (mechanics, electronics, and business), define the specifications for the electronic component suppliers, and interface with the teams that specify the mechanical portions (body and engine). These teams essentially use past experience to perform their job, and currently have serious problems forecasting the state of electronics ten years in advance. Control algorithms are defined in the next design phase, when the first engine models (generally described using Simulink, Matlab, and StateFlow) become available, as a specification for both the electronic design and the engine design. An important aspect of the overall flow is that these models are not frozen until much later, and hence both algorithm design and (often) ECU software design must cope with their changes. Another characteristic is that they are parametric models, sometimes reused across multiple engine generations and classes, whose exact parameter values will be determined only when prototypes or actual products will be available. Thus, control algorithms must consider both allowable ranges and combinations of values for these parameters, and the capability to measure directly or indirectly their values from the behavior of engine and vehicle. Finally, algorithms are often distributed over a network of cooperating ECUs, thus deadlines and constraints generally span a number of electronic modules. While control design progresses, ECU hardware design can start, because rough computational and memory requirement, as well as interfacing standards, sensors, and actuators, are already known. At the end of both control design and hardware design, software implementation can start. As mentioned earlier, most of the software running on modern ECUs is automatically generated (model-based design). The electronic subsystem supplier in the hardware implementation phase can use both off-the-shelf components (such as memories), Application Specific Standard Products (ASSPs) (such as microcontrollers and standard bus interfaces), and even ASICs and FPGAs (typically for sensor and actuator signal conditioning and conversion). The final phase, called system integration, is generally performed by the car manufacturer again. It can be an extremely lengthy and an expensive phase, because it requires the use of expensive detailed models of the controlled system (e.g., the engine, modeled with DSP-based multiprocessors) or even of actual car prototypes. The goal of integration is to ensure smooth subsystem communication (e.g., checking that there are no duplicate module identifiers and that there is enough bandwidth in every in-vehicle bus). Simulation support in this domain is provided by companies such as Vast and Axys (now part of ARM), who sell both fast instruction-set simulators for the most commonly used processors in the networked embedded system domain, and network simulation models exploiting either proprietary simulation engines, for example, in the case of Virtio, or standard simulators (HDL [Hardware Description Language] or SystemC).

1.3 Networks Embedded Systems 1.3.1 Networked Embedded Systems in Industrial Automation Although for the origins of field area networks, one can look back as far as the end of 1960s in the nuclear instrumentation domain, CAMAC network [4], and beginning of 1970s in avionics and aerospace

© 2006 by Taylor & Francis Group, LLC

1-6

Embedded Systems Handbook

applications, MIL-STD-1553 bus [5], it was the industrial automation area which brought the main thrust of developments. The need for integration of heterogeneous systems, difficult at that time due to the lack of standards, resulted in two major initiatives which have had a lasting impact on the integration concepts, and architecture of the protocol stack of field area networks. These initiatives were TOP (Technical and Office Protocol) [6] and MAP (Manufacturing Automation Protocol) [7] projects. The two projects exposed some pitfalls of the full seven-layer stack implementations (ISO/OSI model [3]). As a result, typically, only the layers 1 (physical layer), 2 (data link layer, including implicitly the medium access control layer), and 7 (application layer, which also covers user layer) are used in the field area networks [8]; also prescribed by the international fieldbus standard, IEC 61158 [9]. In IEC 61158, functions of layers 3 and 4 are recommended to be placed in either layers 2 or 7 — network and transport layers are not required in a single segment network typical of process and industrial automation (situation is different though in building automation, for instance, where the routing functionality and end-to-end control may be needed arising from a hierarchical network structure); functions of layers 5 and 6 are always covered in layer 7. The evolution of fieldbus technology which begun well over two decades ago has resulted in a multitude of solutions reflecting the competing commercial interests of their developers and standardization bodies, both national and international: IEC [10], ISO [11], ISA[12], CENELEC [13], and CEN[14]. This is also reflected in IEC 61158 (adopted in 2000), which accommodates all national standards and user organization championed fieldbus systems. Subsequently, implementation guidelines were compiled into communication profiles, IEC 61784-1 [15]. Those communication profiles identify seven main systems (or communication profile families) known by brand names as Foundation Fieldbus (H1, HSE, H2) used in process and factory automation; ControlNet and EtherNet/IP both used in factory automation; PROFIBUS (DP, PA) used in factory and process automation respectively; PROFInet used in factory automation; P-Net (RS 485, RS 232) used in factory automation and shipbuilding; WorldFIP used in factory automation; INTERBUS, INTERBUS TCP/IP, and INTERBUS Subset used in factory automation; Swiftnet transport, Swiftnet full stack used by aircraft manufacturers. The listed application areas are the dominant ones. Ethernet, the backbone technology for office networks, is increasingly being adopted for communication in factories and plants at the fieldbus level. The random and native CSMA/CD arbitration mechanism is being replaced by other solutions allowing for deterministic behavior required in real-time communication to support soft and hard real-time deadlines, time synchronization of activities required to control drives, for instance, and for exchange of small data records characteristic of monitoring and control actions. The emerging Real-Time Ethernet (RTE), Ethernet augmented with real-time extensions, under standardization by IEC/SC65C committee, is a fieldbus technology which incorporates Ethernet for the lower two layers in the OSI model. There are already a number of implementations, which use one of the three different approaches to meet real-time requirements. First approach is based on retaining the TCP/UDP/IP protocols suite unchanged (subject to nondeterministic delays); all real-time modifications are enforced in the top layer. Implementations in this category include Modbus/TPC [16] (defined by Schneider Electric and supported by Modbus-IDA [17]), EtherNet/IP [18] (defined by Rockwell and supported by the Open DeviceNet Vendor Association (ODVA) [19] and ControlNet International [20]), P-Net (on IP) [21] (proposed by the Danish P-Net national committee), and Vnet/IP [22] (developed by Yokogawa, Japan). In the second approach, the TCP/UDP/IP protocols suite is bypassed, the Ethernet functionality is accessed directly — in this case, RTE protocols use their own protocol stack in addition to the standard IP protocol stack. The implementations in this category include Ethernet Powerlink (EPL) [23] (defined by Bernecker + Rainer [B&R], and now supported by the Ethernet Powerlink Standardisation Group [24]), TCnet (a Time-critical Control Network) [25] (a proposal from Toshiba), EPA (Ethernet for Plant Automation) [26] (a Chinese proposal), and PROFIBUS CBA (Component-Based Automation) [27] (defined by several manufacturers including Siemens, and supported by PROFIBUS International [28]). Finally, in the third approach, the Ethernet mechanism and infrastructure are modified. The implementations include SERCOS III [29] (under development by SERCOS), EtherCAT [30] (defined by Beckhoff and supported by the EtherCat Technology Group [31]), and PROFINET IO [32] (defined by several manufacturers including Siemens, and supported by PROFIBUS International).

© 2006 by Taylor & Francis Group, LLC

Toward Networking of Embedded Systems

1-7

The use of standard components such as protocol stacks, Ethernet controllers, bridges, etc., allows to mitigate the ownership and maintenance cost. The direct support for the Internet technologies allows for vertical integration of various levels of industrial enterprise hierarchy to include seamless integration between automation and business logistic levels to exchange jobs and production (process) data; transparent data interfaces for all stages of the plant life cycle; the Internet- and web-enabled remote diagnostics and maintenance, as well as electronic orders and transactions. In the case of industrial automation, the advent and use of networking has allowed for horizontal and vertical integration of industrial enterprises.

1.3.2 Networked Embedded Systems in Building Automation Another fast growing application area for networked embedded systems is building automation [33]. Building automation systems aim at the control of the internal environment, as well as the immediate external environment of a building, or a building complex. At present, the focus of research and technology development is on commercial type of buildings (office building, exhibition center, shopping complex, etc.). In future, this will also include industrial type of buildings, which pose substantial challenges to the development of effective monitoring and control solutions. Some of the main services to be offered by the building automation systems typically include: climate control to include heating, ventilation, air conditioning; visual comfort to cover artificial lighting, control of day light; safety services such as fire alarm, and emergency sound system; security protection; control of utilities such as power, gas, water supply, etc.; internal transportation systems to mention lifts, escalators, etc. In terms of the quality of the service requirements imposed on the field area networks, building automation systems differ considerably from their counterparts in industrial automation, for instance. There is seldom a need for hard real-time communication; the timing requirements are much more relaxed. Traffic volume in normal operation is low. Typical traffic is event driven, and mostly uses peer-topeer communication paradigm. Fault tolerance and network management are important aspects. As with industrial fieldbus systems, there are a number of bodies involved in the standardization of technologies for building automation, including the field area networks. The communication architecture supporting automation systems embedded in the buildings has typically three levels: field, control, and management levels. The field level, involves operation of elements such as switches, motors, lighting cells, dry cells, etc. The peer-to-peer communication is perhaps most evident at that level; toggling a switch should activate a lighting cell(s), for instance. The automation level is typically used to evaluate new control strategies for the lower level in response to the changes in the environment; reduction in the day light intensity, external temperature change, etc. LonWorks [34], BACnet [35], and EIB/KNX [36–39] are open system networks, which can be used at more than one level of the communication architecture. A round up of LonWorks will be provided in the following, as a representative example of specialized field area networks used in building automation. LonWorks (EIA-709), a trademark of Echelon Corp. [40], employs LonTalk protocol which implements all seven layers of the ISO/OSI reference model. The LonTalk protocol was published as a formal standard [41], and revised in 2002 [42]. In EIA-709, layer 2 supports various communication media such as twisted pair cables (78 Kbit/sec [EIA-709.3] or 1.25 Mbit/sec), power line communication (4 Kbit/sec, EIA-709.2), radio frequency channel, infrared connections, fiber optic channels (1.25 Mbit/sec), as well as IP connections based on the EIA-852 protocol standard [43] in order to tunnel EIA-709 data packets through IP (Intranet, Internet) networks. A p-persistent CSMA bus arbitration scheme is used on twisted pair cables. For other communication media, the EIA-709 protocol stack uses the arbitration scheme defined for the very media. The EIA-709 layer 3 supports a variety of different addressing schemes and advanced routing capabilities. The entire routable address space of a LonTalk network is referred to as the domain (Figure 1.1). A domain is restricted to 255 subnets; a subnet allows for up to 127 nodes. The total number of addressable nodes in a domain can reach 32385; up to 248 domains can be addressed. Domain gateways can be built between logical domains in order to allow for a communication across domain boundaries. Groups can be formed in order to send a single data packet to a group of nodes using a multicast addressed message.

© 2006 by Taylor & Francis Group, LLC

1-8

Embedded Systems Handbook Group # 1 Node x

Node 3

Node x

S/N

S/N

S/N

Subnet 1

Subnet 2

S/N

S/N

S/N

Router

Node 1 Node 2

Subnet x

S/N

Router

Node 1

Node 2

S/N Domain 2 Domain gateway Domain 1 Subnet x Node x Router Subnet 1

S/N Node 1

Router

S/N Node 1 S/N Node 2

S/N

Subnet 2

S/N Node 1

S/N Node 2

FIGURE 1.1 Addressing elements in EIA-709 networks. (From D. Loy, Fundamentals of LonWorks/EIA — 709 networks: ANSI/EIA — 709 protocol standard (LonTalk). In The Industrial Communication Technology Handbook, Zurawski, R. (Ed.), CRC Press, Boca Raton, FL, 2005. With permission.)

Routing is performed between different subnets only. An EIA-709 node can send a unicast addressed message to exactly one node using either unique 48-bit node identification (Node ID) address or the logical subnet/node address. A multicast addressed message can be sent to either a group of nodes (group address), or all nodes in the subnet, or all nodes in the entire domain (broadcast address). The EIA-709 layer 4 supports four types of services. The unacknowledged service transmits the data packet from the sender to the receiver. The unacknowledged repeated service transmits the same data packet a number of times. The number of retries is programmable. The acknowledged service transmits the data packet and waits for an acknowledgment from the receiver. If not received by the transmitter, the same data packet is sent again. The number of retries is programmable. The request response service sends a request message to the receiver; the receiver must respond with a response message, for instance, with statistics information. There is a provision for authentication of acknowledged transmissions, although not very efficient. Network nodes (which, typically, include Neuron chip, RAM/Flash, power source, clock, network transceiver, and input/output interface connecting to sensor and actuator) can be based on the Echelon’s Neuron chip series manufactured by Motorola, Toshiba, and Cypress; recently also based on other platform independent implementations such as LoyTec LC3020 controller. The Neuron chips-based controllers are programmed with the Echelon’s Neuron C language, which is a derivative of ANSI C. Other controllers such as LC3020 are programmed with standard ANSI C. The basic element of Neuron C is the Network Variable (NV) which can be propagated over the network. For instance, SNVT_temp variable represents temperature in degree Celsius; SNVT stands for Standard Network Variable Type. Network nodes communicate with each other by exchanging NVs. Another way to communicate between nodes is by using explicit messages. The Neuron C programs are used to schedule application events and to react to incoming data packets (receiving NVs) from the network interface. Depending on the network media and the network transceivers, a variety of network topologies are possible with LonWorks nodes, to include bus, ring, star, and free topology. © 2006 by Taylor & Francis Group, LLC

Toward Networking of Embedded Systems

1-9

As the interoperability on all seven OSI layers does not guarantee interworkable products, the LonMark organization [44] has published interoperability guidelines for nodes that use the LonTalk protocol. A number of task groups within LonMark define functional profiles (subset of all the possible protocol features) for analog input, analog output, temperature sensor, etc. The task groups focus on various types of applications such as home/utility, HVAC, lighting, etc. LonBuilder and NodeBuilder are development and integration tools offered by Echelon. Both tools allow writing Neuron C programs, to compile and link them and download the final application into the target node hardware. NodeBuilder supports debugging of one node at the time. LonBuilder, which supports simultaneous debugging of multiple nodes, has a built in protocol analyzer and a network binder to create communication relationships between network nodes. The Echelon’s LNS (network operating system) provides tools that allow one to install, monitor, control, manage, and maintain control devices, and to transparently perform these services over any IP-based network, including the Internet.

1.3.3 Automotive Networked Embedded Systems Similar trends appear in the automotive electronic systems where the ECUs are networked by means of one of automotive specific communication protocols for the purpose of controlling one of the vehicle functions; for instance, electronic engine control, antilocking break system, active suspension, telematics, to mention a few. In Reference 45, a number of functional domains have been identified for the deployment of automotive networked embedded systems. They include the powertrain domain, involving, in general, control of engine and transmission; the chassis domain involving control of suspension, steering and braking, etc.; the body domain involving control of wipers, lights, doors, windows, seats, mirrors, etc.; the telematics domain involving, mostly, the integration of wireless communications, vehicle monitoring systems, and vehicle location systems; and the multimedia and Human Machine Interface (HMI) domains. The different domains impose varying constraints on the networked embedded systems in terms of performance, safety requirements, and Quality of Services (QoSs). For instance, the powertrain and chassis domains will mandate real-time control; typically, bounded delay is required, as well as fault-tolerant services. There are a number of reasons for the interest of the automotive industry in adopting mechatronic solutions, known by their generic name as x-by-wire, aiming to replace mechanical, hydraulic, and pneumatic systems by electrical/electronic systems. The main factors seem to be economic in nature, improved reliability of components, and increased functionality to be achieved with a combination of embedded hardware and software. Steer-by-wire, brake-by-wire, or throttle-by-wire systems are representative examples of those systems. But, it seems that certain safety-critical systems such as steer-by-wire and brake-by-wire will be complemented with traditional mechanical/hydraulic backups, for safety reasons. The dependability of x-by-wire systems is one of the main requirements, as well as constraints on the adoption of this kind of systems. In this context, a safety-critical x-by-wire system has to ensure that a system failure does not lead to a state in which human life, property, or environment are endangered; and a single failure of one component does not lead to a failure of the whole x-by-wire system [46]. When using Safety Integrity Level (SIL) scale, it is required for x-by-wire systems that the probability of a failure of a safety-critical system does not exceed the figure of 10−9 per hour/system. This figure corresponds to the SIL4 level. Another equally important requirement for the x-by-wire systems is to observe hard realtime constraints imposed by the system dynamics; the end-to-end response times must be bounded for safety-critical systems. A violation of this requirement may lead to performance degradation of the control system, and other consequences as a result. Not all automotive electronic systems are safety critical. For instance, system(s) to control seats, door locks, internal lights, etc., are not. Different performance, safety, and QoS requirements dictated by various in-car application domains necessitate adoption of different solutions, which, in turn, gave rise to a significant number of communication protocols for automotive applications. Time-triggered protocols based on TDMA (Time Division Multiple Access) medium access control technology are particularly well suited for the safety-critical solutions, as they provide deterministic access to the medium. In this category, there are two protocols, which, in principle, meet the requirements © 2006 by Taylor & Francis Group, LLC

1-10

Embedded Systems Handbook

of x-by-wire applications, namely TTP/C [47] and FlexRay [48] (FlexRay can support a combination of both time-triggered and event-triggered transmissions). The following discussion will focus mostly on TTP/C and FlexRay. The TTP/C (Time-Triggered Protocol) is a fault-tolerant time-triggered protocol; one of two protocols in the Time Triggered Architecture (TTA) [49]. The other one is a low cost fieldbus protocol TTP/A [50]. In TTA, the nodes are connected by two replicated communication channels forming a cluster. In TTA, a network may have two different interconnection topologies, namely bus and star. In the bus configuration, each node is connected to two replicated passive buses via bus guardians. The bus guardians are independent units preventing associated nodes from transmitting outside predetermined time slots, by blocking the transmission path; a good example may be a case of a controller with a faulty clock oscillator which attempts to transmit continuously. In the star topology, the guardians are integrated in to two replicated central star couplers. The guardians are required to be equipped with their own clocks, distributed clock synchronization mechanism, and power supply. In addition, they should be located at a distance from the protected node to increase immunity to spatial proximity faults. To cope with internal physical faults, TTA employs partitioning of nodes in to so-called Fault-Tolerant Units (FTUs), each of which is a collection of several stations performing the same computational functions. As each node is (statically) allocated a transmission slot in a TDMA round, failure of any node or a frame corruption is not going to cause degradation of the service. In addition, data redundancy allows, by voting process, to ascertain the correct data value. TTP/C employs synchronous TDMA medium access control scheme on replicated channels, which ensures fault-tolerant transmission with known delay and bounded jitter between the nodes of a cluster. The use of replicated channels, and redundant transmission, allows for the masking of a temporary fault on one of channels. The payload section of the message frame contains up to 240 bytes of data protected by a 24-bit CRC checksum. In TTP/C, the communication is organized in to rounds. In a round, different slot sizes may be allocated to different stations. However, slots belonging to the same station are of the same size in successive rounds. Every node must send a message in every round. Another feature of TTP/C is faulttolerant clock synchronization that establishes global time base without a need for a central time provider. In the cluster, each node contains the message schedule. Based on that information, a node computes the difference between the predetermined and actual arrival time of a correct message. Those differences are averaged by a fault-tolerant algorithm, which allows for the adjustment of the local clock to keep it in synchrony with clocks of other nodes in the cluster. TTP/C provides so-called membership service to inform every node about the state of every other node in the cluster; it is also used to implement the fault-tolerant clock synchronization mechanism. This service is based on a distributed agreement mechanism, which identifies nodes with failed links. A node with a transmission fault is excluded from the membership until restarted with a proper state of the protocol. Another important feature of TTP/C is a clique avoidance algorithm to detect and eliminate formation of cliques in case the fault hypothesis is violated. In general, the fault-tolerant operation based on FTUs cannot be maintained if the fault hypothesis is violated. In such a situation, TTA activates Never-Give-Up (NGU) strategy [46]. The NGU strategy, specific to the application, is initiated by TTP/C in combination with the application with an aim to continue operation in a degraded mode. The TTA infrastructure, and the TTP/A and TTP/C protocols have a long history dating back to 1979 when the Maintainable Architecture for Real-Time Systems (MARSs) project started at the Technical University of Berlin. Subsequently, the work was carried out at the Vienna University of Technology. TTP/C protocol have been experimented with and considered for deployment for quite some time. However, to date, there have been no actual implementations of that protocol involving safety-critical systems in commercial automobiles, or trucks. In 1995, a “proof of concept,” organized jointly by Vienna University of Technology and DaimlerChrysler, demonstrated a car equipped with a “brake-by-wire” system based on time-triggered protocol. The TTA design methodology, which distinguishes between the node design and the architecture design, is supported by a comprehensive set of integrated tools from TTTech. A range of development and prototyping hardware is available from TTTech, as well. Austriamicrosystems offers automotive certified TTP-C2 Communication Controller (AS8202NF).

© 2006 by Taylor & Francis Group, LLC

Toward Networking of Embedded Systems

1-11

Communication cycle Network communication time

Static segment

Network idle time

Dynamic segment

Symbol window Optional

Static slot

Static slot

Static slot

Minislot

Minislot

Minislot

Minislot

Minislot

Minislot

FIGURE 1.2 FlexRay communication cycle. (From D. Millinger and R. Nossal, FlexRay Communication Technology. In The Industrial Communication Technology Handbook, Zurawski, R. (Ed.), CRC Press, Boca Raton, FL, 2005. With permission.)

FlexRay, which appears to be the frontrunner for future automotive safety-critical control applications, employs a modified TDMA medium access control scheme on a single or replicated channel. The payload section of a frame contains up to 254 bytes of data protected by a 24-bit CRC checksum. To cope with transient faults, FlexRay also allows for a redundant data transmission over the same channel(s) with a time delay between transmissions. The FlexRay communication cycle comprises of a network communication time, and network idle time, Figure 1.2. Two or more communication cycles can form an application cycle. The network communication time is a sequence of static segment, dynamic segment, and symbol window. The static segment uses a TDMA MAC protocol. The static segment comprises of static slots of fixed duration. Unlike in TTP/C, the static allocation of slots to a node (communication controller) applies to one channel only. The same slot may be used by another node on the other channel. Also, a node may possess several slots in a static segment. The dynamic segment uses a FTDMA (Flexible Time Division Multiple Access) MAC protocol, which allows for a priority and demand driven access pattern. The dynamic segment comprises of so-called mini-slots with each node allocated a certain number of mini-slots, which do not have to be consecutive. The mini-slots are of a fixed length, and much shorter than static slots. As the length of a mini-slot is not sufficient to accommodate a frame (a mini-slot only defines a potential start time of a transmission in the dynamic segment), it has to be enlarged to accommodate transmission of a frame. This in turn reduces the number of mini-slots in the reminder of the dynamic segment. A mini-slot remains silent if there is nothing to transmit. The nodes allocated mini-slots toward the end of the dynamic segment are less likely to get transmission time. This in turn enforces a priority scheme. The symbol window is a time slot of fixed duration used for network management purposes. The network idle time is a protocol specific time window, in which no traffic is scheduled on the communication channel. It is used by the communication controllers for the clock synchronization activity; in principle, similar to the one described for TTP/C. If the dynamic segment and idle window are optional, the idle time, and minimal static segment are mandatory parts of a communication cycle; minimum two static slots (degraded static segment), or four static slots for fault-tolerant clock synchronization are required. With all that, FlexRay allows for three configurations: pure static; mixed, with both static and dynamic — bandwidth ratio depends on the application; and pure dynamic, where all bandwidth is allocated to the dynamic communication. FlexRay supports a range of network topologies offering a maximum of scalability and a considerable flexibility in the arrangement of embedded electronic architectures in automotive applications. The supported configurations include bus, active star, active cascaded stars, and active stars with bus extension. FlexRay also uses the bus guardians in the same way as TTP/C. The existing FlexRay communication controllers support communication bit rates of up to 10 Mbit/sec on two channels. The transceiver component of the communication controller also provides a set

© 2006 by Taylor & Francis Group, LLC

1-12

Embedded Systems Handbook

of automotive network specific services. Two major services are alarm handling and wakeup control. In addition to the alarm information received in a frame, an ECU also receives the alarm symbol from the communication controller. This redundancy can be used to validate critical signals; for instance, an air bag fire command. The wakeup service is required where electronic components have a sleep mode to reduce power consumption. FlexRay is a joint effort of a consortium involving some of the leading car makers and technology providers to mention BMW, Bosch, DaimlerChrysler, General Motors, Motorola, Philips, and Volkswagen, as well as Hyundai Kia Motors as a premium associate member with voting rights. DECOMSYS offers Designer Pro, a comprehensive set of tools to support the development process of FlexRay based applications. The FlexRay protocol specification version 2.0 was released in 2004. The controllers are currently available from Freescale, and in future from NEC. The latest controller version, MFR4200, implements the protocol specification versions 1.0 and 1.1. Austriamicrosystems offers high-speed automotive bus transceiver for FlexRay (AS8221). The special physical layer for FlexRay is provided by Phillips. It supports the topologies described above, and a data rate of 10 Mbit/sec on one channel. Two versions of the bus driver will be available. Time-Triggered Controller Area Network (TTCAN) [51], that can support a combination of both timetriggered and event-triggered transmissions, utilize physical and data-link layer of the CAN protocol. Since this protocol, as in the standard, does not provide necessary dependability services, it is unlikely to play any role in fault-tolerant communication in automotive applications. TTP/C and FlexRay protocols belong to class D networks in the classification published by the Society for Automotive Engineers [52, 53]. Although the classification dates back to 1994, it is still a reasonable guideline for distinction of different protocols based on data transmission speed and functions distributed over the network, which comprises of four classes. Class A includes networks with a data rate less than 10 Kbit/sec. Some of the representative protocols are Local Interconnect Network (LIN) [54] and TTP/A [50]. Class A networks are employed largely to implement the body domain functions. Class B networks operate within the range of 10 Kbit/sec to 125 Kbit/sec. Some of the representative protocols are J1850 [55], low-speed CAN [56], and VAN (Vehicle Area Network) [57]. Class C networks operate within the range of 125 Kbit/sec to 1 Mbit/sec. Examples of this class networks are high-speed CAN [58] and J1939 [59]. Network in this class are used for the control of powertrain and chassis domains. High-speed CAN, although used in the control of powertrain and chassis domains, is not suitable for safety-critical applications as it lacks the necessary fault-tolerant services. Class D networks (not formally defined as yet) includes networks with a data rate over 1 Mbit/sec. Networks to support the x-by-wire solutions fall in to this class, to include TTP/C and FlexRay. Also, MOST (Media Oriented System Transport) [60] and IDB-1394 [61], both for multimedia applications, belong to this class. The cooperative development process of networked embedded automotive applications brings with itself heterogeneity of software and hardware components. Even with the inevitable standardization of those components, interfaces, and even complete system architectures, the support for reuse of hardware and software components is limited. Thus potentially making the design of networked embedded automotive applications labor-intensive, error-prone, and expensive. This necessitates the development of component-based design integration methodologies. An interesting approach is based on platform-based design [62], discussed in this book with a view for automotive applications. Some industry standardization initiatives include: OSEK/VDX with its OSEKTime OS (OSEK/VDX Time-Triggered Operating Systems) [63]; OSEK/VDX Communication [64] which specifies a communication layer that defines common software interfaces and common behavior for internal and external communications among application processes; and OSEK/VDX FTCom (Fault-Tolerant Communication) [65] — a proposal for a software layer to provide services to facilitate development of fault-tolerant applications on top of time-triggered networks; HIS (Herstellerinitiative Software)[66] with a broad range of goals including standardization of software modules, specification of process maturity levels, development of software test, development of software tools, etc; ASAM (Association for Standardization of Automation and Measuring Systems) [67] which develops, amongst other projects, a standardized XML based format for data exchange between tools from different vendors.

© 2006 by Taylor & Francis Group, LLC

Toward Networking of Embedded Systems

1-13

One of the main bottlenecks in the development of safety-critical systems is the software development process. The automotive industry clearly needs a software development process model and supporting tools suitable for the development of safety-critical software. At present, there are two potential candidates: MISRA (Motor Industry Software Reliability Association) [68], which published recommended practices for safe automotive software. The recommended practices, although automotive specific, do not support x-by-wire. IEC 61508 [69] is an international standard for electrical, electronic and programmable electronic safety related systems. IEC 61508 is not automotive specific, but broadly accepted in other industries.

1.3.4 Sensor Networks Another trend in networking of field devices has emerged recently; namely, sensor networks, which is another example of networked embedded systems. Here, the “embedding” factor is not as evident as in other applications; particularly true for wireless and self-organizing networks where the nodes may be embedded in the ecosystem or a battlefield, to mention some. Although potential applications in the projected areas are still under discussion, the wireless sensor/actuator networks are in the deployment stage by the manufacturing industry. The use of wireless links with field devices, such as sensors and actuators, allow for flexible installation and maintenance, mobile operation required in case of mobile robots, and alleviates problems with cabling. A wireless communication system to operate effectively in the industrial/factory floor environment has to guarantee high reliability, low and predictable delay of data transfer (typically, less than 10 msec for real time applications), support for high number of sensor/actuators (over 100 in a cell of a few meters radius), and low power consumption, to mention some. In the industrial environments, the characteristic for the wireless channel degradation artifacts can be compounded by the presence of electric motors or a variety of equipment causing the electric discharge, which contributes to even greater levels of bit error and packet losses. One way to partially alleviate the problem is either by designing robust and loss-tolerant applications and control algorithms, or by trying to improve the channel quality; all subject of extensive research and development. In a wireless sensor/actuator network, stations may interact with each other on the peer-to-peer basis, and with the base station. To leverage low cost, small size, and low-power consumptions, standard Bluetooth (IEEE 802.15.1) 2.4 GHz radio transceivers [70, 71] may be used as the sensor/actuators communication hardware. To meet the requirements for high reliability, low and predictable delay of data transfer, and support for high number of sensor/actuators, custom optimized communication protocols may be required as the commercially available solutions such as IEEE 802.15.1, IEEE 802.15.4 [72], and IEEE 802.11 [73–75] variants may not fulfill all the requirements. The base station may have its transceiver attached to a cable of a fieldbus, giving rise to a hybrid wireless-wireline fieldbus system [2]. A representative example of this kind of systems is a wireless sensor/actuator network developed by ABB and deployed in a manufacturing environment [76]. The system, known as WISA (wireless sensor/actuator) has been implemented in a manufacturing cell to network proximity switches, which are some of the most widely used position sensors in automated factories to control positions of a variety of equipment, including robotic arms, for instance. The sensor/actuators communication hardware is based on a standard Bluetooth 2.4 GHz radio transceiver and low power electronics that handle the wireless communication link. The sensors communicate with a wireless base station via antennas mounted in the cell. For the base station, a specialized RF front end was developed to provide collision free air access by allocating a fixed TDMA time slot to each sensor/actuator. Frequency Hopping (FH) was employed to counter both frequency-selective fading and interference effects, and operates in combination with Automatic Retransmission Requests (ARQs). The parameters of this TDMA/FH scheme were chosen to satisfy the requirements of up to 120 sensor/actuators per base station. Each wireless node has a response or cycle time of 2 msec, to make full use of the available radio band of 80 MHz width. The FH sequences are cell-specific and were chosen to have low cross-correlations to permit parallel operation of many cells on the same factory floor with low self-interference. The base station can handle up to 120 wireless

© 2006 by Taylor & Francis Group, LLC

1-14

Embedded Systems Handbook

sensor/actuators and is connected to the control system via a (wireline) field bus. To increase capacity, a number of base stations can operate in the same area. WISA provides wireless power supply to the sensors, based on magnetic coupling [77].

1.4 Concluding Remarks This chapter has presented an overview of trends for networking of embedded systems, their design, and selected application domain specific network technologies. The networked embedded systems appear in a variety of application domains to mention automotive, train, aircraft, office building, and industrial automation. With the exception of building automation, the systems discussed in this chapter tend to be confined to a relatively small area covered and limited number of nodes, as in the case of an industrial process, an automobile, or a truck. In the building automation controls, the networked embedded systems may take on truly large proportions in terms of area covered and number of nodes. For instance, in a LonTalk network, the total number of addressable nodes in a domain can reach 32385; up to 248 domains can be addressed. The wireless sensor/actuator networks, as well as wireless-wireline hybrid networks, have started evolving from the concept to actual implementations, and are poised to have a major impact on industrial, home, and building automation — at least in these application domains, for a start. The networked embedded systems pose a multitude of challenges in their design, particularly for safetycritical applications, deployment, and maintenance. The majority of the development environments and tools for specific networking technologies do not have firm foundations in computer science or software engineering models and practices making the development process labor-intensive, error-prone, and expensive.

References [1] R. Zurawski (Ed.), The Industrial Communication Systems, Special Issue. In Proceedings of the IEEE, 93, June 2005. [2] J.-D. Decotignie, P. Dallemagne, and A. El-Hoiydi, Architectures for the Interconnection of Wireless and Wireline Fieldbusses. In Proceedings of the 4th IFAC Conference on Fieldbus Systems and Their Applications 2001 (FET ’2001), Nancy, France, 2001. [3] H. Zimmermann, OSI Reference Model: The ISO Model of Architecture for Open System Interconnection. IEEE Transactions on Communications, 28, 425–432, 1980. [4] Costrell, CAMAC Instrumentation System — Introduction and General Description. IEEETransactions-on-Nuclear-Science, 18, 3–8, 1971. [5] C.-A. Gifford, A Military Standard for Multiplex Data Bus. In Proceedings of the IEEE-1974, National Aerospace and Electronics Conference, May 13–15, 1974, Dayton, OH, USA, pp. 85–88. [6] N. Collins, Boeing Architecture and TOP (Technical and Office Protocol). In Networking: A-Large-Organization-Perspective, April, 1986, Melbourne, FL, USA, pp. 49–54. [7] H.A. Schutz, The Role of MAP in Factory Integration. IEEE Transactions on Industrial Electronics, 35, 6–12, 1988. [8] P. Pleinevaux and J.-D. Decotignie, Time Critical Communication Networks: Field Buses. IEEE Network, 2, 55–63, 1988. [9] International Electrotechnical Commission, Digital data communications for measurement and control — Fieldbus for use in industrial control systems, Part 1: Introduction. IEC 61158-1, IEC, 2003. [10] International Electrotechnical Commission (IEC). www.iec.ch. [11] International Organization for Standardization (ISO). www.iso.org. [12] Instrumentation Society of America (ISA). www.isa.org. [13] Comité Européen de Normalisation Electrotechnique (CENELEC). www.cenelec.org.

© 2006 by Taylor & Francis Group, LLC

Toward Networking of Embedded Systems

1-15

[14] European Committee for Standardization (CEN). www.cenorm.be. [15] International Electrotechnical Commission, Digital data communications for measurement and control — Part 1: Profile sets for continuous and discrete manufacturing relative to fieldbus use in industrial control systems, IEC 61784-1, IEC, 2003. [16] International Electrotechnical Commission, Real Time Ethernet Modbus-RTPS, Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC 65C/341/NP, 2004. [17] www.modbus-ida.org. [18] International Electrotechnical Commission, Real Time Ethernet: EtherNet/IP with Time Synchronization, Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/361/NP, IEC, 2004. [19] www.odva.org. [20] www.controlnet.org. [21] International Electrotechnical Commission, Real Time Ethernet: P-NET on IP, Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/360/NP, IEC, 2004. [22] International Electrotechnical Commission, Real Time Ethernet Vnet/IP, Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/352/NP, IEC, 2004. [23] International Electrotechnical Commission, Real Real Time Ethernet EPL (ETHERNET Powerlink), Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/356a/NP, IEC, 2004. [24] www.ethernet-powerlink.org. [25] International Electrotechnical Commission, Real Time Ethernet TCnet (Time-Critical Control Network), Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/353/NP, IEC, 2004. [26] International Electrotechnical Commission, Real Time Ethernet EPA (Ethernet for Plant Automation), Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC 65C/357/NP, IEC, 2004. [27] J. Feld, PROFINET — Scalable Factory Communication for all Applications. In Proceedings of the 2004 IEEE International Workshop on Factory Communication Systems, September 22–24, 2004, Vienna, Austria, pp. 33–38. [28] www.profibus.org. [29] International Electrotechnical Commission, Real Time Ethernet SERCOS III, Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/358/NP, IEC, 2004. [30] International Electrotechnical Commission, Real Time Ethernet Control Automation Technology (ETHERCAT), Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/355/NP, IEC, 2004. [31] www.ethercat.org. [32] International Electrotechnical Commission, Real-Time Ethernet PROFINET IO, Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/359/NP, IEC, 2004. [33] Deborah Snoonian, Smart Buildings. IEEE Spectrum, 40, 18–23, 2003. [34] D. Loy, D. Dietrich, and H. Schweinzer, Open Control Networks, Kluwer, Dordrecht, 2004. [35] Steven T. Bushby. BACnet: A Standard Communication Infrastructure for Intelligent Buildings. Automation in Construction, 6, 529–540, 1997. [36] ENV 13154-2, Data Communication for HVAC Applications — Field net — Part 2: Protocols, 1998. [37] EIA/CEA 776.5, CEBus-EIB Router Communications Protocol — The EIB Communications Protocol, 1999. [38] EN 50090-X, Home and Building Electronic Systems (HBES), 1994–2004. [39] Konnex Association, Diegem, Belgium. KNX Specifications, V. 1.1, 2004. [40] www.echelon.com. [41] Control Network Protocol Specification, ANSI/EIA/CEA-709.1-A, 1999. [42] Control Network Protocol Specification, EIA/CEA Std. 709.1, Rev. B, 2002.

© 2006 by Taylor & Francis Group, LLC

1-16

Embedded Systems Handbook

[43] Tunneling Component Network Protocols Over Internet Protocol Channels, ANSI/EIA/CEA 852, 2002. [44] www.lonmark.org. [45] F. Simonot-Lion, In-Car Embedded Electronic Architectures: How to Ensure Their Safety. In Proceedings of the 5th IFAC International Conference on Fieldbus Systems and their Applications — FeT’2003, July 2003, Aveiro, Portugal. [46] X-by-Wire Project, Brite-EuRam 111 Program, X-By-Wire — Safety Related Fault Tolerant Systems in Vehicles, Final report, 1998. [47] TTTech Computertechnik GmbH. Time-Triggered Protocol TTP/C, High-Level Specification Document, Protocol Version 1.1, November 2003. www.tttech.com. [48] FlexRay Consortium, FlexRay Communication System, Protocol Specification, Version 2.0, June 2004. www.flexray.com. [49] H. Kopetz and G. Bauer, The Time Triggered Architecture. Proceedings of the IEEE, 91, 112–126, 2003. [50] H. Kopetz et al., Specification of the TTP/A Protocol, University of Technology, Vienna, 2002. [51] International Standard Organization, 11898-4, Road Vehicles — Controller Area Network (CAN) — Part 4: Time-Triggered Communication, ISO, 2000. [52] Society of Automotive Engineers, J2056/1 Class C Application Requirements Classifications. In SAE Handbook, SAE, 1994. [53] Society of Automotive Engineers, J2056/2 Survey of Known Protocols. SAE Handbook, Vol. 2, SAE, 1994. [54] Antal Rajnak, The LIN Standard. In The Industrial Communication Technology Handbook, CRC Press, Boca Raton, FL, 2005. [55] Society of Automotive Engineers, Class B Data Communications Network Interface — SAE J1850 Standard — rev. nov96, 1996. [56] International Standard Organization, ISO 11519-2, Road Vehicles — Low Speed Serial Data Communication — Part 2: Low Speed Controller Area Network, ISO, 1994. [57] International Standard Organization, ISO 11519-3, Road Vehicles — Low Speed Serial Data Communication — Part 3: Vehicle Area Network (VAN), ISO, 1994. [58] International Standard Organization, ISO 11898, Road Vehicles — Interchange of Digital Information — Controller Area Network for High-speed Communication, ISO, 1994. [59] SAE J1939 Standards Collection. www.sae.org. [60] MOST Cooperation, MOST Specification Revision 2.3, August 2004. www.mostnet.de. [61] www.idbforum.org. [62] K. Keutzer, S. Malik, A.R. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli, System Level Design: Orthogonalization of Concerns and Platform-Based Design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 19(12), 1523–1543, 2000. [63] OSEK Consortium, OSEK/VDX Operating System, Version 2.2.2, July 2004. www.osek-vdx.org. [64] OSEK Consortium, OSEK/VDX Communication, Version 3.0.3, July 2004. www.osek-vdx.org. [65] OSEK Consortium, OSEK/VDX Fault-Tolerant Communication, Version 1.0, July 2001. www.osek-vdx.org. [66] www.automotive-his.de. [67] www.asam.de. [68] www.misra.org.uk. [69] International Electrotechnical Commission, IEC 61508:2000, Parts 1–7, Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems, 2000. [70] Bluetooth Consortium, Specification of the Bluetooth System, 1999. www.bluetooth.org. [71] Bluetooth Special Interest Group, Specification of the Bluetooth System, Version 1.1, December 1999. [72] LAN/MAN Standards Committee, IEEE Standard for Information Technology — Telecommunications and Information Exchange between Systems — Local and Metropolitan Area Networks — Specific Requirements — Part 15.4: Wireless Medium Access Control (MAC) and Physical Layer

© 2006 by Taylor & Francis Group, LLC

Toward Networking of Embedded Systems

[73]

[74]

[75]

[76]

[77]

1-17

(PHY) Specifications for Low Rate Wireless Personal Area Networks (LR-WPANs), IEEE Computer Society, Washington, 2003. LAN/MAN Standards Committee of the IEEE Computer Society, IEEE Standard for Information Technology — Telecommunications and Information Exchange between Systems — Local and Metropolitan Networks — Specific Requirements — Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications: Higher Speed Physical Layer (PHY) Extension in the 2.4 GHz band, 1999. LAN/MAN Standards Committee of the IEEE Computer Society, Information Technology — Telecommunications and Information Exchange between Systems — Local and Metropolitan Area Networks — Specific Requirements — Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications, 1999. Institute of Electrical and Electronic Engineering “Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications, Amendment 4: Further Higher Data Rate Extension in the 2.4 GHz Band,” June 2003, aNSI/IEEE Std 802.11. Christoffer Apneseth, Dacfey Dzung, Snorre Kjesbu, Guntram Scheible, and Wolfgang Zimmermann, Introducing Wireless Proximity Switches. ABB Review, 4, 42–49, 2002. www.abb.com/review. Dacfey Dzung, Christoffer Apneseth, and Jan Endresen, A Wireless Sensor/Actuator Communication System for Real-Time Factory Applications, private communication. IEEE Transactions on Industrial Electronics (submitted).

© 2006 by Taylor & Francis Group, LLC

2 Real-Time in Embedded Systems 2.1 2.2

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Design of RTSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2-1 2-3

Reference Architecture • Models of Interaction • Execution Strategies • Component-Based Design • Tools for Design of RTSs

2.3

Real-Time Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . .

2-8

Typical Properties of RTOSs • Mechanisms for Real-Time • Commercial RTOSs

2.4

Real-Time Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11 Introduction to Scheduling • Offline Schedulers • Online Schedulers

2.5

Real-Time Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13 Communication Techniques • Fieldbuses • Ethernet for Real-Time Communication • Wireless Communication

2.6

Analysis of RTSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18 Timing Properties • Methods for Timing Analysis • Example of Analysis • Trends and Tools

2.7

Hans Hansson, Mikael Nolin, and Thomas Nolte Mälardalen University

Component-Based Design of RTS . . . . . . . . . . . . . . . . . . . . . . 2-25 Timing Properties and CBD • Real-Time Operating Systems • Real-Time Scheduling

2.8 Testing and Debugging of RTSs . . . . . . . . . . . . . . . . . . . . . . . . . 2-29 2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-30 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-31

In this chapter we will provide an introduction to issues, techniques, and trends in real-time systems (RTSs). We will specifically discuss design of RTSs, real-time operating systems (RTOSs), real-time scheduling, real-time communication, real-time analysis, as well as testing and debugging of RTSs. For each of these areas, state-of-the-art tools and standards are presented.

2.1 Introduction Consider the airbag in the steering-wheel of your car. It should after the detection of a crash (and only then) inflate just in time to softly catch your head to prevent it from hitting the steering-wheel; not too early — since this would make the airbag deflate before it can catch you; nor too late — since the exploding 2-1

© 2006 by Taylor & Francis Group, LLC

2-2

Embedded Systems Handbook

airbag could then injure you by blowing up in your face and/or catch you too late to prevent your head from banging into the steering wheel. The computer controlled airbag system is an example of a RTS. But RTSs come in many different flavors, including vehicles, telecommunication systems, industrial automation systems, household appliances, etc. There is no commonly agreed upon definition of what a RTS is, but the following characterization is (almost) universally accepted: • RTSs are computer systems that physically interact with the real world. • RTSs have requirements on the timing of these interactions. Typically, the real-world interactions are via sensors and actuators, rather than the keyboard and screen of your standard PC. Real-time requirements typically express that an interaction should occur within specified timing bound. It should be noted that this is quite different from requiring the interaction to be as fast as possible. Essentially all RTSs are embedded in products, and the vast majority of embedded computer systems are RTSs. RTSs is the dominating application of computer technology, as more than 99% of the manufactured processors (more than 8 billion in 2000 [1]) are used in embedded systems. Returning to the airbag system, we note that, in addition to being a RTS it is a safety-critical system, that is, a system that owing to severe risks of damage have strict Quality of Service (QoS) requirements, including requirements on the functional behavior, robustness, reliability, and timeliness. A typical strict timing property could be that a certain response to an interaction always must occur within some prescribed time, for example, the charge in the airbag must detonate between 10 and 20 msec from the detection of a crash; violating this must be avoided at any cost, since it would lead to something unacceptable, such as having to spend a couple of months in hospital. A system that is designed to meet strict timing requirements is often referred to as a hard RTS. In contrast, systems for which occasional timing failures are acceptable — possibly because this will not lead to something terrible — are termed soft RTS. An illustrative comparison between hard and soft RTSs that highlights the difference between the extremes is shown in Table 2.1. A typical hard RTS could in this context be an engine control system, which must operate with µsec-precision, and which will severely damage the engine if timing requirements fail by more than a few msec. A typical soft RTS could be a banking system, for which timing is important, but where there are no strict deadlines and some variations in timing are acceptable. Unfortunately, it is impossible to build real systems that satisfy hard real-time requirements, since, owing to the imperfection of hardware (and designers) any system may break. The best that can be achieved is a system that, with very high probability provides the intended behavior during a finite interval of time. However, on the conceptual level hard real-time makes sense, since it implies a certain amount of rigor in the way the system is designed, for example, it implies an obligation to prove that the strict timing requirements are met. TABLE 2.1 Typical Characteristics of Hard- and Soft-RTSs [2] Characteristic Timing requirements Pacing Peak-load performance Error detection Safety Redundancy Time granularity Data files Data integrity

© 2006 by Taylor & Francis Group, LLC

Hard real-time

Soft real-time

Hard Environment Predictable System Critical Active Millisecond Small Short term

Soft Computer Degraded User Noncritical Standby Second Large Long term

Real-Time in Embedded Systems

2-3

Since the early 1980s a substantial research effort has provided a sound theoretical foundation (e.g., [3,4]) and many practically useful results for the design of hard RTSs. Most notably, hard RTS scheduling has evolved into a mature discipline, using abstract, but realistic, models of tasks executing on single CPU, multiprocessor, or distributed computer systems, together with associated methods for timing analysis. Such schedulability analysis, for example, the well-known rate-monotonic analysis [5–7], have also found significant use in some industrial segments. However, hard real-time scheduling is not the cure for all RTSs. Its main weakness is that it is based on analysis of the worst possible scenario. For safety-critical systems this is of course a must, but for other systems, where general customer satisfaction is the main criteria, it may be too costly to design the system for a worst-case scenario that may not occur during the system’s lifetime. If we look at the other end of the spectrum, we find the best-effort approach, which is still the dominating approach in the industry. The essence of this approach is to implement the system using some best practice, and then use measurements, testing, and tuning to make sure that the system is of sufficient quality. On one hand such a system will hopefully satisfy some soft real-time requirement; the weakness being that we do not know which. On the other hand, compared with the hard real-time approach, the system can be better optimized for the available resources. A further difference is that hard RTS methods essentially are applicable to static configurations only, whereas it is less problematic to handle dynamic task creation etc., in best-effort systems. Having identified the weaknesses of the hard real-time and best-effort approaches major efforts are now put into more flexible techniques for soft RTSs. These techniques provide analyzability (such as hard real-time), together with flexibility and resource efficiency (such as best-effort). The basis for the flexible techniques are often quantified QoS characteristics. These are typically related to nonfunctional aspects, such as timeliness, robustness, dependability, and performance. To provide a specified QoS, some sort of resource management is needed. Such a QoS management is either handled by the application, by the operating system (OS), by some middleware, or by a mix of the above. The QoS management is often a flexible online mechanism that dynamically adapts the resource allocation to balance between conflicting QoS requirements.

2.2 Design of RTSs The main issue in designing RTSs is timeliness, that is, that the system performs its operations at proper points in time. Not considering timeliness at the design phase will make it virtually impossible to analyze and predict the timely behavior of the RTS. This section presents some important architectural issues for embedded RTSs, together with some supporting commercial tools.

2.2.1 Reference Architecture A generic system architecture for a RTS is depicted in Figure 2.1. This architecture is a model of any computer-based system interacting with an external environment via sensors and actuators. Since our focus is on the RTS we will look more into different organizations of that part of the generic architecture in Figure 2.1. The simplest RTS is a single processor, but in many cases the RTS is a distributed computer system consisting of a set of processors interconnected by a communications network. There could be several reasons for making an RTS distributed, including: • • • •

The physical distribution of the application. The computational requirements that may not be conveniently provided by a single CPU. The need for redundancy to meet availability, reliability, or other safety requirements. To reduce the cabling in the system.

Figure 2.2 shows an example of a distributed RTS. In a modern car, like the one depicted in the figure, there are some 20 to 100 computer nodes (which in the automotive industry are called Electronic Control

© 2006 by Taylor & Francis Group, LLC

2-4

Embedded Systems Handbook

RTS

Environment

Sensors

Actuators

FIGURE 2.1 A generic RTS architecture.

SUM PDM

UEM SCM

MMS RSM SRM

MP1

SHM

PHM

SUB ATM

CCM ICM

MP2 TCM

DEM AUD

GSM

MMM SRS DIM

ICM

ECM BCM

SAS

BSC

SWS

SHM

SWM

PSM

PAS

AEM REM

CPM

FIGURE 2.2

LSM

CEM

ISM

DDM

Network infrastructure of Volvo XC90.

Units [ECUs]) interconnected with one or more communication networks. The initial motivation for this type of electronic architecture in cars was the need to reduce the amount of cabling. However, the electronic architecture has also led to other significant improvements, including substantial pollution reduction and new safety mechanisms, such as computer controlled Electronic Stabilization Programs (ESPs). The current development is toward making the most safety-critical vehicle functions, such as braking and steering, completely computer controlled. This is done by removing the mechanical connections (e.g., between steering wheel and front wheels, and between break pedal and breaks), replacing them with computers and computer networks. Meeting the stringent safety requirements for such functions will require careful introduction of redundancy mechanisms in hardware and communication, as well as software, that is, a safety-critical system architecture is needed (an example of such an architecture is TTA [8]).

2.2.2 Models of Interaction In Section 2.2.1 we presented the physical organization of a RTS, but for an application programmer this is not the most important aspect of the system architecture. Actually, from an application programmers

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-5

perspective the system architecture is given more by the execution paradigm (execution strategy) and the interaction model used in the system. In this section we describe what an interaction model is and how it affects the real-time properties of a system, and in Section 2.2.3 we discuss the execution strategies used in RTSs. A model of interaction describes the rules by which components interact with each other (in this section we will use the term component to denote any type of software unit, such as a task or a module). The interaction model can govern both control flow and data flow between system components. One of the most important design decisions, for all types of systems, is which interaction models to use (sadly, however, this decision is often implicit, and hidden in the system’s architectural description). When designing RTSs, attention should be paid to the timing properties of the interaction models chosen. Some models have a more predictable and robust behavior with respect to timing than other models. Examples of some of the more predictable models that are commonly used in RTSs design, are pipes-and-filters, publisher–subscriber, and blackboard. On the other end of the spectrum of interaction models there are models that increase the (timing) unpredictability of the system. These models should, if possible, be avoided when designing RTSs. The two most notable, and commonly used, are client–server and message boxes. 2.2.2.1 Pipes-and-Filters In this model, both data and control flow is specified using input and output ports of components. A component becomes eligible for execution when data has arrived on its input ports and when the component finishes execution it produces output on its output ports. This model fits well for many types of control programs, and control laws are easily mapped to this interaction model. Hence, it has gained widespread use in the real-time community. The real-time properties of this model are also quite nice. Since both data and control flows unidirectionally through a series of components, the order of execution and end-to-end timing delay usually becomes predictable. The model also provides a high degree of decoupling in time; that is, components can often execute without having to worry about timing delays caused by other components. Hence, it is usually straightforward to specify the compound timing behavior of set of components. 2.2.2.2 Publisher–Subscriber The publisher–subscriber model is similar to the pipes-and-filters model but it usually decouples data and control flow. That is, a subscriber can usually choose different forms for triggering its execution. If the subscriber chooses to be triggered on each new published value, the publisher–subscriber model takes on the form of the pipes-and-filters model. However, in addition, a subscriber could choose to ignore the timing of the published values and decide to use the latest published value. Also, for the publisher–subscriber model, the publisher is not necessarily aware of the identity, or even the existence, of its subscribers. This provides a higher degree of decoupling of components. Similar to the pipes-and-filters model, the publisher–subscriber model provides good timing properties. However, a prerequisite for analysis of systems using this model is that subscriber components make explicit the values they subscribe to (this is not mandated by the model itself). However, when using the publisher–subscriber model for embedded systems, it is the norm that subscription information is available (this information is used, for instance, to decide the values that are to be published over a communications network, and to decide the receiving nodes of those values). 2.2.2.3 Blackboard The blackboard model allows variables to be published on a globally available blackboard area. Thus, it resembles the use of global variables. The model allows any component to read or write values to variables in the blackboard. Hence, the software engineering qualities of the blackboard model is questionable. Nevertheless, it is a model that is commonly used, and in some situations it

© 2006 by Taylor & Francis Group, LLC

2-6

Embedded Systems Handbook

provides a pragmatic solution to problems that are difficult to address with more stringent interaction models. Software engineering aspects aside, the blackboard model does not introduce any extra elements of unpredictable timing. On the other hand, the flexibility of the model does not help engineers to achieve predictable systems. Since the model does not address the control flow, components can execute relatively undisturbed and decoupled from other components. 2.2.2.4 Client–Server In the client–server model, a client asynchronously invokes the service of a server. The service invocation passes the control flow (plus any input data) to the server, and control stays at the server until it has completed the service. When the server is done, the control flow (and any return data) is returned to the client which in turn resumes execution. The client–server model has inherently unpredictable timing. Since services are invoked asynchronously, it is very difficult to a priori asses the load on the server for a certain service invocation. Thus, it is difficult to estimate the delay of the service invocation and, in turn, it is difficult to estimate the response time of the client. This matter is furthermore complicated by the fact that most components often behave both as clients and as servers (a server often uses other servers to implement its own services); leading to very complex and unanalyzable control flow paths. 2.2.2.5 Message Boxes A component can have a set of message boxes, and components communicate by posting messages in each others message boxes. Messages are typically handled in First In First Out (FIFO) order, or in priority order (where the sender specifies a priority). Message passing does not change the flow of control for the sender. A component that tries to receive a message from an empty message box, however, blocks on that message box until a message arrives (often the receiver can specify a timeout to prevent indefinite blocking). From a sender’s point of view, the message box model has similar problems as the client–server model. The data sent by the sender (and the action that the sender expects the receiver to perform) may be delayed in an unpredictable way when the receiver is highly loaded. Also, the asynchronous nature of the message passing makes it difficult to foresee the load of a receiver at any particular moment. Furthermore, from the receiver’s point of view, the reading of message boxes is unpredictable in the sense that the receiver may or may not block on the message box. Also, since message boxes often are of limited size, there is a risk that a highly loaded receiver loose some message. Lost messages are another source of unpredictability.

2.2.3 Execution Strategies There are two main execution paradigms for RTSs: time-triggered and event-triggered. On one hand, when using timed-triggered execution, activities occur at predefined instances of time, for example, a specific sensor value is read exactly every 10 msec and 2 msec, later the corresponding actuator receives an updated control parameter. In an event-triggered execution, on the other hand, actions are triggered by event occurrences, for example, when the toxic fluid in a tank reaches a certain level an alarm will go off. It should be noted that the same functionality, typically, can be implemented in both paradigms, for example, a time-triggered implementation of the above alarm would be to periodically read the levelmeasuring sensor and activate the alarm when the read level exceeds the maximum allowed. If alarms are rare, the time-triggered version will have much higher computational overhead than the eventtriggered one. On the other hand, the periodic sensor readings will facilitate detection of a malfunctioning sensor. Time-triggered executions are used in many safety-critical systems with high dependability requirements (such as avionic control systems), whereas the majority of other systems are event-triggered. Dependability can also be guaranteed in the event-triggered paradigm, but owing to the observability

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-7

provided by the exact timing of time-triggered executions, most experts argue for using time-triggered in ultra-dependable systems. The main argument against time-triggered is its lack of flexibility and the requirement of pre runtime schedule generation (which is a nontrivial and possibly time-consuming task). Time-triggered systems are mostly implemented by simple proprietary table-driven dispatchers [9] (see Section 2.4.2 for a discussion on table-driven execution), but complete commercial systems including design tools are also available [10, 11]. For the event-triggered paradigm a large number of commercial tools and OSs are available (examples are given in Section 2.3.3). There are also examples of systems integrating the two execution paradigms, thereby aiming at getting the best of two worlds: time-triggered dependability and event-triggered flexibility. One example is the Basement system [12] and its associated real-time kernel Rubus [13]. Since computations in time-triggered systems are statically allocated both in space (to a specific processor) and time, some sort of configuration tool is often used. This tool assumes that the computations are packaged into schedulable units (corresponding to tasks or threads in an event-triggered system). Typically, for example, in Basement, computations are control-flow based, in the sense that they are defined by sequences of schedulable units, each unit performing a computation based on its inputs and producing outputs to the next unit in sequence. The system is configured by defining the sequences and their timing requirements. The configuration tool will then automatically (if possible)1 generate a schedule which guarantees that all timing requirements are met. Event-triggered systems typically have a richer and more complex Application Programming Interfaces (APIs), defined by the used OS and middleware, which will be elaborated on in Section 2.3.

2.2.4 Component-Based Design Component-Based Design (CBD) of software systems is an interesting approach for software engineering in general, and for engineering of RTSs in particular. In CBD, a software component is used to encapsulate some functionality. That functionality is only accessed through the interface of the component. A system is composed by assembling a set of components and connecting their interfaces. The reason CBD could prove extra useful for RTSs is the possibility to extend components with introspective interfaces. An introspective interface does not provide any functionality per se, rather the interface can be used to retrieve information about extra-functional properties of the component. Extra-functional properties can include attributes such as memory consumption, execution times, task periods, etc. For RTS, timing properties are of course of particular interest. Unlike the functional interfaces of components, the introspective interfaces can be available offline, that is, during the component assembly phase. This way, the timing attributes of the system components can be obtained at design time and tools to analyze the timing behavior of the system could be used. If the introspective interfaces are also available online they could be used in, for instance, admission control algorithms. An admission control could query new components for their timing behavior and resource consumption before deciding to accept new component to the system. Unfortunately, many industry standard software techniques are based on the client–server or the message-box models of interaction, which we deemed, in Section 2.2.2, unfit for RTSs. This is especially true for the most commonly used component models. For instance, the Corba Component Model (CCM) [14], Microsoft’s COM [15] and .NET [16] models, and Java Beans [17] all have the client–server model as their core model. Also, none of these component technologies allow the specification of extrafunctional properties through introspective interfaces. Hence, from the real-time perspective, the biggest advantage of CBD is void for these technologies. However, there are numerous research projects addressing CBD for real-time and embedded systems (e.g., [18–21]). These projects are addressing the issues left behind by the existing commercial technologies, such as timing predictability (using suitable computational models), support for offline analysis of 1 This

scheduling problem is theoretically intractable, so the configuration tool will have to rely on some heuristics which works well in practice, but which does not guarantee to find a solution in all cases when there is a solution.

© 2006 by Taylor & Francis Group, LLC

2-8

Embedded Systems Handbook

component assemblies, and better support for resource constrained systems. Often, these projects strive to remove the considerable runtime flexibility provided by existing technologies. This runtime flexibility is judged to be the foremost contributor to unpredictability (the flexibility is also adding to the runtime complexity and prevents CBD for resource constrained systems).

2.2.5 Tools for Design of RTSs In the industry the term real-time system is highly overloaded, and can mean anything from interactive systems to superfast systems, or embedded systems. Consequently, it is not easy to judge what tools are suitable for developing RTSs (as we define real-time in this chapter). For instance, UML [22] is commonly used for software design. However, UML’s focus is mainly on client–server solutions, and it has proven inapt for RTSs design. As a consequence, UML-based tools that extend UML with constructs suitable for real-time programs have emerged. The two most known products are Rational’s Rose RealTime [23] and i-Logix’ Rhapsody [24]. These tools provide UML support with the extension of real-time profiles. While giving real-time engineers access to suitable abstractions and computational models, these tools do not provide means to describe timing properties or requirements in a formal way; thus they do not allow automatic verification of timing requirements. TeleLogic provides programming and design support using the language SDL [25]. SDL was originally developed as a specification language for the telecom industry, and is as such highly suitable for describing complex reactive systems. However, the fundamental model of computation is the message-box model, which has an inherently unpredictable timing behavior. However, for soft embedded RTSs, SDL can give very time- and space-efficient implementations. For more resource constrained hard RTSs, design tools are provided by, for example, Arcticus System [13], TTTech [10], and Vector [26]. These tools are instrumental during both system design and implementation, and also provide some timing analysis techniques that allow timing verification of the system (or parts of the system). However, these tools are based on proprietary formats and processes, and have as such reached a limited customer base (mainly within the automotive industry). Within the near future UML2 will become an adopted standard [27]. UML2 has support for computational models suitable for RTSs. This support comes mainly in the form of ports that can have protocols associated to them. Ports are either provided or required, hence allowing type-matching of connections between components. UML2 also includes much of the concepts from Rose RealTime, Rhapsody, and SDL. Other, future design techniques that are expected to have an impact on the design of RTSs include, the EAST/EEA Architecture Description Language (EAST-ADL) [28]. The EAST-ADL is developed by the automotive industry and is a description language that will cover the complete development cycle of distributed, resource constrained, safety critical, RTSs. Tools to support development with EAST-ADL (which is a UML2 compliant language) are expected to be provided by automotive tool vendors such as ETAS [29], Vector [30], and Siemens [31].

2.3 Real-Time Operating Systems A RTOS provides services for resource access and resource sharing, very similar to a general-purpose OS. An RTOS, however, provides additional services suited for real-time development and also supports the development process for embedded systems. Using a general-purpose OS when developing RTSs has several drawbacks: • High resource utilization, for example, large RAM and ROM footprints, and high internal CPU-demand. • Difficult to access hardware and devices in a timely manner, for example, no application level control over interrupts. • Lack of services to allow timing sensitive interactions between different processes.

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-9

2.3.1 Typical Properties of RTOSs The state of practice in RTOSs is reflected in Reference 32. Not all OSs are RTOSs. A RTOS is typically multithreaded and preemptible, there has to be a notion of thread priority, predictable thread synchronization has to be supported, priority inheritance should be supported, and the OS behavior should be known [33]. This means that the interrupt latency, worst-case execution time (WCET) of system calls, and maximum time during which interrupts are masked must be known. A commercial RTOS is usually marketed as the runtime component of an embedded development platform. As a general rule of thumb one can say that RTOSs are: • Suitable for resource constrained environments. RTSs typically operate in such environments. Most RTOSs can be configured pre runtime (e.g., at compile time) to only include a subset of the total functionality. Thus, the application developer can choose to leave out unused portions of the RTOS in order to save resources. RTOSs typically store much of their configuration in ROM. This is done mainly for two purposes: (1) minimize the use of expensive RAM memory and (2) minimize the risk that critical data is overwritten by an erroneous application. • Giving the application programmer easy access to hardware features. These include interrupts and devices. Most often the RTOSs give the application programmer means to install Interrupt Service Routines during compile time and/or during runtime. This means that the RTOS leaves all interrupt handing to the application programmer, allowing fast, efficient, and predictable handling of interrupts. In general-purpose OSs, memory-mapped devices are usually protected from direct access using the MMU (Memory Management Unit) of the CPU, hence forcing all device accesses to go through the OS. RTOSs typically do not protect such devices, but allow the application to directly manipulate the devices. This gives faster and more efficient access to the devices. (However, this efficiency comes at the price of an increased risk of erroneous use of the device.) • Providing services that allow implementation of timing sensitive code. An RTOS typically has many mechanisms to control the relative timing between different processes in the system. Most notably, an RTOS has a real-time process scheduler whose function is to make sure that the processes execute in the way the application programmer intended them to. We will elaborate more on the issues of scheduling in Section 2.4. An RTOS also provides mechanisms to control the processes relative performance when accessing shared resources. This can, for instance, be done by priority queues, instead of plain FIFO queues as used in general-purpose OSs. Typically, an RTOS supports one or more real-time resource locking protocols, such as priority inheritance or priority ceiling (Section 2.3.2 discusses resource locking protocols further). • Tailored to fit the embedded systems development process. RTSs are usually constructed in a host environment that is different from the target environment, so called cross platform development. Also, it is typical that the whole memory image, including both RTOS and one or more applications, is created at the host platform and downloaded to the target platform. Hence, most RTOSs are delivered as source code modules or precompiled libraries that are statically linked with the applications at compile time.

2.3.2 Mechanisms for Real-Time One of the most important functions of an RTOS is to arbitrate access to shared resources in such a way that the timing behavior of the system becomes predictable. The two most obvious resource that the RTOS manages access to are: • The CPU — that is, the RTOS should allow processes to execute in a predictable manner. • Shared memory areas — that is, the RTOS should resolve contention to shared memory in a way that gives predictable timing. The CPU access is arbitrated with a real-time scheduling policy. Section 2.4 will, in more depth, describe real-time scheduling policies. Examples of scheduling policies that can be used in RTSs

© 2006 by Taylor & Francis Group, LLC

2-10

Embedded Systems Handbook

are priority scheduling, deadline scheduling, or rate scheduling. Some of these policies directly use timing attributes (like deadline) of the tasks to perform scheduling decisions, whereas other policies use scheduling parameters (like priority, rate, or bandwidth) that indirectly affect the timing of the tasks. A special form of scheduling, which is also very useful for RTSs, is table-driven (static) scheduling. Table-driven scheduling is described further in Section 2.4.2. To summarize, in table-driven scheduling all arbitration decisions have been made offline and the RTOS scheduler just follows a simple table. This gives very good timing predictability, albeit on the expense of system flexibility. The most important aspect of a real-time scheduling policy is that it should provide means to a priori analyze the timing behavior of the system, hence giving a predictable timing behavior of the system. Scheduling in general-purpose OSs normally emphasizes properties such as fairness, throughput, and guaranteed progress; these properties may be adequate in their own respect, however, they are usually in conflict with the requirement that an RTOS should provide timing predictability. Shared resources (such as memory areas, semaphores, and mutexes) are also arbitrated by the RTOS. When a task locks a shared resource it will block all other tasks that subsequently tries to lock the resource. In order to achieve predictable blocking times special real-time resource locking protocols have been proposed ([34,35] provides more details about the protocols). 2.3.2.1 Priority Inheritance Protocol The priority inheritance protocol (PIP) makes a low priority task inherit the priority of any higher priority task that becomes blocked on a resource locked by the lower priority task. This is a simple and straightforward method to lower the blocking time. However, it is computationally intractable to calculate the worst-case blocking (which may be infinite since the protocol does not prevent deadlocks). Hence, for hard RTSs or when timing performance needs to be calculated a priori, the PIP is not adequate. 2.3.2.2 Priority Ceiling Inheritance Protocol The priority ceiling protocol (PCP) associates, to each resource, a ceiling value that is equal to the highest priority of any task that may lock the resource. By clever use of the ceiling values of each resource, the RTOS scheduler will manipulate task priorities to avoid the problems of PIP. PCP guarantees freedom from deadlocks, and the worst-case blocking is relatively easy to calculate. However, the computational complexity of keeping track of ceiling values and task priorities gives PCP high runtime overhead. 2.3.2.3 Immediate Ceiling Priority Inheritance Protocol The immediate inheritance protocol (IIP) also associates, to each resource, a ceiling value that is equal to the highest priority of any task that may lock the resource. However, different from PCP, in IIP a task is immediately assigned the ceiling priority of the resource it is locking. IIP has the same real-time properties as PCP (including the same worst-case blocking time).2 However, IIP is significantly more easy to implement. It is, in fact, for single node systems easier to implement than any other resource locking protocol (including non-real-time protocols). In IIP no actual locks need to be implemented, it is enough for the RTOS to adjust the priority of the task that locks or releases a resource. IIP has other operational benefits, notably it paves the way for letting multiple tasks use the same stack area. OSs based on IIP can be used to build systems with footprints that are extremely small [36,37].

2.3.3 Commercial RTOSs There is an abundance of commercial RTOSs. Most of them provide adequate mechanisms to enable development of RTSs. Some examples are Tornado/VxWorks [38], LYNX [39], OSE [40], QNX [41], RT-Linux [42], and ThreadX [43]. However, the major problem with these OSs is the rich set of 2 The

average blocking time will however be higher in IIP than in PCP.

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-11

primitives provided. These systems provide both primitives that are suitable for RTSs and primitives that are unfit for RTSs (or that should be used with great care). For instance, they usually provide multiple resource locking protocols; some of which are suitable and some of which are not suitable for real-time. This richness becomes a problem when these OSs are used by inexperienced engineers and/or project managers when projects are large and project management does not provide clear design guidelines/rules. In these situations, it is very easy to use primitives that will contribute to timing unpredictability of the developed system. Rather, an RTOS should help the engineers and project managers by providing only mechanisms that help in designing predictable systems. However, there is an obvious conflict between the desire/need of RTOS manufacturers to provide rich interfaces and stringency needed by designers of RTSs. There is a smaller set of RTOSs that have been designed to resolve these problems, and at the same time also allow extreme lightweight implementations of predictable RTSs. The driving idea is to provide a small set of primitives that guides the engineers toward good design of their system. Typical examples are the research RTOS Asterix [36] and the commercial RTOS SSX5 [37]. These systems provide a simplified task model, in which tasks cannot suspend themselves (e.g., no sleep() primitive) and tasks are restarted from their entry point on each invocation. The only resource locking protocol that is supported is IIP, and the scheduling policy is a fixed priority scheduling. These limitations makes it possible to build an RTOS that is able to run, for example, ten tasks using less than 200 bytes of RAM, and at the same time giving predictable timing behavior [44]. Other commercial systems that follow a similar principle of reducing the degrees of freedom and hence promote stringent design of predictable RTSs include Arcticus Systems’ Rubus OS [13]. Many of the commercial RTOSs provide standard APIs. The most important RTOS standards are RT-POSIX [45], OSEK [46], and APEX [47]. Here we will only deal with POSIX since it is the most widely adopted RTOS standard, but those interested in automotive and avionic systems should take a closer look at OSEK and APEX, respectively. The POSIX standard is based on Unix, and its goal is portability of applications at the source code level. The basic POSIX services include task and thread management, file system management, input and output, and event notification via signals. The POSIX real-time interface defines services facilitating concurrent programming and providing predictable timing behavior. Concurrent programming is supported by synchronization and communication mechanisms that allow predictability. Predictable timing behavior is supported by preemptive fixed priority scheduling, time management with high resolution, and virtual memory management. Several restricted subsets of the standard intended for different types of systems have been defined, as well as specific language bindings, for example, for Ada [48].

2.4 Real-Time Scheduling Traditionally, real-time schedulers are divided into offline and online schedulers. Offline schedulers make all scheduling decisions before the system is executed. At runtime a simple dispatcher is used to activate tasks according to the offline generated schedule. Online schedulers, on the other hand, decide during execution, based on various parameters, which task should execute at any given time. As there are loads of different schedulers developed in the research community, in this section we have focused on highlighting the main categories of schedulers that are readily available in existing RTOSs.

2.4.1 Introduction to Scheduling A RTS consists of a set of real-time programs, which in turn consists of a set of tasks. These tasks are sequential pieces of code, executing on a platform with limited resources. The tasks have different timing

© 2006 by Taylor & Francis Group, LLC

2-12

Embedded Systems Handbook

properties, for example, execution times, periods, and deadlines. Several tasks can be allocated to a single processor. The scheduler decides, at each moment, which task to execute. A RTS can be preemptive or nonpreemptive. In a preemptive system, tasks can preempt each other, letting the task with the highest priority execute. In a nonpreemptive system a task that has been allowed to start will execute until its completion. Tasks can be categorized into either being periodic, sporadic, or aperiodic. Periodic task execute with a specified time (period) between task releases. Aperiodic tasks have no information saying when the task is to be released. Usually aperiodics are triggered by interrupts. Similarly, sporadic tasks have no period, but in contrast with aperiodics, sporadic tasks have a known minimum time between releases. Typically, tasks that perform measurements are periodic, collecting some value(s) every nth time unit. A sporadic task is typically reacting to an event/interrupt that we know has a minimum interarrival time, for example, an alarm or the emergency shut down of a production robot. The minimum interarrival time can be constrained by physical laws, or it can be enforced by some hardware mechanism. If we do not know the minimum time between two consecutive events, we must classify the event-handling task to be aperiodic. A real-time scheduler schedules the real-time tasks sharing the same resource (e.g., a CPU or a network link). The goal of the scheduler is to make sure that the timing requirements of these tasks are satisfied. The scheduler decides, based on the task timing properties, which task has to execute or to use the resource.

2.4.2 Ofﬂine Schedulers Offline schedulers, or table-driven schedulers, work as follows: the schedulers create a schedule (the table) before the system is started (offline). At runtime, a dispatcher follows the schedule, and makes sure that tasks are only executing at their predetermined time slots (according to the schedule). Offline schedules are commonly used to implement the time-triggered execution paradigm (described in Section 2.2.3). By creating a schedule offline, complex timing constraints can be handled in a way that would be difficult to do online. The schedule that is created will be used at runtime. Therefore, the online behavior of table-driven schedulers is very deterministic. Because of this determinism, table-driven schedulers are more commonly used in applications that have very high safety-critical demands. However, since the schedule is created offline, the flexibility is very limited, in the sense that as soon as the system changes (due to, e.g., adding of functionality or change of hardware), a new schedule has to be created and given to the dispatcher. To create new schedules is nontrivial and sometimes very time consuming. There also exist combinations of the predictable table-driven scheduling and the more flexible prioritybased schedulers, and there exists methods to convert one policy to another [13,49,50].

2.4.3 Online Schedulers Scheduling policies that make their scheduling decisions during runtime are classified as online schedulers. These schedulers make their scheduling decisions based on some task properties, for example, task priority. Schedulers that base their scheduling decisions on task priorities are also called priority-based schedulers. 2.4.3.1 Priority-Based Schedulers Using priority-based schedulers the flexibility is increased (compared with table-driven schedulers), since the schedule is created online, based on the currently active task’s constraints. Hence, priority-based schedulers can cope with changes in workload and added functions, as long as the schedulability of the task set is not violated. However, the exact behavior of priority-based schedulers is harder to predict. Therefore, these schedulers are not used often in the most safety-critical applications. Two common priority-based scheduling policies are Fixed-Priority Scheduling (FPS) and Earliest Deadline First (EDF). The difference between these scheduling policies is whether the priorities of the real-time tasks are fixed or if they can change during execution (i.e., they are dynamic). In FPS, priorities are assigned to the tasks before execution (offline). The task with the highest priority among all tasks that are available for execution is scheduled for execution. It can be proven that some

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-13

priority assignments are better than others. For instance, for a simple task model with strictly periodic noninterfering tasks with deadlines equal to the period of the task, a Rate Monotonic (RM) priority assignment has been shown by Liu and Layland [5] to be optimal. In RM, the priority is assigned based on the period of the task. The shorter the period is, the higher priority will be assigned to the task. Using EDF, the task with the nearest (earliest) deadline among all available tasks is selected for execution. Therefore the priority is not fixed, it changes with time. It has been shown that for simple task models EDF is an optimal dynamic priority scheme [5]. 2.4.3.2 Scheduling with Aperiodics In order for the priority-based schedulers to cope with aperiodic tasks, different service methods have been presented. The objective of these service methods is to give a good average response time for aperiodic requests, while preserving the timing properties of periodic and sporadic tasks. These services are implemented using special server tasks. In the scheduling literature many types of servers are described. Using FPS, for instance, the Sporadic Server (SS) is presented by Sprunt et al. [51]. The SS has a fixed priority chosen according to the RM policy. Using EDF, Dynamic Sporadic Server (DSS) [52,53] extends SS. Other EDF-based schedulers are the Constant Bandwidth Server (CBS), presented by Abeni and Buttazzo [54], and the Total Bandwidth Server (TBS) by Spuri and Buttazzo [52, 55]. Each server is characterized partly by its unique mechanism for assigning deadlines, and partly by a set of variables used to configure the server. Examples of such variables are bandwidth, period, and capacity. In Section 2.6 we give examples of how timing properties of FPS can be calculated.

2.5 Real-Time Communications Real-time communication aims at providing timely and deterministic communication of data between distributed devices. In many cases, there are requirements to provide guarantees of the real-time properties of these transmissions. There are real-time communication networks of different types, ranging from small fieldbus-based control systems to large Ethernet/Internet distributed applications. There is also a growing interest for wireless solutions. In this section we give a brief introduction to communications in general and real-time communications in particular. We then provide an overview of the currently most popular real-time communication systems and protocols, both in industry and in academia.

2.5.1 Communication Techniques Common access mechanisms used in communication networks are CSMA/CD (Carrier Sense Multiple Access/Collision Detection), CSMA/CA (Carrier Sense Multiple Access/Collision Avoidance), TDMA (Time Division Multiple Access), Tokens, Central Master, and Mini Slotting. These techniques are used in both real-time and non-real-time communication, and each of the techniques have different timing characteristics. In CSMA/CD, collisions between messages are detected, causing the messages involved in the collision to be retransmitted. CSMA/CD is used, for example, in Ethernet. CSMA/CA, on the other hand, is avoiding collisions and is therefore more deterministic in its behavior compared with CSMA/CD. Hence, CSMA/CA is more suitable for hard real-time guarantees, whereas CSMA/CD can provide soft real-time guarantees. Examples of networks that implement CSMA/CA are Controller Area Networks (CAN) and ARINC 629. TDMA is using time to achieve exclusive usage of the network. Messages are sent at predetermined instances in time. Hence, the behavior of TDMA-based networks is very deterministic, that is, very suitable to provide real-time guarantees. One example of a TDMA-based real-time network is TTP. An alternative way of eliminating collisions on the network is to use tokens. In token-based networks only the owner of the (unique within the network) token is allowed to send messages on the network.

© 2006 by Taylor & Francis Group, LLC

2-14

Embedded Systems Handbook

Once the token holder is done or has used its allotted time the token is passed to another node. Tokens are used in, for example, Profibus. It is also possible to eliminate collisions by letting one node in the network be the master node. The master node is controlling the traffic on the network, and it decides which and when messages are allowed to be sent. This approach is used in, for example, LIN and TTP/A. Finally, mini slotting can also be used to eliminate collisions. When using mini slotting, as soon as the network is idle and some node would like to transmit a message, the node has to wait for a unique (for each node) time before sending any messages. If there are several competing nodes wanting to send messages, the node with the longer waiting time will see that there is another node that already has started its transmission of a message. In such a situation the node has to wait until the network becomes idle again. Hence, collisions are avoided. Mini slotting can be found in, for example, FlexRay and ARINC 629.

2.5.2 Fieldbuses Fieldbuses are a family of factory communication networks that have evolved as a response to the demand to reduce cabling costs in factory automation systems. By moving from a situation in which every controller has its own cables connecting the sensors to the controller (parallel interface), to a system with a set of controllers sharing a bus (serial interface), costs could be cut and flexibility could be increased. Pushing for this evolution of technology was both the fact that the number of cables in the system increased as the number of sensors and actuators grew, together with controllers moving from being specialized with their own microchip, to sharing a microprocessor with other controllers. Fieldbuses were soon ready to handle the most demanding applications on the factory floor. Several fieldbus technologies, usually very specialized, were developed by different companies to meet the demands of their applications. Fieldbuses used in the automotive industry are, for example, CAN, TT-CAN, TTP, LIN, and FlexRay. In avionics, ARINC 629 is one of the frequently used communication standards. Profibus is widely used in automation and robotics, while in trains TCN and WorldFIP are very popular communication technologies. We will now present each of these fieldbuses in some more detail, outlining key features and specific properties. 2.5.2.1 Controller Area Network The Controller Area Network (CAN) [56] was standardized by the International Standardisation Organisation (ISO) [57] in 1993. Today CAN is a widely used fieldbus, mainly in automotive systems but also in other real-time applications, for example, medical equipment. CAN is an event-triggered broadcast bus designed to operate at speeds of up to 1 Mbps. CAN is using a fixed-priority based arbitration mechanism that can provide timing guarantees using FPS type of analysis [58,59]. An example of this analysis will be provided in Section 2.6.3. CAN is a collision-avoidance broadcast bus, using deterministic collision resolution to control access to the bus (so-called CSMA/CA). The basis for the access mechanism is the electrical characteristics of a CAN bus, allowing sending nodes to detect collisions in a nondestructive way. By monitoring the resulting bus value during message arbitration, a node detects if there are higher priority messages competing for the access to the bus. If this is the case, the node will stop the message transmission, and try to retransmit the message as soon as the bus becomes idle again. Hence, the bus is behaving like a priority-based queue. 2.5.2.2 Time-Triggered CAN Time-triggered communication on CAN (TT-CAN) [60] is a standardized session layer extension to the original CAN. In TT-CAN, the exchange of messages is controlled by the temporal progression of time, and all nodes are following a predefined static schedule. It is also possible to support original event-triggered CAN traffic together with the time-triggered traffic. This traffic is sent in dedicated arbitration windows, using the same arbitration mechanism as native CAN. The static schedule is based on a time division (TDMA) scheme, where message exchanges may only occur during specific time slots or in time windows. Synchronization of the nodes is done using either a

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-15

clock synchronization algorithm, or by periodic messages from a master node. In the latter case, all nodes in the system are synchronizing with this message, which gives a reference point in the temporal domain for the static schedule of the message transactions, that is, the master’s view of time is referred to as the network’s global time. TT-CAN appends a set of new features to the original CAN, and being standardized, several semiconductor vendors are manufacturing TT-CAN compliant devices. 2.5.2.3 Flexible Time-Triggered CAN Flexible time-triggered communication on CAN (FTT-CAN) [61,62] provides a way to schedule CAN in a time-triggered fashion with support for event-triggered traffic as well. In FTT-CAN, time is partitioned into Elementary Cycles (ECs) which are initiated by a special message, the Trigger Message (TM). This message triggers the start of the EC and contains the schedule for the time-triggered traffic that shall be sent within this EC. The schedule is calculated and sent by a master node. FTT-CAN supports both periodic and aperiodic traffic by dividing the EC into two parts. In the first part, the asynchronous window, the aperiodic messages are sent, and in the second part, the synchronous window, traffic is sent in a time-triggered fashion according to the schedule delivered by the TM. FTT-CAN is still mainly an academic communication protocol. 2.5.2.4 Time-Triggered Protocol The Time-Triggered Protocol Class C for, TTP/C [10, 63], is a TDMA based communication network intended for truly hard real-time communication. TTP/C is available for network speeds of up to 25 Mbps. TTP/C is part of the Time-Triggered Architecture (TTA) by Kopetz [10, 64], which is designed for safetycritical applications. TTP/C has support for fault tolerance, clock synchronization, membership services, fast error detection, and consistency checks. Several major automotive companies are supporting this protocol. For the less hard RTSs (e.g., soft RTSs), there exists a scaled-down version of TTP/C called TTP/A [10]. 2.5.2.5 Local Interconnect Network The Local Interconnect Network LIN [65], was developed by the LIN Consortium (including Audi, BMW, Daimler–Chrysler, Motorola, Volvo, and VW) as a low cost alternative for small networks. LIN is cheaper than, for example, CAN. LIN is using the UART/SCI interface hardware, and transmission speeds are possible up to 20 Kbps. Among the nodes in the network, one node is the master node, responsible for synchronization of the bus. The traffic is sent in a time-triggered fashion. 2.5.2.6 FlexRay FlexRay [66] was proposed in 1999 by several major automotive manufacturers, for example, Daimler– Chrysler and BMW, as a competitive next generation fieldbus replacing CAN. FlexRay is a real-time communication network that provides both synchronous and asynchronous transmissions with network speeds up to 10 Mbps. For the synchronous traffic FlexRay is using TDMA, providing deterministic data transmissions with a bounded delay. For the asynchronous traffic mini-slotting is used. Compared with CAN, FlexRay is more suitable for the dependable application domain, by including support for redundant transmission channels, bus guardians, and fast error detection and signaling. 2.5.2.7 ARINC 629 For avionic and aerospace communication systems, the ARINC 429 [67] standard and its newer ARINC 629 [67] successor are the most commonly used communication systems today. ARINC 629 supports both periodic and sporadic communication. The bus is scheduled in bus cycles, which in turn are divided in two parts. In the first part periodic traffic is sent, and in the second part the sporadic traffic is sent. The arbitration of messages is based on collision avoidance (i.e., CSMA/CA) using mini-slotting. Network speeds are as high as 2 Mbps.

© 2006 by Taylor & Francis Group, LLC

2-16

Embedded Systems Handbook

2.5.2.8 Proﬁbus Profibus [68] is used in process automation and robotics. There are three different versions of Profibus: (1) Profibus-DP is optimized for speed and low cost, (2) Profibus-PA is designed for process automation, and (3) Profibus-FMS is a general purpose version of Profibus. Profibus provides master–slave communication together with token mechanisms. Profibus is available with data rates up to 12 Mbps. 2.5.2.9 Train Communication Network The Train Communication Network (TCN) [69] is widely used in trains, and implements the IEC 61275 standard as well as the IEEE 1473 standard. TCN is composed of two networks: the Wire Train Bus (WTB) and the Multifunction Vehicle BUS (MVB). The WTB is the network used to connect the whole train, that is, all vehicles of the train. Network data rate is up to 1 Mbps. The MVB is the network used within one vehicle. Here the maximum data rate is 1.5 Mbps. Both the WTB and the MVB are scheduled in cycles called basic periods. Each basic period consists of a periodic phase and a sporadic phase. Hence, there is a support for both periodic and sporadic type of traffic. The difference between the WTB and the MVB (apart from the data rate) is the length of the basic periods (1 or 2 msec for the MVB and 25 msec for the WTB). 2.5.2.10 WorldFIP The WorldFIP [70] is a very popular communication network in train control systems. WorldFIP is based on the Producer–Distributor–Consumers (PDC) communication model. Currently, network speeds are as high as 5 Mbps. The WorldFIP protocol defines an application layer that includes PDC- and messaging-services.

2.5.3 Ethernet for Real-Time Communication In parallel with the search for the holy grail of real-time communication, Ethernet has established itself as the de facto standard for non-real-time communication. Comparing networking solutions for automation networks and office networks, fieldbuses was the choice for the former. At the same time, Ethernet developed as the standard for office automation, and owing to its popularity, prices on networking solutions dropped. Ethernet is not originally developed for real-time communication since the original intention with Ethernet is to maximize throughput (bandwidth). However, nowadays, a big effort is being made in order to provide real-time communication using Ethernet. The biggest challenge is to provide real-time guarantees using standard Ethernet components. The reason why Ethernet is not very suitable for real-time communication is its handling of collisions on the network. Several proposals to minimize or eliminate the occurrence of collisions on Ethernet have been proposed. The following sections present some of these proposals. 2.5.3.1 TDMA A simple solution would be to eliminate the occurrence of collisions on the network. This has been explored by, for example, Kopetz et al. [71], using a TDMA protocol on top of Ethernet. 2.5.3.2 Usage of Tokens Another solution to eliminate the occurrence of collisions is the usage of tokens. Token-based solutions [72,73] on the Ethernet also eliminates collisions, but is not compatible with standard hardware. A token-based communication protocol is a way to provide real-time guarantees on most types of networks. This is because they are deterministic in their behavior, although a dedicated network is required. That is, all nodes sharing the network must obey the token protocol. Examples of token-based protocols are the Timed Token Protocol (TTP) [74] and the IEEE 802.5 Token Ring Protocol.

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-17

2.5.3.3 Modiﬁed Collision Resolution Algorithm A different approach is to modify the collision resolution algorithm [75, 76]. Using standard Ethernet controllers, the modified collision resolution algorithm is nondeterministic. In order to make a deterministic modified collision resolution algorithm, a major modification of the Ethernet controllers is required [77]. 2.5.3.4 Virtual Time and Window Protocols Another solution to real-time communication using Ethernet is the usage of the Virtual Time CSMA (VTCSMA) [78–80] protocol, where packets are delayed in a deterministic way in order to eliminate the occurrence of collisions. Moreover, Window Protocols [81] are using a global window (synchronized time interval) that also remove collisions. The window protocol is more dynamic and somewhat more efficient in its behavior compared with the VTCSMA approach. 2.5.3.5 Master/Slave A fairly straightforward way of providing real-time traffic on Ethernet is by using a master/slave approach. As a part of the FTT framework [82], FTT Ethernet [83] is proposed as a master/multislave protocol. At the cost of some computational overhead at each node in the system, timely delivery of messages on Ethernet is provided. 2.5.3.6 Trafﬁc Smoothing The most recent work, without modifications to the hardware or networking topology (infrastructure), is the usage of traffic smoothing. Traffic smoothing can be used to eliminate bursts of traffic [84, 85] that have severe impact on the timely delivery of message packets on the Ethernet. By keeping the network load below a given threshold, a probabilistic guarantee of message delivery can be provided. Hence, traffic smoothing could be a solution for soft RTSs. 2.5.3.7 Black Bursts Black burst [86] is implementing a collision avoidance protocol on Ethernet. When a station wants to submit a message, the station waits until the network is idle, i.e., no traffic is being transmitted. Then, to avoid collisions, the transmitting station starts jamming the network. Several transmitting stations might start jamming the network at the same time. However, each station is using unique length jamming signals, always allowing a unique station to win. Winning means that once the jamming signal is over, the network should be idle, i.e., no other stations are jamming the network. If this is the case, the message is transmitted. In other cases, a loosing station will wait until the network is idle again, and the mechanism starts over. Hence, no message collisions will occur on the network. 2.5.3.8 Switches Finally, a completely different approach to achieve real-time communication using Ethernet is by changing the infrastructure. One way of doing this is to construct the Ethernet using switches to separate collision domains. By using these switches, a collision free network is provided. However, this requires new hardware supporting the IEEE 802.1p standard. Therefore it is not an as attractive solution for existing networks as, for example, traffic smoothing.

2.5.4 Wireless Communication There are no commercially available wireless communication protocols providing real-time guarantees.3 Two of the more common used wireless protocols today are the IEEE 802.11 (WLAN) and Bluetooth. However, these protocols are not providing the temporal guarantees needed for hard real-time communication. Today, a big effort is being made (as with Ethernet) to provide real-time guarantees for wireless communication, possibly by using either WLAN or Bluetooth. 3 Bluetooth

provides real-time guarantees limited to streaming voice traffic.

© 2006 by Taylor & Francis Group, LLC

2-18

Embedded Systems Handbook

2.6 Analysis of RTSs The most important property to analyze in a RTS is its temporal behavior, that is, the timeliness of the system. The analysis should provide strong evidence that the system performs as intended at the correct time. This section will give an overview of the basic properties that are analyzed in a RTS. The section concludes with a presentation of trends and tools in the area of RTS analysis.

2.6.1 Timing Properties Timing analysis is a complex problem. Not only are the used techniques sometimes complicated, but also the problem itself is elusive; for instance, what is the meaning of the term “program execution time”? Is it the average time to execute the program, or the worst possible time, or does it mean some form of “normal” execution time? Under what conditions does a statement regarding program execution times apply? Is the program delayed by interrupts or higher priority tasks? Does the time include waiting for shared resources? etc. To straighten out some of these questions, and to be able to study some existing techniques for timing analysis, we categorize timing analysis into three major types. Each type has its own purpose, benefits, and limitations. The types are listed below. 2.6.1.1 Execution Time This refers to the execution time of a singe task (or program, or function, or any other unit of singlethreaded sequential code). The result of an execution-time analysis is the time (i.e., the number of clock cycles) the task takes to execute, when executing undisturbed on a single CPU, that is, the result should not account for interrupts, preemption, background DMA transfers, DRAM refresh delays, or any other type of interfering background activities. At a first glance, leaving out all types of interference from the execution-time analysis would give us unrealistic results. However, the purpose of the execution-time analysis is not to deliver estimates on “real-world” timing when executing the task. Instead, its role is to find out how much computing resources is needed to execute the task. (Hence, background activities that are not related to the task should not be accounted for.) There are some different types of execution times that can be of interest: • Worst-case execution time (WCET). This is the worst possible execution time a task could exhibit, or equivalently, the maximum amount of computing resources required to execute the task. The WCET should include any possible atypical task execution such as exception handling or clean up after abnormal task termination. • Best-case execution time (BCET). During some types of real-time analysis, not only the WCET is used, but also, as we will describe later, having knowledge about the BCET of tasks is useful. • Average execution time (AET). The AET can be useful in calculating throughput figures for a system. However, for most RTS analysis the AET is of less importance, simply since a reasonable approximation of the average case is easy to obtain during testing (where typically, the average system behavior is studied). Also, only knowing the average and not knowing any other statistical parameters such as standard deviation or distribution function makes statistical analysis difficult. For analysis purposes a more pessimistic metric such as the 95% quartile would be more useful. However, analytical techniques using statistical metrics of execution time are scarce and not very well developed. 2.6.1.2 Response Time The response time of a task is the time it takes from the invocation to the completion of the task. In other words, the time from when the task is first placed in the OS’s ready-queue to the time when it is removed from the running state and placed in the idle or sleeping state.

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-19

Typically, for analysis purposes it is assumed that a task does not voluntarily suspend itself during its execution. That is, the task may not call primitives such as sleep() or delay(). However, involuntary suspension, such as blocking on shared resources, is allowed. That is, primitives such as get_semaphore() and lock_database_tuple() are allowed. When a program voluntarily suspends itself, then that program should be broken down into two (or more) analysis tasks. The response time is typically a system level property, in that it includes interference from other, unrelated, tasks and parts of the system. The response time also includes delays caused by contention on shared resources. Hence, the response time is only meaningful when considering a complete system, or in distributed systems, a complete node. 2.6.1.3 End-to-End Delay The described “execution time” and “response time” are useful concepts since they are relatively easy to understand and have well defined scopes. However, when trying to establish the temporal correctness of a system, knowing the WCET and/or the response times of tasks is often not enough. Typically, the correctness criteria is stated using end-to-end latency timing requirements, for instance, an upper bound on the delay between the input of a signal and the output of a response. In a given implementation there may be a chain of events taking place between the input of a signal and the output of a response. For instance, one task may be in charge of reading the input and another task for generating the response, and the two tasks may have to exchange messages on a communications link before the response can be generated. The end-to-end timing denotes timing of externally visible events. 2.6.1.4 Jitter

Interru

pt

Interru pt

The term jitter is used as a metric for variability in time. For instance, the jitter in execution time of a task is the difference between the task’s BCET and WCET. Similarly, the response-time jitter of a task is the difference between its best-case response time and its worst-case response time. Often, control algorithms has requirements that the jitter of the output should be limited. Hence, the jitter is sometimes a metric equally important as the end-to-end delay. Also input to the system can have jitter. For instance, an interrupt which is expected to be periodic may have a jitter (owing to some imperfection in the process generating the interrupt). In this case the jitter value is used as a bound on the maximum deviation from the ideal period of the interrupt. Figure 2.3 illustrates the relation between the period and the jitter for this example. Note that jitter should not accumulate over time. For our example, even though two successive interrupts could arrive closer than one period, in the long run, the average interrupt interarrival time will be that of the period. In the above list of types of time, we only mentioned time to execute programs. However, in many RTSs, other timing properties may also exist. This includes delays on communications network and other resources, such as hard disk drives may be causing delays and need to be analyzed. The above introduced times can all be mapped on to different types of resources, for instance, the WCET of a task corresponds to the maximum size of a message to be transmitted, and the response time of message is defined analogous to the response time of a task.

Period

Earliest time

FIGURE 2.3 Jitter used as a bound on variability in periodicity.

© 2006 by Taylor & Francis Group, LLC

Jitter

Time Latest time

2-20

Embedded Systems Handbook

2.6.2 Methods for Timing Analysis When analyzing hard RTSs it is essential that the estimates obtained during timing analysis are safe. An estimate is considered safe if it is guaranteed that it is not an underestimation of the actual worst-case time. It is also important that the estimate is tight, meaning that the estimated time is close to the actual worst-case time. For the previously defined types of timings (Section 2.6.1) there are different methods available that are given in the following sections. 2.6.2.1 Execution-Time Estimation For real-time tasks the WCET is the most important execution time measure to obtain. Sadly, however, it is also often the most difficult measure to obtain. Methods to obtain the WCET of a task can be divided into two categories: (1) static analysis, and (2) dynamic analysis. Dynamic analysis is essentially equivalent to testing (i.e., executing the task on the target hardware) and has all the drawbacks/problems that testing exhibits (such as being tedious and error prone). One major problem with dynamic analysis is that it does not produce safe results. In fact, the result can never exceed the true WCET and it is very difficult to make sure that the estimated WCET is really the true WCET. Static analysis, on the other hand, can give guaranteed safe results. Static analysis is performed by analyzing the code (source and/or object code is used) and basically counting the number of clock cycles that the task may use to execute (in the worst possible case). Static analysis uses models of the hardware to predict execution times for each instruction. Hence, for modern hardware it may be very difficult to produce static analyzers that give good results. One source of pessimism in the analysis (i.e., overestimation) is hardware caches; whenever an instruction or data item cannot be guaranteed to reside in the cache, a static analyzer must assume a cache miss. And since modeling the exact state of caches (sometimes of multiple levels), branch predictors etc. is very difficult and time consuming, few tools that give adequate results for advanced architectures exists. Also, to perform a program flow and data analysis that exactly calculates, for example, the number of times a loop iterates or the input parameters for procedures is difficult. Methods for good hardware and software modeling do exist in the research community, however, combining these methods into good quality tools has proven tedious. 2.6.2.2 Schedulability Analysis The goal of schedulability analysis is to see whether or not a system is schedulable. A system is deemed schedulable if it is guaranteed that all task deadlines will always be met. For statically scheduled (table driven) systems, calculation of response times are trivially given from the static schedule. However, for dynamically scheduled systems (such as fixed priority or deadline scheduling) more advanced techniques have to be used. There are two main classes of schedulability analysis techniques: (1) response-time analysis, and (2) utilization analysis. As the name suggest, a response-time analysis calculates an (safe) estimate of the worst-case response time of a task. That estimate can then be compared with the deadline of the task and if it does not exceed the deadline then the task is schedulable. Utilization analysis, in contrast, does not directly derive the response times for tasks, rather they give a boolean result for each task telling whether or not the task is schedulable. This result is based on the fraction of utilization of the CPU for a relevant subset of the tasks, hence the term utilization analysis. Both the analyses are based on similar types of task models. However, typically, the task models used for analysis are not the task models provided by commercial RTOSs. This problem can be resolved by mapping one or more OS task on to one or more analysis task. However, this mapping has to be performed manually and requires an understanding of the limitations of the analysis task model and the analysis technique used.

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-21

2.6.2.3 End-to-End Delay Estimation The typical way to obtain end-to-end delay estimations is to calculate the response time for each task/message in the end-to-end chain and to summarize these response times to obtain an end-to-end estimate. When using an utilization based analysis technique (in which no response time is calculated) one has to resort to using the task/message deadlines as safe upper bounds on the response times. However, when analyzing distributed RTSs, it may not be possible to calculate all response times in one pass. The reason for this is that delays on one node will lead to jitter on another node, and that this jitter may in turn affect the response times on that node. Since jitter can propagate in several steps between nodes, in both directions, there may not exist a right order to analyze the nodes. (If A sends a message to B, and B sends a message to A; which node should one analyze first?) Solution to these type of problems are called holistic schedulability analysis methods (since they consider the whole system). The standard method for holistic response-time analysis is to repeatedly calculate response times for each node (and update jitter values in the nodes affected by the node just analyzed) until response times do not change (i.e., a fix point is reached). 2.6.2.4 Jitter Estimation To calculate the jitter one need not only perform a worst-case analysis (of for instance, response time or end-to-end delay), but also perform a best-case analysis. However, even though best-case analysis techniques often are conceptually similar to worst-case analysis techniques, there has been little attention paid to best-case analysis. One reason for not spending too much time on best-case analysis is that it is quite easy to make a conservative estimate of the best-case: the bestcase time is never less than zero (0). Hence, in many tools it is simply assumed that the BCET (for instance) is zero, whereas great efforts can be spent analyzing the WCET. However, it is important to have tight estimates of the jitter, and to keep the jitter as low as possible. It has been shown that the number of execution paths a multi-tasking RTS can take, dramatically increases if jitter increases [87]. Unless the number of possible execution paths is kept as low as possible it becomes very difficult to achieve good coverage during testing.

2.6.3 Example of Analysis In this section we give simple examples of schedulability analysis. We show a very simple example of how a set of tasks running on a single CPU can be analyzed, and we also give an example of how the response times for a set of messages sent on a CAN bus can be calculated. 2.6.3.1 Analysis of Tasks This example is based on some 30 year old task models and is intended to give the reader a feeling for how these types of analysis work. Today’s methods allow for far richer and more realistic task models; with resulting increase of complexity of the equations used (hence they are not suitable for use in our example). In the first example we will analyze a small task set as described in Table 2.2, where T , C, and D denote the tasks’ period, WCET, and deadline, respectively. In this example T = D for all tasks and priorities have been assigned in RM order, that is, the highest rate gives the highest priority.

TABLE 2.2 Example Task Set for Analysis

© 2006 by Taylor & Francis Group, LLC

Task

T

C

D

Prio

X Y Z

30 40 52

10 10 10

30 40 52

High Medium Low

2-22

Embedded Systems Handbook

For the task set in Table 2.2 original analysis techniques of Liu and Layland [5], and Joseph and Pandya [88] are applicable, and we can perform both utilization-based and response-time based schedulability analysis. We start with the utilization based analysis; for this task model Liu and Layland’s result is that a task set of n tasks is schedulable if its total utilization, Utot , is bounded by the following equation: Utot ≤ n(21/n − 1) Table 2.3 shows the utilization calculations performed for the schedulability analysis. For our example, task set n = 3 and the bound is approximately 0.78. However, the utilization (Utot = ni=1 (Ci /Ti )) for our task set is 0.81, which exceeds the bound. Hence, the task set fails the RM test and cannot be deemed schedulable. Joseph and Pandya’s response-time analysis allows us to calculate worst-case response time, Ri , for each task i in our example (Table 2.2). This is done using the following formula: Ri = Ci +

Ri Cj Tj

(2.1)

j∈hp(i)

where hp(i) denotes the set of tasks with priority higher than task i. The observant reader may have noticed that equation 2.1 is not on closed form, in that Ri is not isolated on the left-hand side of the equality. As a matter of fact, Ri cannot be isolated on the left-hand side of the equality; instead equation 2.1 has to be solved using fix-point iteration. This is done with the recursive formula in equation 2.1, starting with Ri0 = 0 and terminating when a fix point has been reached (i.e., when Rim+1 = Rim ). Rm i Rim+1 = Ci + Cj (2.2) Tj j∈hp(i)

For our example task set Table 2.4 shows the results of calculating equation 2.1. From the table we can conclude that no deadlines will be missed and that the system is schedulable. Remarks As we could see for our example task set in Table 2.2, the utilization based test could not deem the task set as schedulable whereas the response-time based test could. This situation is symptomatic for the relation between utilization based and response-time based schedulability tests. That is, the response-time based tests find more task sets schedulable than the utilization based tests. TABLE 2.3 Result of RM Test Task

T

C

D

Prio

U

X Y Z

30 40 52

10 10 10

30 40 52

High Medium Low

0.33 0.25 0.23

Total Bound

0.81 0.81

TABLE 2.4 Result of Response-Time Analysis for Tasks Task

T

C

D

Prio

R

R≤D

X Y Z

30 40 52

10 10 10

30 40 52

High Medium Low

10 20 52

Yes Yes Yes

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-23

TABLE 2.5 Example CAN-Message Set Message X Y Z

T

S

D

Id

350 500 1000

8 6 5

300 400 800

00010 00100 00110

TABLE 2.6 Result of Response-Time Analysis for CAN Message X Y Z

T

S

D

Id

Prio

C

w

R

R≤D

350 500 1000

8 6 5

300 400 800

00010 00100 00110

High Medium Low

130 111 102

130 260 612

260 371 714

Yes Yes Yes

However, as also shown by the example, the response-time based test needs to perform more calculations than the utilization based tests. For this simple example the extra computational complexity of the response-time test is insignificant. However, when using modern task models (that are capable of modeling realistic systems) the computational complexity of response-time based tests is significant. Unfortunately, for these advanced models, utilization based tests are not always available. 2.6.3.2 Analysis of Messages In our second example we show how to calculate the worst-case response times for a set of periodic messages sent over the CAN-bus (CAN is described in Section 2.5.2). We use a response-time analysis technique similar to the one we used when we analyzed the task set in Table 2.2. In this example our message set is given in Table 2.5, where T , S, D, and Id denotes the messages’ period, data size (in bytes), deadline, and CAN identifier, respectively. (The time-unit used in this example is “bit-time,” that is, the time it takes to send one bit. For a 1 Mbit CAN this means that 1 time-unit is 10−6 sec.) Before we attack the problem of calculating response times we extend Table 2.5 with two columns. First, we need the priority of each message; in CAN this is given by the identifier, the lower the numerical value the higher the priority. Second, we need to know the worst-case transmission time of each message. The transmission time is given partly by the message data size but we also need to add time for the frame header and for any stuff bits.4 The formula to calculate the transmission time, Ci for a message i containing Si bytes of pay load data is given below: Ci = 8Si + 47 +

34 + 8Si − 1 4

In Table 2.6 the two columns Prio and C shows the priority assignment and the transmission times for our example message set. Now we have all the data needed to perform the response-time analysis. However, since CAN is a nonpreemptive resource the structure of the equation is slightly different from equation 2.1 which we used for analysis of tasks. The response-time equation for CAN is given in equation 2.3. Ri = wi + Ci w i = Bi +

wi + 1 Cj Tj

(2.3)

∀j∈hp(i) 4 CAN

adds stuff bits, if necessary, to avoid the two reserved bit patterns 000000 and 111111. These stuff bits are never seen by the CAN user but have to be accounted for in the timing analysis.

© 2006 by Taylor & Francis Group, LLC

2-24

Embedded Systems Handbook

In equation 2.3, Bi denotes the blocking time orginating from a lower priority message already in transition when message i enters arbitration (Bi ≤ 135 which is the largest possible message), and hp(i) denotes the set of messages with higher priority than message i. Note that (similar to equation 2.1) wi is not isolated on the left-hand side of the equation, and its value has to be calculated using fix-point iteration (compare with equation 2.2). Applying equation 2.3 we can now calculate the worst-case response time for our example messages. In Table 2.6 the two columns w and R shows the results of the calculations, and the final column shows the schedulablilty verdict for each message. As we can see from Table 2.6, our example message set is schedulable, meaning that the messages will always be transmitted before their deadlines. Note that this analysis was made assuming that there will not be any retransmissions of broken messages. Normally, CAN automatically retransmits any message that has been broken owing to interference on the bus. To account for such automatic retransmissions an error model needs to be adopted and the response-time equation adjusted accordingly, see, for example, Reference 59.

2.6.4 Trends and Tools As discussed earlier, and also illustrated by our example in Table 2.2, there is a mismatch between the analytical task models and the task models provided by commonly used RTOSs. One of the basic problems is that there is no one-to-one mapping between analysis tasks and RTOS tasks. In fact, for many systems there is a N -to-N mapping between the task types. For instance, an interrupt handler may have to be modeled as several different analysis task (one analysis task for each type of interrupt it handles), and one OS task may have to be modeled as several analysis tasks (for instance, one analysis task per call to sleep() primitives). Also, current schedulability analysis techniques cannot adequately model other types of task synchronization than locking/blocking on shared resources. Abstractions such as message queues are difficult to include in the schedulability analysis.5 Furthermore, tools to estimate the WCET are also scarce. Currently only two tools that gives safe WCET estimates are commercially available [90,91]. These problems have led to low penetration of schedulability analysis in industrial softwaredevelopment processes. However, in isolated domains, such real-time networks, some commercial tools that are based on real-time analysis do exist. For instance, Volcano [92, 93] provides tools for the CAN bus that allow system designers to specify signals on an abstract level (giving signal attributes such as size, period, and deadline) and automatically derive a mapping of signals to CAN messages where all deadlines are guaranteed to be met. On the software side, tools provided by, for instance, TimeSys [94], Arcticus Systems [13], and TTTech [10] can provide system development environments with timing analysis as an integrated part of the tool suite. However, all these tools require that the software development processes is under complete control of the respective tool. This requirement has limited the use of these tools. The widespread use of UML [22] in software design has led to some specialized UML products for real-time engineering [23,24]. However, these products, as of today, do not support timing analysis of the designed systems. There is, however, recent work within the OMG that specifies a profile “Schedulability, Performance, and Time” (SPT) [95], which allows specification of both timing properties and requirement in a standardized way. This will in turn lead to products that can analyze UML models conforming to the SPT profile. The SPT profile has, however, not been received without criticism. Critique has mainly come from researchers active in the timing analysis field, claiming both that the profile is not precise enough and that some important concepts are missing. For instance, the Universidad de Cantabria has instead developed 5 Techniques to handle more advanced models include timed logic and model checking. However, the computational

and conceptual complexity of these techniques has limited their industrial impact. However, there are examples of commercial tools for this type of verification, for example, Reference [89].

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-25

the MAST–UML profile and an associated MAST tool for analyzing MAST–UML models [96, 97]. MAST allows modeling of advanced timing properties and requirement, and the tool also provides state-of-the-art timing analysis techniques.

2.7 Component-Based Design of RTS Component-Based Design is a current trend in software engineering. In the desktop-area component technologies like COM [15], .NET [16], and Java Beans [17] have gained widespread use. These technologies give substantial benefits, in terms of reduced development time and software complexity, when designing complex and/or distributed systems. However, for RTSs these, and other, desktop oriented component technologies does not suffice. As stated before, the main challenge of designing RTSs is the need to consider issues that do not typically apply to general-purpose computing systems. These issues include: • Constraints on extra-functional properties, such as timing, QoS, and dependability. • The need to statically predict (and verify) these extra-functional properties. • Scarce resources, including processing power, memory, and communication bandwidth. In the commercially available component technologies today, there is little or no support for these issues. Also on the academic scene, there are no readily available solutions to satisfactorily handle all these issues. In the remainder of this chapter we will discuss how these issues can be addressed in the context of CBD. In doing so, we also highlight the challenges in designing a CBD process and component technology for development of RTS.

2.7.1 Timing Properties and CBD In general, for systems where timing is crucial there will necessarily be at least some global timing requirements that have to be met. If the system is built from components, this will imply the need for timing parameters/properties of the components and some proof that the global timing requirements are met. In Section 2.6 we introduced the following four types of timing properties: • • • •

execution time response time end-to-end delay jitter.

So, how are these related to the use of a CBD methodology? 2.7.1.1 Execution Time For a component used in a real-time context, an execution time measure will have to be derived. This is, as discussed in Section 2.6, not an easy or satisfactorily solved problem. Furthermore, since execution time is inherently dependent on the target hardware, and since reuse is the primary motivation for CBD, it is highly desirable if the execution time for several targets would be available. (Alternatively, that the execution time for new hardware platforms is automatically derivable.) The nature of the applied component model may also make execution-time estimation more or less complex. Consider, for instance, a client–server oriented component model, with a server component that provides services of different types, as illustrated in Figure 2.4(a). What does “execution time” mean for such a component? Clearly, a single execution time is not appropriate, rather the analysis will require a set of execution times related to servicing different requests. On the other hand, for a simple port-based object component model [21] in which components are connected in sequence to form periodically executing transactions (illustrated in Figure 2.4[b]), it could be possible to use a single execution time measure,

© 2006 by Taylor & Francis Group, LLC

2-26

Embedded Systems Handbook (a)

(b) Client comp. Client comp. Client comp. Client comp.

Client comp.

Client comp. Client comp.

Server component

FIGURE 2.4 (a) A complex server component, providing multiple services to multiple users, and (b) a simple chain of components implementing a single thread of control.

(a)

(b)

Task

Task

Component

Component

(c) Component Task

Task (d) Task

Component

Task

Component

Component

Component Task

Task Component

Task Component

Component

Component Task

Task

Task

Component

FIGURE 2.5 Tasks and components: (a) one-to-one correspondence, (b) one-to-many correspondence, (c) manyto-one correspondence, (b + c) many-to-many correspondence, and (d) irregular correspondence.

corresponding to the execution time required for reading the values at the input ports, performing the computation, and writing values to the output ports. 2.7.1.2 Response Time Response times denote the time from invocation to completion of tasks, and response-time analysis is the activity to statically derive response-time estimates. The first question to ask from a CBD perspective is: what is the relation between a “task” and a “component”? This is obviously highly related to the component model used. As illustrated in Figure 2.5(a), there could be a one-to-one mapping between components and tasks, but in general, several components could be implemented in one task (Figure 2.5[b]) or one component could be implemented by several tasks (Figure 2.5[c]), hence there is a many-to-many relation between components and tasks. In principle, there could even be more irregular correspondence between components and tasks, as illustrated in Figure 2.5(d). Furthermore, in a distributed system there could be a many-to-many relation between components and processing nodes, making the situation even more complicated. Once we have sorted out the relation between tasks and components, we can calculate the response times of tasks, given that we have an appropriate analysis method for the used execution paradigm, and that relevant execution time measures are available. However, to relate these response times to components and the application level timing requirements may not be straightforward, but this is an issue for the subsequent end-to-end analysis. Another issue with respect to response times is how to handle communication delays in distributed systems. In essence there are two ways to model the communication, as depicted in Figure 2.6. In Figure 2.6(a) the network is abstracted away and the intercomponent communication is handled by the framework. In this case, response-time analysis is made more complicated since it must account for different delays in intercomponent communication, depending on the physical location of components.

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-27

(a)

(b)

Node Comp. Comp. Comp.

Node Comp. Comp. Comp.

Node Comp. Comp. Comp.

Network

Node Comp. Comp. Comp.

Node Comp. Comp. Comp.

Node Comp. Comp. Comp.

Network Network Network component

FIGURE 2.6 Components and communication delays: (a) communication delays can be part of the intercomponent communication properties, and (b) communication delays can be timing properties of components.

In Figure 2.6(b), on the other hand, the network is modeled as a component itself, and network delays can be modeled as delays in any other component (and intercomponent communication can be considered instantaneous). However, the choice of how to model network delays also has an impact on the software engineering aspects of the component model. In Figure 2.6(a), the communication is completely hidden from the components (and the software engineers), hence giving optimizing tools many degrees of freedom with respect to component allocation, signal mapping, and scheduling parameter selection. Whereas, in Figure 2.6(b) the communication is explicitly visible to the components (and the software engineers), hence putting a larger burden on the software engineers to manually optimize the system. 2.7.1.3 End-to-End Delay End-to-end delays are application level timing requirements relating the occurrence in time of one event to the occurrence of another event. As pointed out earlier, how to relate such requirements to the lower level timing properties of components is highly dependent on both the component model and the timing analysis model. When designing RTS using CBD the component structure gives excellent information about the points of interaction between the RTS and its environment. Since, end-to-end delays is about timing estimates and timing requirements on such interactions, CBD gives a natural way of stating timing requirements in terms of signals received or generated. (In traditional RTS development, the reception and generation of signals is embedded into the code of tasks and are not externally visible, hence making it difficult to relate response times of tasks to end-to-end requirements.) 2.7.1.4 Jitter Jitter is an important timing parameter that is related to execution time, and that will affect response times and end-to-end delays. There may also be specific jitter requirements. Jitter has the same relation to CBD as does end-to-end delay. 2.7.1.5 Summary of Timing and CBD As described earlier, there is no single solution for how to apply CBD to RTSs. In some cases, timing analysis is made more complicated when using CBD, for example, when using client–server oriented component models, whereas in other cases, CBD actually helps timing analysis, for example, identifying interfaces/events associated with end-to-end requirements is facilitated when using CBD. Further, the characteristics of the component model has great impact on the analyzability of CBDed RTSs. For instance, interaction patterns such as client–server does not map well to established analysis methods and makes analysis difficult, whereas pipes-and-filter based patterns (such as the port based objects component model [21]) maps very well to existing analysis methods and allow for tight analysis of timing behavior. Also, the execution semantics of the component model has an impact on the analyzability. The execution semantics gives restrictions on how to map components to tasks, for example, in the Corba Component Model [14] each component is assumed to have its own thread of execution, making it difficult

© 2006 by Taylor & Francis Group, LLC

2-28

Embedded Systems Handbook

to map multiple components to a single thread. On the other hand, the simple execution semantics of pipes-and-filter based models allow for automatic mapping of multiple components to a single task, simplifying timing analysis and making better use of system resources.

2.7.2 Real-Time Operating Systems There are two important aspects regarding CBD and RTOSs: (1) the RTOS may itself be component based, and (2) the RTOS may support or provide a framework for CBD. 2.7.2.1 Component-Based RTOSs Most RTOSs allow for offline configuration where the engineer can choose to include or exclude large parts of functionality. For instance, which communications protocols to include is typically configurable. However, this type of configurability is not the same as the RTOS being component based (even though the unit of configuration is often referred to as components in marketing material). For an RTOS to be component based it is required that the components conform to a component model, which is typically not the case in most configurable RTOSs. There has been some research on component-based RTOSs, for instance, the research RTOS VEST [18]. In VEST, schedulers, queue managers, and memory management is built up out of components. Furthermore, special emphasis has been put on predictability and analyzability. However, VEST is currently still on the research stage and has not been released to the public. Publicly available is, however, the eCos RTOS [98, 99] which provides a component based configuration tool. Using eCos components the RTOS can be configured by the user, and third party extension can be provided. 2.7.2.2 RTOSs that Support CBD Looking at component models in general and those intended for embedded systems in particular, we observe that they are all supported by some runtime executive or simple RTOS. Many component technologies provides frameworks that are independent of the underlying RTOS, and hence, RTOS can be used to support CBD using such an RTOS-independent framework. Examples include Corba’s ORB [100] and the framework for PECOS [20,101]. Other component technologies have a tighter coupling between the RTOS and component framework, in that the RTOS explicitly supports the component model by providing the framework (or part of the framework). Such technologies include: • Koala [19] is a component model and architectural description language from Philips. Koala provides high-level APIs to the computing and audio/video hardware. The computing layer provides a simple proprietary real-time kernel with priority-driven preemptive scheduling. Special techniques for thread sharing is used to limit the number of concurrent threads. • The Chimera RTOS provides an execution framework for the Port-Based-Object component model [21], intended for development of sensor-based control systems, specifically reconfigurable robotics applications. Chimera has multiprocessor support, and handles both static and dynamic scheduling, the latter EDF based. • The Rubus is a RTOS. Rubus supports a component model in which behaviors are defined by sequences of port-based objects [13]. The Rubus kernel supports predictable execution of statically scheduled periodic tasks (termed red tasks in Rubus) and dynamically fixed-priority preemptive scheduled tasks (termed Blue). In addition, support for handling interrupts is provided. In Rubus, support is provided for transforming sets of components into sequential chains of executable code. Each such chain is implemented as a single task. Support is also provided for analysis of response times and end-to-end deadlines, based on execution-time measures that have to be provided, that is, execution-time analysis is not provided by the framework. • The Time-Triggered Operating System (TTOS) is an adapted and extended version of the MARS OS [71]. Task scheduling in TTOS is based on an offline generated scheduling table, and relies on the global time base provided by the TTP/C communication system. All synchronization is handled

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-29

by the offline scheduling. TTOS, and in general the entire TTA is (just as IEC61131-3) well suited for the synchronous execution paradigm. In a synchronous execution the system is considered sequential; computing in each step (or cycle) a global output based on a global input. The effect of each step is defined by a set of transformation rules. Scheduling is done statically by compiling the set of rules into a sequential program implementing these rules and executing them in some statically defined order. A uniform timing bound for the execution of global steps is assumed. In this context, a component is a design level entity. TTA defines a protocol for extending the synchronous language paradigm to distributed platforms, allowing distributed components to interoperate, as long as they conform to imposed timing requirements.

2.7.3 Real-Time Scheduling Ideally, from a CBD perspective, the response time of a component should be independent of the environment in which it is executing (since this would facilitate reuse of the component). However, this is in most cases highly unrealistic, since: 1. The execution time of the task will be different in different target environments. 2. The response time is additionally dependent on the other tasks competing for the same resources (CPU etc.) and the scheduling method used to resolve the resource contention. Rather than aiming for the nonachievable ideal, a realistic ambition could be to have a component model and framework which allows for analysis of response times based on abstract models of components and their compositions. Time-triggered systems goes one step toward the ideal solution, in that components can be timely isolated from each other. While not having a major impact on the component model, time-triggered systems simplify implementation of the component framework since all synchronization between components is resolved offline. Also, from a safety perspective, the time-triggered paradigm gives benefits, in that it reduces the number of possible execution scenarios (owing to the static order of execution of components and owing to the lack of preemption). Also, in time-triggered component models it is possible to use the structure given by the component composition to synthesize scheduling parameters. For instance, in Rubus [13] and TTA [8] this is already done, by generating the static schedule using the components as schedulable entities. In theory, a similar approach could be used also for dynamically scheduled systems; using a scheduler/ task configuration tool to automatically derive mappings of components to tasks and scheduling parameters (such as priorities or deadlines) for the tasks. However, this approach is still on the research stage.

2.8 Testing and Debugging of RTSs According to a recent study by NIST [102] up to 80% of the life cycle cost for software is spent on testing and debugging. Despite the importance, there are few results on RTSs testing and debugging. The main reason for this is that it is actually quite difficult to test and debug RTS. Remember that RTSs are timing critical and that they interact with the real world. Since testing and debugging typically involves some instrumentation of the code, the timing behavior of the system will be different when testing/debugging compared with when executing the deployed system. Hence, the test-cases that were passed during testing may lead to failures in the deployed system, and tests that failed may not cause any problem at all in the deployed system. For debugging the situation is possibly even worse, since in addition to a similar effect when running the system in a debugger, entering a breakpoint will stop the execution for an unspecified time. The problem with this is that the controlled external process will continue to evolve (e.g., a car will not momentarily stop by stopping the execution of the controlling software). The result of this is that we get a behavior of the debugged system which will not be possible in the real system. Also, it is often the case that the external process cannot be completely controlled, which

© 2006 by Taylor & Francis Group, LLC

2-30

Embedded Systems Handbook

means that we cannot reproduce the observed behavior, which means that it will be difficult to use (cyclic) debugging to track down an error that caused a failure. The following are two possible solutions to the presented problems: • To build a simulator that faithfully captures the functional as well as timing behavior of both the RTS and the environment which it is controlling. Since this is both time consuming and costly, this approach is only feasible in very special situations. Since such situations are rare we will not further consider this alternative here. • To record the RTS’s behavior during testing or execution, and then if a failure is detected replay the execution in a controlled way. For this to work it is essential that the timing behavior is the same during testing as in the deployed system. This can either be achieved by using nonintrusive hardware recorders, or by leaving the software used for instrumentation in the deployed system. The latter comes at a cost in memory space and execution time, but gives the additional benefit that it becomes possible to debug also the deployed system in case of a failure [103]. An additional problem for most RTSs is that the system consists of several concurrently executing threads. This is also the case for the majority of nonRTSs. This concurrency will per se lead to a problematic nondeterminism, since owing to race conditions caused by slight variations in execution time the exact preemption points will vary, causing unpredictability, both in terms of the number of scenarios and in terms of being able to predict which scenario will actually be executed in a specific situation. In conclusion we note that testing and debugging of RTSs are difficult and challenging tasks. The following is a brief account of some of the few results on testing of RTSs reported in the literature: • Thane and Hansson [87] proposed a method for deterministic testing of distributed RTSs. The key element here is, to identify the different execution orderings (serializations of the concurrent system) and treat each of these orderings as a sequential program. The main weakness of this approach is the potentially exponential blow-up of the number of execution orderings. • For testing of temporal correctness Tsai et al. [104] provide a monitoring technique that records runtime information. This information is then used to analyze if the temporal constraints are violated. • Schütz [105] has proposed a strategy for testing distributed RTSs. The strategy is tailored for the time-triggered MARS system [71]. • Zhu et al. [106] have proposed a framework for regression testing of real-time software in distributed systems. The framework is based on the Onoma’s [107] regression testing process. When it comes to RTS debugging the most promising approach is record/replay [108–112] as mentioned earlier. Using record/replay, first, a reference execution of the system is executed and observed, second, a replay execution is performed based on the observations made during the reference execution. Observations are performed by instrumenting the system, in order to extract information about the execution. The industrial practice for testing and debugging of multi-tasking RTS is a time consuming activity. At best, hardware emulators, for example, Reference 113, are used to get some level of observability without interfering with the observed system. More often, it is an ad hoc activity, using intrusive instrumentations of the code to observe test results or try to track down intricate timing errors. However, some tools using the above record/replay method is now emerging on the market, for example, Reference 114.

2.9 Summary This chapter has presented the most important issues, methods, and trends in the area of embedded RTSs. A wide range of topics have been covered, from the initial design of embedded RTSs to analysis and testing. Important issues discussed and presented are design tools, OSs, and major underlying mechanisms such

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-31

as architectures, models of interactions, real-time mechanisms, executions strategies, and scheduling. Moreover, communications, analysis, and testing techniques are presented. Over the years, the academics have put an effort in increasing the various techniques used to compose and design complex embedded RTSs. Standards and industry are following a slower pace, while also adopting and developing area-specific techniques. Today, we can see diverse techniques used in different application domains, such as automotive, aero, and trains. In the area of communications, an effort is made in the academic, and also in some parts of industry, toward using Ethernet. This is a step toward a common technique for several application domains. Different real-time demands have led to domain specific OSs, architectures, and models of interactions. As many of these have several commonalities, there is a potential for standardization across several domains. However, as this takes time, we will most certainly stay with application specific techniques for a while, and for specific domains, with extreme demands on safety or low cost, specialized solutions will most likely be used also in the future. Therefore, knowledge of the techniques used in and suitable for the various domains will remain important.

References [1] Tom R. Halfhill. Embedded Markets Breaks New Ground. Microprocessor Report, 17, 2000. [2] H. Kopetz. Introduction. In Real-Time Systems: Introduction and Overview. Part XVIII of Lecture Notes from ESSES 2003 — European Summer School on Embedded Systems. Ylva Boivie, Hans Hansson, and Sang Lyul Min, Eds., Västerås, Sweden, September 2003. [3] IEEE Computer Society. Technical Committee on Real-Time Systems Home Page. http://www.cs. bu.edu/pub/ieee-rts/. [4] Kluwer. Real-Time Systems (Journal). http://www.wkap.nl/kapis/CGI-BIN/WORLD/ journalhome.htm?0922-6443. [5] C. Liu and J. Layland. Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment. Journal of the ACM, 20:46–61, 1973. [6] M.H. Klein, T. Ralya, B. Pollak, R. Obenza, and M.G. Harbour. A Practitioners Handbook for Rate-Monotonic Analysis. Kluwer, Dordrecht, 1998. [7] N.C. Audsley, A. Burns, R.I. Davis, K. Tindell, and A.J. Wellings. Fixed Priority Pre-Emptive Scheduling: An Historical Perspective. Real-Time Systems, 8:129–154, 1995. [8] Hermann Kopetz and Günther Bauer. The Time-Triggered Architecture. Proceedings of the IEEE, Special Issue on Modeling and Design of Embedded Software, 91:112–126, 2003. [9] J. Xu and D.L. Parnas. Scheduling Processes with Release Times, Deadlines, Precedence, and Exclusion Relations. IEEE Transactions on Software Engineering, 16:360–369, 1990. [10] Time Triggered Technologies. http://www.tttech.com. [11] H. Kopetz and G. Grünsteidl. TTP — A Protocol for Fault-Tolerant Real-Time Systems. IEEE Computer, 27(1):14–23, 1994. [12] H. Hansson, H. Lawson, and M. Strömberg. BASEMENT a Distributed Real-Time Architecture for Vehicle Applications. Real-Time Systems, 3:223–244, 1996. [13] Arcticus Systems. The Rubus Operating System. http://www.arcticus.se. [14] OMG. CORBA Component Model 3.0, June 2002. http://www.omg.org/technology/documents/ formal/components.htm. [15] Microsoft. Microsoft .COM Technologies. http://www.microsoft.com/com/. [16] Microsoft. .NET Home Page. http://www.microsoft.com/net/. [17] SUN Microsystems. Introducing Java Beans. http://developer.java.sun.com/developer/online/ Training/Beans/ Beans1/index.html. [18] John A. Stankovic. VEST — A Toolset for Constructing and Analyzing Component-Based Embedded Systems. Lecture Notes in Computer Science, 2211:390–402, 2001. [19] Rob van Ommering. The Koala Component Model. In Building Reliable Component-Based Software Systems. Artech House Publishers, July 2002, pp. 223–236.

© 2006 by Taylor & Francis Group, LLC

2-32

Embedded Systems Handbook

[20] P.O. Müller, C.M. Stich, and C. Zeidler. Component-Based Embedded Systems. In Building Reliable Component-Based Software Systems. Artech House Publisher, 2002, pp. 303–323. [21] D.B. Stewart, R.A. Volpe, and P.K. Khosla. Design of Dynamically Reconfigurable Real-Time Software Using Port-Based Objects. IEEE Transactions on Software Engineering, 23(12):759–776, 1997. [22] OMG. Unified Modeling Language (UML), Version 1.5, 2003. http://www.omg.org/technology/ documents/formal/uml.htm. [23] Rational. Rational Rose RealTime. http://www.rational.com/products/rosert. [24] I-Logix. Rhapsody. http://www.ilogix.com/products/rhapsody. [25] TeleLogic. Telelogic tau. http://www.telelogic.com/products/tau. [26] Vector. DaVinci Tool Suite. http://www.vector-informatik.de/. [27] OMG. Unified Modeling Language (UML), Version 2.0 (draft). OMG document ptc/03-09-15, September 2003. [28] ITEA. EAST/EEA Project Site. http://www.east-eea.net. [29] ETAS. http://en.etasgroup.com. [30] Vector. http://www.vector-informatik.com. [31] Siemens. http://www.siemensvdo.com. [32] Comp.realtime FAQ. Available at http://www.faqs.org/faqs/realtime-computing/faq/. [33] Roadmap — Adaptive Real-Time Systems for Quality of Service Management. ARTIST — Project IST-2001-34820, May 2003. http://www.artist-embedded.org/Roadmaps/. [34] G.C. Buttazzo. Hard Real-Time Computing Systems. Kluwer Academic Publishers, Dordrecht, 1997. [35] A. Burns and A. Wellings. Real-Time Systems and Programming Languages, 2nd ed. AddisonWesley, Reading, MA, 1996. [36] The Asterix Real-Time Kernel. http://www.mrtc.mdh.se/projects/asterix/. [37] LiveDevices. Realogy Real-Time Architect, SSX5 Operating System, 1999. http://www.livedevices. com/realtime.shtml. [38] Wind River Systems Inc. VxWorks Programmer’s Guide. http://www.windriver.com/. [39] Lynuxworks. http://www.lynuxworks.com. [40] Enea OSE Systems. Ose. http://www.ose.com. [41] QNX Software Systems. QNX Realtime OS. http://www.qnx.com. [42] List of Real-Time Linux Variants. http://www.realtimelinuxfoundation.org/variants/ variants.html. [43] Express Logic. Threadx. http://www.expresslogic.com. [44] Northern Real-Time Applications. Total Time Predictability. Whitepaper on SSX5, 1998. [45] IEEE. Standard for Information Technology — Standardized Application Environment Profile — POSIX Realtime Application Support (AEP). IEEE Standard P1003.13-1998, 1998. [46] OSEK Group. OSEK/VDX Operating System Specification 2.2.1. http://www.osek-vdx.org/. [47] Airlines Electronic Engineering Committee (AEEC). ARINC 653: Avionics Application Software Standard Interface (Draft 15), June 1996. [48] ISO. Ada95 Reference Manual. ISO/IEC 8652:1995(E), 1995. [49] G. Fohler, T. Lennvall, and R. Dobrin. A Component Based Real-Time Scheduling Architecture. In Architecting Dependable Systems, Vol. LNCS-2677. R. de Lemos, C. Gacek, and A. Romanovsky, Eds., Springer-Verlag, Heidelberg, 2003. [50] J. Mäki-Turja and M. Sjödin. Combining Dynamic and Static Scheduling in Hard Real-Time Systems. Technical report MRTC no. 71, Mälardalen Real-Time Research Centre (MRTC), October 2002. [51] B. Sprunt, L. Sha, and J.P. Lehoczky. Aperiodic Task Scheduling for Hard Real-Time Systems. Real-Time Systems, 1:27–60, 1989.

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-33

[52] M. Spuri and G.C. Buttazzo. Efficient Aperiodic Service under Earliest Deadline Scheduling. In Proceedings of the 15th IEEE Real-Time Systems Symposium (RTSS’94). IEEE Computer Society, San Juan, Puerto Rico, December 1994, pp. 2–11. [53] M. Spuri and G.C. Buttazzo. Scheduling Aperiodic Tasks in Dynamic Priority Systems. Real-Time Systems, 10:179–210, 1996. [54] L. Abeni and G. Buttazzo. Integrating Multimedia Applications in Hard Real-Time Systems. In Proceedings of the 19th IEEE Real-Time Systems Symposium (RTSS’98). IEEE Computer Society, Madrid, Spain, December 1998, pp. 4–13. [55] M. Spuri, G.C. Buttazzo, and F. Sensini. Robust Aperiodic Scheduling under Dynamic Priority Systems. In Proceedings of the 16th IEEE Real-Time Systems Symposium (RTSS’95). IEEE Computer Society, Pisa, Italy, December 1995, pp. 210–219. [56] CAN Specification 2.0, Part-A and Part-B. CAN in Automation (CiA), Am Weichselgarten 26, D-91058 Erlangen, 2002. http://www.can-cia.de. [57] Road Vehicles — Interchange of Digital Information — Controller Area Network (CAN) for High Speed Communications, ISO/DIS 11898, February 1992. [58] K.W. Tindell, A. Burns, and A.J. Wellings. Calculating Controller Area Network (CAN) Message Response Times. Control Engineering Practice, 3:1163–1169, 1995. [59] K. Tindell, H. Hansson, and A. Wellings. Analysing Real-Time Communications: Controller Area Network (CAN). In Proceedings of the 15th IEEE Real-Time Systems Symposium (RTSS). IEEE Computer Society Press, December 1994, pp. 259–263. [60] Road Vehicles — Controller Area Network (CAN) — Part 4: Time Triggered Communication. ISO/CD 11898-4. [61] L. Almeida, J.A. Fonseca, and P. Fonseca. Flexible Time-Triggered Communication on a Controller Area Network. In Proceedings of the Work-In-Progress Session of the 19th IEEE Real-Time Systems Symposium (RTSS’98). IEEE Computer Society, Madrid, Spain, December 1998. [62] L. Almeida, J.A. Fonseca, and P. Fonseca. A Flexible Time-Triggered Communication System Based on the Controller Area Network: Experimental Results. In Proceedings of the IFAC International Conference on Fieldbus Technology (FeT). Springer, 1999, pp. 342–350. [63] TTTech Computertechnik AG. Specification of the TTP/C Protocol v0.5, July 1999. [64] H. Kopetz. The Time-Triggered Model of Computation. In Proceedings of the 19th IEEE RealTime Systems Symposium (RTSS’98). IEEE Computer Society, Madrid, Spain, December 1998, pp. 168–177. [65] LIN. Local Interconnect Network. http://www.lin-subbus.de. [66] R. Belschner, J. Berwanger, C. Ebner, H. Eisele, S. Fluhrer, T. Forest, T. Führer, F. Hartwich, B. Hedenetz, R. Hugel, A. Knapp, J. Krammer, A. Millsap, B. Müller, M. Peller, and A. Schedl. FlexRay — Requirements Specification, April 2002. http://www.flexray-group.com. [67] ARINC/RTCA-SC-182/EUROCAE-WG-48. Minimal Operational Performance Standard for Avionics Computer Resources, 1999. [68] PROFIBUS. PROFIBUS International. http://www.profibus.com. [69] H. Kirrmann and P.A. Zuber. The IEC/IEEE Train Communication Network. IEEE Micro, 21:81–92, 2001. [70] WorldFIP. WorldFIP Fieldbus. http://www.worldfip.org. [71] H. Kopetz, A. Damm, C. Koza, and M. Mullozzani. Distributed Fault Tolerant Real-Time Systems: The MARS Approach. IEEE Micro, 9(1):25–40, 1989. [72] C. Venkatramani and T. Chiueh. Supporting Real-Time Traffic on Ethernet. In Proceedings of the 15th IEEE Real-Time Systems Symposium (RTSS’94). IEEE Computer Society, San Juan, Puerto Rico, December 1994, pp. 282–286. [73] D.W. Pritty, J.R. Malone, S.K. Banerjee, and N.L. Lawrie. A Real-Time Upgrade for Ethernet Based Factory Networking. In Proceedings of the IECON’95. IEEE Industrial Electronics Society, 1995, pp. 1631–1637.

© 2006 by Taylor & Francis Group, LLC

2-34

Embedded Systems Handbook

[74] N. Malcolm and W. Zhao. The Timed Token Protocol for Real-Time Communication. IEEE Computer, 27:35–41, 1994. [75] K.K. Ramakrishnan and H. Yang. The Ethernet Capture Effect: Analysis and Solution. In Proceedings of the 19th IEEE Local Computer Networks Conference (LCNC’94), October 1994, pp. 228–240. [76] M. Molle. A New Binary Logarithmic Arbitration Method for Ethernet. Technical report, TR CSRI-298, CRI, University of Toronto, Canada, 1994. [77] G. Lann and N. Riviere. Real-Time Communications over Broadcast Networks: The CSMA/DCR and the DOD-CSMA/CD Protocols. Technical report, TR 1863, INRIA, 1993. [78] M. Molle and L. Kleinrock. Virtual Time CSMA: Why Two Clocks are Better than One. IEEE Transactions on Communications, 33:919–933, 1985. [79] W. Zhao and K. Ramamritham. A Virtual Time CSMA/CD Protocol for Hard Real-Time Communication. In Proceedings of the 7th IEEE Real-Time Systems Symposium (RTSS’86). IEEE Computer Society, New Orleans, LA, December 1986, pp. 120–127. [80] M. El-Derini and M. El-Sakka. A Novel Protocol Under a Priority Time Constraint for Real-Time Communication Systems. In Proceedings of the 2nd IEEE Workshop on Future Trends of Distributed Computing Systems (FTDCS’90). IEEE Computer Society, Cairo, Egypt, September 1990, pp. 128–134. [81] W. Zhao, J.A. Stankovic, and K. Ramamritham. A Window Protocol for Transmission of TimeConstrained Messages. IEEE Transactions on Computers, 39:1186–1203, 1990. [82] L. Almeida, P. Pedreiras, and J.A. Fonseca. The FTT-CAN Protocol: Why and How? IEEE Transaction on Industrial Electronics, 49(6):1189–1201, 2002. [83] P. Pedreiras, L. Almeida, and P. Gai. The FTT-Ethernet Protocol: Merging Flexibility, Timeliness and Efficiency. In Proceedings of the 14th Euromicro Conference on Real-Time Systems (ECRTS’02). IEEE Computer Society, Vienna, Austria, June 2002, pp. 152–160. [84] S.K. Kweon, K.G. Shin, and G. Workman. Achieving Real-Time Communication over Ethernet with Adaptive Traffic Smoothing. In Proceedings of the Sixth IEEE Real-Time Technology and Applications Symposium (RTAS’00). IEEE Computer Society, Washington DC, USA, June 2000, pp. 90–100. [85] A. Carpenzano, R. Caponetto, L. LoBello, and O. Mirabella. Fuzzy Traffic Smoothing: An Approach for Real-Time Communication over Ethernet Networks. In Proceedings of the Fourth IEEE International Workshop on Factory Communication Systems (WFCS’02). IEEE Industrial Electronics Society, Västerås, Sweden, August 2002, pp. 241–248. [86] J.L. Sobrinho and A.S. Krishnakumar. EQuB-Ethernet Quality of Service Using Black Bursts. In Proceedings of the 23rd IEEE Annual Conference on Local Computer Networks (LCN’98). IEEE Computer Society, Lowell, MA, October 1998, pp. 286–296. [87] H. Thane and H. Hansson. Towards Systematic Testing of Distributed Real-Time Systems. In Proceedings of the 20th IEEE Real-Time Systems Symposium (RTSS). December 1999, pp. 360–369. [88] M. Joseph and P. Pandya. Finding Response Times in a Real-Time System. Computer Journal, 29:390–395, 1986. [89] The Times Tool. http://www.docs.uu.se/docs/rtmv/times. [90] AbsInt. http://www.absint.com. [91] Bound-T Execution Time Analyzer. http://www.bound-t.com. [92] L. Casparsson, A. Rajnak, K. Tindell, and P. Malmberg. Volcano — A Revolution in On-Board Communications. Volvo Technology Report, 1:9–19, 1998. [93] Volcano automotive group. http://www.volcanoautomotive.com. [94] TimeSys. Timewiz — A Modeling and Simulation Tool. http://www.timesys.com/. [95] OMG. UML Profile for Schedulability, Performance, and Time Specification. OMG document formal/2003-09-01, September 2003.

© 2006 by Taylor & Francis Group, LLC

Real-Time in Embedded Systems

2-35

[96] J.L. Medina, M. González Harbour, and J.M. Drake. MAST Real-Time View: A Graphic UML Tool for Modeling Object-Oriented Real-Time Systems. In Proceedings of the 22nd IEEE Real-Time Systems Symposium (RTSS). IEEE Computer Society, December 2001, pp. 245–256. [97] MAST home-page. http://mast.unican.es/. [98] A. Massa. Embedded Software Development with eCos. Prentice Hall, New York, November 2002, ISBN: 0130354732. [99] eCos Home Page. http://sources.redhat.com/ecos. [100] OMG. CORBA Home Page. http://www.omg.org/corba/. [101] PECOS Project Web Site. http://www.pecos-project.org. [102] U.S. Department of Commerce. The Economic Impacts of Inadequate Infrastructure for Software Testing. NIST report, May 2002. [103] M. Ronsse, K. De Bosschere, M. Christiaens, J. Chassin de Kergommeaux, and D. Kranzlmüller. Record/Replay for Nondeterministic Program Executions. Communications of the ACM, 46:62–67, 2003. [104] J.J.P. Tsai, K.Y. Fang, and Y.D. Bi. On Realtime Software Testing and Debugging. In Proceedings of the 14th Annual International Computer Software and Application Conference. IEEE Computer Society, November 1990, pp. 512–518. [105] W. Schütz. Fundamental Issues in Testing Distributed Real-Time Systems. Real-Time Systems, 7:129–157, 1994. [106] H. Zhu, P. Hall, and J. May. Software Unit Test Coverage and Adequacy. ACM Computing Surveys, 29(4):366–427, 1997. [107] K. Onoma, W.-T. Tsai, M. Poonawala, and H. Suganuma. Regression Testing in an Industrial Environment. Communications of the ACM, 41:81–86, 1998. [108] J.D. Choi, B. Alpern, T. Ngo, M. Sridharan, and J. Vlissides. A Pertrubation-Free Replay Platform for Cross-Optimized Multithreaded Applications. In Proceedings of the 15th International Parallel and Distributed Processing Symposium. IEEE Computer Society Press, Washington, April 2001. [109] J. Mellor-Crummey and T. LeBlanc. A Software Instruction Counter. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, April 1989, pp. 78–86. [110] K.C. Tai, R. Carver, and E. Obaid. Debugging Concurrent ADA Programs by Deterministic Execution. IEEE Transactions on Software Engineering, 17:280–287, 1991. [111] H. Thane and H. Hansson. Using Deterministic Replay for Debugging of Distributed Real-Time Systems. In Proceedings of the 12th Euromicro Conference on Real-Time Systems. IEEE Computer Society Press, Washington, June 2000, pp. 265–272. [112] F. Zambonelli and R. Netzer. An Efficient Logging Algorithm for Incremental Replay of MessagePassing Applications. In Proceedings of the 13th International and 10th Symposium on Parallel and Distributed Processing. IEEE, April 1999, pp. 392–398. [113] Lauterbach. Lauterbach. http://www.laterbach.com. [114] ZealCore. ZealCore Embedded Solutions AB. http://www.zealcore.com.

© 2006 by Taylor & Francis Group, LLC

Design and Validation of Embedded Systems 3

Design of Embedded Systems Luciano Lavagno and Claudio Passerone

4

Models of Embedded Computation Axel Jantsch

5

Modeling Formalisms for Embedded System Design Luís Gomes, João Paulo Barros, and Anikó Costa

6

System Validation J.V. Kapitonova, A.A. Letichevsky, V.A. Volkov, and Thomas Weigert

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 1 — #1

3 Design of Embedded Systems 3.1 3.2 3.3 3.4 3.5 3.6

The Embedded System Revolution . . . . . . . . . . . . . . . . . . . . . 3-1 Design of Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 Functional Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6 Function–Architecture and Hardware–Software Codesign. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7 Hardware–Software Coverification and Hardware Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12 Compilation, Debugging, and Memory Model • Real-Time Scheduling

Luciano Lavagno Cadence Berkeley Laboratories and Politecnico di Torino

Claudio Passerone Politecnico di Torino

3.7

Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16 Logic Synthesis and Equivalence Checking • Placement, Routing, and Extraction • Simulation, Formal Verification, and Test Pattern Generation

3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22

3.1 The Embedded System Revolution The world of electronics has witnessed a dramatic growth of its applications in the last few decades. From telecommunications to entertainment, from automotives to banking, almost any aspect of our everyday life employs some kind of electronic components. In most cases, these components are computer-based systems, which are not, however, used or perceived as a computers. For instance, they often do not have a keyboard or a display to interact with the user, and they do not run standard operating systems and applications. Sometimes, these systems constitute a self-contained product themselves (e.g., a mobile phone), but they are frequently embedded inside another system, for which they provide better functionalities and performance (e.g., the engine control unit of a motor vehicle). We call these computer-based systems embedded systems. The huge success of embedded electronics has several causes. The main one in our opinion is that embedded systems bring the advantages of Moore’s Law into everyday life, that is, an exponential increase in performance and functionality at an ever decreasing cost. This is possible because of the capabilities of integrated circuit technology and manufacturing, which allows one to build more and more complex devices, and because of the development of new design methodologies, which allows one to efficiently and cleverly use those devices. Traditional steel-based mechanical development, on the other hand, has reached

3-1

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 1 — #3

3-2

Embedded Systems Handbook

a plateau near the middle of the twentieth century, and thus it is not a significant source of innovation any longer, unless coupled to electronic manufacturing technologies (microelectromechanical systems — MEMS) or embedded systems, as argued above. There are many examples of embedded systems in the real world. For instance, a modern car contains tens of electronic components (control units, sensors, and actuators) that perform very different tasks. The first embedded systems that appeared in a car were related to the control of mechanical aspects, such as the control of the engine, the antilock brake system, and the control of suspension and transmission. However, nowadays cars also have a number of components that are not directly related to mechanical aspects, but are mostly related to the use of the car as a vehicle for moving around, or the communication needs of the passengers: navigation systems, digital audio and video players, and phones are just a few examples. Moreover, many of these embedded systems are connected together using a network, because they need to share information regarding the state of the car. Other examples come from the communication industry: a cellular phone is an embedded system whose environment is the mobile network. These are very sophisticated computers whose main task is to send and receive voice, but are also currently used as personal digital assistants, for games, to send and receive images and multimedia messages, and to wirelessly browse the Internet. They have been so successful and pervasive that in just a decade they became essential in our life. Other kinds of embedded systems significantly changed our life as well: for instance, ATM and Point-of-Sale (POS) machines modified the way we do payments, and multimedia digital players changed how we listen to music and watch videos. We are just at the beginning of a revolution that will have an impact on every other industrial sector. Special purpose embedded systems will proliferate and will be found in almost any object that we use. They will be optimized for the application and show a natural user interface. They will be flexible, in order to adapt to a changing environment. Most of them will also be wireless, in order to follow us wherever we go and keep us constantly connected with the information we need and the people we care. Even the role of computers will have to be reconsidered, as many of the applications for which they are used today will be performed by specially designed embedded systems. What are the consequences of this revolution in the industry? Modern car manufacturers today need to acquire a significant amount of skills in hardware and software design, in addition to the mechanical skills that they already had in-house, or they should outsource the requirements they have to an external supplier. In either case, a broad variety of skills needs to be mastered, from the design of software architectures for implementing the functionality, to being able to model the performance, because realtime aspects are extremely important in embedded systems, especially those related to safety critical applications. Embedded system designers must also be able to architect and analyze the performance of networks, as well as validate the functionality that has been implemented over a particular architecture and the communication protocols that are used. A similar revolution has happened or is about to happen to other industrial and socioeconomical areas as well, such as entertainment, tourism, education, agriculture, government, and so on. It is therefore clear that new, more efficient and easy to use embedded electronics design methodologies need to be developed, in order to enable the industry to make use of the available technology.

3.2 Design of Embedded Systems Embedded system are informally defined as a collection of programmable parts surrounded by Application Specific Integrated Circuits (ASICs) and other standard components (Application Specific Standard Parts, ASSPs) that interact continuously with an environment through sensors and actuators. The collection can be physically a set of chips on a board, or a set of modules on an integrated circuit. Software is used for features and flexibility, while dedicated hardware is used for increased performance and reduced power consumption. An example of an architecture of an embedded system is shown in Figure 3.1. The main programmable components are microprocessors and Digital Signal Processors (DSPs), that implement the software partition of the system. One can view reconfigurable components, especially

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 2 — #4

Design of Embedded Systems

mP/mC

Memory

3-3

CoProc

Bridge

Peripheral

IP Block

Memory

DSP

Dual port memory

FPGA

FIGURE 3.1 A reactive real-time embedded system architecture.

if they can be reconfigured at runtime, as programmable components in this respect. They exhibit area, cost, performance, and power characteristics that are intermediate between dedicated hardware and processors. Custom and programmable hardware components, on the other hand, implement applicationspecific blocks and peripherals. All components are connected through standard and dedicated buses and networks, and data is stored on a set of memories. Often several smaller subsystems are networked together to control, for example, an entire car, or to constitute a cellular or wireless network. We can identify a set of typical characteristics that are commonly found in embedded systems. For instance, they are usually not very flexible and are designed to perform always the same task: if you buy an engine control embedded system, you cannot use it to control the brakes of your car, or to play games. A PC, on the other hand, is much more flexible because it can perform several very different tasks. An embedded system is often part of a larger controlled system. Moreover, cost, reliability, and safety are often more important criteria than performance, because the customer may not even be aware of the presence of the embedded system, and so he looks at other characteristics, such as the cost, the ease of use, or the lifetime of a product. Another common characteristic of many embedded systems is that they need to be designed in an extremely short time to meet their time to market. Only a few months should elapse from conception of a consumer product to the first working prototypes. If these deadlines are not met, the result is a concurrent increase in design costs and decrease of the profits, because fewer items will be sold. So delays in the design cycle may make a huge difference between a successful product and an unsuccessful one. In the current state of the art, embedded systems are designed with an ad hoc approach that is heavily based on earlier experience with similar products and on manual design. Often the design process requires several iterations to obtain convergence, because the system is not specified in a rigorous and unambiguous fashion, and the level of abstraction, details, and design style in various parts are likely to be different. But as the complexity of embedded systems scales up, this approach is showing its limits, especially regarding design and verification time. New methodologies are being developed to cope with the increased complexity and enhance designers’ productivity. In the past, a sequence of two steps has always been used to reach this goal: abstraction and clustering . Abstraction means describing an object (i.e., a logic gate made of metal oxide semiconductor [MOS] transistors) using a model where some of the low-level details are ignored (i.e., the Boolean expression representing that logic gate). Clustering means connecting a set of models at the same level of abstraction, to get a new object, which usually shows new properties that are not part of the isolated models that constitute it. By successively applying these two steps, digital electronic design went from drawing layouts, to transistor schematics, to logic gate netlists, to register transfer level (RTL) descriptions, as shown in Figure 3.2.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 3 — #5

3-4

Embedded Systems Handbook

Abstract

System level

Register transfer level

Cluster

RTL

SW

Abstract

RTL

Gate-level model

Abstract

Cluster

Transistor model

Abstract

Cluster

1970s

1980s

1990s

2000+

FIGURE 3.2 Abstraction and clustering levels in hardware design.

The notion of platform is key to the efficient use of abstraction and clustering. A platform is a single abstract model that hides the details of a set of different possible implementations as clusters of lowerlevel components. The platform, for example, a family of microprocessors, peripherals, and bus protocol, allows developers of designs at the higher level (generically called “applications” in the following) to operate without detailed knowledge of the implementation (e.g., the pipelining of the processor or the internal implementation of the Universal Asychronous Receiver/Transmitter [UART]). At the same time, it allows platform implementors to share design and fabrication costs among a broad range of potential users, broader than if each design was a one-of-a-kind type. Today we are witnessing the appearance of a new higher level of abstraction, as a response to the growing complexity of integrated circuits. Objects can be functional descriptions of complex behaviors, or architectural specifications of complete hardware platforms. They make use of formal high-level models that can be used to perform an early and fast validation of the final system implementation, although with reduced details with respect to a lower-level description. The relationship between an “application” and elements of a platform is called a mapping . This exists, for example, between logic gates and geometric patterns of a layout, as well as between RTL statements and gates. At the system level, the mapping is between functional objects with their communication links, and platform elements with their communication paths. Mapping at the system level means associating a functional behavior (e.g., an FFT [fast Fourier transform] or a filter) to an architectural element that can implement that behavior (e.g., a CPU or DSP or piece of dedicated hardware). It can also associate a communication link (e.g., an abstract FIFO [first in first out]) to some communication services available in the architecture (e.g., a driver, a bus, and some interfaces). The mapping step may also need to specify parameters for these associations (e.g., the priority of a software task or the size of a FIFO), in order to completely describe it. The object that we obtain after mapping shows properties that were not directly exposed in the separate descriptions, such as the performance of the selected system implementation.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 4 — #6

Design of Embedded Systems

3-5

Performance is not just timing, but any other quantity that can be defined to characterize an embedded system, either physical (area, power consumption, …) or logical (quality of service [QOS], fault tolerance, …). Since the system-level mapping operates on heterogeneous objects, it also allows one to nicely separate different and orthogonal aspects such as: 1. Computation and communication. This separation is important because refinement of computation is generally done by hand, or by compilation and scheduling, while communication makes use of patterns. 2. Application and platform implementation (also called functionality and architecture, e.g., in Reference 1), because they are often defined and designed independently by different groups or companies. 3. Behavior and performance, which should be kept separate because performance information can either represent nonfunctional requirements (e.g., maximum response time of an embedded controller), or the result of an implementation choice (e.g., the worst-case execution time [WCET] of a task). Nonfunctional constraint verification can be performed traditionally, by simulation and prototyping, or with static formal checks, such as schedulability analysis. All these separations result in better reuse, because they decouple independent aspects, that would otherwise tie, for example, a given functional specification to low-level implementation details, by modeling it as assembler or Verilog code. This in turn allows one to reduce design time, by increasing the productivity and decreasing the time needed to verify the system. A schematic representation of a methodology that can be derived from these abstraction and clustering steps is shown in Figure 3.3. At the functional level, a behavior for the system to be implemented is specified, designed, and analyzed, either through simulation or by proving that certain properties are satisfied (the algorithm always terminates, the computation performed satisfies a set of specifications, the complexity of the algorithm is polynomial, etc.). In parallel, a set of architectures are composed from a clustering of platform elements, and selected as candidates for the implementation of the behavior. These components may come from an existing library or may be specifications of components that will be designed later. Now functional operations are assigned to the various architecture components, and patterns provided by the architecture are selected for the defined communications. At this level we are now able to verify the performance of the selected implementation, with much richer details than at the pure functional

Verify architecture

Verify function Behavioral libraries

Function

Architecture

Verify performance

Mapping

Refinement

Architecture libraries

Functional level

Mapping level

Verify refinements Implementation level

Implementation

FIGURE 3.3 Design methodology for embedded system.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 5 — #7

3-6

Embedded Systems Handbook

level. Different mappings to the same architecture, or mapping to different architectures, allow one to explore the design space to find the best solutions to important design challenges. These kinds of analysis let the designer identify and correct possible problems early in the design cycle, thus reducing drastically the time to explore the design space and weed out potentially catastrophic mistakes and bugs. At this stage it is also very important to define the organization of the data storage units for the system. Various kind of memories (e.g., ROM, SRAM, DRAM, Flash, …) have different performance and data persistency characteristics, and must be used judiciously to balance cost and performance. Mapping data structures to different memories, and even changing the organization and layout of arrays can have a dramatic impact on the satisfaction of a given latency in the execution of an algorithm, for example. In particular, a SystemOn-Chip designer can afford to do a very fine tuning of the number and sizes of embedded memories (especially SRAM, but now also Flash) to be connected to processors and dedicated hardware [2]. Finally, at the implementation level, the reverse transformation of abstraction and clustering occurs, that is, a lower-level specification of the embedded system is generated. This is obtained through a series of manual or automatic refinements and modifications that successively add more details, while checking their compliance with the higher-level requirements. This step does not need to generate directly a manufacturable final implementation, but rather produces a new description that in turn constitutes the input for another (recursive) application of the same overall methodology at a lower level of abstraction (e.g., synthesis, placement and routing for hardware, and compilation and linking for software). Moreover, the results obtained by these refinements can be back-annotated to the higher level, to perform a better and more accurate verification.

3.3 Functional Design As discussed in the previous section, system-level design of embedded electronics requires two distinct phases. In a first phase, functional and nonfunctional constraints are the key aspects. In the second phase, the available architectural platforms are taken into account, and detailed implementation can proceed after a mapping phase that defines the architectural component on which every functional model is implemented. This second phase requires a careful analysis of the trade-offs between algorithmic complexity, functional flexibility, and implementation costs. In this section we describe some of the tools that are used for requirements capture, focusing especially on those that permit executable specification. Such tools generally belong to two broad classes. The first class is represented, for example, by Simulink [3], MATRIXx [4], Ascet-SD [5], SPW [6], SCADE [7], and SystemStudio [8]. It includes block-level editors and libraries using which the designer composes data-dominated digital signal processing and embedded control systems. The libraries include simple blocks, such as multiplication, addition, and multiplexing, as well as more complex ones, such as FIR filters, FFTs, and so on. The second class is represented by tools, such as Tau [9], StateMate [10], Esterel Studio [7], StateFlow [3]. It is oriented to control-dominated embedded systems. In this case, the emphasis is placed on the decisions that must be taken by the embedded system in response to environment and user inputs, rather than on numerical computations. The notation is generally some form of Har’el’s Statecharts [11]. The Unified Modeling Language (UML), as standardized by the Object Management Group [12], is in a class by itself, since first of all it focused historically more on general-purpose software (e.g., enterprise and commercial software), rather than on embedded real-time software. Only recently have some embedded aspects, such as performance and time, been incorporated in UML 2.0 [12,13], and emphasis has been placed on model-based software generation. However, tool support for UML 2.0 is still limited (Tau [9], Real Time Studio [14], and Rose RealTime [15] provide some), and UML-based hardware design is still in its infancy. Furthermore, the UML is a collection of notations, some of which (especially Statecharts) are supported by several of the tools listed above in the control-dominated class. Simulink and its related tools and toolboxes, both from Mathworks and from third parties, such as dSPACE [16], is the workhorse of modern model-based embedded system design. In model-based design,

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 6 — #8

Design of Embedded Systems

3-7

a functional executable model is used for algorithm development. This is made easier in the case of Simulink by its tight integration with Matlab, the standard tool in DSP algorithm development. The same functional model, with added annotations such as bitwidths and execution priorities, is then used for algorithmic refinements, such as floating-point to fixed-point conversion and real-time task generation. Then automated software generators, such as Real-Time Workshop, Embedded Coder [3], and TargetLink [16], are used to generate task code and sometimes to customize a real-time operating system (RTOS) on which the tasks will run. Ascet-SD, for example, automatically generates a customization of the OSEK automotive RTOS [17] for the tasks that are generated from a functional model. In all these cases, a task is typically generated from a set of blocks that are executed at the same rate or triggered by the same event in the functional model. Task formation algorithms can use either direct user input (e.g., the execution rate of each block in discrete time portions of a Simulink or Ascet-SD design), or static scheduling algorithms for dataflow models (e.g., based on relative block-to-block rate specifications in SPW or SystemStudio [18,19]). Simulink is also tightly integrated with StateFlow, a design tool for control-dominated applications, in order to ease the integration of decisionmaking and computation code. It also allows one to smoothly generate both hardware and software from the very same specification. This capability, as well as the integration with some sort of Statechart-based finite state machine (FSM) editor, is available from most tools in the first class above. The difference in market share can be attributed to the availability of Simulink “toolboxes” for numerous embedded system design tasks (from fixed-point optimization to FPGA [Field Programmable Gate Array]-based implementation) and its widespread adoption in undergraduate university courses, which makes it well known to most of today’s engineers. The second class of tools either plays an ancillary role in the design of embedded control systems (e.g., as StateFlow and EsterelStudio), or is devoted to inherently control-dominated application areas, such as telecommunication protocols. In the latter market the clear dominator today is Tau. The underlying languages, such as the Specification and Description Language (SDL) and Message Sequence Charts, are standardized by the International Telecommunication Union (ITU). They are commonly used to describe in a tool-independent way protocol standards, thus modeling in SDL is quite natural in this application domain, since validation and refinement can proceed formally within a unified environment. Tau also has code generation capabilities for both application code and customization of real-time kernels on which the FSM-generated code will run. The use of Tau for embedded code generation (model-based design) significantly predates that of Simulink-based code generators, mostly due to the highly complex nature of telecom protocols and the less demanding memory and computing power constraints that switches and other networking equipment have. Tau has links to the requirements capture tool Doors [9], also from Telelogic, which allows one to trace dependencies between multiple requirements written in English, and connect them to aspects of the embedded system design files that implement these requirements. The state of the art of such requirement tracing, however, is far from satisfactory, since there is no formal means in Doors to automatically check for violations. Similar capabilities are provided by Reqtify [20]. Techniques for automated functional constraint validation, starting from formal languages, are described in several books, for example, References 21 and 22. Deadline, latency, and throughput constraints are special kinds of nonfunctional requirements that have received extensive treatment in the real-time scheduling community. They are also covered in several books, for example, References 23–25. While model-based functional verification is quite attractive, due to its high abstraction level, it ignores cost and performance implications of algorithmic decisions. These are taken into account by the tools described in the next section.

3.4 Function–Architecture and Hardware–Software Codesign In this section, we describe some of the tools that are available to help embedded system designers to optimally architect the implementation of the system, and choose the best solution for each functional

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 7 — #9

3-8

Embedded Systems Handbook

component. After these decisions have been made, detailed design can proceed using the languages, tools, and methods described in the following chapters in this book. This step of the design process, whose general structure has been outlined in Section 3.2 by using the platform-based design paradigm, has received various names in the past. Early work [26,27] called it hardware–software codesign (or cosynthesis), because one of the key decisions at this level is what functionality has to be implemented in software and in dedicated hardware, and how the two partitions of the design interact together with minimum cost and maximum performance. Later on, people came to realize that hardware–software was too coarse a granularity, and that more implementation choices had to be taken into account. For example, one could trade-off single versus multiple processors, general-purpose CPUs versus specialized DSPs and Application-Specific Instruction-set Processors (ASIPs), dedicated ASIC versus ASSP (e.g., an MPEG coprocessor or an Ethernet Medium Access Controller), standard cells versus FPGA. Thus the term function–architecture codesign was coined [1], to refer to the more complex partitioning problem of a given functionality onto a heterogeneous architecture such as the one in Figure 3.1. The term system-level design also had some popularity in the industry [6,28], to indicate “the level of design above Register Transfer, at which software and hardware interact.” Other terms, such as timed functional model have also been used [29]. The key problems that are tackled by tools acting as a bridge between the system-level application and the architectural platform are: 1. How to model the performance impact of making mapping decisions from a virtually “implementation-independent” functional specification to an architectural model. 2. How to efficiently drive downstream code generation, synthesis, and validation tools to avoid redoing the modeling effort from scratch at the RTL, C, or assembly code levels respectively. The notion of automated implementation generation from a high-level functional model is called “model-based design” in the software world. In both cases, the notion of what is an “implementation-independent” functional specification, which can be retargeted indifferently to hardware and software implementations, must be carefully evaluated and considered. Taken in its most literal terms, this idea has often been taunted as a myth. However, current practice shows that it is already a reality, at least for some application domains (automotive electronics and telecommunication protocols). It is intuitively very appealing, since it can be considered as a highlevel application of the platform-based design principle, by using a formal system-level platform. Such a platform, embodied in one of the several models of computation that are used in embedded system design, is a perfect candidate to maximize design reuse, and to optimally exploit different implementation options. In particular, several of the tools that have been mentioned in the previous section (e.g., Simulink, TargetLink, StateFlow, SPW, System Studio, Tau, Ascet-SD, StateMate, Esterel Studio) have code generation capabilities that are considered good enough for implementation and not just for rapid prototyping and simulation acceleration. Moreover, several of them (e.g., Simulink, StateFlow, SPW, System Studio, StateMate, Esterel Studio) can generate indifferently C for software implementation, and synthesizable VHDL or Verilog for hardware implementation. Unfortunately, these code generation capabilities often require the laborious creation of implementation models for each target platform (e.g., software in C or assembler for a given DSP, synthesizable VHDL or macroblock netlist for ASIC or FPGA, etc.). However, since these careful implementations are instances of the system-level platform mentioned above, their development cost can be shared among a multitude of designs performed using the tool. Most block diagram or Statechart-based code generators work in a syntax-directed fashion. A piece of C or synthesizable VHDL code is generated for each block and connection, or for each hierarchical state and transition. Thus the designer has tight control over the complexity of the generated software or hardware. While this is a convenient means to bring manual optimization capabilities within the modelbased design flow, it has a potentially significant disadvantage in terms of cost and performance (such as disabling optimizations in the case of a C compiler). On the other hand, more recent tools, such as

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 8 — #10

Design of Embedded Systems

3-9

Esterel Studio and System Studio, take a more radical approach to code generation, based on aggressive optimizations [30]. These optimizations, based on logic synthesis techniques also in the case of software implementation, destroy the original model structure, and thus make debugging and maintenance much harder. However, they can result in an order of magnitude improvement in terms of cost (memory size) and performance (execution speed) with respect to their syntax-directed counterparts [31]. Assuming that good automated code generation, or manual design, is available for each block in the functional model of the application, we are now faced with the function–architecture codesign problem. This essentially means tuning the functional decomposition, as well as the algorithms employed by the overall functional model and each block within it, to the available architecture, and vice versa. Several design environments, for example: • POLIS [1], COSYMA [26], Vulcan [27], COSMOS [32], and Roses [33] in the academic world • Real Time Studio [14], Foresight [34], and CARDtools [35] in the commercial world help the designer in this task by using somehow the notion of independence between functional specification on one side, and hardware–software partitioning or architecture mapping choices on the other. The step of performance evaluation is performed in an abstract, approximate manner by the tools listed above. Some of them use estimators to evaluate the cost and performance of mapping a functional block to an architectural block. Others (e.g., POLIS) rely on cycle-approximate simulation to perform the same task in a manner which better reflects real-life effects, such as burstiness of resource occupation and so on. Techniques for deriving both abstract static performance models (e.g., the WCET of a software task) and performance simulation models are discussed below. In all cases, both the cost of computation and that of communication must be taken into account. This is because the best implementation, especially in the case of multimedia systems that manipulate large amounts of image and sound data, is often one that reduces the amount of transferred data between multiple memory locations, rather than one that finds the absolute best trade-off between software flexibility and hardware efficiency. In this area, the Atomium project at IMEC [2,36] has focused on finding the best memory architecture and schedule of memory transfers for data-dominated applications on mixed hardware–software platforms. By exploiting array access models based on polyhedra, they identify the best reorganization of inner loops of DSP kernels and the best embedded memory architecture. The goal is to reduce memory traffic due to register spills, and maximize the overall performance by accessing several memories in parallel (many DSPs offer this opportunity even in the embedded software domain). A very interesting aspect of Atomium, which distinguishes it from most other optimization tools for embedded systems, is the ability to return a set of Pareto-optimal solutions (i.e., solutions which are not strictly better than one another in at least one aspect of the cost function), rather than a single solution. This allows the designer to pick the best point based on the various aspects of cost and performance (e.g., silicon area versus power and performance), rather than forcing him to “abstract” optimality into a single number. Performance analysis can be based on simulation, as mentioned above, or rely on automatically constructed models that reflect the WCET of pieces of software (e.g., RTOS tasks) running on an embedded processor. Such models, which must be both provably conservative and reasonably accurate, can be constructed by using an execution model called abstract interpretation [37]. This technique traverses the software code, while building a symbolic model, often in the form of linear inequalities [38,39], which represents the requests that the software makes to the underlying hardware (e.g., code fetches, data loads and stores, code execution). A solution to those inequalities then represents the total “cost” of one execution of the given task. It can be combined then with processor, bus, cache, and main memory models that in turn compute the cost of each of these requests in terms of time (clock cycles) or energy. This finally results in a complete model for the cost of mapping that task to those architectural resources. Another technique for software performance analysis, which does not require detailed models of the hardware, uses an approximate compilation step from the functional model to an executable model

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 9 — #11

3-10

Embedded Systems Handbook

(rather than a set of inequalities as above) annotated with the same set of fetch, load, store, and execute requests. Then simulation is used, in a more traditional setting, to analyze the cost of implementing that functionality on a given processor, bus, cache, and memory configuration. Simulation is more effective than WCET analysis in handling multiprocessor implementations, in which bus conflicts and cache pollution can be difficult, if not utterly impossible, to predict statically in a manner that is not too conservative. However, its success in identifying the true worst-case depends on the designer ability to provide the appropriate simulation scenarios. Coverage enhancement techniques from the hardware verification world [40,41] can be extended to help also in this case. Similar abstract models can be constructed in the case of implementation as dedicated hardware, by using high-level synthesis techniques. Such techniques are not yet good enough to generate productionquality RTL code, but can be considered as a reasonable estimator of area, timing, and energy costs for both ASIC and FPGA implementations [42–44]. SystemC [29] and SpecC [45,46], on the other hand, are more traditional modeling and simulation languages, for which the design flow is based on successive refinement rather than codesign or mapping. Finally, OPNET [47] and NS [48] are simulators with a rich modeling library specialized for wireline and wireless networking applications. They help the designer in the more abstract task of generic performance analysis, without the notion of function–architecture separation and codesign. Communication performance analysis, on the other hand, is generally not done using approximate compilation or WCET analysis techniques like those outlined above. Communication is generally implemented not by synthesis but by refinement using patterns and “recipes,” such as interrupt-based, DMA-based, and so on. Thus several design environments and languages at the function–architecture level, such as POLIS, COSMOS, Roses, SystemC, and SpecC, as well as N2C [6], provide mechanisms to replace abstract communication, for example, FIFO-based or discrete-event-based, with detailed protocol stacks using buses, interrupt controllers, memories, drivers, and so on. These refinements can then be estimated either using a library-based approach (they are generally part of a library of implementation choices anyway), or sometimes using the approaches described above for computation. Their cost and performance can thus be combined in an overall system-level performance analysis. However, approximate performance analysis is often not good enough, and a more detailed simulation step is required. This can be achieved by using tools, such as Seamless [49], CoMET [50], MaxSim [51], and N2C [6]. They work at a lower abstraction level, by cosimulating software running on Instruction Set Simulators (ISSs) and hardware running in a Verilog or VHDL simulator. While the simulation is often slower than with more abstract models, and dramatically slower than with static estimators, the precision can now be at the cycle level. Thus it permits close investigation of detailed communication aspects, such as interrupt handling and cache behavior. These approaches are further discussed in the next section. The key advantage of using the mapping-based approach over the traditional design–evaluate–redesign one is the speed with which design space exploration can be performed. This is done by setting up experiments that change either mapping choices or parameters of the architecture (e.g., cache size, processor speed, or bus bandwidth). Key decisions, such as the number of processors and the organization of the bus hierarchy, can thus be based on quantitative application-dependent data, rather than on past experience. If mapping can then be used to drive synthesis, in addition to simulation and formal verification, advantages in terms of time-to-market and reduction of design effort are even more significant. Model-based code generation, as we mentioned in the previous section, is reasonably mature, especially for embedded software in application areas, such as avionics, automotive electronics, and telecommunications. In these areas, considerations other than absolute minimum memory footprint and execution time, for example, safety, sheer complexity, and time-to-market, dominate the design criteria. At the very least, if some form of automated model-based synthesis is available, it can be used to rapidly generate FPGA- and processor-based prototypes of the embedded system. This significantly speeds up verification, with respect to workstation-based simulation. It permits even some hardware-in-the-loop validation for cases (e.g., the notion of “driveability” of a car) in which no formalization or simulation is possible, but a real physical experiment is required.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 10 — #12

Design of Embedded Systems

3-11

3.5 Hardware–Software Coveriﬁcation and Hardware Simulation Traditionally the term “hardware–software codesign” has been identified with the ability to execute a simulation of the hardware and software at the same time. We prefer to use the term “hardware–software coverification” for this task, and leave codesign for the synthesis- and mapping-oriented approaches outlined in the previous section. In the form of simultaneously running an ISS and a Hardware Description Language (HDL) simulator, while keeping the timing of the two synchronized, the area is not new [52]. In recent years, however, we have seen a number of approaches to speeding up the task, in order to tackle platforms with several processors, and the need, for example, to boot an operating system in order to coverify a platform with a processor and its peripherals. Recent techniques have been devoted to the three main ways in which cosimulation speed can be increased: Accelerate the hardware simulator. Coverification generally works at the “clock cycle accurate” level, meaning that both the hardware simulator and the ISS view time as a sequence of discrete clock cycles, ignoring finer aspects of timing (sometimes clock phases are considered, e.g., for DSP systems, in which different memory banks are accessed in different phases of the same cycle). This allows one to speed up simulation with respect to traditional event-driven logic simulation, and yet retain enough precision to identify, for example, bottlenecks, such as interrupt service latency or bus arbitration overhead. Native-code hardware simulation (e.g., NCSim [28]) and emulation (e.g., QuickTurn [28] and Mentor Emulation [49]) can be used to further speed up hardware simulation, at the expense of longer compilation times and much higher costs, respectively. Accelerate the ISS. Compiled-code simulation has been a popular topic in this area as well [53]. The technique compiles a piece of assembler or C code for a target processor into object code that can be run on a host workstation. This code generally also contains annotations counting clock cycles by modeling the processor pipeline. The speed-up that can be achieved with this technique over a traditional ISS, which fetches, decodes, and executes each target instruction individually, is significant (at least one order of magnitude). Unfortunately this technique is not suitable for self-modifying code, such as that of a RTOS. This means that it is difficult to adapt to modern embedded software, which almost invariably runs under RTOS control, rather than on the bare CPU. However, hybrid techniques involving partial compilation on the fly are reportedly used by companies selling fast ISSs [50,51]. Accelerate the interface between the two simulators. This is the area where the earliest work has been performed. For example, Seamless [49] uses sophisticated filters to avoid sending requests for memory accesses over the CPU bus. This allows the bus to be used only for peripheral access, while memory data are provided to the processor directly by a “memory server,” which is a simulation filter sitting in between the ISS and the HDL simulator. The filter reduces stimulation of the HDL simulator, and thus can result in speed-ups of one or more orders of magnitude, when most of the bus traffic consists of filtered memory accesses. Of course, also precision of analysis drops, since, for example, it becomes harder to identify an overload in the processor bus due to a combination of memory and peripheral accesses, since no simulator component sees both. In the HDL domain, as mentioned above, progress in the levels of performance has been achieved essentially by raising the level of abstraction. A “cycle-based” simulator, that is, one that ignores the timing information within a clock cycle, can be dramatically faster than one that requires the use of a timing queue to manage time-tagged events. This is mainly due to two reasons. The first one is that now most of the simulation can be executed always, at every simulation clock cycle. This means that it is much more parallelizable, while event-driven simulators do not fit well over a parallel machine due to the presence of the centralized timing queue. Of course, there is a penalty if most of the hardware is generally idle, since it has to be evaluated anyway, but clock gating techniques developed for low-power consumption can obviously be applied here. The second one is that the overhead of managing the time queue, which often accounts for 50% to 90% of the event-driven simulation time, can now be completely eliminated.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 11 — #13

3-12

Embedded Systems Handbook

Modern HDLs either are totally cycle-based (e.g., SystemC 1.0 [29]) or have a “synthesizable subset,” which is fully synchronous and thus fully compilable to cycle-based simulation. The same synthesizable subset, by the way, is also supported by hardware emulation techniques, for obvious reasons. Another interesting area of cosimulation in embedded system design is analog–digital cosimulation. This is because such systems quite often include analog components (amplifiers, filters, A/D and D/A converters, de-modulators, oscillators, phase locked loops [PLLs], etc.), and models of the environment quite often involve only continuous variables (distance, time, voltage, etc.). Simulink includes a component for simulating continuous-time models, employing a variety of numerical integration methods, which can be freely mixed with discrete-time sampled-data subsystems. This is very useful when modeling and simulating, for example, a control algorithm for automotive electronics, in which the engine dynamics are modeled with differential equations, while the controller is described as a set of blocks implementing a sampled-time subsystem. Simulink is still mostly used to drive software design, despite good toolkits implementing it in reconfigurable hardware [54,55]. Simulators in the hardware design domain, on the other hand, generally use HDLs as their input languages. Analog extensions of both VHDL [56] and Verilog [57] are available. In both cases, one can represent quantities that satisfy either of Kirchhoff ’s Laws (i.e., conserved over cycles or nodes). Thus one can easily build netlists of analog components interfacing with the digital portion, modeled using traditional Boolean or multivalued signals. The simulation environment will then take care of synchronizing the event-driven portion and the continuous time portion. A key problem here is to avoid causality errors, when an event that happens later in “host workstation” time (because the simulator takes care of it later) has an effect on events that preceded it in “simulated time.” In this case, one of the simulators has to “roll back” in time, undoing any potential changes in the state of the simulation, and restart with the new information that something has happened in the past (generally the analog simulator does it, since it is easier to reverse time in that case). Also in this case, as we have seen for hardware–software cosimulation, execution is much slower than in the pure event-driven or cycle-based case, due to the need to take small simulation steps in the analog part. There is only one case in which the performance of the interface between the two domains or of the continuous time simulator is not problematic. It is when the continuous time part is much slower in reality than the digital part. A classical example is automotive electronics, in which mechanical time constants are larger by several orders of magnitude than the clock period of a modern integrated circuit. Thus the performance of continuous time electronics and mechanical cosimulation may not be the bottleneck, except in the case of extremely complex environment models with huge systems of differential equations (e.g., accurate combustion engine models). In that case, hardware emulation of the differential equation solver is the only option (e.g., see Reference 16).

3.6 Software Implementation The next two sections provide an overview of traditional design flows for embedded hardware and software. They are meant to be used as a general introduction to the topics described in the rest of the book, and as a source of references to standard design practice. The software components of an embedded system are generally implemented using the traditional design–code–test–debug cycle, which is often represented using a V-shaped diagram to illustrate the fact that every implementation level of a complex software system must have a corresponding verification level (Figure 3.4). The parts of the V-cycle which relate to system design and partitioning have been described in the previous sections. Here we outline the tools that are available to the embedded software developer.

3.6.1 Compilation, Debugging, and Memory Model Compilation of mathematical formulas into binary machine-executable code followed almost immediately the invention of electronic computer. The first Fortran compiler dates back to 1954, and subroutines

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 12 — #14

Design of Embedded Systems

3-13

Requirements

Product

Function and system analysis

System validation

Subsystem and communication testing

System design partitioning

SW design specification

SW integration

Implementation

FIGURE 3.4 V-cycle for software implementation.

were introduced in 1958, resulting in the creation of the Fortran II language. Since then, languages have evolved a little, more structured programming methodologies have been developed, and compilers have improved quite a bit, but the basic method has remained the same. In particular the C language, originally designed by Kernighan and Ritchie [58] between 1969 and 1972, and used extensively for programming the Unix operating system, is now dominant in the embedded system world, almost replacing the more flexible but much more cumbersome and less portable assembler. Its descendants Java and C++ are beginning to make some inroads, but are still viewed as requiring too much memory and computing power for widespread embedded use. Java, although originally designed for embedded applications [59,60], has a memory model based on garbage collection, that still defies effective embedded real-time implementation [61]. The first compilation step from a high-level language is the conversion of the human-written or machine-generated code into an internal format, called Abstract Syntax Tree [62], which is then translated into a representation that is closer to the final output (generally assembler code) and is suitable for a host of optimizations. This representation can take the form of a control/dataflow graph or a sequence of register transfers. The internal format is then mapped, generally via a graph-matching algorithm, to the set of available machine instructions, and written out to a file. A set of assembler files, in which references to data variables and to subroutine names are still based on symbolic labels, are then converted to an absolute binary file, in which all addresses are explicit. This phase is called assembly and loading. Relocatable code generation techniques, which basically permit code and its data to be placed anywhere in memory, without requiring recompilation, are now being used also in the embedded system domain, thanks to the availability of index registers and relative addressing modes in modern microprocessors. Debuggers for modern embedded systems are much more vital than for general-purpose programming, due to the more limited accessibility of the embedded CPU (often no file system, limited display and keyboard, etc.). They must be able to show several concurrent threads of control, as they interact with each other and with the underlying hardware. They must also be able to do so by minimally disrupting normal operation of the system, since it often has to work in real time, interacting with its environment. Both hardware and operating system support are essential, and the main RTOS vendors, such as WindRiver, all provide powerful interactive multitask debuggers. Hardware support takes the form of breakpoint and watchpoint registers, which can be set to interrupt the CPU when a given address is used for fetching or

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 13 — #15

3-14

Embedded Systems Handbook

data load/store, without requiring one to change the code (which may be in ROM) or to continuously monitor data accesses, which would dramatically slow down execution. A key difference between most embedded software and most general-purpose software is the memory model. In the latter case, memory is viewed as an essentially infinite uniform linear array, and the compiler provides a thin layer of abstraction on top of it, by means of arrays, pointers, and records (or structs). The operating system generally provides virtual memory capabilities, in the form of user functions to allocate and deallocate memory, and by swapping less frequently used pages of main memory to disk. This provides the illusion of a memory as large as the disk area allocated to paging, but with the same direct addressability characteristics as main memory. In embedded systems, however, money is an expensive resource, both in terms of size and speed. Cost, power, and physical size constraints generally forbid the use of virtual memory, and performance constraints force the designer to always carefully lay out data in memory, and match its characteristics (SRAM, DRAM, Flash, ROM) to those of the data and code. Scratchpads [63], that is, manually managed areas of small and fast memory, often on-chip SRAM, are still dominant in the embedded world. Caches are frowned upon in the real-time application domain, since the time at which a computation is performed often matters much more than the accuracy of its result. This is due to the fact that, despite a large body of research devoted to timing analysis of software code in the presence of caches (e.g., see References 64 and 65), their performance must still be assumed to be worst-case, rather than average-case as in general-purpose and scientific computing, thus leading to poor performance at a high cost (large and power-hungry tag arrays). However, compilers that traditionally focused on code optimizations for various underlying architectural features of the processor [66], now offer more and more support for memory-oriented optimizations, in terms of scheduling data transfers, sizing memories of various types, and allocating data to memory, sometimes moving it back and forth between fast and expensive and slow and cheap storage1 [2,63].

3.6.2 Real-Time Scheduling Another key difference with respect to general-purpose software are the real-time characteristics of most embedded software, due to its continual interaction with an environment that seldom can wait. In hard real-time applications, results produced after the deadline are totally useless. On the other hand, in soft realtime applications a merit function measures QOS, allowing one to evaluate trade-offs between missing various deadlines and degrading the precision or resolution with which computations are performed. While the former is often associated with safety-critical (e.g., automotive or avionics) applications and the latter is associated to multimedia and telecommunication applications, algorithm design can make a difference even within the very same domain. Consider, for example, a frame decoding algorithm that generates its result at the end of each execution, and that is scheduled to be executed in real-time every 50th of a second. If the CPU load does not allow it to complete each execution before the deadline, the algorithm will not produce any results, and thus behave as a hard real-time application, without being life-threatening. On the other hand, a smarter algorithm or a smarter scheduler would just reduce the frame size or the frame rate, whenever the CPU load due to other tasks increases, and thus produce a result that has lower quality, but is still viewable. A huge amount of research, summarized in excellent books, such as References 23–25, has been devoted to solving the problems introduced by real-time constraints on embedded software. Most of this work models the system (application, environment, and platform) in very abstract terms, as a set of tasks, each with a release time (when the task becomes ready), a deadline (by which the task must complete), and a WCET. In most cases tasks are periodic, that is, release times and deadlines of multiple instances of the same task are separated by a fixed period. The job of the scheduler is to find an execution order such that each task can complete by its deadline, if it exists. The scheduler may or may not, depending on the underlying hardware and software platform (CPU, peripherals, and RTOS) be able to preempt an executing 1 While

this may seem similar to virtual memory techniques, it is generally done explicitly, always keeping cost, power, and performance under tight control.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 14 — #16

Design of Embedded Systems

3-15

task in order to execute another one. Generally the scheduler bases its preemption decision, and the choice of which task must be run next, on an integer rank assigned to each task and called priority. Priorities may be assigned statically, at compile time, or dynamically, at runtime. The trade-off is between usage of precious CPU resources for runtime (also called online) priority assignment, based on an observation of the current execution conditions, versus the waste of resources inherent in the a priori definition of a priority assignment. A scheduling algorithm is also supposed in general to be able to tell conservatively if a set of tasks is unschedulable on a given platform, and given a set of modeling assumptions (e.g., availability of preemption, fixed or stochastic execution time, and so on). Unschedulability may occur, for example, because the CPU is not powerful enough and the WCETs are too long to satisfy some deadline. In this case the remedy could be either the choice of a faster clock frequency, or a change of CPU, or the transfer of some functionality to a hardware coprocessor, or the relaxation of some of the constraints (periods, deadlines, etc.). A key distinction in this domain is between time-triggered and event-triggered scheduling [67]. The former (also called Time-Division Multiple Access in telecommunications) relies on the fact that the start, preemption (if applicable), and end times of all instances of all tasks are decided a priori, based on worstcase analysis. The resulting system implementation is very predictable, easy to debug, and allows one to guarantee some service even under fault hypotheses [68]. The latter decides start and preemption times based on the actual time of occurrence of the release events, and possibly on the actual execution time (shorter than worst-case). It is more efficient than time-triggering in terms of CPU utilization, especially when release and execution times are not known precisely but subject to jitter. It is, however, more difficult to use in practice, because it requires either some form of conservative schedulability analysis a priori, and the dynamic nature of event arrival makes troubleshooting much harder. Some models and languages listed above, such as synchronous languages and dataflow networks, lend themselves well to time-triggered implementations. Some form of time-triggered scheduling is being, or will most likely be used for both CPUs and communication resources for safety-critical applications. This is already state of the art in avionics (“fly-by-wire,” as used e.g., in the Boeing 777 and in all Airbus models), and it is being seriously considered for automotive applications (“X-by-wire,” where X can stand for brake, drive, or steer). It is considered, coupled with certified high-level language compilers and standardized code review and testing processes, to be the only mechanism to comply with the rules imposed by various governmental certification agencies. Moving such control functions to embedded hardware and software, thus replacing older mechanical parts, is considered essential in order to both reduce costs and improve safety. Embedded electronic systems can analyze continuously possible wearing and faults in the sensors and the actuators, and thus warn drivers or maintenance teams. The simple task-based model outlined above can also be modified in various ways in order to take into account: • The cost of various housekeeping operations, such as recomputing priorities, swapping tasks in and out (also called “context switch”), accessing memory, and so on. • The availability of multiple resources (processors). • The fact that a task may need more than one resource (e.g., the CPU, a peripheral, a lock on a given part of memory), and possibly may have different priorities and different preemptability characteristics on each such resource (e.g., CPU access may be preemptable, while disk or serial line access may not). • Data or control dependencies between tasks. Most of these refinements of the initial model can be taken into account by appropriately modifying the basic parameters of a task set (release time, execution time, priority, and so on). The only exception is the extension to multiple concurrent CPUs, which makes the problem substantially more complex. We refer the interested reader to References 23–25 for more information about this subject. This sort of real-time schedulability is currently replacing manual trial-and-error and extensive simulation as a means to ensure satisfaction of deadlines or a given QOS requirement.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 15 — #17

3-16

Embedded Systems Handbook

3.7 Hardware Implementation The modern hardware implementation process [69,70] in most cases starts from the so-called RTL. At this level of abstraction the required functionality of the circuit is modeled with the accuracy of a clock cycle, that is, it is known in which clock cycle each operation, such as addition or data transfer, occurs, but the actual delay of each operation, and hence the stabilization time of data on the inputs of the registers, is not known. At this level the number of registers and their bitwidths are also precisely known. The designer usually writes the model using an HDL, such as Verilog or VHDL, in which registers are represented using special kinds of “clock-triggered” assignments, and combinational logic operations are represented using the standard arithmetic, relational, and Boolean operators that are familiar to software programmers using high-level languages. The target implementation generally is not in terms of individual transistors and wires, but uses the Boolean gate abstraction as a convenient hand-off point between logic designer and technology specialist. Such abstraction can take the form of a standard cell, that is, an interconnection of transistors realized and well characterized on silicon, which implements a given Boolean function, and exhibits a specific propagation delay from inputs to outputs, under given supply, temperature, and load conditions. It can also be a Combinational Logic Block (CLB) in a FPGA. The former, which is the basis of the modern ASIC design flow, is much more efficient than the latter,2 however, it requires a very significant investment in terms of EDA3 tools, mask production costs and engineer training. The advantage of ASICs over FPGAs in terms of area, power, and performance efficiency comes from two main factors. The first one is the broader choice of basic gates: an average standard cell library includes about 100 to 500 gates, with both different logic functions and different drive strengths, while a given FPGA contains only one type of CLB. The second one is the use of static interconnection techniques, that is, wires and contact vias, versus the transistor-based dynamic interconnects of FPGAs. The much higher nonrecurrent engineering cost of ASICs comes first of all from the need to create at least a set of masks for each design (assuming it is correct the first time, that is, there is no need to respin), which can be up to about $ 1 million for current technologies and is growing very fast, and from the long fabrication times, which can be up to several weeks. Design costs are also higher, again in the million dollar range, both due to the much greater flexibility, requiring skilled personnel and sophisticated implementation tools, and due to the very high cost of design failure, requiring sophisticated verification tools. Thus ASIC designs are the most economically viable solution only for very high volumes. The rising mask costs and manufacturing risks are making the FPGA option viable for larger and larger production counts as technology evolves. A third alternative, structured ASICs, has been proposed recently. It features fixed layout schemes, similar to FPGAs, but also implements interconnect using contact vias. A comparison of the alternatives, for a given design complexity and varying production volumes, is shown in Figure 3.5 (the exact points at which each alternative is best are still subject to debate, and they are moving to the right over time).

3.7.1 Logic Synthesis and Equivalence Checking The semantics of HDLs and of languages, such as C or Java, are very different from each other. HDLs were born in the 1970s in order to model highly concurrent hardware systems, built using registers and Boolean gates. They, and the associated simulators that allow one to analyze the behavior of the modeled design in detail, are very efficient in handling fine-grained concurrency and synchronization, which is necessary when simulating huge Boolean netlists. However, they often lack constructs found in modern programming languages, such as recursive functions and complex data types (only recently introduced in Verilog), or objects, methods, and interfaces. An HDL model is essentially meant to be simulated under 2 The difference is about one order of magnitude in terms of area, power, and performance for the current fabrication technology, and the ratio is expected to remain constant over future technology generations. 3 The term EDA, which stands for Electronic Design Automation, is often used to distinguish this class of tools from the CAD tools used for mechanical and civil engineering design.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 16 — #18

Total cost

Design of Embedded Systems

3-17 FPGA

SA

Standard cell

A 1–10,000

B 10,000 –100,000

C

Volume

100,000–. . .

FIGURE 3.5 Comparison between ASIC, FPGA, and Structured ASIC production costs.

a variety of timing models (generally at the register transfer or gate level, even though cosimulation with analog components or continuous time models is also supported, that is, in Verilog-AMS and AHDL). Synthesis from an HDL into an interconnection of registers and gates, normally consists of two substeps. The first one, called RTL synthesis and module generation, transforms high-level operators, such as adders, multiplexers, and so on, into Boolean gates using an appropriate architecture (e.g., ripple carry or carry lookahead). The second one, called logic synthesis, optimizes the combinational logic resulting from the above step, under a variety of cost and performance constraints [71,72]. It is well known that, given a function to be implemented (e.g., 32-bit twos-complement addition), one can use the properties of Boolean algebra in order to find alternative implementations with different characteristics in terms of: 1. Area, for example, estimated as the number of gates, or as the number of gate inputs, or as the number of literals in the Boolean expression representing each gate function, or using a specific value for each gate selected from the standard cell library, or even considering an estimate of interconnect area. This sequence of cost functions increases estimation precision, but is more and more expensive to compute. 2. Delay, for example, estimated as number of levels, or more precisely as a combination of levels and fanout of each gate, or even more precisely as a table that takes into account gate type, transistor size, input transition slope, output capacitance, and so on. 3. Power, for example, estimated as transition activity times capacitance times voltage squared, using the well-known equation valid for Complementary MOS (CMOS) transistors. It is also well known that generally Pareto-optimal solutions to this problem exhibit an area-delay product that is approximately constant for a given function. Modern EDA tools, such as Design Compiler from Synopsys [8], RTL Compiler from Cadence [28], Leonardo Spectrum from Mentor Graphics [49], Synplify from Synplicity [73], and Blast Create from Magma Design Automation [74] and others, perform such task efficiently for designs that today may include a few million gates. Their widespread adoption has enabled designers to tackle huge designs in a matter of months, which would have been unthinkable or extremely inefficient using either manual or purely block-based design techniques. Such logic synthesis systems take into account the required functionality, the target clock cycle, and the set of physical gates that are available for implementation (the standard-cell library or the CLB characteristics, e.g., number of inputs), as well as some estimates of capacitance and resistance of interconnection wires4 and generate efficient netlists of Boolean gates, which can be passed on to the following design steps. 4 Some

such tools also include rough placement and routing steps, which will be described below, in order to increase the precision of such interconnect estimates for current deep submicron (DSM) technologies.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 17 — #19

3-18

Embedded Systems Handbook

While synthesis is performed using precise algebraic identities, bugs can creep into any program. Thus, in order to avoid extremely costly respins due to an EDA tool bug, it is essential to verify that the functionality of the synthesized gate netlist is the same as that of the original RTL model. This verification step was traditionally performed using a multilevel HDL simulator, comparing responses to designerwritten stimuli in both representations. However, multimillion gate circuits would require too many very slow simulation steps (a large circuit today can be simulated at the speed of a handful of clock cycles per second). Formal verification is thus used to prove, using algorithms that are based on the same laws as synthesis techniques, but which have been written by different people and thus hopefully have different bugs, that indeed the responses of the two circuit models are identical under all legal input sequences. This verification, however, solves only half of the problem. One must also check that all combinational logic computations complete within the required clock cycle. This second check can be performed using timing simulators, however, complexity considerations also suggest the usage a more static approach. Static Timing Analysis, based on worst-case longest-path search within combinational logic, is today a workhorse of any logic synthesis and verification framework. It can either be based on purely topological information, or consider only so-called true paths along which a transition can propagate [75], or even include the effects of crosstalk on path delay. Crosstalk may alter the delay of a “victim” wire, due to simultaneous transitions of temporally and spatially close “aggressor” wires, as analyzed by tools such as PrimeTime from Synopsys [8] and CeltIc from Cadence [28]. This kind of coupling of timing and geometry makes crosstalk-aware timing analysis very hard, and essentially contributes to the breaking of traditional boundaries between synthesis, placement, and routing. Tools performing these tasks are available from all major EDA vendors (e.g., Synopsys, Cadence) as well as from a host of startups. Synthesis has become more or less a commodity technology, while formal verification, even in its simplest form of equivalence checking, as well as in other emerging forms, such as property checking, which are described below, is still an emerging technology, for which disruptive innovation occurs mostly in smaller companies.

3.7.2 Placement, Routing, and Extraction After synthesis (and sometimes during synthesis) gates are placed on silicon, either at fixed locations (the positions of CLBs) for FPGAs and Structured ASICs, or with a row-based organization for standard cell ASICs. Placement must avoid overlaps between cells, while at the same time satisfying clock cycle time constraints, avoiding excessively long wires on critical paths.5 Placement, especially for multimillion-gate circuits, is an extremely difficult problem, which requires complex constrained combinatorial optimization. Modern algorithms [76] drastically simplify the model, in order to ensure reasonable runtimes. For example, the quadratic placement model used in several modern EDA tools, minimizes the sum of squares of net lengths. This permits very efficient derivation of the cost function and fast identification of a minimum cost solution. However, this quadratic cost only approximately correlates with the true objective, which is the minimization of the clock period, due to parasitic capacitance. True cost first of all depends also on the actual interconnect, which is designed only later by the routing step, and second depends on the maximum among a set of sums (one for each register-to-register path), rather than on the sum over all gate-to-gate interconnects. For this reason, modern placers iterate steps solved using fast but approximate algorithms, with more precise analysis phases, often involving actual routing, in order to recompute the actual cost function at each step. Routing is the next step, and involves generating (or selecting from the available prelaid-out tracks in FPGAs) the metal and via geometries that will interconnect placed cells. It is also extremely difficult in modern submicron technologies, not only due to the huge number of geometries involved (10 million gates can easily involve a billion wire segments and contacts), but also due to the complexity of modern 5 Power

density has recently become a prime concern for placement as well, implying the need to avoid “hot spots” of very active cells, where power dissipation through the silicon substrate would be too difficult to manage.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 18 — #20

Design of Embedded Systems

3-19

interconnect modeling. A wire used to be modeled, in CMOS technology, essentially as a parasitic capacitance. This (or minor variations considering also resistance) is still the model used by several commercial logic synthesis tools. However, nowadays a realistic model of a wire, to be used when estimating the cost of a placement or of a routing solution, must take into account: • Realistic resistance and capacitance, for example, using the Elmore model [77], considering each wire segment separately, due to the very different resistance and capacitance characteristics of different metal layers.6 • Crosstalk noise due to capacitive coupling.7 This means that, exactly as in placement (and sometimes during placement), one needs to alternate between fast routing using approximate cost functions and detailed analysis steps that refine the value of the cost function. Again, all major EDA vendors offer solutions to the routing problem, which are generally tightly integrated with the placement tool, even though in principle the two perform separate functions. The reason for the tight coupling lies in the above-mentioned need for the placer to accurately estimate the detailed route taken by a given interconnect, rather than just estimate it with the square of the distance between its terminals. Exactly as in the case of synthesis, a verification step must be performed after placement and routing. This is required in order to verify that: • All design rules are satisfied by the final layout. • All and only the desired interconnects have been realized by placement and routing. This step is done by extracting electrical and logic models from layout masks, and comparing these models with the input netlist (already verified for equivalence with the RTL). Note that within each standard cell, design rules are verified independently, since the ASIC designer, for reason of intellectual property protection, generally does not see the actual layout of the standard cells, but only an external “envelope” of active (transistor) and interconnect areas, which is sufficient to perform this kind of verification. The layout of each cell is known and used only at the foundry, when masks are finally produced.

3.7.3 Simulation, Formal Veriﬁcation, and Test Pattern Generation The steps mentioned above create a layout implementation from RTL, while checking simultaneously that no errors are introduced, either due to programming errors, or due to manual modifications, and that performance and power constraints are satisfied. However, they neither ensure that the original RTL model satisfied the customer-defined requirements, nor that the circuit after manufacturing did not have any flaws compromising either its functionality or its performance. The former problem is tackled by simulation, prototyping, and formal verification. None of these techniques is sufficient to ensure that an ill-defined problem has a solution: customer needs are inherently nonformalizable.8 However, they help building up confidence in the fact that the final product will satisfy the requirements. Simulation and prototyping are both trial-and-error procedures, similar to the compile–debug cycle used for software. Simulation is generally cheaper, since it only requires a generalpurpose workstation (nowadays often a PC running Linux), while prototyping is faster (it is based on synthesizing the RTL model into one or several FPGAs). Cost and performance of these options differ by 6 Layers

that are farther away from silicon are best for long-distance wires, due to the smaller substrate and mutual capacitance, as well as due to the smaller sheet resistance [78]. 7 Inductance fortunately is not yet playing a significant role, and many doubt that it ever will, for digital integrated circuits. 8 For example, what is the definition of “a correct phone call”? Does this refer to not dropping the communication? To transferring exactly a certain number of voice samples per second? To setting up quickly a communication path? Since all these desirable characteristics have a cost, what is the maximum price various classes of customers are willing to pay for them, and what is the maximum degree of violation that can be admitted by each class?

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 19 — #21

3-20

Embedded Systems Handbook

several orders of magnitude. Prototyping on multi-FPGA platforms, such as those offered by Quickturn, is thus limited to the most expensive designs, such as microprocessors.9 Unfortunately, both simulation and prototyping suffer from a basic capacity problem. It is true that cost decreases exponentially and performance increases exponentially over technology generations for the simulation and prototyping platforms (CPUs and FPGAs). However, the complexity of the verification problem grows as a double or even triple exponential (approximately) with technology. The reason is that the number of potential states of a digital design grows exponentially with the number of memory-holding components (flip-flops and latches), and the complexity of the verification problem for a sequential entity (e.g., a FSM) grows even more than exponentially with its state-space. For this reason, the growth in the number of input patterns which are required to prove up to a given level of confidence that a design is correct, grows triply exponentially with each technology generation, while capacity and performance grow “only” as a single exponential. This is clearly an untenable situation, given that the number of engineers is finite, and the size of the verification teams is already much larger than that of the design teams. Formal verification, defined as proving semiautomatically that, under a set of assumptions, a given property holds for a design, is a means of alleviating at least the human aspect of the “verification complexity explosion” problem. Formal verification allows one to state a property, such as, for example, “this protocol never deadlocks” or “the value of this register is never overwritten before being read,” using relatively simple mathematical formulas. Then one can automatically check that the property holds over all possible input sequences. The problem, unfortunately, is inherently extremely complex (the triple exponential mentioned above affects this formulation as well). However, the complexity is now relegated to the automated portion of the flow. Thus manual generation and checking of individual pattern sequences is no longer required. Several EDA companies on the market, such as Cadence, Mentor Graphics, Synopsys, as well as several silicon vendors, such as Intel and IBM, currently offer or internally develop and use such tools. The key barriers to adoption are twofold: 1. The complexity of the task, as mentioned above, is just shifted. While a workstation costs much less than an engineer, exponential growth is never tenable in the long term, regardless of the constant factors. This means that significant human intervention is still required in order to keep within acceptable limits the time required to check each individual property. This involves both breaking properties into simpler subproperties and abstracting away aspects of the system that are not relevant for the property at hand. Abstraction, however, hides aspects of the real design from the automated prover, and thus implies the risk of “false positive” results, that is, of declaring a system correct even when it is not. 2. Specification of properties is much more difficult than identification of input patterns. A property must encompass a variety of possible scenarios and state explicitly all assumptions made (e.g., “there is no deadlock in the bus access protocol only if no master makes requests at every clock cycle”). The language in which properties are specified is often a form of mathematical logics, and thus is even less familiar than software languages to a typical design engineer. However, significant progress is being made in this area every year by researchers, and adoption of such automated formal verification techniques in the specification verification domain is growing. Testing a manufactured circuit to verify that it operates correctly according to the RTL model is a closely related problem. In principle, one would need to prove equivalent behavior under all possible input–output sequences, which is clearly impossible. In practice, test engineers either use a “naturally orthogonal” architecture, such as that of a microprocessor, in order to functionally test small sequences of instructions. Or they decompose testing into that of combinational and sequential logic. Combinational logic testing is a relatively “easy” task, as compared to the formal verification described above. If one considers only Boolean functionality (i.e., delay is not tested), its complexity (assuming that no polynomial 9 Nowadays even microprocessors are mostly designed using a modified ASIC-like flow, except for memories, register

files, and sometimes portions of the ALU, which are still designed by hand down to the polygon level, at least for leading edge CPUs.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 20 — #22

Design of Embedded Systems

3-21

algorithm exists for NP-complete problems) is just a single exponential in the number of combinational circuit inputs. While a priori there is no reason why testing only Boolean equivalence between the specification and the manufactured circuit should be enough to ensure correct functionality, empirically there is a significant amount of evidence that fully testing for a relatively small class of Boolean manufacturing faults, namely stuck-at faults, is sufficient to ensure satisfactory actual yield for ASICs. The stuck-at-fault model assumes that the only problem that can occur during manufacturing is the fact that some gate inputs are fixed at logical 0 or 1. This may have been a physically realistic model in the early days of bipolar-based Transistor–Transistor Logic. However, in DSM CMOS a host of physical defects may short wires together, increase or decrease their resistance and capacitance, short a transistor gate to its source or drain, and so on. At the logic level, a combinational function may become sequential (even worse, may exhibit dynamic behavior, that is, slowly change output values over time, without changing inputs), or it may become faster or slower. Still, full checking for stuck-at faults is excellent to ensure that none of these complex physical problems has occurred, or will affect the operation of the circuit. For this reason, today testing is mostly accomplished by first of all reducing sequential testing to combinational testing, by using special memory elements, the so-called scan flip-flops and latches. Second, combinational test pattern generation is performed only at the Boolean level, using the above-mentioned stuck-at model. Test pattern generation is similar to equivalence checking, because it amounts to proving that two copies of the same circuit, one with and one without a given fault, are indeed not equivalent. The witness to this nonequivalence is the pattern to be applied to the circuit inputs to identify the fault. The problem of actually applying the pattern to the physical fragment of combinational logic, and then observing its outputs to verify if the fault is present, is solved by converting all or most of the registers of the sequential circuit into one (or a handful of) giant shift registers, each including several hundred thousand bits. The pattern (and several others, used to test several CLBs in parallel) is first loaded serially through the shift register. Then a multiplexer at the input of each flip-flop is switched, transforming the serial loading mode into parallel loading mode, using as register inputs the outputs of each CLB. Finally, serial conversion is performed again, and the outputs of the logic are checked for correctness by the test equipment. Figure 3.6 shows an example of this sort of arrangement, in which also the flip-flop clock is changed from normal operation (in which it can be gated) to test mode. The only drawback of this elegant solution, due to the IBM engineers in the 1970s, is the additional time that the circuit needs to spend on very expensive testing machines, in order to shift patterns in and out through very long flip-flop chains. Test pattern generation for combinational circuits is a very well-established area of research, and again the reader is referred to one of many books in the area for a more extensive description [79]. Note that memories are not tested using this mechanism, both because it would be too expensive to convert each cell into a scan register, and because the stuck-at-fault model does not apply to this kind of circuits. Memories are tested using appropriate input–output pattern sequences, which are generated, applied, and verified on-chip, using either self-test software running on the embedded processor, or some

Q Sout

Q Sout

Test_Data Test_Mode Test_Clk User_Clk

FIGURE 3.6 Two scan flip-flops with combinational logic.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 21 — #23

3-22

Embedded Systems Handbook

form of Built-In Self-Test (BIST) logic circuitry. Modern RAM generators, that produce directly the layout in a given process, based on the requested number of rows and columns, often produce directly also the BIST circuitry.

3.8 Conclusions This chapter discussed several aspects of embedded system design, including both methodologies that allow one to perform judicious algorithmic and architectural decisions, and tools supporting various steps of these methodologies. One must not forget, however, that often embedded systems are complex compositions of parts that have been implemented by various parties, and thus the task of physical board or chip integration can be as difficult as, and much more expensive than, the initial architectural decisions. In order to support the integration and system testing tasks one must use formal models throughout the design process, and if possible perform early evaluation of the difficulties of integration, by virtual integration and rapid prototyping techniques. These allow one to find or avoid completely subtle bugs and inconsistencies earlier in the design cycle, and thus reduce overall design time and cost. Thus the flow and tools that we described in this chapter help not only with the initial design, but also with the final integration. This is because they are based on executable specifications of the whole system (including models of its environment), early virtual integration, and systematic (often automated) refinement toward implementation. The last part of the chapter summarized the main characteristics of the current hardware and software implementation flows. While complete coverage of this huge topic is beyond its scope, a lightweight introduction can hopefully serve to direct the interested reader, who has only a general electrical engineering or computer science background, toward the most appropriate source of information.

References [1] F. Balarin, E. Sentovich, M. Chiodo, P. Giusto, H. Hsieh, B. Tabbara, A. Jurecska, L. Lavagno, C. Passerone, K. Suzuki, and A. Sangiovanni-Vincentelli. Hardware–Software Co-design of Embedded Systems — The POLIS Approach. Kluwer Academic Publishers, Dordrecht, 1997. [2] F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, and A. Vandecapelle. Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design. Kluwer Academic Publishers, Dordrecht, 1998. [3] The Mathworks Simulink and StateFlow. http://www.mathworks.com. [4] National Instruments MATRIXx. http://www.ni.com/matrixx/. [5] ETAS Ascet-SD. http://www.etas.de. [6] N2C CoWare SPW and LISATek. http://www.coware.com. [7] Esterel Technologies Esterel Studio. http://www.esterel-technologies.com. [8] Design Compiler Synopsys SystemStudio and PrimeTime. http://www.synopsys.com. [9] Telelogic Tau and Doors. http://www.telelogic.com. [10] I-Logix Statemate and Rhapsody. http://www.ilogix.com. [11] D. Har’el, H. Lachover, A. Naamad, A. Pnueli, M. Politi, R. Sherman, A. Shtull-Trauring, and M.B. Trakhtenbrot. STATEMATE: a working environment for the development of complex reactive systems. IEEE Transactions on Software Engineering, 16:403–414, 1990. [12] The Object Management Group UML. http://www.omg.org/uml/. [13] L. Lavagno, G. Martin, and B. Selic, Eds. UML for Real: Design of Embedded Real-Time Systems. Kluwer Academic Publishers, Dordrecht, 2003. [14] Artisan Software Real Time Studio. http://www.artisansw.com/. [15] IBM Rational Rose RealTime. http://www.rational.com/products/rosert/. [16] dSPACE TargetLink and Prototyper. http://www.dspace.de. [17] OSEK/VDX. http://www.osek-vdx.org/.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 22 — #24

Design of Embedded Systems

3-23

[18] E.A. Lee and D.G. Messerschmitt. Synchronous data flow. IEEE Proceedings, 75(9):1235–1245, 1987. [19] J. Buck and R. Vaidyanathan. Heterogeneous modeling and simulation of embedded systems in El Greco. In Proceedings of the International Conference on Hardware Software Codesign, May 2000. [20] TNI Valiosys Reqtify. http://www.tni-valiosys.com. [21] R.P. Kurshan. Automata-Theoretic Verification of Coordinating Processes. Princeton University Press, Princeton, NJ, 1994. [22] K. McMillan. Symbolic Model Checking. Kluwer Academic Publishers, Dordrecht, 1993. [23] G. Buttazzo. Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications. Kluwer Academic Publishers, Dordrecht, 1997. [24] H. Gomaa. Software Design Methods for Concurrent and Real-Time Systems. Addison-Wesley, Reading, MA, 1993. [25] W.A. Halang and A.D. Stoyenko. Constructing Predictable Real Time Systems. Kluwer Academic Publishers, Dordrecht, 1991. [26] R. Ernst, J. Henkel, and T. Benner. Hardware–software codesign for micro-controllers. IEEE Design and Test of Computers, 10:64–75, 1993. [27] R.K. Gupta and G. De Micheli. Hardware–software cosynthesis for digital systems. IEEE Design and Test of Computers, 10:29–41, 1993. [28] CeltIc Cadence Design Systems RTL Compiler and Quickturn. http://www.cadence.com. [29] Open SystemC Initiative. http://www.systemc.org. [30] G. Berry. The foundations of esterel. In Plotkin, Stirling, and Tofte, Eds., Proof, Language and Interaction: Essays in Honour of Robin Milner. MIT Press, Lanchester, 2000. [31] S.A. Edwards. Compiling Esterel into sequential code. In International Workshop on Hardware/Software Codesign. ACM Press, May 1999. [32] T.B. Ismail, M. Abid, and A.A. Jerraya. COSMOS: a codesign approach for communicating systems. In International Workshop on Hardware/Software Codesign. ACM Press, 1994. [33] W. Cesario, A. Baghdadi, L. Gauthier, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, A.A. Jerraya, and M. Diaz-Nava. Component-based design approach for multicore socs. In Proceedings of the Design Automation Conference, June 2002. [34] Foresight Systems. http://www.foresight-systems.com. [35] CARDtools. http://www.cardtools.com. [36] IMEC ATOMIUM. http://www.imec.be/design/atomium/. [37] P. Cousot and R. Cousot. Abstract interpretation: a unified lattice model for static analysis of programs by construction of approximation of fixpoints. In Proceedings of the ACM Symposium on Principles of Programming Languages. ACM Press, 1977. [38] AbsInt Worst-Case Execution Time Analyzers. http://www.absint.com. [39] Y.T.S. Li and S. Malik. Performance analysis of embedded software using implicit path enumeration. In Proceedings of the Design Automation Conference, June 1995. [40] 0-In Design Automation. http://www.0-in.com/. [41] C. Norris Ip. Simulation coverage enhancement using test stimulus transformation. In Proceedings of the International Conference on Computer Aided Design, November 2000. [42] Forte Design Systems Cynthesizer. http://www.forteds.com. [43] Celoxica DK Design suite. http://www.celoxica.com. [44] K. Wakabayashi. Cyber: high level synthesis system from software into ASIC. In R. Camposano and W. Wolf, Eds., High Level VLSI Synthesis. Kluwer Academic Publishers, Dordrecht, 1991. [45] D. Gajski, J. Zhu, and R. Domer. The SpecC Language. Kluwer Academic Publishers, Dordrecht, 1997. [46] D. Gajski, J. Zhu, R. Domer, A. Gerstlauer, and S. Zhao. SpecC: Specification Language and Methodology. Kluwer Academic Publishers, Dordrecht, 2000. [47] OPNET. http://www.opnet.com. [48] Network Simulator NS-2. http://www.isi.edu/nsnam/ns/.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 23 — #25

3-24

[49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63]

[64]

[65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79]

Embedded Systems Handbook

Mentor Graphics Seamless and Emulation. http://www.mentor.com. VAST Systems CoMET. http://www.vastsystems.com/. Axys Design Automation MaxSim and MaxCore. http://www.axysdesign.com/. J. Rowson. Hardware/software co-simulation. In Proceedings of the Design Automation Conference, 1994, pp. 439–440. V. Zivojnovic and H. Meyr. Compiled HW/SW co-simulation. In Proceedings of the Design Automation Conference, 1996. Altera DSP Builder. http://www.altera.com. Xilinx System Generator. http://www.xilinx.com. IEEE. Standard 1076.1, vhdl-ams. http://www.eda.org/vhdl-ams. OVI. Verilog-a standard. http://www.ovi.org. B. Kernighan and D. Ritchie. The C Programming Language. Prentice-Hall, New York, 1988. K. Arnold and J. Gosling. The Java Programming Language. Addison-Wesley, Reading, MA, 1996. Sun Microsystem, Inc. Embedded Java Specification. Available at http://java.sun.com, 1998. Real-Time for Java Expert Group. The real time specification for Java. Available at http:// rtsj.dev.java.net/, 1998. A.V. Aho, J.E. Hopcroft, and J.D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, MA, 1974. P. Panda, N. Dutt, and A. Nicolau. Efficient utilization of scratch-pad memory in embedded processor applications. In Proceedings of Design Automation and Test in Europe (DATE), February 1997. Y.T.S. Li, S. Malik, and A. Wolfe. Performance estimation of embedded software with instruction cache modeling. In Proceedings of the International Conference on Computer-Aided Design, November 1995. ˜ Whalley. Fast instruction cache analysis via static cache simulation. F. Mueller and D.B. In Proceedings of the 28th Annual Simulation Symposium, April 1995. P. Marwedel and G. Goossens, Eds. Code Generation for Embedded Processors. Kluwer Academic Publishers, Dordrecht, 1995. H. Kopetz. Should responsive systems be event-triggered or time-triggered? IEICE Transactions on Information and Systems, E76-D:1325–1332, 1993. H. Kopetz and G. Grunsteidl. TTP — A protocol for fault-tolerant real-time systems. IEEE Computer, 27:14–23, 1994. G. De Micheli. Synthesis and Optimization of Digital Circuits. McGraw-Hill, New York, 1994. Jan M. Rabaey, Anantha Chandrakasan, and Borivoje Nikolic. Digital Integrated Circuits, 2nd ed. Prentice-Hall, New York, 2003. S. Devadas, A. Ghosh, and K. Keutzer. Logic Synthesis. McGraw-Hill, New York, 1994. G.D. Hachtel and F. Somenzi. Logic Synthesis and Verification Algorithms. Kluwer Academic Publishers, Dordrecht, 1996. Synplicity Synplify. http://www.synplicity.com. Magma Design Automation Blast Create. http://www.magma-da.com. P. McGeer. On the Interaction of Functional and Timing Behavior of Combinational Logic Circuits. Ph.D. thesis, U.C. Berkeley, 1989. Naveed A. Sherwani. Algorithms for VLSI Physical Design Automation, 3rd ed. Kluwer Academic Publishers, Dordrecht, 1999. W.C. Elmore. The transient response of damped linear network with particular regard to wideband amplifiers. Journal of Applied Physics, 19:55–63, 1948. R.H.J.M. Otten and R.K. Brayton. Planning for performance. In Proceedings of the Design Automation Conference, June 1998. M. Abramovici, M.A. Breuer, and A.D. Friedman. Digital Systems Testing and Testable Design. Computer Science Press, Rockville, MD, 1990.

© 2006 by Taylor & Francis Group, LLC

ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 24 — #26

4 Models of Embedded Computation 4.1

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4-1

Models of Sequential and Parallel Computation • Nonfunctional Properties • Heterogeneity • Component Interaction • Time • The Purpose of an MoC

4.2

The MoC Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4-9

Processes and Signals • Signal Partitioning • Untimed MoCs • The Synchronous MoC • Discrete Timed MoCs

4.3

Integration of MoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16 MoC Interfaces • Interface Refinement • MoC Refinement

Axel Jantsch Royal Institute of Technology

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23

4.1 Introduction A model of computation (MoC) is an abstraction of a real computing device. Different computational models serve different objectives and purposes. Thus, they always suppress some properties and details that are irrelevant for the purpose at hand, and they focus on other properties that are essential. Consequently, MoCs have been evolving during the history of computing. In the early decades, between 1920 and 1950, the main focus has been on the question: “What is computable?” The Turing machine and the lambda calculus are prominent examples of computational models developed to investigate that question.1 It turned out that several, very different MoCs, such as the Turing machine, the lambda calculus, partial recursive functions, register machines, Markov algorithms, Post systems, etc., [1] are all equivalent in the sense that they all denote the same set of computable mathematical functions. Thus, today the so-called Church–Turing thesis is widely accepted: Church–Turing thesis. If function f is effectively calculable, then f is Turing-computable. If function f is not Turing-computable, then f is not effectively calculable [1, p. 379]. It is the basis for our understanding today what kind of problems can be solved by computers, and what kind of problems principally are beyond a computer’s reach. A famous example of what cannot be solved by a computer is the halting problem for Turing machines. A practical consequence is that there cannot be an 1 The term “model of computation” came in use only much later in the 1970s, but conceptually the computational models of today can certainly be traced back to the models developed in the 1930s.

4-1

© 2006 by Taylor & Francis Group, LLC

4-2

Embedded Systems Handbook

algorithm that given a function f and a C++ program P (or a program in any other sufficiently complex programming language), could determine if P computes f . This illustrates the principal difficulty of programming language teachers in correcting exams, and of verification engineers in validation programs and circuits. Later the focus changed to the question: “What can be computed in reasonable time and with reasonable resources?,” which spun off the theories of algorithmic complexity based on computational models exposing timing behavior in a particular but abstract way. This resulted in a hierarchy of complexity classes for algorithms according to their asymptotic complexity. The computation time (or other resources) for an algorithm is expressed as a function of some characteristic figure of the input, for example, the size of the input. For instance we can state that the function f (n) = 2n, for natural numbers n can be computed in p(n) time steps by any computer for some polynomial function p(n). In contrast, the function g (n) = n! cannot be computed in p(n) time steps on any sequential computer for any polynomial function p(n) and arbitrary n. With growing n the time steps required to compute g (n) grows faster than can be expressed by any polynomial function. This notion of asymptotic complexity allows us to express properties about algorithms in general disregarding details of the algorithms and the computer architecture. This comes at the cost of accuracy. We may only know that there exists some polynomial function p(n) for every computer, but we do not know p(n) since it may be very different for different computers. To be more accurate one needs to take into account more details of the computer architecture. As a matter of fact the complexity theories rest on the assumption that one kind of computational model, or machine abstraction, can simulate another one with a bounded and well-defined overhead. This simulation capability has been expressed in the thesis given below: Invariance thesis. “Reasonable” machines can simulate each other with a polynomially bounded overhead in time and a constant overhead in space [2]. This thesis establishes an equivalence between different machine models and makes results for a particular machine more generally useful. However, some machines are equipped with considerably more resources and cannot be simulated by a conventional Turing machine according to the invariance thesis. Parallel machines have been the subject of a huge research effort and the question, how parallel resources increase the computational power of a machine has lead to a refinement of computational models and an accuracy increase for estimating computation time. The fundamental relation between sequential and parallel machines has been captured by the following thesis. Parallel computation thesis. Whatever can be solved in polynomially bounded space on a reasonable sequential machine model, can be solved in polynomially bounded time on a reasonable parallel machine, and vice versa [2]. Parallel computers prompted researchers to refine computational models to include the delay of communication and memory access, which we review briefly in Section 4.1.1. Embedded systems require a further evolution of computational models due to new design and analysis objectives and constraints. The term “embedded” triggers two important associations. First, an embedded component is squeezed into a bigger system, which implies constraints on size, the form factor, weight, power consumption, cost, etc. Second, it is surrounded by real-world components, which implies timing constraints and interfaces to various communication links, sensors, and actuators. As a consequence, the computational models that are used and are useful in embedded system design are different from those in general purpose sequential and parallel computing. The difference comes from the nonfunctional requirements and constraints and from the heterogeneity.

4.1.1 Models of Sequential and Parallel Computation Arguably, general purpose sequential computing had for a long time a privileged position, in that it had a single, very simple, and effective MoC. Based on the van Neumann machine, the

© 2006 by Taylor & Francis Group, LLC

Models of Embedded Computation

4-3

random access machine (RAM) model [3] is a sufficiently general model to express all important algorithms and reflects the salient nonfunctional characteristics of practical computing engines. Thus, it can be used to analyze performance properties of algorithms in a hardware architecture and implementation independent way. This favorable situation for sequential computing has been eroded over the years as processor architectures and memory hierarchies became ever more complex and deviated from the ideal RAM model. The parallel computation community has been searching in vain for a similar simple and effective model [4]. Without a universal model of parallel computation, the foundations for the development of portable and efficient parallel applications and architectures were lacking. Consequently, parallel computing has not gained as wide acceptance as sequential computing and is still confined to niche markets and applications. The parallel random access machine (PRAM) [5] is perhaps the most popular model of parallel computation and closest to its sequential counterpart with respect to simplicity. A number of processors execute in a lock-step way, that is, synchronized after each cycle governed by a global clock, and access global, shared memory simultaneously within one cycle. The PRAM model’s main virtue is its simplicity but it poorly captures the costs associated with computing. Although the RAM model has a similar cost model, there is a significant difference. In the RAM model the costs (execution time, program size) are in fact well reflected and grow linearly with the size of the program and the length of the execution path. This correlation is in principle correct for all sequential processors. The PRAM model does not exhibit this simple correlation because in most parallel computers the cost of memory access, communication and synchronization can be vastly different depending on which memory location is accessed and which processors communicate. Thus, the developer of parallel algorithms does not have sufficient information from the PRAM model alone to develop efficient algorithms. He or she has to consult the specific cost models of the target machine. Many PRAM variants have been developed to more realistically reflect real cost. Some made the memory access more realistic. The exclusive read–exclusive write (EREW) and the concurrent read–exclusive write (CREW) models [5] serialize access to a given memory location by different processors but still maintain the unit cost model for memory access. The local memory PRAM (LPRAM) model [6] introduces a notion of memory hierarchy while the queued read–queued write (QRQW) PRAM [7] models the latency and contention of memory access. A host of other PRAM variants have factored in the cost of synchronization, communication latency, and bandwidth. Other models of parallel computation, many of which are not directly derived from the PRAM machine, focus on memory. There either the distributed nature of memory is the main concern [8] or the various cost factors of the memory hierarchy are captured [6,9,10]. An introductory survey of models of parallel computation has been written by Maggs et al. [4].

4.1.2 Nonfunctional Properties A main difference between sequential computation and parallel computation comes from the role of time. In sequential computing, time is solely a performance issue which is moreover captured fairly well by the simple and elegant RAM model. In parallel computing the execution time can only be captured by complex cost functions that depend heavily on various details of the parallel computer. In addition, the execution time can also alter the functional behavior, because the changes in the relative timing of different processors and the communication network can alter the overall functional behavior. To counter this danger, different parts of the parallel program must be synchronized properly. In embedded systems the situation is even more delicate if real-time deadlines have to be observed. A system that responds slightly too late may be as unacceptable as a system that responds with incorrect data. Even worse, it is entirely context dependent if it is better to respond slightly too late, incorrectly, or not at all. For instance when transmitting a video stream, incorrect data arriving on time may be preferable to correct data arriving too late. Moreover, it may be better not to send data that arrive too late to save resources. On the other hand, control signals to drive the engine or the breaks in a car must always arrive and a tiny delay may be preferable to no signal at all. These observations lead to the distinction of different kinds of real-time systems, for example, hard versus soft real-time systems, depending on the requirements on the timing.

© 2006 by Taylor & Francis Group, LLC

4-4

Embedded Systems Handbook

Since most embedded systems interact with real-world objects they are subject to some kind of real-time requirements. Thus, time is an integral part of the functional behavior and cannot be abstracted away completely in many cases. So it should not come as a surprise that MoCs have been developed to allow the modeling of time in an abstract way to meet the application requirements while at the same time avoiding the unnecessary burden of too detailed timing. We will discuss some of these models below. In fact, the timing abstractions of different MoCs is a main organizing principle in this chapter. Designing for low power is a high priority for most, if not all, embedded systems. However, power has been treated in a limited way in computational models because of the difficulty to abstract the power consumption from the details of architecture and implementation. For very large-scale integration (VLSI) circuits computational models have been developed to derive lower and upper bounds with respect to complexity measures that usually include both circuit area and computation time for a given behavior. AT 2 has been found to be a relevant and interesting complexity measure, where A is the circuit area and T is the computation time either in clock cycles or in physical time. These models have also been used to derive bounds on the energy consumption by usually assuming that the consumed energy is proportional to the state changes of the switching elements. Such analysis shows for instance that AT 2 optimal circuits, that is, circuits which are optimal up to a constant factor with respect to the AT 2 measure for a given boolean function, utilize their resources to a high degree, which means that on average a constant fraction of the chip changes state. Intuitively this is obvious since, if large parts of a circuit are not active over a long period (do not change state), it can presumably be improved by making it either smaller or faster and thus utilizing the circuit resources to a higher degree on average. Or, to conclude the other way round, an AT 2 optimal circuit is also optimal with respect to energy consumption for computing a given boolean function. One can spread out the consumed energy over a larger area or a longer time period, but one cannot decrease the asymptotic energy consumption for computing a given function. Note, that all these results are asymptotic complexity measures with respect to a particular size metric of the computation, for example, the length in bit of the input parameter of the function. For a detailed survey of this theory see Reference 11. These models have several limitations. They make assumptions about the technology. For instance, in different technologies the correlation between state switching and energy consumption is different. In n-channel metal oxide semiconductor (NMOS) technologies the energy consumption is more correlated with the number of switching elements. The same is true for complementary metal oxide semiconductor (CMOS) technologies if leakage power dominates the overall energy consumption. Also, they provide asymptotic complexity measures for very regular and systematic implementation styles and technologies with a number of assumptions and constraints. However, they do not expose relevant properties for complex modern microprocessors, VLIW (Very Large Instruction Word) processors, DSPs (Digital Signal Processings), FPGAs (Field Programmable Gate Arrays), or ASIC (Application Specific Integrated Circuit) designs in a way useful for system level design decisions. And we are again back at our original question about what exactly is the purpose of a computational model and how general or how specific should it be. In principle, there are two alternatives to integrate nonfunctional properties such as power, reliability, and also time in an MoC: • First, we can include these properties in the computational model and associate every functional operation with a specific quantity of that property. For example, an add operation takes 2 nsec and consumes 60 pW. During simulation or some other analysis we can calculate the overall delay and power consumption. • Second, we can allocate abstract budgets for all parts of the design. For instance, in synchronous design styles, we divide the time axis in slots or cycles and assign every part of the design to exactly one slot. Later on during implementation, we have to find the physical time duration of each slot, which determines the clock frequency. We can optimize for high clock frequency by identifying the critical path and optimizing that design part aggressively. Alternatively, we can move some of the functionality from the slot with the critical part to a neighboring slot, thus balancing the different slots. This budget approach can also be used for managing power consumption, noise, and other properties.

© 2006 by Taylor & Francis Group, LLC

Models of Embedded Computation

4-5

The first approach suffers from inefficient modeling and simulation when all implementation details are included in a model. Also, it cannot be applied to abstract models since there these implementation details are not available. Recall, that a main idea of computational models is that they should be abstract and general enough to support analysis of a large variety of architectures. The inclusion of detailed timing and power consumption data would obstruct this objective. Even the approach to start out with an abstract model and later on back-annotate the detailed data from realistic architectural or implementation models does not help, because the abstract model does not allow to draw concrete conclusions and the detailed, back-annotated model is valid only for a specific architecture. The second approach with abstract budgets is slightly more appealing to us. On the assumption that all implementations will be able to meet the budgeted constraints, we can draw general conclusions about performance or power consumption on an abstract level valid for a large number of different architectures. One drawback is that we do not know exactly for which class of architectures our analysis is valid, since it is hard to predict which implementations will at the end be able to meet the budget constraints. Another complication is, that we do not know the exact physical size of these budgets and it may indeed be different for different architectures and implementations. For instance an ASIC implementation of a given architecture may be able to meet a cycle constraint of 1 nsec and run at 1 GHz clock frequency, while an FPGA implementation of exactly the same algorithms requires a cycle budget of 5 nsec. But still, the abstract budget approach is promising because it divides the overall problem into more manageable pieces. At the higher level we make assumptions about abstract budgets and analyze a system based on these assumptions. Our analysis will then be valid for all architectures and implementations that meet the stated assumptions. At the lower level we have to ensure and verify that these assumptions are indeed met.

4.1.3 Heterogeneity Another salient feature of many embedded systems is heterogeneity. It comes from various environmental constraints on the interfaces, from heterogeneous applications and from the need to find different tradeoffs among performance, cost, power consumption, and flexibility for different parts of the system. Consequently, we see analog and mixed signal parts, digital signal processing parts, image, and video processing parts, control parts, and user interfaces coexisting in the same system or even on the same VLSI device. We also see irregular architectures with microprocessors, DSPs, VLIWs, custom hardware coprocessors, memories, and FPGAs connected via a number of different segmented and hierarchical interconnection schemes. It is a formidable task to develop a uniform MoC that exposes all relevant properties while nicely suppressing irrelevant details. Heterogeneous MoCs is one way to address heterogeneity at the application, architecture, and implementation level. Different computational models are connected and integrated into a hierarchical, heterogeneous MoC that represents the entire system. Many different approaches have been taken to either connect two different computational models or provide a general framework to integrate a number of different models. It turns out that issues of communication, synchronization, and time representation pose the most formidable challenges. The reason is that the communication, and, in particular, the synchronization semantics between different MoC domains correlates the time representation between the two domains. As we will see below, connecting a timed MoC with an untimed model leads to the import of a time structure from the timed to the untimed model resulting in a heterogeneous, timed MoC. Thus the integration cannot stop superficially at the interfaces leaving the interior of the two computational domains unaffected. Due to the inherent heterogeneity of embedded systems, different MoCs will continue to be used and thus different MoC domains will coexist within the same system. There are two main possible relations, one is due to refinement and the other due to partitioning. A more abstract MoC can be refined into a more detailed model. In our framework, time is the natural parameter that determines the abstraction level of a model. The untimed MoC is more abstract then the synchronous MoC, which in turn is more abstract than the timed MoC. It is in fact common practice that a signal processing algorithm is first

© 2006 by Taylor & Francis Group, LLC

4-6

Embedded Systems Handbook

modeled as an untimed dataflow algorithm, which is then refined into a synchronous circuit description, which in turn is mapped onto a technology dependent netlist of fully timed gates. However, this is not a natural flow for all applications. Control dominated systems or subsystems require some notion of time already at the system level and sensor and actuator subsystems may require a continuous time model right from the start. Thus, different subsystems should be modeled with different MoCs.

4.1.4 Component Interaction A troubling issue in complex, heterogeneous systems is unexpected behavior of the system due to subtle and complex ways of interaction of different MoCs parts. Eker et al. [12] call this phenomenon emergent behavior. Some examples shall illustrate this important point: Priority inversion. Threads in a real-time operating system may use two different mechanism of resource allocation[12]. One is based on priority and preemption to schedule the threads. The second is based on monitors. Both are well defined and predictable in isolation. For instance, priority and preemption based scheduling means that a higher priority thread cannot be blocked by a lower priority thread. However, if the two threads also use a monitor lock, the lower priority thread may block the high priority thread via the monitor for an indefinite amount of time. Performance inversion. Assume there are four CPUs on a bus. CPU1 sends data to CPU2 , CPU3 sends data to CPU4 over the bus [13]. We would expect that the overall system performance improves when we replace one CPU with a faster processor, or at least that the system performance does not decrease. However, replacing CPU1 with a faster CPU1 may mean that data is sent from CPU1 to CPU2 with a higher frequency, at least for a limited amount of time. This means, that the bus is more loaded by this traffic, which may slow down the communication from CPU3 to CPU4 . If this communication performance has a direct influence on the system performance, we will see a decreased overall system performance. Over synchronization. Assume that the upper and lower branches in Figure 4.1 have no mutual functional dependence as the dataflow arrows indicate. Assume further that process B is blocked when it tries to send data to C1 or D1, but the receiver is not ready to accept the data. Then, a delay or deadlock in branch D will propagate back through process B to both A and the entire C branch. These examples are not limited to situations when different MoCs interact. They show that, when separate, seemingly unrelated subsystems interact via a nonobvious mechanism, which is often a shared resource, the effects can be hard to analyze. When the different subsystems are modeled in different MoCs the problem is even more pronounced due to different communication semantics, synchronization mechanisms, and time representation.

4.1.5 Time The treatment of time will serve for us as the most important dimension to distinguish MoCs. We can identify at least four levels of accuracy, which are continuous time, discrete time, clocked time, and causality. In the sequel, we only cover the last three levels. When time is not modeled explicitly, events are only partially ordered with respect to their causal dependences. In one approach, taken for instance in deterministic dataflow networks [14, 15], the system

A

FIGURE 4.1

C1

C2

C3

D1

D2

D3

B

Over synchronization between functionally independent subsystems.

© 2006 by Taylor & Francis Group, LLC

Models of Embedded Computation

4-7

behavior is independent of delays and timing behavior of computation elements and communication channels. These models are robust with respect to time variations in that any implementation, no matter how slow or fast it is, will exhibit the same behavior as the model. Alternatively, different delays may affect the system’s behavior and we obtain an inherently nondeterministic model since time behavior, which is not modeled explicitly is allowed to influence the observable behavior. This approach has been taken both in the context of dataflow models [16–19] and process algebras [20, 21]. In this chapter we follow the deterministic approach, which can be generalized to approximate nondeterministic behavior by means of stochastic processes as shown in Reference 22. To exploit the very regular timing of some applications, the synchronous dataflow (SDF) [23] has been developed. Every process consumes and emits a statically fixed number of events in each evaluation cycle. The evaluation cycle is the reference time. The regularity of the application is translated into a restriction of the model, which in turn allows efficient analysis and synthesis techniques that are not applicable for more general models. Scheduling, buffer size optimization, and synthesis have been successfully developed for the SDF. One facet related to the representation of time is the dichotomy of dataflow dominated and control flow dominated applications. Dataflow dominated applications tend to have events that occur in very regular intervals. Thus, explicit representation of time is not necessary and in fact often inefficient. In contrast, control dominated applications deal with events occurring at very irregular time instants. Consequently, explicit representation of time is a necessity because the timing of events cannot be inferred. Difficulties arise in systems that contain both elements. Unfortunately, these kind of systems become more common since the average system complexity steadily increases. As a consequence, several attempts to integrate dataflow and control dominated modeling concepts have emerged. In the synchronous piggybacked dataflow model [24] control events are transported on dataflow streams to represent a global state without breaking the locality principle of dataflow models. The composite signal flow [25] distinguishes between control and dataflow processes and puts significant effort to maintain the frame-oriented processing which is so common in dataflow and signal processing applications for efficiency reasons. However, conflicts occur when irregular control events must be synchronized with dataflow events inside frames. The composite signal flow addresses this problem by allowing an approximation of the synchronization and defines conditions when approximations are safe and do not lead to erroneous behavior. Time is divided up into time slots or clock cycles by various synchronous models. According to the perfect synchrony assumption [26, 27] neither communication nor computation takes any noticeable time and the time slots or evaluation cycles are completely determined by the arrival of input events. This assumption is useful because designer and tools can concentrate solely on the functionality of the system without mixing this activity with timing considerations. Optimization of performance can be done in a separate step by means of static timing analysis and local retiming techniques. Even though timing does not appear explicitly in synchronous models, the behavior is not independent of time. The model constrains all implementations such that they must be fast enough to process input events properly and to complete an evaluation cycle before the next events arrive. When no events occur in an evaluation cycle, a special token called absent event is used to communicate the advance of time. In our framework we use the same technique in Sections 4.2.4 and 4.2.5 for both the synchronous MoC and the fully timed MoC. Discrete timed models use a discrete set, usually integers or natural numbers, to assign a time stamp to each event. Many discrete event models fall into this category [28–30] as well as most popular hardware description languages, such as VHDL and Verilog. Timing behavior can be modeled most accurately, which makes it the most general model we consider here and makes it applicable to problems such as detailed performance simulation where synchronous and untimed models cannot be used. The price for this is the intimate dependence of functional behavior on timing details and significantly higher computation costs for analysis, simulation, and synthesis problems. Discrete timed models may be nondeterministic, as mainly used in performance analysis and simulation (see e.g., Reference 30), or deterministic, as more desirable for hardware description languages such as VHDL.

© 2006 by Taylor & Francis Group, LLC

4-8

Embedded Systems Handbook

The integration of these different timing models into a single framework is a difficult task. Many attempts have been made on a practical level with a concrete design task, mostly simulation, in mind [31–35]. On a conceptual level Lee and Sangiovanni-Vincentelli [36] have proposed a tagged time model in which every event is assigned a time tag. Depending on the tag domain we obtain different MoCs. If the tag domain is a partially ordered set, it results in an untimed model according to our definition. Discrete, totally ordered sets lead to timed MoCs and continuous sets result in continuous time MoCs. There are two main differences between the tagged time model and our proposed framework. First, in the tagged time model processes do not know how much time has progressed when no events are received since global time is only communicated via the time stamps of ordinary events. For instance, a process cannot trigger a time-out if it has not received events for a particular amount of time. Our timed model in Section 4.2.5 does not use time tags but absent events to globally order events. Since absent events are communicated between processes whenever no other event occurs, processes are always informed about the advance of global time. We chose this approach because it resembles better the situation in design languages, such as VHDL, C, or SDL (Specification and Description Language) where processes always can experience time-outs. Second, one of our main motivations was the separation of communication and synchronization issues from the computation part of processes. Hence, we strictly distinguish between process interfaces and process functionality. Only the interfaces determine to which MoC a process belongs, while the core functionality is independent of the MoC. This feature is absent from the tagged token model. This separation of concerns has been inspired by the concept of firing cycles in dataflow process networks [37]. Our mechanism for consuming and emitting events based on signal partitionings as described in Sections 4.2.2 and 4.2.3.1 is only slightly more general than the firing rules described by Lee [37] but it allows a useful definition of process signatures based on the way processes consume and emit events.

4.1.6 The Purpose of an MoC As mentioned several times, the purpose of a computational model determines, how it is designed, what properties it exposes, and what properties it suppresses. We argue that MoCs for embedded systems should not address principal questions of computability or feasibility, but should rather aid the design and validation of concrete systems. How this is accomplished best remains a subject of debate, but for this chapter we assume that an MoC should support the following properties: Implementation independence. An abstract model should not expose too much details of a possible implementation, for example, which kind of processor is used, how much parallel resources are available, what kind of hardware implementation technology is used, details of the memory architecture, etc. Since an MoC is a machine abstraction, it should, by definition, avoid unnecessary machine details. Practically speaking, the benefits of an abstract model include that analysis and processing is faster and more efficient, that analysis results are relevant for a larger set of implementations, and that the same abstract model can be directed to different architectures and implementations. On the downside we note diminished analysis accuracy and a lack of knowledge of the target architecture that can be exploited for modeling and design. Hence, the right abstraction level is a fine line that is also changing over time. While many embedded system designers could for long safely assume a purely sequential implementation, current and future computational models should avoid such an assumption. Resource sharing and scheduling strategies become more complex, and an MoC should thus either allow the explicit modeling of such a strategy or restrict the implementations to follow a particular, well-defined strategy. Composability. Since many parts and components are typically developed independently and integrated into a system, it is important to avoid unexpected interferences. Thus, some kind of composability property [38] is desirable. One step in this direction is to have a deterministic computational model such as Kahn process networks that guarantee a particular behavior independent of the time or individual activities and independent of the amount of available resources in general.

© 2006 by Taylor & Francis Group, LLC

Models of Embedded Computation

4-9

This is of course only a first step since, as argued earlier, time behavior is often an integral part of the functional behavior. Thus, resource sharing strategies, that greatly influence timing, will still have a major impact on the system behavior even for fully deterministic models. We can reconcile good system composability with shared resources by allocating a minimum but guaranteed amount of resources for each subsystem or task. For instance, two tasks get a fixed share of the communication bandwidth of a bus. This approach allows for ideal composability but has to be based on worst case behavior. It is very conservative and hence, does not utilize resources efficiently. We can relax this approach by allocating abstract resource budgets as part of the computational model. Then we require from the implementation to provide the requested resources, and at the same time to minimize the abstract budgets and thus the required resources. As example consider two tasks that have a particular communication need per abstract time slot, where the communication need may be different for different slots. The implementation has to fulfill the communication requirements of all tasks by providing the necessary bandwidth in each time slot, tuning the length of the individual time slots, or by moving communication from one slot to another. These optimizations will also have to consider global timing and resource constraints. In any case, in the abstract model we can deal with abstract budgets and assume that they will be provided by any valid implementation. Analyzability. A general tradeoff exists between the expressiveness of a model and its analyzability. By restricting models in clever ways, one can apply powerful and efficient analysis and synthesis methods. For instance, the SDF model allows all actors only a constant amount of input and output tokens in each activation cycle. While this restricts the expressiveness of the model, it allows to efficiently compute static schedules when they exist. For general dataflow graphs this may not be possible because it could be impossible to ensure that the amount of input and output is always constant for all actors, even if they are in a particular case. Since SDF covers a fairly large and important application domain, it has become a very useful MoC. The key is to understand what are the important properties (finding static schedules, finding memory bounds, finding maximum delays, etc.) and devising an MoC that allows to handle these properties efficiently and does not restrict the modeling power too much. In the following sections we discuss a framework to study different MoCs. The idea is to use different types of process constructors to instantiate processes of different MoCs. Thus, one type of process constructors would yield only untimed processes, while another type results in timed processes. The elements for process construction are simple functions and are in principle independent of a particular MoC. However, the independence is not complete since some MoCs put specific constraints on the functions. But still the separation of the process interfaces from the internal process behavior is fairly far reaching. The interfaces determine the time representation, synchronization, and communication, hence the MoC. In this chapter we will not elaborate all interesting and desirable properties of computational models. Rather we will use the framework to introduce four different MoCs that only differ in their timing abstraction. Since time plays a very prominent role in embedded systems, we focus on this aspect and show how different time abstractions can serve different purposes and needs. Another defining aspect of embedded systems is heterogeneity, which we address by allowing different MoCs to coexist in a model. The common framework makes this integration semantically clean and simple. We study two particular aspects of this coexistence, namely the interfaces between two different MoCs and the refinement of one MoC into another. Other central issues of embedded systems, such as power consumption, global analysis and optimization, are not covered, mostly because they are not very well understood in this context and few advanced proposals exist on how to deal with them from an MoC perspective.

4.2 The MoC Framework In the remainder of this chapter we discuss a framework that accommodates MoCs with different timing abstractions. It is based on process constructor, which is a mechanism to instantiate processes. A process constructor takes one or more pure functions as arguments and creates a process. The functions represent

© 2006 by Taylor & Francis Group, LLC

4-10

Embedded Systems Handbook

the process behavior and have no notion of time or concurrency. They simply take arguments and produce results. The process constructor is responsible for establishing communication with other processes. It defines the time representation, the communication, and synchronization semantics. A set of process constructors determines a particular MoC. This leads to a systematic and clean separation of computation and communication. A function, that defines the computation of a process, can in principle be used to instantiate processes in different computational models. However, a computational model may put constraints on functions. For instance, the synchronous MoC requires a function to take exactly one event on each input and produce exactly one event for each output. The untimed MoC does not have a similar requirement. After some preliminary definitions in this section, we introduce the untimed processes, give a formal definition of an MoC, and define the untimed MoC (Section 4.2.3) the perfectly synchronous and the clocked synchronous MoC (Section 4.2.4), and the discrete time MoC (Section 4.2.5). Based on this we introduce interfaces between MoCs and present an interface refinement procedure in the next section. Furthermore, we discuss the refinement from an untimed MoC to a synchronous MoC and to a timed MoC.

4.2.1 Processes and Signals Processes communicate with each other by writing to and reading from signals. Given is a set of values V , which represents the data communicated over the signals. Events, which are the basic elements of signals, are or contain values. We distinguish among three different kinds of events. Untimed events E˙ are just values without further information, E˙ = V . Synchronous events E¯ include a pseudo-value ⊥ in addition to the normal values, hence E¯ = V ∪ {⊥}. Timed events Eˆ are identical ¯ However, since it is often useful to distinguish them, we use different to synchronous events, Eˆ = E. symbols. Intuitively, timed events occur at much finer granularity than synchronous events and they would usually represent physical time units, such as a nanosecond. In contrast, synchronous events represent abstract time slots or clock cycles. This model of events and time can only accommodate discrete time models. Continuous time would require a different representation of time and events. We use the symbols e˙ , e¯ , and eˆ to denote individual untimed, synchronous, and timed events, respectively. We use E = E˙ ∪ E¯ ∪ Eˆ and e ∈ E to denote any kind of event. Signals are sequences of events. Sequences are ordered and we use subscripts as in ei to denote the ith event in a signal. For example, a signal may be written as e0 , e1 , e2 . In general, signals can be finite or infinite sequences of events and S is the set of all signals. We also distinguish among three kinds of signals and S˙ , S¯ , and Sˆ denote the untimed, synchronous, and timed signal sets, respectively, and s˙, s¯, and sˆ designate individual untimed, synchronous, and timed signals.

is the empty signal and ⊕ concatenates two signals. Concatenation is associative and has the empty signal as its neutral element: s1 ⊕ (s2 ⊕ s3 ) = (s1 ⊕ s2 ) ⊕ s3 , ⊕ s = s ⊕ = s. To keep the notation simple we often treat individual events as one-event sequences, for example, we may write e ⊕ s to denote

e ⊕ s. We use angle brackets, “ ” and “”not only to denote ordered sets or sequences of events, but also to denote sequences of signals if we impose an order on a set of signals. #s gives the length of signal s. Infinite signals have infinite length and # = 0. [] is an index operation to extract an event on a particular position from a signal. For example, s[2] = e2 if s = e1 , e2 , e3 . Processes are defined as functions on signals p : S → S. Processes are functions in the sense that for a given input signal we always get the same output signal, that is, s = s ⇒ p(s) = p(s ). Note, that this still allows processes to have an internal state. Thus, a process does not necessarily react identical to the same event applied at different times. But it will

© 2006 by Taylor & Francis Group, LLC

Models of Embedded Computation

4-11

s = 1r0,r1, ...2 = 11e0, e1, e22,1e3, e4, e52, ...2 pn(s) = 1ri2 for n(i ) = 3 for all i

p s⬘= 1r ⬘0, r 1⬘ , ...2 = 11e⬘0, e⬘12, 1e⬘2 ,e⬘32, ...2 n⬘(s⬘) = 1r i⬘2 for n⬘(i ) = 2 for all i

FIGURE 4.2 The input signal of process p is partitioned into an infinite sequence of subsignals each of which contains three events, while the output signal is partitioned into subsignals of lengths 2.

produce the same, possibly infinite, output signal when confronted with identical, possibly infinite, input signals provided it starts with the same initial state.

4.2.2 Signal Partitioning We shall use the partitioning of signals into subsequences to define the portions of a signal that is consumed or emitted by a process in each evaluation cycle. A partition π(ν, s) of a signal s defines an ordered set of signals, ri , which, when concatenated together, form “almost” the original signal s. The function ν : N0 → N0 defines the lengths of all elements in the partition. ν(0) = #r0 gives the length of the first element in the partition, ν(1) = #r1 gives the length of the second element, etc. Example 4.1 Let s1 = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and ν1 (0) = ν1 (1) = 3, ν1 (2) = 4. Then we get the partition π(ν1 , s1 ) =

1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Let s2 = 1, 2, 3, . . . be the infinite signal with ascending integers. Let ν2 (i) = 2 for all i ≥ 0. The resulting partition is infinite: π(ν2 , s2 ) =

1, 2, 3, 4, . . .. The function ν(i) defines the length of the subsignals ri . If it is constant for all i we usually omit the argument and write ν. Figure 4.2 illustrates a process with an input signal s and an output signal s . s is partitioned into subsignals of length 3 and s into subsignals of length 2.

4.2.3 Untimed MoCs 4.2.3.1 Process Constructors Our aim is to relate functions of events to processes, which are functions of signals. Therefore we introduce process constructors that can be considered as higher-order functions. They take functions on events as arguments and return processes. We define only a few basic process constructors that can be used to compose more complex processes and process networks. All untimed process constructors and processes operate exclusively on untimed signals. Processes with arbitrary number of input and output signals are cumbersome to deal with in a formal way. To avoid this inconvenience we mostly deal with processes with one input and one output. To handle arbitrary processes also, we introduce“zip”and“unzip”processes that merge two input signals into one and split one input signal into two output signals, respectively. These processes together with appropriate process composition allows us to express arbitrary behavior. Processes instantiated with the mealyU constructor resemble Mealy state machines in that they have a next state function and an output encoding function that depend on both the input and the current state. Definition 4.1 Let V be an arbitrary set of values, let g , f : (V × S˙ ) → S˙ the next state and output encoding functions, let γ : V → N be a function defining the input partitioning, and let w0 ∈ V be an initial state. mealyU is a process constructor which, given γ , f , g , and w0 as arguments, instantiates a process p : S˙ → S˙ .

© 2006 by Taylor & Francis Group, LLC

4-12

Embedded Systems Handbook

The function γ determines the number of events consumed by the process in the current evaluation cycle. γ is dependent on the current state. p repeatedly applies g on the current state and the input events to compute the next state. Further, it applies f repeatedly on the current state and the input events, to compute the output events. Processes instantiated by mealyU are general state machines with one input and one output. To create processes with arbitrary inputs and outputs, we also use the following constructors: • zipU instantiates a process with two inputs and one output. In every evaluation cycle this process takes one event from the left input and one event from the right input and packs them into an event pair that is emitted at the output. • unzipU instantiates a process with one input and two outputs. In every evaluation cycle this process takes one event from the input. It requires it to be an event pair. The first event of this pair is emitted to the left output, the second event of the event pair is emitted to the right output. For truly general process networks we would in fact need more complex zip processes, but for the purpose of this chapter the simple constructors are sufficient and we refer the reader for details to Reference 39. 4.2.3.2 Composition Operators We consider only three basic composition operators, namely sequential composition, parallel composition, and feedback. We give the definitions only for processes with one or two input and output signals, because the generalization to arbitrary numbers of inputs and outputs is straightforward. Definition 4.2 Let p1 , p2 : S˙ → S˙ be two processes with one input and one output each, and let s1 , s2 ∈ S˙ be two signals. Their parallel composition, denoted as p1 p2 , is defined as follows. (p1 p2 )( s1 , s2 ) = p1 (s1 ), p2 (s2 ). Since processes are functions we can easily define sequential composition in terms of functional composition. Definition 4.3 Let again p1 , p2 : S˙ → S˙ be two processes and let s ∈ S˙ be a signal. The sequential composition, denoted as p1 ◦ p2 , is defined as follows. (p2 ◦ p1 )(s) = p2 (p1 (s)). Definition 4.4 Given a process p : (S × S) → (S × S) with two input signals and two output signals we define the process µp : S → S by the equation (µp)(s1 ) = s2

where p(s1 , s3 ) = (s2 , s3 ).

The behavior of the process µp is defined by the least fixed point semantics based on the prefix order of signals. The µ operator gives feedback loops (Figure 4.3) a well-defined semantics. Moreover, the value of the feedback signal can be constructed by repeatedly simulating the process network starting with the empty signal until the values on all feedback signals stabilize and do not change any more [39]. Now we are in a position to define precisely what we mean with an MoC. Definition 4.5 An MoC is a 2-tuple MoC = (C, O), where C is a set of process constructors, each of which, when given constructor specific parameters, instantiates a process. O is a set of process composition operators, each of which, when given processes as arguments, instantiates a new process.

© 2006 by Taylor & Francis Group, LLC

Models of Embedded Computation

4-13

s1 s3

p

mp s2

FIGURE 4.3

Feedback composition of a process.

Definition 4.6 The untimed MoC is defined as untimed MoC = (C, O ), where C = {mealyU, zipU, unzipU} O = {, ◦, µ}. In other words, a process or a process network belongs to the untimed MoC domain iff all its processes and process compositions are constructed either by one of the named process constructors or by one of the composition operators. We call such processes U-MoC processes. Because the process interface is separated from the functionality of the process, interesting transformations can be done. For instance, a process can be mechanically transformed into a process that consumes and produces a multiple number of events of the original process. Processes can be easily merged into more complex processes. Moreover, there may be the opportunity to move functionality from one process to another. For more details on this kind of transformations see Reference 39.

4.2.4 The Synchronous MoC The synchronous languages StateCharts [40], Esterel [41], Signal [42], Argos, Lustre [43], and some others have been developed on the basis of the perfect synchrony assumption. Perfect synchrony hypothesis. Neither computation nor communication takes time. Timing is entirely determined by the arriving of input events because the system processes input samples in zero time and then waits until the next input arrives. If the implementation of the system is fast enough to process all inputs before the next sample arrives, it will behave exactly as the specification in the synchronous language. 4.2.4.1 Process Constructors Formally, we develop synchronous processes as a special case of untimed processes. This will allow us later to easily connect different domains. Synchronous processes have two specific characteristics. First, all synchronous processes consume and produce exactly one event on each input or output in each evaluation cycle, that is, the signature is always {1, . . .}, {1, . . .}. Second, in addition to the value set V events can carry the special value ⊥, which denotes the absence of an event; this is the way we defined synchronous events E¯ and signals S¯ in Section 4.2.1. Both, the processes and their contained functions must be able to deal with these events. All synchronous process constructors and processes operate exclusively on synchronous signals. Definition 4.7 Let V be an arbitrary set of values, E¯ = V ∪ {⊥}, let g , f : (E¯ × S¯ ) → S¯ and let w0 ∈ V be an initial state. mealyS is a process constructor which, given f , g , and w0 as arguments, instantiates a process p : S¯ → S¯ . p repeatedly applies g on the current state and the input event to compute the next state. Further it

© 2006 by Taylor & Francis Group, LLC

4-14

Embedded Systems Handbook

applies repeatedly f on the current state and the input event to compute the output event. p consumes exactly one input event in each evaluation cycle and emits exactly one output event. We only require that g and f are defined for absent input events and that the output signal partitioning is the constant 1. When we merge two signals into one we have to decide how to represent the absence of an event in one input signal in the compound signal. We choose to use the ⊥ symbol for this purpose also, which has the consequence, that ⊥ appears also in tuples together with normal values. Thus, it is essentially used for two different purposes. Having clarified this, the definition for zipS and unzipS is straightforward. zipSbased processes pack two events from the two inputs into an event pair at the output, while unzipS performs the inverse operation. 4.2.4.2 The Perfectly Synchronous MoC Again, we can now make precise what we mean by synchronous MoC. Definition 4.8 The synchronous MoC is defined as synchronous MoC = (C, O), where C = {mealyS, zipS, unzipS} O = {, ◦, µS }. In other words, a process or a process network belongs to the synchronous MoC domain iff all its processes and process compositions are constructed either by one of the named process constructors or by one of the composition operators. We call such processes S-MoC processes. Note, that we do not use the same feedback operator for the synchronous MoC. µS defines the semantics ¯ It is also based on a fixed point semantics of the feedback loop based on the Scott order of the values in E. but it is resolved for each event and not over a complete signal. We have adopted µS to be consistent with the zero-delay feedback loop semantics of most synchronous languages. For our purpose here this is not significant and we do not need to go into more details. For precise definitions and a thorough motivation the reader is referred to Reference 39. Merging of processes and other related transformations are very simple in the synchronous MoC because all processes have essentially identical interfaces. For instance, the merge of two mealyS-based processes can be formulated as follows. mealyS(g1 , f1 , v0 ) ◦ mealyS(g2 , f2 , w0 ) = mealyS(g , f , (v0 , w0 )) where g ((v, w), e¯ ) = (g1 (v, f2 (w, e¯ )), g2 (w, e¯ ))

f ((v, w), e¯ ) = f1 (v, f2 (w, e¯ )).

4.2.4.3 The Clocked Synchronous MoC It is useful to define a variant of the perfectly synchronous MoC, the clocked synchronous MoC that is based on the following hypothesis. Clocked synchronous hypothesis. There is a global clock signal controlling the start of each computation in the system. Communication takes no time and computation takes one clock cycle. First, we define a delay process that delays all inputs by one evaluation cycle. = mealyS( f , g , ⊥) where g (w, e¯ ) = e¯ ,

f (w, e¯ ) = w.

Based on this delay process we define the constructors for the clocked synchronous model.

© 2006 by Taylor & Francis Group, LLC

Models of Embedded Computation

4-15

Definition 4.9 mealyCS(g , f , w0 ) = mealyS(g , f , w0 ) ◦ zipCS()(¯s1 , s¯2 ) = zipS()((¯s1 ), (¯s2 )) unzipCS() = unzipS() ◦ .

(4.1)

Thus, elementary processes are composed of a combinatorial function and a delay function that essentially represents a latch at the inputs. Definition 4.10 The clocked synchronous MoC is defined as clocked synchronous MoC = (C, O), where C = {mealyCS, zipCS, unzipCS} O = {, ◦, µ}. In other words, a process or a process network belongs to the clocked synchronous MoC Domain iff all its processes and process compositions are constructed either by one of the named process constructors or by one of the composition operators. We call such processes CS-MoC processes.

4.2.5 Discrete Timed MoCs Timed processes are a blend of untimed and synchronous processes in that they can consume and produce more than one event per cycle and they also deal with absent events. In addition, they have to comply with the constraint that output events cannot occur before the input events of the same evaluation cycle. This is achieved by enforcing an equal number of input and output events for each evaluation cycle, and by prepending an initial sequence of absent events. Since the signals also represent the progression of time, the prefix of absent events at the outputs corresponds to an initial delay of the process in reacting to the inputs. Moreover, the partitioning of input and output signals corresponds to the duration of each evaluation cycle. Definition 4.11 mealyT is a process constructor which, given γ , f , g , and w0 as arguments, instantiates a process p : Sˆ → Sˆ . Again, γ is a function of the current state and determines the number of input events consumed in a particular evaluation cycle. Function g computes the next state and f computes the output events with the constraint that the output events do not occur earlier than the input events on which they depend. This constraint is necessary because in the timed MoC each event corresponds to a time stamp and we have a globally total order of time, relating all events in all signals to each other. To avoid causality flaws every process has to abide by this constraint. Similarly zipT-based processes consume events from their two inputs and pack them into tuples of events emitted at the output. unzipT performs the inverse operation. Both have also to comply with the causality constraint. Again, we can now make precise what we mean by timed MoC. Definition 4.12 The timed MoC is defined as timed MoC = (C, O), where C = {mealyT, zipT, unzipT} O = {, ◦, µ}. In other words, a process or a process network belongs to the timed MoC domain iff all its processes and process compositions are constructed either by one of the named process constructors or by one of the composition operators. We call such processes T-MoC processes.

© 2006 by Taylor & Francis Group, LLC

4-16

Embedded Systems Handbook

Merging other transformations as well as analysis of time process networks is more complicated than for synchronous or untimed MoCs, because the timing may interfere with the pure functional behavior. However, we can further restrict the functions used in constructing the processes, to more or less separate behavior from timing also in the timed MoC. To illustrate this we discuss a few variants of the Mealy process constructor. mealyPT. In mealyPT (γ , f , g , w0 ) based processes the functions f and g are not exposed to absent events and they are only defined on untimed sequences. The interface of the process strips-off all absent events of the input signal, hands over the result to f and g , and inserts absent events at the output as appropriate to provide proper timing for the output signal. The function γ , which may depend on the process state as usual, defines how many events are consumed. Essentially, it represents a timer and determines when the input should be checked the next time. mealyST. In mealyST (γ , f , g , w0 ) based processes γ determines the number of nonabsent events that should be handed over to f and g for processing. Again, f and g never see or produce absent events and the process interface is responsible for providing them with the appropriate input data and for synchronization and timing issues on inputs and outputs. Unlike mealyPT processes, functions f and g in mealyST processes have no influence on when they are invoked. They only control how many nonabsent events have appeared before their invocation. f and g in mealyPT processes on the other hand determine the time instant of their next invocation independent of the number of nonabsent events. mealyTT. However, a combination of these two process constructors is mealyTT, which allows to control the number of nonabsent input events and a maximum time period, after which the process is activated in any case independent of the number of nonabsent input events received. This allows to model processes that wait for input events but can set internal timers to provide time-outs. These examples illustrate that process constructors and MoCs could be defined, which allow us to precisely define to which extent communication issues are separated from the purely functional behavior of the processes. Obviously, a stricter separation greatly facilitates verification and synthesis but may restrict expressiveness.

4.3 Integration of MoCs 4.3.1 MoC Interfaces Interfaces between different MoCs determine the relation of the time structure in the different domains and they influence the way a domain is triggered to evaluate inputs and produce outputs. If an MoC domain is time triggered, the time signal is made available through the interface. Other domains are triggered when input data is available. Again, the input data appears through the interfaces. We introduce a few simple interfaces for the MoCs of the previous sections, in order to be able to discuss concrete examples. Definition 4.13 A stripS2U process constructor takes no arguments and instantiates a process p : S¯ → S˙ , which takes a synchronous signal as input and generates an untimed signal as output. It reproduces all data from the input in the output in the same order with the exception of the absent event, which is translated into the value 0. Definition 4.14 An insertU2S process constructor takes no arguments and instantiates a process p : S˙ → S¯ , which takes an untimed signal as input and generates a synchronous signal as output. It reproduces all data from the input in the output in the same order without any change. These interface processes between the synchronous and the untimed MoCs are very simple. However, they establish a strict and explicit time relation between two connected domains. Connecting processes from different MoCs also requires a proper semantic basis, which we provide by defining a hierarchical MoC.

© 2006 by Taylor & Francis Group, LLC

Models of Embedded Computation

4-17

Definition 4.15 A hierarchical model of computation (HMoC) is a 3-tuple HMoC = (M , C, O), where M is a set of HMoCs or simple MoCs, each capable of instantiating processes or process networks; C is a set of process constructors; O is a set of process composition operators that governs the process composition at the highest hierarchy level but not inside process networks instantiated by any of the HMoCs of M . In the following examples and discussion we will use a specific but rather simple HMoC. Definition 4.16 H = (M , C, O) with M = {U-MoC, S-MoC} C = {stripS2U, insertU2S} O = {, ◦, µ}. Example 4.2 As example, consider the equalizer system of Figure 4.4 [39]. The control part consists of two synchronous MoC processes and the dataflow part, modeled as untimed MoC processes, filter and analyze an audio stream. Depending on the analysis results of the Analyzer process, the Distortion control will modify the filter parameters. The Button control takes also user input into account to steer the filter. The purpose of Analyzer and Distortion control are to avoid dangerously strong signals that could jeopardize the loud speakers. Control and dataflow parts are connected via two interface processes. The dataflow processes can be developed and verified separately in the untimed MoC domain, but as soon as they are connected to the synchronous MoC control part, the time structure of the synchronous MoC domain gets imposed on all the untimed MoC processes. With the simple interfaces of Figure 4.4, the Filter process consumes 4096 data tokens from the primary input, 1 token from the stripS2U process, and it emits 4096 tokens in every synchronous MoC time slot. Similarly, the activity of the Analyzer is precisely defined for every synchronous MoC time slot. Also, the activities of the two control processes are related precisely to the activities of the dataflow processes in every time slot. Moreover, the timing of the two primary inputs and the primary outputs are now related timewise. Their timing must be consistent because the timing of the primary input data determines the timing of the entire system. For example, if the input signal to

S-MoC 1

U-MoC

Button control

1

1

}

Distortion control

1

1

1

1

stripS2U

insertU2S

1

1

1

1

4096

4096 Filter

4096 Analyzer

FIGURE 4.4 A digital equalizer consisting of a dataflow part and control. The numbers annotating process inputs and outputs denote the number of tokens consumed and produced in each evaluation cycle. (From A. Jantsch. Modeling Embedded Systems and SoCs. Morgan Kaufmann Publishers, San Francisco, CA, 2004. With permission.)

© 2006 by Taylor & Francis Group, LLC

4-18

Embedded Systems Handbook

the Button control process assumes that each time slot has the same time duration, the 4096 data samples of the Filter input in each evaluation cycle must correspond to the same constant time period. It is the responsibility of the domain interfaces to correctly relate the timing of the different domains to each other. It is required that the time relation established by all interfaces is consistent with each other and with the timing of the primary inputs. For instance if the stripS2U takes 1 token as input and emits 1 token as output in each evaluation cycle, the insertU2S process cannot take 1 token as input and produce 2 tokens as output. The interfaces in Figure 4.4 are very simple and lead to a strict coupling between the two MoC domains. Could more sophisticated or nondeterministic interfaces avoid this coupling effect? The answer is no because even if the input and output tokens of the interfaces vary from evaluation cycle to evaluation cycle in complex or nondeterministic ways, we still have a very precise timing relation in each and every time slot. Since in every evaluation cycle all interface processes must consume and produce a particular number of tokens, this determines the time relation in that particular cycle. Even though this relation may vary from cycle to cycle, it is still well defined for all cycles and hence for the entire execution of the system. The possibly nondeterministic communication delay between MoC domains, as well as between any other processes, can be modeled, but this should not be confused with establishing a time relation between two MoC domains.

4.3.2 Interface Reﬁnement In order to show this difference and to illustrate how abstract interfaces can be gradually refined to accommodate channel delay information and detailed protocols, we propose an interface refinement procedure, given below: 1. Add a time interface. When we connect two different MoC domains, we always have to define the time relation between the two. This is the case even if the two domains are of the same type, for example, both are synchronous MoC domains, because the basic time unit may or may not be identical in the two domains. In our MoC framework the occurrence of events also represent time in both the synchronous MoC and timed MoC domains. Thus, setting the time relation means to determine the number of events in one domain that correspond to one event in the other domain. For example, in Figure 4.4 the interfaces establish a one-to-one relation while the interface in Figure 4.5 represents a 3/2 relation.

MoC B

MoC A

Q

P

MoC B

MoC A

P

I1 3

2

Q

FIGURE 4.5 Determining the time relation between two MoC domains. (From A. Jantsch. Modeling Embedded Systems and SoCs. Morgan Kaufmann Publishers, San Francisco, CA, 2004. With permission.)

© 2006 by Taylor & Francis Group, LLC

Models of Embedded Computation

4-19

In other frameworks the establishing of a time relation will take a different form. For instance, if languages such as SystemC or VHDL are used, the time of the different domains have to be related to the common time base of the simulator. 2. Refine the protocol. When the time relation between the two domains is established, we have to provide a protocol that is able to communicate over the final interface at that point. The two domains may represent different clocking regimes on the same chip, or one may end up as software while the other is implemented as hardware, or both may be implemented as software on different chips or cores, etc. Depending on the final implementations we have to develop a protocol fulfilling the requirements of the interface, such as buffering and error control. In our example in Figure 4.6 we have selected a simple handshake protocol with limited buffering capability. Note, however, that this assumes that for every three events arriving from MoC A there are only two useful events to be delivered to MoC B. The interface processes I1 and I2 , and the protocol processes P1 , P2 , Q1 , and Q2 must be designed carefully to avoid both losing data and deadlock. 3. Model the channel delay. In order to have a realistic channel behavior, the delay can be modeled deterministically or stochastically. In Figure 4.7 we have added a stochastic delay varying between 2 and 5 MoC B cycles. The protocol will require more buffering to accommodate the varying delays. To dimension the buffers correctly we have to identify the average and the worst-case behavior that we should be able to handle. This refinement procedure proposed here is consistent with and complementary to other techniques proposed, for example, in the context of SystemC [44]. We only want to emphasize here that the time relation between domains from channel delay and protocol design have to be separated. Often these issues MoC B

MoC A

I1

P2

Q2

Q1

P1 I2

FIGURE 4.6 A simple handshake protocol. (From A. Jantsch. Modeling Embedded Systems and SoCs. Morgan Kaufmann Publishers, San Francisco, CA, 2004. With permission.) MoC B

MoC A

P2

I1

D[2,5]

I2

D[2,5]

Q2

Q1

P2

FIGURE 4.7 The channel delay can vary between 2 and 5 cycles measured in MoC B cycles. (From A. Jantsch. Modeling Embedded Systems and SoCs. Morgan Kaufmann Publishers, San Francisco, CA, 2004. With permission.)

© 2006 by Taylor & Francis Group, LLC

4-20

Embedded Systems Handbook

are not separated clearly making interface design more complicated than necessary. More details about this procedure and the example can be found in Reference 39.

4.3.3 MoC Reﬁnement The three introduced MoCs represent three time abstractions and, naturally, design often starts with higher time abstractions and gradually leads to lower abstractions. It is not always appropriate to start with an untimed MoC because when timing properties are an inherent and crucial part of the functionality, a synchronous model is more appropriate to start with. But if we start with an untimed model, we need to map it onto an architecture with concrete timing properties. Frequently, resource sharing makes the consideration of time functionally relevant, because of deadlock problems and complicated interaction patterns. All the three phenomena discussed in Section 4.1.4, priority inversion, performance inversion, and over-synchronization, emerged due to resource sharing. Example 4.3 We discuss therefore an example for MoC refinement from the untimed through the synchronous to the timed MoC, which is driven by resource sharing. In Figure 4.8 we have two unlimited MoC process pairs, which are functionally independent from each other. At this level, under the assumption of infinite buffers and unlimited resources, we can analyze and develop the core functionality embodied by the process internal functions f and g . In the first refinement step, shown in Figure 4.9, we introduce finite buffers between the processes. Bn,2 and Bm,2 represent buffers of size n and m, respectively. Since the untimed MoC assumes implicitly infinite buffers between two communicating processes, there is no point in modeling finite buffers in the untimed MoC domain. We just would not see any effect. In the synchronous MoC domain, however, we can analyze

P1

R1

FIGURE 4.8

P1 = mealyU (1, f P1, g P1, w P1)

Q1

Q1 = mealyU (1, f Q1, gQ1, w Q1)

R1 = mealyU (1, f R1, g R1, w R1)

S1

S

S

S

S2 = mealyU (1, f 1, g 1, w 1)

Two independent process pairs.

P2

Bn,2

P2 = mealyS: 2:1(f P2, g P2, w P2) Q2

Q

Q

Q

Q2 = mealyS(f2 , g2 , w2 ) Bn,2 = mealyS(f2Bn,

g2Bn,

w2Bn)

R2 = mealyS:2:1(f R2, g R2, w R2 ) R2

Bm,2

S2

S2 = mealyS(f 2S, g 2S, w S2 ) Bm,2 = mealyS(f2Bm, g2Bm, w2Bm)

FIGURE 4.9

Two independent process pairs with explicit buffers.

© 2006 by Taylor & Francis Group, LLC

Models of Embedded Computation

4-21

the consequences of finite buffers. The processes need to be refined. Processes P2 and R2 have to be able to handle full buffers while processes Q2 and S2 have to handle empty buffers. In the untimed MoC, processes always block on empty input buffers. This behavior can also be modeled in synchronous MoC processes easily. In addition more complicated behavior such as time-outs can be modeled and analyzed. To find the minimum buffer sizes while avoiding deadlock and ensuring the original system behavior is by itself a challenging task. Basten and Hoogerbrugge [45] propose a technique to address this. More frequently, the buffer minimization problem is formulated as part of the process scheduling problem [46,47]. The communication infrastructure is typically shared among many communicating actors. In Figure 4.10 we map the communication links onto one bus, represented as process I3 . It contains an arbiter that resolves conflicts when both processes Bn,3 and Bm,3 try to access the bus at the same time. It also implements a bus access protocol, that has to be followed by connecting processes. The synchronous MoC model in Figure 4.10 is cycle true and the effect of bus sharing on system behavior and performance can be analyzed. A model checker can prove and use the soundness and fairness of the arbitration algorithm and performance requirements on the individual processes can be derived to achieve a desirable system performance. Sometimes, it is a feasible option to synthesize the model of Figure 4.10 directly into a hardware or software implementation, provided we can use standard templates for the process interfaces. Alternatively we can refine the model into a fully timed model. However, we still have various options depending on what exactly we would like to model and analyze. For each process we can decide how much of the timing and synchronization details should be explicitly taken care of by the process and how much can be handled implicitly by the process interfaces. For instance in Section 4.2.5 we have introduced constructors mealyST and mealyPT. The first provides a process interface that strips-off all absent events and inserts absent events at the output as needed. The internal functions have only to deal with the functional events but they have no access to timing information. This means that an untimed mealyU process can be directly refined into a timed mealyST process with exactly the same functions f and g . Alternatively, the constructor mealyPT provides an interface that invokes the internal functions at regular time intervals. If this interval corresponds to a synchronous time slot, a synchronous MoC process can be easily mapped onto a mealyPT type of process, with the only difference, that the functions in a mealyPT process may receive several nonabsent events in each cycle. But in both cases the processes experience a notion of time based on cycles. In Figure 4.11 we have chosen to refine processes P, Q, R, and S into mealyST-based processes to keep them as similar to the original untimed processes. Thus, the original f and g functions can be used without major modification. The process interfaces are responsible to collect the inputs, present them to the f and g functions and emit properly synchronized output. The buffer and the bus processes however have been mapped onto mealyPT processes. The constants λ and λ/2 represent the cycle time for the processes. Process Bm,4 operates with half the cycle time of

P3 = mealyS(f P3, g P3, w P3) P3

Bn,3

Q3

Q

Q

Q

Q2 = mealyS(f3 , g3 , w3 ) Bn,3 = mealyS:2:1(f3Bn,

g3Bn, w3Bn )

I3 R R R P3 = mealyS(f 3, g 3, w 3)

R3

Bm,3

S3

S3 = mealyS(f3S, g3S, w3S ) B

B

B

Bn,3 = mealyS:2:1(f3 m, g3 m, w3 m ) I I3 =mealyS:4:2(f3,

FIGURE 4.10 Two independent process pairs with explicit buffers.

© 2006 by Taylor & Francis Group, LLC

I g3,

I w3 )

4-22

Embedded Systems Handbook

P3

Bn,3

Q3

I3

R3

Bm,3

S3

P

P

P

P4 = mealyST (1, f 4, g 4, w 4) Q Q2 = mealyST (f 4 ,

Q g4 ,

Q w4 ) Bn,4 = mealyPT:2:1 (l,f4Bn, g4Bn, R

R

w4Bn )

R

R4= mealyST (1, f 4, g 4, w 4 ) S4= mealyS (1, f4S, g4S, w4S) l Bm,4 = mealyPT:2:1( ,f4Bm, g4Bm, w4Bm) 2 I

I

I

I4 = mealyPT:4:2 (l, f4 , g4 , w4 )

FIGURE 4.11 All processes are refined into the timed MoC but with different synchronization interfaces.

the other processes, which illustrates that the modeling accuracy can be arbitrarily selected. We can also choose other process constructors and hence interfaces if desirable. For instance, some processes can be mapped onto mealyT-type processes in a further refinement step to expose them to even more timing information.

4.4 Conclusion We tried to motivate that MoC for embedded systems should be different from the many computational models developed in the past. The purpose of model of embedded computation should be to support analysis and design of concrete systems. Thus, it needs to deal with salient and critical features of embedded systems in a systematic way. These features include real-time requirements, power consumption, architecture heterogeneity, application heterogeneity, and real-world interaction. We have proposed a framework to study different MoCs that allow us to appropriately capture some, but unfortunately not all, of these features. In particular power consumption and other nonfunctional properties are not covered. Time is of central focus in the framework but continuous time models are not included in spite of their relevance for the sensors and actuators in embedded systems. Despite the deficiencies of this framework we hope that we were able to argue well for a few important points: • Different computational models should and will continue to coexist for a variety of technical and nontechnical reasons. • To use the “right” computational model in a design and for a particular design task can greatly facilitate the design process and the quality of the result. What is the “right” model depends on the purpose and objectives of a design task. • Time is of central importance and computational models with different timing abstractions should be used during system development.

© 2006 by Taylor & Francis Group, LLC

Models of Embedded Computation

4-23

From an MoC perspective, several important issues are open research topics and should be addressed urgently to improve the design process for embedded systems: • We need to identify efficient ways to capture a few important nonfunctional properties in MoCs. At least power and energy consumption and perhaps signal noise issues should be attended to. • The effective integration of different MoCs will require (1) the systematic manipulation and refinement of MoC interfaces and interdomain protocols; (2) the crossdomain analysis of functionality, performance, and power consumption; (3) the global optimization and synthesis including migration of tasks and processes across MoC domain boundaries. • In order to make the benefits and the potential of well-defined MoCs available in the practical design work, we need to project MoCs into design languages, such as VHDL, Verilog, SystemC, C++, etc. This should be done by properly subsetting a language and by developing pragmatics to restrict the use of a language. If accompanied by tools to enforce the restrictions and to exploit the properties of the underlying MoC, this will be accepted quickly by designers. In the future we foresee a continuous and steady further development of MoCs to match future theoretical objectives and practical design purposes. But we also hope that they become better accepted as practically useful devices for supporting the design process just like design languages, tools, and methodologies.

References [1] Ralph Gregory Taylor. Models of Computation and Formal Language. Oxford University Press, New York, 1998. [2] Peter van Embde Boas. Machine models and simulation. In J. van Leeuwen, Ed., Handbook of Theoretical Computer Science, Vol. A: Algorithms and Complexity. Elsevier Science Publishers B.V., Amsterdam, 1990, chap. 1, pp. 1–66. [3] S. Cook and R. Reckhow. Time bounded random access machines. Journal of Computer and System Sciences, 7:354–375, 1973. [4] B.M. Maggs, L.R. Matheson, and R.E. Tarjan. Models of parallel computation: a survey and synthesis. In Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS), Vol. 2, 1995, pp. 61–70. [5] S. Fortune and J. Wyllie. Parallelism in random access machines. In Proceedings of the 10th Annual Symposium on Theory of Computing, San Diego, CA, 1978. [6] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. Communication complexity of PRAMs. Theoretical Computer Science, 71:3–28, 1990. [7] Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. The QRQW PRAM: accounting for contention in parallel algorithms. In Proceedings of the 5th Annual ACM-SIAM Symposium on Discrete Algorithms, Arlington, VA, January 1994, pp. 638–648. [8] Eli Upfal. Efficient schemes for parallel communication. Journal of the ACM, 31:507–517, 1984. [9] A. Aggarwal, B. Alpern, A.K. Chandra, and M. Snir. A model for hierarchical memory. In Proceedings of the 19th Annual ACM Symposium on Theory of Computing, May 1987, pp. 305–314. [10] Bowen Alpern, Larry Carter, Ephraim Feig, and Ted Selker. The uniform memory hierarchy model of computation. Algorithmica, 12:72–109, 1994. [11] Thomas Lengauer. VLSI theory. In J. van Leeuwen, Ed., Handbook of Theoretical Computer Science, Vol. A: Algorithms and Complexity, 2nd ed., Elsevier Science Publishers, Amsterdam, 1990, chap. 16, pp. 835–868. [12] Johan Eker, Jörn W. Janneck, Edward A. Lee, Jie Liu, Xiaojun Liu, Jozsef Ludvig, Stephen Neuendorffer, Sonia Sachs, and Yuhong Xiong. Taming heterogeneity? The Ptolemy approach. Proceedings of the IEEE, 91:127–144, 2003. [13] Rolf Ernst. MPSOC Performance Modeling and Analysis. Paper Presented at the 3rd International Seminar on Application-Specific Multi-Processor SoC, Chamonix, France, 2003.

© 2006 by Taylor & Francis Group, LLC

4-24

Embedded Systems Handbook

[14] Gilles Kahn. The semantics of a simple language for parallel programming. In Proceedings of the IFIP Congress 74. North-Holland, Amsterdam, 1974. [15] Edward A. Lee and T.M. Parks. Dataflow process networks. Proceedings of the IEEE, 83:773–801, 1995. [16] Jarvis Dean Brock. A Formal Model for Non-Deterministic Dataflow Computation. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA, 1983. [17] J. Dean Brock and William B. Ackerman. Scenarios: a model of nondeterminate computation. In J. Diaz and I. Ramos, Eds., Formalism of Programming Concepts, Vol. 107 of Lecture Notes in Computer Science. Springer Verlag, Heidelberg, 1981, pp. 252–259. [18] Paul R. Kosinski. A straight forward denotational semantics for nondeterminate data flow programs. In Proceedings of the 5th ACM Symposium on Principles of Programming Languages, 1978, pp. 214–219. [19] David Park. The ‘fairness’ problem and nondeterministic computing networks. In J.W. De Baker and J. van Leeuwen, Eds., Foundations of Computer Science IV, Part 2: Semantics and Logic. Mathematical Centre Tracts, Amsterdam, The Netherlands, 1983, Vol. 159, pp. 133–161. [20] Robin Milner. Communication and Concurrency. International Series in Computer Science. Prentice Hall, New York, 1989. [21] C.A.R. Hoare. Communicating sequential processes. Communications of the ACM, 21:666–676, 1978. [22] Axel Jantsch, Ingo Sander, and Wenbiao Wu. The usage of stochastic processes in embedded system specifications. In Proceedings of the Ninth International Symposium on Hardware/Software Codesign, April 2001. [23] Edward Ashford Lee and David G. Messerschmitt. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers, C-36:24–35, 1987. [24] Chanik Park, Jaewoong Jung, and Soonhoi Ha. Extended synchronous dataflow for efficient DSP system prototyping. Design Automation for Embedded Systems, 6:295–322, 2002. [25] Axel Jantsch and Per Bjuréus. Composite signal flow: a computational model combining events, sampled streams, and vectors. In Proceedings of the Design and Test Europe Conference (DATE), 2000. [26] Nicolas Halbwachs. Synchronous programming of reactive systems. In Proceedings of Computer Aided Verification (CAV), 2000. [27] Albert Benveniste and Gérard Berry. The synchronous approach to reactive and real-time systems. Proceedings of the IEEE, 79:1270–1282, 1991. [28] Frank L. Severance. System Modeling and Simulation. John Wiley & Sons, New York, 2001. [29] Averill M. Law and W. David Kelton. Simulation, Modeling and Analsysis, 3rd ed., Industrial Engineering Series. McGraw Hill, New York, 2000. [30] Christos G. Cassandras. Discrete Event Systems. Aksen Associates, Boston, MA, 1993. [31] Per Bjuréus and Axel Jantsch. Modeling of mixed control and dataflow systems in MASCOT. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 9:690–704, 2001. [32] Peeter Ellervee, Shashi Kumar, Axel Jantsch, Bengt Svantesson, Thomas Meincke, and Ahmed Hemani. IRSYD: an internal representation for heterogeneous embedded systems. In Proceedings of the 16th NORCHIP Conference, 1998. [33] P. Le Marrec, C.A. Valderrama, F. Hessel, A.A. Jerraya, M. Attia, and O. Cayrol. Hardware, software and mechanical cosimulation for auto-motive applications. In Proceedings of the Ninth International Workshop on Rapid System Prototyping, 1998, pp. 202–206. [34] Ahmed A. Jerraya and K. O’Brien. Solar: an intermediate format for system-level modeling and synthesis. In Jerzy Rozenblit and Klaus Buchenrieder, Eds., Codesign: Computer-Aided Software/Hardware Engineering. IEEE Press, Piscataway, NJ, 1995, chap. 7, pp. 145–175. [35] Edward A. Lee and David G. Messerschmitt. An Overview of the Ptolemy Project. Report from Department of Electrical Engineering and Computer Science, University of California, Berkeley, January 1993.

© 2006 by Taylor & Francis Group, LLC

Models of Embedded Computation

4-25

[36] Edward A. Lee and Alberto Sangiovanni-Vincentelli. A framework for comparing models of computation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 17:1217–1229, 1998. [37] Edward A. Lee. A Denotational Semantics for Dataflow with Firing. Technical report UCB/ERL M97/3, Department of Electrical Engineering and Computer Science, University of California, Berkeley, January 1997. [38] Axel Jantsch and Hannu Tenhunen. Will networks on chip close the productivity gap? In Axel Jantsch and Hannu Tenhunen, Eds., Networks on Chip, Kluwer Academic Publishers, Dordrecht, 2003, chap. 1, pp. 3–18. [39] Axel Jantsch. Modeling Embedded Systems and SoCs — Concurrency and Time in Models of Computation. Systems on Silicon. Morgan Kaufmann Publishers, San Francisco, CA, 2003. [40] D. Harel. Statecharts: A visual formalism for complex systems. Science of Computer Programming, 8:231–274, 1987. [41] G. Berry, P. Couronne, and G. Gonthier. Synchronous programming of reactive systems: an introduction to Esterel. In Kazuhiro Fuchi and M. Nivat, Eds., Programming of Future Generation Computers, Elsevier, New York, 1988, pp. 35–55. [42] Paul le Guernic, Thierry Gautier, Michel le Borgne, and Claude le Maire. Programming real-time applications with SIGNAL. Proceedings of the IEEE, 79:1321–1336, 1991. [43] N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous data flow programming language LUSTRE. Proceedings of the IEEE, 79:1305–1320, 1991. [44] Thorsten Grötker, Stan Liao, Grant Martin, and Stuart Swan. System Design with SystemC. Kluwer Academic Publishers, Dordrecht, 2002. [45] Twan Basten and Jan Hoogerbrugge. Efficient execution of process networks. In Alan Chalmers, Majid Mirmehdi, and Henk Muller, Eds., Communicating Process Architectures. IOS Press, Amsterdam, 2001. [46] Sundararajan Sriram and Shuvra S. Bhattacharyya. Embedded Multiprocessors: Scheduling and Synchronization. Marcel Dekker, New York, 2000. [47] Shuvra S. Bhattacharyya, Praveen K. Murthy, and Edward A. Lee. Software Synthesis from Dataflow Graphs. Kluwer Academic Publishers, Dordrecht, 1996.

© 2006 by Taylor & Francis Group, LLC

5 Modeling Formalisms for Embedded System Design

Luís Gomes Universidade Nova de Lisboa and UNINOVA

João Paulo Barros Instituto Politécnico de Beja and UNINOVA

Anikó Costa Universidade Nova de Lisboa and UNINOVA

5.1 5.2 5.3 5.4

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Notions of Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Communication Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Common Modeling Formalisms. . . . . . . . . . . . . . . . . . . . . . . .

5-1 5-3 5-4 5-5

Finite State Machines • Finite State Machines with Datapath • Statecharts and Hierarchical/Concurrent Finite State Machines • Program-State Machines • Codesign Finite State Machines • Specification and Description Language • Message Sequence Charts • Petri Nets • Discrete Event • Synchronous/Reactive Models • Dataflow Models

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-32

5.1 Introduction The importance of the system specification phase is directly proportional to the respective system complexity. Embedded systems have become more and more complex not only due to the increasing system’s dimension, but also due to the interactions among the different system design aspects. These include, among others, correctness, platform heterogeneity, performance, power consumption, costs, and time-to-market. Therefore, a multitude of modeling formalisms have been applied to embedded system design. Typically, these formalisms strive for a maximum of preciseness, as they rely on a mathematical (formal) model. Modeling formalisms are often referred as models of computation (MoC) [1–5]. An MoC is composed by a notation and by the rules for computation of the behavior. Instead of notation, we talk about the syntax of the model; the rules that define the model semantics. Usage of formal models in embedded system design allows (at least) one of the following [2]: • Unambiguously capture the required system’s functionality. • Verification of functional specification correctness with respect to its desired proprieties. • Support synthesis onto a specific architecture and communication resources. 5-1

© 2006 by Taylor & Francis Group, LLC

5-2

Embedded Systems Handbook

• Use different tools based on the same model (supporting communication among teams involved in design, producing, and maintaining the system). It has to be stressed that model-based verification of proprieties is a subject of major importance in embedded system design (and also in system design in general terms), as far as it allows to verify model correctness even if the system does not exist (physically), or if it is difficult/dangerous/costly to analyze the system directly. The construction of a system model brings several advantages, as it forces a more complete system comprehension and allows the comparison of distinct approaches. Hence, it becomes easier to identify desired and undesired system properties, as the requirements become more precise and complete. Most modeling formalisms for embedded systems design are based on a particular diagrammatic (or graphic) language. Despite known arguments against diagrammatic languages (e.g., Reference 6), they are presently widely acknowledged as extremely useful and popular for software development and also for embedded system development in general. The history of the Unified Modeling Language (UML), Specification and Description Language (SDL), and Message Sequence Charts (MSCs) certainly proves it. Even though diagrammatic languages are often seen as inherently less precise than textual languages, this is certainly not true (see, e.g., References 7 and 8). These diagrammatic representations are usually graph-based. Finite state machines (FSMs), in their different forms (Moore, Mealy) and extensions (hierarchical and concurrent, Statecharts, etc.) are a well-known example. The same is true for dataflows and Petri nets. These formalisms offer a variety of semantics for the modeling of time, communication, and concurrency modeling. Besides distinct graphical syntaxes and semantics, different formalisms also have different analysis and verification capabilities. The plethora of MoCs ready to be used by embedded system designers means that the task of the modeler trying to choose the “best” formalism is a very difficult task. Different embedded systems can, and often do, emphasize different aspects, namely the reactive nature associated with their behavior, the realtime constraints, or data processing capabilities. The same happens with the available MoCs. For example, some MoCs for embedded systems are control dominated (data processing and computation are minimal), emphasizing the reactiveness of system’s behavior. Others emphasize data processing, containing complex data transformations, normally described by dataflows. Reactive control systems are in the first group and digital signal-processing applications are in the second. For example, digital signal-processing applications emphasize the usage of dataflow models, whereas FSMs explicitly emphasize reactiveness. Unfortunately, other aspects are also important to be considered when producing the model for the system; for example, the need to model specific notions of time or different modes of communication among components may further complicate the search for the “right” MoC. So, in some embedded system designs, heterogeneity in terms of the implementation platforms has to be faced, and it is not possible to find a unique formalism to model the whole system. In those situations, the goal is to decompose the system’s model into submodels and to pick up the right formalism for the different submodels; although, at the end, designer has to be able to integrate all those models in a coherent way [9]. Several formalisms allow the modeler to partition the system’s model and describe it as a collection of communicating modules (components). In this sense, behavior’s modeling and communication among components are often interdependent. Yet, separating behavior and communication is a sound attitude as it allows handling system design complexity and reusability of components. In fact, it is very difficult to reuse components if behavior and communication are intertwined, as behavior is dependent on the communication mechanisms with other components of the system design [2]. Modeling formalisms for embedded system design have been widely studied, and several reviews and textbooks about MoCs can be found in the literature [1–5]. This chapter surveys several modeling formalisms for embedded system design taking Reference 5 as the main reference and expanding it to encompass a set of additional modeling formalisms widely used by embedded system designers in several application areas. The following sections address aspects of time representation and communication support. Afterwards, several selected modeling formalisms are presented.

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms

5-3

5.2 Notions of Time Embedded systems are often characterized as real-time systems. Thus the notion of time is extremely important in many of the modeling formalisms for embedded system design. Generally speaking, we may identify three approaches to time modeling: 1. Continuous time and differential equations. 2. Discrete time and difference equations. 3. Discrete events. The first approach (see Figure 5.1[a]) uses differential equations to model continuous time functions. This attitude is mostly used for the modeling of specific interface components, where the continuous nature of signal evolution is present, such as analog circuits modeling and physical systems modeling in a broad sense. In the second approach (Figure 5.1[b]), it is assumed that the time is discrete; in this sense, difference equations replace differential equations. A global clock (the tick) defines the specific points in time where signals have values. For some applications, involving heterogeneous components, it is also useful to consider multirate difference equations (which mean that several clock signals are available). Digital signal processing is one of the main application areas. In the third approach, a signal is seen as a sequence of events (see Figure 5.1[c]). This concept of events can be associated to physical signals evolution as presented in Figure 5.2 for a Boolean signal; there event a is associated with the rising edge of the signal x, while event x− is generated at all falling edges of signal x. Extension to other useful types of signals is straightforward, namely for signals that can hold multivalued, enumerated or integer values. Each event has a value and a time tag. The events are processed in a chronological order, based on a predefined precedence. If the time tags are totally ordered [10], we are in presence of a timed system: for any distinct t1 and t2 , either t1 < t2 or t2 < t1 (this is called a total order). It is possible to define an associated metric, for instance f (t1 , t2 ) = |t1 − t2 |. If the metric is a continuum we have a continuous time system. A discrete-event system is a timed system where the time tags are totally ordered. (a)

(b)

A

(c) A

Events

Time (sec)

ab

Time (sec)

cde

Time

FIGURE 5.1 Time representations. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

Boolean signal x Associated events

Associated holding conditions

x+ a

x=0

x–

x=1

FIGURE 5.2 From signals to events and conditions.

© 2006 by Taylor & Francis Group, LLC

x+

t

a

x=0

x=1

5-4

Embedded Systems Handbook

Two events are synchronous if they have the same time tag attached (they occur simultaneously). Similarly, two signals are synchronous if for each event in one signal, there is a synchronous event in the other signal, and vice versa. A system is synchronous if every signal in the system is synchronous with every other signal in the system. In this sense, a discrete-time system is a synchronous discrete-event system. Totally ordered events are used with digital hardware simulators, namely the ones associated with VHDL and Verilog hardware description languages. Any two events are either simultaneous, which means that they have the same time tag, or any one of them can precede the other. Events can be considered partially ordered in the sense that the order does not include all the events in the system. When tags are partially ordered instead of totally ordered, the system is untimed. This means that we can build several event sequences that do not contain all the system events. These missing events are included in other completely ordered event sequences. It is known [11] that total order of the events cannot be maintained in distributed systems, where a partial order is sufficient to analyze system behavior. Partial orders have also been used to analyze Petri nets [12]. An asynchronous system is a system in which no two events can have the same tag [1]. The system is asynchronous interleaved if tags are totally ordered, and is asynchronous concurrent if tags are partially ordered. As time is intrinsically continuous, real systems are asynchronous by nature. Yet, synchronicity is a very convenient abstraction, allowing efficient and robust implementations, through the use of a reference “clock” signal.

5.3 Communication Support The embedded systems complexity usually motivates their decomposition in several interacting components. These can be more or less independent, for example, they can be executed in true concurrency or in an interleaved way, but probably all will have to communicate with some other components. Therefore, communication is of topmost importance. It can be classified as implicit or explicit [3]: • Implicit communication generally requires totally ordered tag events, normally associated with physical time. In order to support this form of communication it is necessary to have a physically shared signal (for instance, a clock signal), whose availability may be difficult or unfeasible in a large number of embedded system applications. • Explicit communication imposes an order on the events: the sender process will guarantee that all the receiver processes are informed about some part of its internal state. The following models of communication are normally considered: • Handshake using a synchronization mechanism; all intervening components are blocked, waiting for conclusion. • Message passing using a send–receive pattern where the receiver will wait for the message. • Shared variables; the blocking is decided by the control part of the memory where the shared variable is stored. The referred communication modes are supported by a set of communication primitives (or by some combination), namely [3]: • Unsynchronized: producer and consumer(s) are not synchronized; there are no guarantees that the producer does not overwrite previously produced data, or that the consumer(s) will get all produced data. • Read-modify-write: this is the common way to get access to shared data structures from different processes in software; access to the data structure is locked during a data access (either read–write or read-modify-write), it is an atomic action (indivisible, and thus uninterruptible). • Unbounded FIFO (first in first out) buffered: producer generates a sequence of data tokens and the consumer will get those tokens using a FIFO discipline.

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms

5-5

• Bounded FIFO buffered: as in the latter but the buffer size is limited, so the difference between writings and readings will be bounded by some value. This means that writings can be blocked if the buffer is full. • Petri net places: producers generate sequences of data tokens and consumers will read those tokens. • Rendezvous: the writing process and the reading process must simultaneously be at the point where the write and the read occur.

5.4 Common Modeling Formalisms Most modeling formalisms are control dominated or data dominated. However, as already referred, embedded systems are composed of a mixture of reactive behavior, control functions, and data processing, especially those targeted for networking and multimedia applications. In the following sections, a set of selected formalisms are presented, taking FSMs as the starting point, which proved to be adequate for low to medium complexity control-dominated system modeling. We can find in the literature numerous proposals extending FSMs in several directions. Each extension tries to overcome one or more intrinsic FSMs shortages, from concurrency modeling inability and the associated state-space explosion problem, to data processing modeling and the absence of hierarchical structuring mechanisms (supporting specification at different levels of abstraction). After control-dominated formalisms (emphasizing the reactive nature of embedded systems), dataflow-dominant formalisms will be presented.

5.4.1 Finite State Machines Finite state machines are common computational models that have been used by system designers for decades. It is common to represent FSM in different ways: from graphical-based representations (like state diagrams and flowcharts), to textual-based representations. In this chapter, state diagrams are used. The modeling attitude is based on the characterization of the system in terms of the global states that the system can exhibit, and also in terms of the conditions that can cause a change in those states (transitions between states). A basic FSM consists of a finite set of states S (with a specified initial state, sis ), a set of input signals I , a set of output signals O, an output function f , and a next-state function h. Output and next-state functions ( f and h, respectively) map a crossproduct of S and I into S and O, respectively ( f : S × I → S, h: S × I → O). Two basic models can be considered for output modeling: Moore-type machine [13], also called state-based FSM, where outputs are associated with state activation (and where the output function f only maps states S into outputs O), and Mealy-type machine [14], also called transition-based FSM, where outputs are associated with transitions between states. It is important to note that both models have the same modeling capabilities. The referred FSM model can be limited or extended to accommodate different needs (specific modeling capabilities or target architectures), as analyzed in some of the following sections. Figure 5.3 illustrates a basic notation for a state diagram. Circles or ellipses represent states; transitions between states use a directed arc. Each arc has an attached expression, potentially containing reference to the input event and/or to an external condition that will cause the change of state. Outputs can be modeled as Moore-type output actions (associated with states, such as z in state S2 ), or as Mealy-type output events (associated with transitions, such as x in the presented transition expression). FSMs are a control-dominated MoC, so intrinsically adequate to model the embedded system reactive component. We will introduce a running example, adapted from Reference 15, and we will start using an FSM to model the system controller. The system to be modeled is the controller of an electric car, which is installed in an industrial plant. The electric car has to carry goods from one point to another, and come back. The controller receives commands from the operator, namely actuation on key “GO” to start the movement from home position, and actuation on the key “BACK” to force returning the car to home position after

© 2006 by Taylor & Francis Group, LLC

5-6

Embedded Systems Handbook

S1

S0

Transition expression

State

S2 Transition

FIGURE 5.3

a (C ) / x z

State diagram basic notation.

(a)

(b) A From plant

M

B GO

From operator

M Controller

DIR

GO

DIR

BACK

To motor

BACK A

FIGURE 5.4

B

Electric car plant running example.

(a)

(b) B=0 S1 GO = 1

GO = 0

M=1 DIR = right

B=1

GO S2 M=0

S0 M=0

M=1 DIR = left

BACK = 1

B S2 M=0

S0 M=0

S3 A=1

S1 M=1 DIR = right

A

S3 M=1 DIR = left

BACK

BACK = 0

A=0

FIGURE 5.5 State diagram models of an electric car plant controller. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski, Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

end position reached. After receiving an order, the car motor is adequately activated, while the initial, or the final, position is not reached. There are two sensors available for detecting home and end position reached, A and B, respectively. Figure 5.4(a) represents the external view of the controller in terms of inputs and outputs, and Figure 5.4(b) illustrates the layout of the plant. Figure 5.5(a) and (b) present two possible (and equivalent) models for the control of the referred system. The first relies on the evaluation of external conditions (signal values are explicitly checked), while the second relies on external events (obtained through the preprocessing of external signals). It is clear that events usage will produce a lighter model, with less arcs and inscriptions (it is assumed in this representation that an event associated to a signal is generated when the signal changes its state from “0” to “1”).

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms

5-7

Mealy outputs Output function

e

at

t ts

x Ne

State variables

Moore outputs Output function

Cu

rr en

ts

ta

te

Inputs

Next-state function

FIGURE 5.6

FSM implementation reference model.

From the point of view of the implementation model, it is common to decompose the system into a set of functions to compute next state and outputs, and a set of state variables, as presented in Figure 5.6. From the execution semantics point of view, one of two reference approaches can be chosen (which correspond to different MoCs) [2]: 1. Synchronous FSMs. 2. Asynchronous FSMs. In synchronous FSMs, both computation and communication happen instantaneously at discrete-time instants (under the control of clock ticks). In this sense, from the point of view of active state changes, each transition arc expression is implicitly “ANDed” with a rising (or falling) edge event of the clock signal. Referring to Figure 5.6, the clock signal will be connected to the “State variables” block (for hardware implementations, a register will be used to implement this block, while for software implementation the clock will be used to trigger the execution cycle). One strong aspect toward synchronous FSMs usage is its implementation robustness, especially when using synchronous hardware. However, when heterogeneous implementations are foreseen, some difficulties or inefficiencies may arise (namely synchronous clock signal distribution). For distributed heterogeneous systems, it is also of interest to consider a globally asynchronous locally synchronous attitude (GALS systems), where the interaction between components is asynchronous, although the implementation of each component is synchronous. So, within a synchronous implementation island (a component), it is possible to rely on robust compilation techniques, either to optimally map FSMs into Boolean and sequential circuits (hardware) or into software code (supported by specific tools). In asynchronous FSMs, process behavior is similar to the one on synchronous FSMs, but without dependency on a clock tick. An asynchronous system is a system in which two events cannot have the same time tag. In this sense, two asynchronous FSMs never execute a transition at the same time (asynchronous interleaving). For heterogeneous architectures or for multirate specifications, implementation can be easier than in synchronous case. The difficulties come from the need to synchronize communicating transitions, and to assure that they occur at the same instant, which is essential for a correct implementation of rendezvous on a distributed architecture. FSMs have well-known strengths and weaknesses. Among the strengths, we should mention that they are simple and intuitive to understand; also that they benefit from the availability of robust compilation tools. These are some of the reasons why designers have extensively used them in the past, and continue to use. Unfortunately, several weaknesses prevent their usage for complex systems modeling. Namely, FSMs do not provide data processing capabilities, support for concurrency modeling, (practical) support

© 2006 by Taylor & Francis Group, LLC

5-8

Embedded Systems Handbook External control inputs ...

External data inputs ... Datapath control inputs ...

Control part

Datapath ... Datapath control outputs

FIGURE 5.7

...

...

External control outputs

External data outputs

Control and datapath decomposition.

for data memory, and hierarchical constructs. Several of the modeling formalisms to be presented try to overcome one, some, or all the referred weaknesses.

5.4.2 Finite State Machines with Datapath One common extension to FSMs trying to cope with the lack of support for data memory and data processing capabilities are Finite State Machines with Datapath (FSMD) [16]. For instance, to model an 8-bit variable with 256 possible values through an FSM, it is necessary to use 256 states; the model looses its expressiveness and the designer cannot manage the specification. An FSMD adds to a basic FSM a set of variables and redefines the next-state and output functions. So, an FSMD consists of a finite set of states S (with a specified initial state, sis ), a set of input signals I , a set of output signals O, a set of variables V , an output function f , and a next-state function h. The next-state function h maps a crossproduct of S, I , and V into S ( f : S × I × V → S). Output function f maps current states to outputs and variables (h: S → O + V ). As defined, output function f only supports Moore-type outputs; it can also be easily extended to accommodate Mealy-type outputs. From the implementation point of view, an FSMD model is decomposed as presented in Figure 5.7, where the control part can be represented by a simple FSM model, and the datapath part can be characterized through a register transfer architecture. So, the datapath is decomposed into a set of variables to store operands and results, and a set of processing blocks to perform computation on those values. It has to be stressed that this is the common reference architecture for single-purpose processor and simple microprocessor designs. As a simple example, Figure 5.8 presents the decomposition associated with the modeling of a multiplier of two numbers, A and B, producing result C through successive additions. Figure 5.8(a) presents top-level decomposition and interconnections of control and data blocks, while Figure 5.8(c) presents a simple FSM to model the control part and Figure 5.8(d) shows the register transfer architecture to support the required computations (left-hand side is responsible for counting B times, while right-hand side is responsible for the successive additions of A into C).

5.4.3 Statecharts and Hierarchical/Concurrent Finite State Machines A second common extension to FSMs tries to cope with the lack of support for concurrency and hierarchical structuring mechanisms of the model (still emphasizing the reactive part of the model).

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms

5-9

(a)

(b) GO

A

B

N bits

C=A×B C = A + A + ... + A

N bits

LOAD_A LOAD_B CLEAR_B

Control part

B times

Datapath

INC_B CLEAR_C LO AD_C STOP

(d) Clock

A

2N bits OK

B

C

(c)

N bits N bits

LOAD_B S0 OK

GO = 0

GO = 1

GO = 0

S1 LOAD_A LOAD_B CLEAR_B CLEAR_C STOP = 1

STOP = 1

S4 OK

LOAD_A

RA RA

RB RB

CLEAR_B S2

INC_B

INC_B

N bits

STOP = 0

S3 INC_B LOAD_C

CB CB

STOP = 0

N bits

Σ 2N bits CLEAR_C RC LOAD_C RC

= STOP

2N bits

GO = 1

C

FIGURE 5.8

Decomposition of a multiplier into control and datapath.

Several formalisms can be included in the group of hierarchical/concurrent finite state machines (HCFSMs), all of them including mechanisms for concurrency and hierarchy support, but having different execution semantics. Among them, Statecharts [7,17] are the most well-known modeling formalism providing a MoC to specify complex reactive systems. One main advantage of Statecharts over FSMs is the structuring of the specification, magnifying the legibility, and improving the system maintenance. Those characteristics were key points that supported its adoption as one of the specification formalisms within the UML [18–20]. Statecharts are based on state diagrams, plus the notions of hierarchy, parallelism, and communication between parallel components. Statecharts were informally defined in [17] as “Statecharts = state-diagrams + depth + orthogonality + broadcast-communication.” Depth concept encapsulates the multilevel hierarchical structuring mechanism and is supported by the XOR refinement mechanism, while orthogonality concept allows concurrency modeling and is supported by the AND refinement mechanism. Unfortunately, the broadcast-communication mechanism semantics is not similar in all Statecharts variants as it was defined in different ways by several authors. This fact had a strong impact on possible Statecharts operational semantics, as discussed later in this section. Statecharts define three types of state instances: the set (implementing the AND refinement mechanism), the cluster (implementing the XOR refinement mechanism), and the simple state. The cluster supports the hierarchy concept, through encapsulation of state machines. The set supports the concurrency concept, through parallel execution of clusters. Figure 5.9 illustrates the usage of the cluster mechanism, adopting a bottom-up approach. Starting with the SYS_C model, the state diagram composed by states C and D, and associated arcs can be encapsulated by the state A, as represented in SYS_B model. This provides us with a top-level view of the model composed only by the states A and B, complemented by the inner level if one wants to get further details about the system behavior, as represented in SYS_A model. In this sense, the designer has the possibility to describe the system at different levels of abstraction. The designer is free to follow a top-down or

© 2006 by Taylor & Francis Group, LLC

5-10

Embedded Systems Handbook

SYS_A w B

A z

SYS_B

SYS_C

A w

w C

C

x

y

x D

FIGURE 5.9

B

z

B y

z

D

z

Usage of XOR refinement in Statecharts basic models.

A H*

B

C

H

w

D

E F x y

G

v

L

p I

q

J

H q

r K

z

p M

FIGURE 5.10 Usage of AND refinement in Statecharts basic models. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski, Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

a bottom-up approach while producing the system’s model by applying the hierarchical decomposition constructs available through the XOR refinement mechanism. Figure 5.10 presents a simple model containing a set A composed by three AND components (B, C, and D); whenever A is activated/deactivated the associated components B, C, and D will also be activated/deactivated. Apart from the referred main characteristics, the Statecharts formalism presents some interesting features, such as: • The default state that defines which state will take control in the case where a transition reaches a cluster state. In Figure 5.9, SYS_B model, the system initial state is state B, and, after the occurrence of w, states A and C will become active.

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms

5-11

• The notion of history, simple or deep, can be associated to cluster state instances. When the system enters a cluster with history property, the state that will be active upon entrance will be the one that was active upon the last exit from that cluster. In the case of the first entrance in the cluster, the active state will be the default one. This is the case for the cluster C of Figure 5.10, which holds the “H” attribute inside a circle. The history property can also be deep history, meaning that all the clusters inside the cluster with the deep-history property, also have that property; this is the case for cluster B in Figure 5.10, which holds the “H∗ ” attribute inside a circle. • The activation expressions present in transitions can also have special events, special conditions, and special actions. Special events can be the entered(state) event and the exited(state) event, which are generated whenever the system enters or exits some state, respectively. One special condition is the in(state), which indicates if the system is currently in the specified state. Examples of special actions are: clear_history(state) and clear_deep_history(state). These initialize a cluster history state to the default state. So far we have only characterized syntactic issues for the Statecharts formalism. Its semantic characterization includes, among other things, the characterization of the step algorithm and the definition of the criteria for solving conflicts (which set of triggering transitions to choose among a set of conflicting transitions), in order to obtain a deterministic behavior. The step algorithm is of fundamental importance as it dictates the way the system evolves from one global state to another. The triggering of a finite set of transitions causes the evolution of the system. In Reference 21, the trigger of this set of transitions is called microstep (conceptually, can be seen as equivalent to the delta concept, used in discrete event formalism). Transitions fired in a microstep can generate internal events that, in turn, can fire other transitions. Thus, a step is defined as a finite sequence of microsteps. This means that, after all the microsteps have taken place, there will be no more active transitions and the system will remain in a stable state. External events are not considered during a step, so one must assume that a system can completely react to an external event before the occurrence of the next external event. The initial Statecharts proposal left their semantics partially open [22]. As a consequence, different semantics associated with the handling of broadcasted event have been considered possible. Statecharts’ semantics proposals are discussed in Reference 23. Yet, as pointed by Harel and Naamad [22] that survey does not mention the semantics implemented in the STATEMATE tool (in the following citation the original reference citations were replaced by the citation numbering used in this chapter): “the survey [23] does not mention the STATEMATE implementation of Statecharts or the semantics adopted for it at all, although this semantics is different from the ones surveyed therein (and was developed earlier than all of them except for [22]).” Even so, most common semantics consider that a generated event will only be valid during the next microstep (more used) or for the remaining microsteps within the same step (persistent events). Coming back to the example of the electric car controller, if one considers controlling two or three of those cars using the same controller, and tries to use a state diagram to model the whole system, he or she will face the well-known problem of state explosion, due to the orthogonal composition of state diagrams: if we have two independent state diagrams with N 1 and N 2 states, the composed state diagram will contain N 1 × N 2, as for each state of machine 1, one may have machine 2 in any of its states. This illustrates the already presented weakness resulting from the lack of support for concurrency modeling of FSM. Now consider that we want to keep the cars synchronized at both ends, which means that we will wait for all cars at each end, before launching them in the opposite direction. The associated FSM model is presented in Figure 5.11, where the state explosion phenomenon is observed, with the lack of intuitive legibility of the resulting model. Figure 5.12 presents a Statecharts model (TWO_CARS) enabling a compact modeling of the referred system. The system is composed by an AND set (TWO_CARS) of two XOR clusters (CAR1 and CAR2), modeling each one of the cars. It has also to be noted that the rounded rectangle representation for states takes advantage of the area, allowing a more readable graphical notation (compared with the circle representation of common state diagrams). Additional constraints associated with global start and global

© 2006 by Taylor & Francis Group, LLC

5-12

Embedded Systems Handbook

B1=0 B2=0 M1=1 M2=1 DIR 1 = right DIR 2 = right

S1 GO = 1 GO = 0

B2=0 B1=1 B2=0

M1=0 M2=0

M1=0 M2=1 DIR 2 = right

S3

M1=1 M2=0

A2 = 0 A2 = 1

S6 S6 M1=0 M 2 =1 DIR 2 = left DI R2=left

B2=1

B1=1 B2=1 B1=0 B2=1

S0

S2

B1 = 1

S4 M1=0 M2=0

DIR 1= right A1 = 1 A2=0

B1 = 0 BACK = 0

S7

A1 = 1 A2=1

M1=1 M2=0

A1 = 1

DIR 1 = left A1 = 0

A1 = 0 A2=0

BACK = 1 S5

M1=1 M2=1 DIR 1= left DIR 2 = left

A1 = 0 A2=0

FIGURE 5.11 An FSM model for two cars controller system.

TWO_CARS GO[in(S02)]

S11

B1

DIR 1= right M 1=1

S21

S01 M1=0

S31 A1

DIR 1= left M 1=1

M1 = 0 BACK[in(S22)]

CAR1 CAR2 GO[in(S01)]

S12 DIR 2 = right M2=1

S02

B2 S22

M2=0

S32 A2

DIR 2 = right M2=1

M2=0 BACK[in(S 21)]

FIGURE 5.12 Simple Statecharts model for electric cars example.

return are explicitly modeled through specific arc inscriptions; for instance, arc expression leaving state S01 is augmented with the condition [in(S02)], imposing that CAR1 only starts its travel if CAR2 is also in its original position (identical constraint for CAR2). The reader should refer to References 7, 17, and 21 for further details concerning the Statecharts formalism.

5.4.4 Program-State Machines A third extension to FSMs is the program-state machine (PSM) model, which allows the use of sequential program code to define a state action. A PSM includes the hierarchy and concurrency extensions of

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms Leaf

PS1

5-13 Sequential

PS3

Concurrent

PS5

x=0 y=1 y=0

PS2

PS4

PS6

a = 0; b = 0; while(c<3) { ...; } a=1;

FIGURE 5.13 A simple PSM model.

HCFSMs, but also merges the sequential program model, subsuming both models [4]. So, in a PSM, the concept of state was substituted by the concept of program-state, in the sense that on top of the state concept, the state actions are defined using a sequential program. So, we can consider two extreme situations: 1. A PSM model with only one state and with a sequential program is associated as a state action. In this case, the model is equivalent to a sequential program. 2. A PSM model with many states, hierarchically organized or not, within clusters and sets, but where all output actions are simple assignment statements (like the ones used in Figure 5.12), which are equivalent to an HCFSM model. In common modeling applications, a mixture of the two extreme situations will be found. Figure 5.13 present a simple PSM model skeleton. It is also possible, in some situations, to model a specific process through a set of states (an FSM) or through a sequential program (associated with one program-state); in these situations, the designer has the capability to produce completely different models, although with equivalent behaviors. So, the designer may either emphasize the graphical expressiveness of an FSM, or take advantage of the conciseness associated with program textual representations. Some new features of PSM have to be highlighted. First of all, only well-formed hierarchical constructs are allowed. This means that it is not possible to define a transition arc between two states crossing the boundary of a cluster or of a set; only transitions between states in the same cluster are allowed, which means arcs between states with the same parent state. This is due to the fact that the PSM’s model of hierarchy is similar as in sequential program languages, using subroutines with only one entry point, and not specifying the point to return. This is a big difference regarding Statecharts, where no restrictions on arc connectivity were imposed. From the computational point of view, three types of program-states are defined: leaf, sequential, and concurrent. Hierarchical composition of program-states can be sequential or concurrent (in the common sense used in HCFSM: in concurrent constructs all the components will be active/inactive at a specific point in time, while in sequential construct only one component will be active at a time); leaf program-states exist at the bottom of the behavioral hierarchy and can have a sequential program attached with it. From the execution point of view, a program-state can be in one of three states: inactive, executing, and complete. If the program-state is a sequential program, then reaching the end of the code means that the program-state is complete. Special complete states can be added to the model, represented by a dark square (as in Figure 5.13). Complete states only receive arc transitions from the substates at the same level (with the same parent state);

© 2006 by Taylor & Francis Group, LLC

5-14

Embedded Systems Handbook

no transition can start at a complete state. When the control is transferred to the complete state, the associated program-state is complete. As a consequence, PSM supports two types of transitions: 1. A transition-immediately (TI), as in Statecharts, which means that the transition occurs as soon as the associated condition becomes true, regardless of the source program-state. This is the case of the transition in Figure 5.13 starting at PS5 and having condition x = 0 attached. 2. A transition-on-completion (TOC) is defined, occurring when the associated condition is true and the source program-state is complete. The graphical notation used considers that a TI starts at the boundary of the program-state, while a TOC starts from a dark square inside the state; Figure 5.13 presents several examples. In this sense, the reactive nature of HCFSM/Statecharts is preserved using TI transitions, while the data processing and transformational nature of sequential program models is supported using TOC transitions. The PSM modeling formalism was used by SpecCharts [24] and SpecC [25] languages. In SpecCharts language, VHDL was used for inscriptions and code generation, while in SpecC language, C was initially used for inscriptions and code generation (currently it is also possible to generate HDL code).

5.4.5 Codesign Finite State Machines Codesign Finite State Machines (CFSMs) are another modeling formalism for embedded system design [3,26,27]. A system is described by a network of CFSMs. They extend FSMs with support for data handling and asynchronous communication. Data handling is associated to transitions and these are associated to external instantaneous functions. This implies that data manipulation should typically exhibit a low algorithmic complexity. In this sense, CFSMs supports the specification of embedded systems, involving both control and dataflow aspects, and having the following key characteristics: • Are composed by FSMs components. • Have a data computation part, in the form of references in the transition relation to external, instantaneous (combinatorial) functions. • Use a GALS reference model, where: (a) Local behavior is synchronous (as seen inside each component). (b) Global behavior is asynchronous (as seen by the rest of the system). Each CFSM reads inputs, executes a transition, and produces outputs in an unbounded but finite amount of time. Communication between CFSMs is supported by signals, control signals, and data signals. Signals carry information in the form of signal events and/or signal values, and may function as input signals, output signals, or state signals. A connection, with an associated input buffer (1-place buffer), is used to support signal communication between two CFSMs. The buffer contains one memory element for the event (event buffer) and one for the data (data buffer). An event can be produced by a sender CFSM, by setting the event buffer to 1. Afterwards, the event can be detected (through its reading) and consumed (by setting the buffer to 0) by a receiver CFSM. Before being consumed by the receiver CFSM, and after being emitted by the sender CFSM, an event is present (asynchronous communication among CFSMs). For a CFSM A send a signal S to CFSM B, it is necessary that CFSM A writes the data of S into the buffer, and afterwards emits its event. The CFSM B detects the event, reads data, and consumes the event. This order ensures that data are transferred. A multicast communication mechanism is also available through a net, which is a set of connections on the same output signal, in the sense that a sender communicates a signal to several receivers with a single emission, and each receiver has a private copy of the communicated signal. Afterwards, each receiver independently detects the event, reads associated data, and consumes the event.

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms

5-15

x,y = 0,0 x = 1/z = 1

1

y=0

y = 1/z = 1

0

x,y = 0,1

x = 1/z = 1

2

x=0

.r 0 #transition: X Y st => Z st .trans 1 0 0 1 0

0 1 1 0 – –

0 0 0 1 1 2 2

1 0 0 1 0 1 0

0 1 2 0 1 # trivial 0 2 # trivial

FIGURE 5.14 Simple CFSM and associated SHIFT representation.

From the implementation point of view, a system described by a set of CFSMs can be decomposed into a set of CFSMs to be implemented in software, a set of CFSMs to be implemented in hardware, and a communication interface between them. Any high-level language with precise semantics based on extended FSMs can be used to model individual CFSMs; currently Esterel is directly supported. The CFSM model is strongly connected with the POLIS approach [27], and can be represented using a textual notation named SHIFT (Software–Hardware Intermediate FormaT), which extends to BLIF-MV [28], which in turn is a multivalued version and an extension of the Berkeley Logic Interchange Format (BLIF). A CFSM representation in SHIFT is composed of a set of input signals, a set of output signals, a set of states or feedback signals, a set of initial values for output and state signals, and a transition relation describing reactive behavior. Figure 5.14 presents a simple CFSM and associated SHIFT textual representation.

5.4.6 Speciﬁcation and Description Language Specification and Description Language (SDL) is a modeling language standardized by the ITU (International Telecommunication Union) [29]. Basically, an SDL model is a set of FSMs running in parallel. These machines have their own input queues and communicate using discrete signals. The signals can be asynchronous or synchronous. The latter correspond to synchronous remote procedure calls. All signals can have associated parameters to interchange and synchronize information between SDL processes and also between these and the environment. SDL has been designed for the development of concurrent, reactive, real-time, distributed, and heterogeneous systems. Yet, by large, the major application area for SDL is in telecommunications. It is the usually chosen language to define and standardize communication protocols. The specifications of many communications systems use SDL. These include the GSM second generation mobile telephony system, the UMTS third-generation mobile telephony system, the ETSI HiperLAN 2 Broadband Radio Access Network or the IEEE 802.11 wireless Ethernet local area network [30]. The language has been evolving since the first ITU Z.100 Recommendation in 1980 with updates in 1984, 1988, 1992, 1996, and 1999. The 1992 version added some object-oriented concepts. These were significantly expanded in the latter version (SDL-2000) toward a better integration with the UML [31]. This trend should continue as SDL is increasingly being used together with UML. An article by Rick Reed [32] provides a quite comprehensive overview of SDL’s history up to 1999. More recent developments, and all other types of SDL related information, are available at the SDL Forum Society Internet site [33]. In particular SDL-2000 [29] breaks the compatibility with SDL-96 and adds numerous significant modifications. Here, we restrict to the main fundamental concepts thus avoiding the main incompatibility issues.

© 2006 by Taylor & Francis Group, LLC

5-16

Embedded Systems Handbook

The SDL has four structuring concepts: system, blocks, processes, and procedures. Besides these structuring concepts, SDL has two communication-related constructions: 1. Channels (operate at system level connecting blocks). 2. Signal routes (operate inside blocks connecting processes). The system is the entry point in an SDL specification. Its main components are blocks and channels. Channels connect blocks with each other and also with the environment. Communication is only possible along the specified channels. These can be unidirectional or bi-directional. By default, channels are free of errors and preserve the order of the transmitted signals. These properties can be modified through the definition of channel substructures. A block is a static structure defined as a set of interrelated blocks or as a set of interrelated processes. In SDL-2000, system, blocks, and processes were harmonized as agents. The state machines are addressed by the name of the enclosing agent. Agents may include their own state machine, as well as other agents’, and shared data declarations. In SDL-2000 blocks can contain other blocks together with processes. Yet, processes still cannot contain blocks or systems. The main difference between blocks and processes is in the implicit concurrency type: blocks are true concurrent while agents inside processes have an interleaved semantics. Each process has its own signal queue and usually only contains a state machine. Differently, blocks are usually used as a structuring diagram showing other blocks and other processes. The system is the top-level block that interfaces with the environment. Figure 5.15 shows an SDL block named controller. This block, together with the process car in Figure 5.16, models the example system of one electric car controller, already presented in Figure 5.4. The controller is defined as an SDL block. As it is the only block, it can be seen as the system (more precisely, the system contains a single block named controller). The numbers on the top-right corner specify the page number being shown and the total number of pages in the current level. The block defines the signals inside a text element (a rectangle with a folded top-right corner). Note that the two output signals M and DIR carry data (an Integer parameter). These are used to assign values to the outputs. Besides the text element, the controller block contains a single process named car. This process is connected to the environment through two input channels (passage and gb) and two output channels

Block

1(1)

controller

signal A, B; signal GO, BACK; signal M(Integer); signal DIR(Integer);

[A,B]

Passage

[GO, BACK]

gb

FIGURE 5.15 A controller for one electric car.

© 2006 by Taylor & Francis Group, LLC

Car

Motor

Direction

[M]

[DIR]

Modeling Formalisms

Process

5-17

car

1(1) dcl right Integer := 0; dcl left Integer := 1;

S0

GO

M(1)

DIR(right)

S1

S1

S2

S3

B

BACK

A

M(0)

M(1)

M(0)

S2

DIR(left)

S0

S3

FIGURE 5.16 The process car.

(motor and direction). The signals handled by the channels are specified as signals’ lists (inside square brackets). For example, the passage channel handles two signals: A and B. Notice the graphical similarity between the process car notation and the controller external view in terms of inputs and outputs, as in Figure 5.4(a). Figure 5.16 shows the car process. It starts with a start symbol in the top-left corner and immediately proceeds to state S0. The process waits for the reception of signal GO on its input queue and emits two output signals: one assigns 1 to M and the other assigns right (defined as a value of zero) to the DIR output signal. It stops in state S1. Afterward, the process waits for the B signal, assigns zero to the M output and arrives at state S2. It then proceeds accordingly to the presented example until the initial state S0. We now present a generalized version for N car processes. We consider N = 3. Figure 5.17 shows a block controller able to model three cars, which synchronize on the GO and BACK signals. Now, the process car has the notation car(3,3), which specifies that three instances are created on startup and a maximum of three can exist. Additionally, we define another process named Sync. This process forces the synchronization of the three parallel car processes on the GO and BACK events. A car process (see Figure 5.18) starts by registering itself in the Sync process by passing its process identifier (PId) in the signal carPId through channel register. When each car process is ready to receive the GOSync or the BACKSync signals from process Sync (which means that the car is ready at its home or end position), it sends a GOok or a BACKok signal through channel ok. Afterwards, each car process can receive the respective GOSync or BACKSync signals from process Sync through channel gbSync. The remaining signals are now indexed by the respective process identifiers (a PId type parameter). This allows a simple way to model an arbitrary number of car processes. In the presented example, the initial, and also the maximal, number car process instances is three. The Sync process (see Figure 5.19) starts by registering the N process car instances. It receives N carPId signals carrying the pIds. Each of these signals is sent by a different car process. To this end, the process defines an Array data structure named cars. It will act as a list: when all the N car processes have sent their GOok signal, the Sync process sends a GOSync process to each of the pIds in the cars Array, after reception of the GO signal. The same happens for the BACKok and BACKSync signals. Differently, from

© 2006 by Taylor & Francis Group, LLC

5-18

Embedded Systems Handbook

Block

controller

1(1) signal A(PId), B(PId); signal carPId(PId); signal GO, BACK; signal GOok, BACKok; signal GOSync, BACKSync; signal M(PId, Integer); signal DIR(PId, Integer);

Passage

car(3,3)

Motor

[A,B] Sync

[M] Register [carPId]

gb [GO,BACK]

Direction ok [DIR] [GOok,BACKok] gbSync [GOSync,BACKSync]

FIGURE 5.17 A controller for three cars synchronizing in the GO and BACK events.

the previous processes, this Sync process uses tasks: the rectangles with language instructions, for example, c := c + 1. It also uses decision nodes. These are the lozenge with an associated condition (c = N ) and two possible outputs (true and false). A process can also invoke procedures. A procedure is executed sequentially while the caller process waits for it to terminate. When the procedure terminates, the process continues from the symbol following the procedure invocation. This mimics what happens in a typical procedural programming language (e.g., PASCAL). Accordingly, SDL procedures can also return a value and have parameters passed by value and one parameter passed by reference. Specification and Description Language is usually used in combination with other languages, namely MSCs [34], ASN.1 (Abstract Syntax Notation One), and TTCN (Testing and Test Control Notation). The ASN.1 is an international standard for describing the structure of data exchanged between communicating systems (ITU-T X.680 to X.683 | ISO/IEC 8824-1 to 4). TTCN is a language used to write detailed test specifications. The latest version of the language, TTCN version 3 (TTCN-3), is standardized by ETSI (ES 201 873 series) and the ITU-T (Z.140 series). The use of the object model notation of SDL-2000 in combination with MSC, traditional SDL state models and ASN.1 is a powerful combination that covers most system engineering aspects. These languages have been studied in the same group within the ITU: • ITU Z.105(11/99) and Z.107(11/99) standards define the use of SDL with ASN.1. • ITU Z.109(11/99) standard defines a UML profile for SDL. • ITU Z.120(11/99) standard defines MSCs.

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms

Process

5-19

car

1(1)

dcl right Integer := 0; dcl left Integer := 1; dcl p PId; carPId(self)

GOok

Initial

S1

B(p)

GOSync

S2

BACKSync

p = self M(self,1)

S3

A(p)

p = self M(self,1)

M(self,0)

M(self,0) DIR(self,left)

DIR(self,right) BACKok

S1

S2

S3

FIGURE 5.18 The new process car.

5.4.7 Message Sequence Charts Message Sequence Chart is a scenario language standardized by the ITU [34]. Although originally invented in the telecommunications domain, nowadays it is widely used in embedded, real-time, and distributed systems among others. It is not tailored to any particular application domain. Before the approval at the ITU meeting at Geneva in 1992, the MSC recommendation was used merely as an informal language. After being approved as a standard, MSC has increased its usage in numerous application areas. After the first revision in 1996 the object-oriented community has shown an increasing interest in MSC standard, namely for use case formalization [35]. Basically, we can say that MSC is a graphical and textual language, which are used for the description and specification of interaction between system components. It is used in the early stages of the system specification to represent system behavior. In the present chapter, we will focus on the MSC graphical representation. It is a simple, twodimensional diagram, giving an overview of the system behavior. The behavior is characterized through message exchanged between the system entities. Usually, these entities are called instances or processes and could be any part of the system: a subsystem, a component, a process, an object, or a user among others. The entities are represented by vertical lines and the messages between them are oriented arrows from the sender to the receiver.

© 2006 by Taylor & Francis Group, LLC

5-20

Embedded Systems Handbook

Process

Sync

1(1) dcl N Integer := 3; dcl c Integer := 0; dcl pid PId; dcl cars Array(Integer, PId);

S1

S3

S2

S4

S0 BACK

GO BACKok

GOok

carPId(pid) false cars(c) := pid

c := c + 1

false

GOSync to cars(c)

false c := c + 1

BACKSync to cars(c)

false

false

c := c + 1 c=N

c := c + 1 c=N

c=N

true c=N

true

true c := 0

S1

c=N true

true c := 0

c := 0

c := 0

c := 0

S2

S3

S4

S1

A

B

FIGURE 5.19 Sync process.

C

m_ab m_ac

m_b m_c

FIGURE 5.20 An elementary MSC.

Figure 5.20 shows an example of an elementary MSC in graphical form. Here, we have three entities, A, B, and C, represented by vertical lines (axes). The axis starts in the “instance head symbol” (a nonfilled rectangle) and ends in the “instance end symbol” (a filled rectangle). The exchanged messages between the instances are represented by oriented labeled arrows (e.g., A sends the message m_ab to B). Time has no explicit representation on MSC diagrams, but we can assume that time flows on the vertical axis, top-down direction. Yet, it is not possible to represent the exact time when the events or messages arrives; it is only possible to establish a partial order between messages. Within basic MSC models, we can also represent conditions and timers. Conditions are states that must be reached before the“execution”can continue. They are represented as an informal text in a hexagonal box.

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms

5-21 A

B

C

Condition m_a m_b m_c

FIGURE 5.21

Using a synchronizing condition in an MSC model.

A

B

C

m_a m_b1 t1 m_b2

t2

t1 m_c

FIGURE 5.22

Using timers in an MSC model.

It is also possible to define a shared condition for synchronization purposes. This means that two or more processes must reach a specific state before continuing with further procedures (see Figure 5.21). On the other hand, three special events can be associated with timers: timer set, timer reset and time-out. In the graphical representation the timer set symbol has the form of an hour glass connected to the instance axis. The time-out is a message arrow attached to the hour glass symbol and pointing to the instance axis. The timer reset is represented by a cross symbol connected to the instance axis by a line (see Figure 5.22). Message Sequence Chart also supports structured design. This means that beyond representation of simple scenarios using basic MSCs, it is possible to represent more complex specifications by means of high-level message sequence charts (HMSCs). HMSCs are used for the hierarchical top-down specification of scenarios as a directed graph. An HMSC model has a start node and an end node, and the vertices can represent other “lower-level” HMSCs, basic MSCs (“MSC reference”) or conditions. All conditions at HMSC level are considered as global. Each “MSC reference” refers to a specific MSC, which defines it. Figure 5.23 presents a simple HMSC model with associated basic MSCs. A possible scenario description for our electric car controller running example, already presented before, is the following: Considering that the car is at its home position, the operator activates the car motor in order to move it to the end of the line. Detection of car presence by the sensor_B means that the car reached the

© 2006 by Taylor & Francis Group, LLC

5-22

Embedded Systems Handbook

(a)

(b)

msc Reference

msc False

A

B

Start node

C

A

B

C

when MSC

Initial condition

m_baf

m m_ab

Initial condition

m_caf m_ac

m_f

Reference Initial condition

when MSC

False

when MSC Connection point True

msc True when MSC true

msc Do

A

B

C

when MSC true

Do when MSC

Check_A

m_bat End node

Check_B

m_cat m_t when MSC true

FIGURE 5.23 An HMSC model: (a) top-level model, (b) associated inner level MSCs. Operator

Sensor_A

Motors

Sensor_B

Go Car at the end position

Stop

Back

Car at the start position

Stop

FIGURE 5.24 An MSC model for the electric car controller system.

end of the line, and the car must be stopped. After that, the operator can activate the motor in the reverse way, in order to take the car back to home position. When the car presence is detected by the sensor_A, the car must be stopped, because it is again on its home position. This scenario can be represented graphically by the MSC of Figure 5.24. For the case when we want to consider more than one car in the system, it is necessary to replicate instances and messages. In order to activate each car, it is necessary to send a message to each of them. The basic MSC for two electric cars controller system is presented in Figure 5.25. Alternatively, we could also get some benefit from the HMSC representation.

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms Operator

5-23

Sensor_A1

Motor1

Sensor_B1

Sensor_A2

Go car1 Car1 at the end position

Sensor_B2

Go car2 Stop car1

Car2 at the end position

Back car1 Car1 at the start position

Motor2

Stop2

Back car2

Stop

Car2 at the start position

Stop car2

FIGURE 5.25 An MSC model for two electric cars controller system.

(a) p1

(b) p1 t1

t1 p4

p2

p4 p2

p5 p3

p5 p3

FIGURE 5.26 Transition t1 firing: (a) before and (b) after.

5.4.8 Petri Nets Carl Adam Petri proposed Petri nets in 1962 [36]. Petri nets can be viewed as a generalization of a state machine where several states can be active and transitions can start at a set of “states” and end in another set of “states.” More formally, Petri nets are directed graphs with two distinct sets of nodes: places, drawn as circles, are the passive entities; and transitions, drawn as bars, rectangles, or squares, are the active entities. Places and transitions are connected by arcs in an alternate way: places can only be connected to transitions and transitions can only be connected to places. Places can contain tokens, also called marks, specified by small circles, inside places. Figure 5.26 shows two nets, each with one transition (t1 ), five places (p1 to p5 ), and five arcs. The nets have distinct markings: the net (b) corresponds to the net (a) marking after the transition firing. A transition can only fire if all its input places contain, at least, one token. From the transition point of view, one token is taken from each input place and one token is put in each output place; these destructions and creations of tokens are an atomic operation, in the sense that it cannot be decomposed. Notice that a transition can have only input arcs (sink transition) or only output arcs (source transition). More exactly, this Petri net model is a Place/Transition net [12,37]. They are the most well-known and best-studied class of Petri nets. In fact, this class is sometimes just called Petri nets. Generalized Petri nets simply add the possibility of weights in arcs with the expected change in the semantics: each arc must now take or deposit the specified number of tokens.

© 2006 by Taylor & Francis Group, LLC

5-24

Embedded Systems Handbook

(a)

(b)

A[1] GO

A[2] A[3]

BACK

A [i ] (i : 1,..,3)

GO 3

3

B [1]

BACK 3 3

B [2] B [3]

B [i ] (i : 1,..,3)

FIGURE 5.27 Petri net models for our electric cars example. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

Coming back to our electric cars example, Figure 5.27(a) presents a Petri net model suitable for the modeling of a system with three electric cars, maintaining the constraint of synchronized starts and returns (in order to improve legibility, annotations associated with outputs are omitted, but can be associated with different places or transitions, as well in state diagrams and Statecharts). It should be noted that transitions have the dynamics of the model associated with them, while places hold the current states of each car. Figure 5.27(b) uses a generalized Petri net model (arcs have weights associated with them), enabling a more compact representation of the system. While thinking about introducing more cars into our system (scalability problem), it is clear that, considering this last model, it is only necessary to change the initial marking of the net: the left-most place should contain as many tokens as the number of electric cars in the system. As even this simple example shows, Petri nets do not give special emphasis to states, as state diagrams, nor to actions, as dataflows: states and actions are both “first-class citizens.” In other words, Petri nets provide a pure dual interpretation of the modeled world: each entity is either passive or active. As already stated, passive entities (e.g., states, resources, buffers, channels, conditions) are modeled by places; active entities (e.g., actions, events, signals, execution of procedures) are modeled as transitions. Another fundamental characteristic of Petri nets is their locality of cause and effect. More concretely, this means transitions are only fire based on their vicinity, namely their input places and, possibly (for some classes of Petri nets), in their output places. Two other fundamental aspects of Petri nets are their concurrent and distributed execution, and the instantaneous firing of transitions. The concurrent and distributed execution results from the locality of cause and effect: all transitions with no input or output places in common, fire independently. This implies the model is intrinsically distributed, as all transitions with disjoint locality do not depend on other transitions. The instantaneous firing means that each transition fires instantaneously: there are no intermediate states between token destruction in input places and token creation in output places. Petri nets are well known for their inherent simplicity due to the small number of elementary concepts. Yet, simplicity of Petri nets is a two-edge sword [38]: on one hand, it avoids the addition of further complications in the model resulting from the language complexity; on the other hand, it invariably implies that some desired feature is missing. As such, this basic type of Petri net has been modified in numerous ways to the point that Petri net is really a general name for a large group of different definitions and extensions, although all claim some close relation to the seminal work by Carl Petri [36]. A very readable and informal discussion, about the existence of many different Petri nets classes can be found in the paper by Desel and Juhás [39]. This evidence has its origins in the fact, easily confirmed by experience, that conventional Petri nets, also known as low-level Petri nets, are sometimes difficult to use in practice due not only to the problem of rapid model growth, but also to the lack of more specific constructions. Namely, time modeling, interface with the environment, and structuring mechanisms are frequent causes for extensions. While the former two are specific to some types of system, structuring mechanisms are useful for even the simplest system model.

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms

5-25

It is important to note that, contrary to state machines, the number of places grows linearly with the size of the system model as not all the states have to be represented explicitly but can result from the combination of several already existent “states” (modeled by places). This implies that Petri net models “scale better” than state machines (including Statecharts). Even so, model growth is still a significant problem for Petri net models due to their graphical nature and low-level concepts. To minimize this, several abstraction and composition mechanisms exist. These can be classified into two groups: 1. Refinement/abstraction mechanisms. 2. Composition mechanisms. Refinement/abstraction constructs correspond, roughly, to the programming languages concepts of macros or procedure invocation [53]. Intuitively macro expansion is an “internal” composition where we get nets inside nets through the use of nodes (places or transitions). The use of places and transitions as encapsulated Petri nets goes back to the works of Carl Adam Petri [36] and was extensively presented in Reference 15. They are usually named macros (or macronodes) [40]. These constructs support hierarchical specification of systems, in the traditional sense of module decomposition/aggregation and “divide to conquer” attitude. Figure 5.28 presents a two-level model of a typical producer–consumer system. At the top-level both producer and consumer are modeled by transitions (associated with the dynamics of the system), while storage is associated with a place (static part of the system). It has to be noted that the executable model is obtained through the “flat” model, the one resulting from the merging of the different subsystems, as also presented in Figure 5.28. Several proposals have even added object-oriented features to Petri nets (see Reference 41 for up-to-date detailed survey of some approaches). On the other hand, a common compositional mechanism is the folding of nets found in high-level nets. High-level nets can be seen as an internal composition made possible by structural symmetries inside a low-level net model. This is achieved by the use of high-level tokens. These tokens are no longer undistinguishable, but have data structures attached to them. Figure 5.29 shows the high-level Petri net model associated with our three electric cars system. Tokens presented at the left-most place (initial marking) now contain integer values, which enable the identification of a specific car status inside the model.

Producer

Storage

Consumer

p1c

p1p

p1s

t1p

t1c

t2c

t2p p2s p2p

p2c

FIGURE 5.28 Hierarchical decomposition of a producer–consumer Petri net model.

© 2006 by Taylor & Francis Group, LLC

5-26

Embedded Systems Handbook A[i ]

GO 112 122 132

1i 2+1 j 2+1k 2

1i 2

1i 2+1 j 2+1k 2

BACK 1i 2

1i 2+1 j 2+1k 2

B [i ] 1i 2

1i 2

1i 2+1 j 2+1k2

FIGURE 5.29 High-level Petri net models for our electric cars example. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

t

(a)

t

(b)

t X

t Y

Z t

(c) t X

Y

t+∆

t

X

Y

X

Y

Z

(d) Z

t+∆

Z

FIGURE 5.30 Handling of simultaneous events in a discrete event system. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

As tokens can carry data, it is necessary to add some inscriptions to the arcs, in order to properly select the right token to involve in a specific transition firing. Taking transition GO of Figure 5.29 as an example, we will need three tokens to enable the transition to fire, obtained through the adequate binding of the variables i, j, and k with data from tokens presented at the incoming place. On transition A[i], inscription at the incoming and outgoing arcs ([i]) simply means that no transformation is performed in the token attributes. For some classes of high-level nets, it is also possible to add within those arc inscriptions references to functions that support token attributes transformation. This is the case for Colored Petri nets [42]. In this way, data processing capabilities can also be embedded in the model.

5.4.9 Discrete Event In a discrete event MoC, events are sorted by a time stamp stating when they occurred and analyzed in chronological order. A discrete event simulator usually maintains a global event queue. The operational semantics of a discrete event system consists of the sequential firing of the processes accordingly to the chronological time stamps of queued events. Each specific queued event will be removed from the queue and the processes, that have it as an input, will be fired. New events will be produced due to the previous firing and should be added to the event queue, according to its time tag. Like most digital hardware simulation engines, hardware description languages, such as VHDL and Verilog, use a discrete-event-based simulation approach. Simultaneous events, especially those originated by zero-time delay, can bring some problems, due to the ambiguity resulting from equal time tags (they can not be ordered). Figure 5.30 illustrates one simple situation.

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms

5-27

Considering that X, Y, and Z represent some processing modules/processes/components, and that module Y has zero-time delay, it will produce output events with the same time stamp as input events. If module X produces an event with time stamp t (Figure 5.30[a]), both modules Y and Z will receive that event; there is ambiguity about whether Y or Z should be executed first. If Z is executed first, the event presented at its input will be consumed; afterwards Y will be executed and will produce an event that will be consumed in turn by Z. On the other hand, if Y is executed first, an event will be produced with the same time stamp, because module Y has zero-time delay (Figure 5.30[b]). At this point, execution of Z can be accomplished in two ways: (1) taking both events at the same invocation step, or (2) taking one of the events first and the other afterwards (it is also not clear which should be processed first). In order to enable a coherent system simulation (producing the same result even if Y or Z are executed first), the concept of delta delay is introduced. Events with the same time stamp are ordered accordingly to the delta value. Now, the event produced by module Y will have a time stamp of t + (Figure 5.30[c]), resulting in a execution of Z first consuming the event originated by X with time stamp t , and afterwards (in the next delta step) consuming the event originated by Y, with time stamp t + (Figure 5.30[d]). If a different zero-time delay process receives the event with time stamp t + , it will generate a further event with time stamp t + 2. All events with the same time tag − delta tag will be consumed at the same time. This means a two-level model of time was introduced: on top of the totally ordered events, each time instant can be further decomposed into (totally ordered) delta steps applicable to generated events. However, from the simulated time point of view reported to the user, no delta information is included. In this sense, delta steps do not relate to time as perceived by the observer. They are simply a mechanism to compute the behavior of the system at a point in time. Similar mechanisms can also be used for synchronous/reactive models. These are analyzed in the next section.

5.4.10 Synchronous/Reactive Models Reactive systems are those that interact continuously with their environment. David Harel and Amir Pnueli gave the name in the early 1980s. In the synchronous MoC, all events are synchronous: signals have events with equal tags to other events in other signals. The tags are totally ordered. In this computational model, the output responses are produced simultaneously with the input stimuli. Conceptually, synchrony means that reactions take no time, so, there is no observable time delay in the reaction (outputs become available as inputs become available). Unlike the discrete event model, all signals have events at all clock ticks, simplifying the simulator, as far as no sorting is needed. Simulators that exploit this characterization are called cycle based (in opposition to the so-called discrete-event simulators). A cycle is the processing of all the events at a given clock tick. Cycle based simulation is excellent for clocked synchronous circuits. Some applications use a multirate cycle-based model, where every nth event in one signal aligns with the events in another signal. Synchronous languages [43,44], such as Esterel [45], Lustre [46], and Signal [47], use the synchronous/reactive MoC. Statecharts graphical formalism belongs to the same language family and is sometimes referred as quasi-synchronous. Statecharts will be presented in a separated section. Esterel is a language for describing a collection of interacting synchronous FSMs supporting the description of concurrent and sequential statements: • S1;S2 represents the sequential execution of S1 followed by S2. • S1||S2 represents the parallel execution of S1 and S2. • [S1||S2] executes until S1 and S2 terminates. The synchronized modules communicate through signals, using an instantaneous broadcast mechanism. As an introductory example, we consider an FSM that implements the following behavior [48]: “Emit an output O as soon as two inputs A and B have occurred. Reset this behavior each time the input R occurs.”

© 2006 by Taylor & Francis Group, LLC

5-28

Embedded Systems Handbook

R

R

A and not (B)

not (A) and B A and B/O

B/O

A/O R

FIGURE 5.31 A simple Mealy state machine.

Figure 5.31 presents the associated state diagram. The respective Esterel representation is presented below: module SMSM: input A, B, R; output O; loop [ await A || await B ]; emit O each R end module

Waiting for the occurrences of A and B is accomplished through “[await A || await B].” The “emit O” statement emits the signal O and terminates at the time it starts. So, O is emitted exactly at the time where the last of A and B occurs. The handling of the reset condition is accomplished by the loop “loop p each R.” When the loop starts, its body p runs freely until the occurrence of R. At that time, the body is aborted and it is immediately restarted. If the body terminates before the occurrence of R, one simply waits for R to restart the body. It has to be noted that Esterel code grows linearly with the size of the specification (while FSM complexity grows exponentially). For instance, if we want to wait for the occurrence of three events, A, B, and C, the code change is minimal: “[await A || await B || await C].” Esterel is a control flow oriented language. On the other hand, Lustre is a language that adopts a dataoriented flavor. It follows a synchronous dataflow approach and supports multirate clocking mechanism. In addition to the ability to translate these languages into finite state descriptions, it is possible to compile these languages directly into hardware (execution in software of these languages is made possible through the simulation of the generated circuit).

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms

5-29

5.4.11 Dataﬂow Models Dataflow models are graphs where nodes represent operations (also called actors) and arcs represent datapaths (also called channels). These datapaths contain totally ordered sequences of data (also called tokens). A dataflow is a distributed modeling formalism, in the sense that there is no single locus of control. Yet, the conventional dataflow model is not suitable for representing the control part of the system (only what is implied by the graph structure). Regarding time handling, dataflow systems are asynchronous concurrent. Dataflow graphs have been extensively used in modeling data-dominated systems, namely digital signal-processing applications, and applications dealing with streams of periodic/regular data samples. Computationally intensive systems, carrying complex data transformation, can be conveniently represented by a directed graph where the nodes represent computation (to be coded in a suitable programming language) and the arcs represent the order in which the computations are performed. This is the case for signal-processing algorithms, for example, encode/decode, filtering, convolution, compression, etc., which are expressed as block diagrams and coded in a specific programming language, such as C. Several dataflow graphs have been referred in the literature, each one with its specific semantics, namely: • Kahn Process Network [49]. • Dataflow Process Networks [50]. • Synchronous dataflow graphs [51]. In Kahn Process Networks, and also in Dataflow Process Networks (which constitute a special case of Kahn Process Network), processes communicate by passing data tokens through unidirectional unbounded FIFO channels. A process can fire according to a set of firing rules. During each atomic firing, the actor consumes tokens from input channels, executes some behavior using a selected programming language, and produces tokens that are put on output channels. Writing to the channel is nonblocking, although reading from the channel blocks the process until there is sufficient data in it. Kahn Process Networks have guaranteed deterministic behavior: for a certain sequence of inputs, there is only one possible sequence of outputs. This is independent from the computation and communication durations, and also from the actors’ firing order. The dataflow process analysis can only be based on graph inspection. The possibility of choosing the actors firing order allows very efficient implementation. Hence, it is frequent for the signal-processing applications to have as a goal the scheduling of the actor firings, at compile time. This results in an interleaved implementation for the concurrent model, represented by a list of processes to be executed, allowing its implementation in single-processor architecture. Accordingly, for multiprocessors architectures, a per processor list is obtained. Kahn Process Networks cannot be scheduled statically, as their firing rules do not allow us to build, at compile time, a firing sequence such that the system does not block under any circumstances. For this modeling formalism, one must use dynamic scheduling (with associated implementation overhead). Another disadvantage of Kahn Process Networks is associated with the unbounded FIFO buffers, and the potential growth of memory needs. Therefore, Kahn Process Networks cannot be efficiently implemented without considering some limitations. Synchronous dataflow graphs are a kind of Kahn Process Networks with additional restrictions, namely: • At each firing, a process consumes and produces a fixed amount of tokens on its incoming and outgoing channels, respectively. • For a process to fire, it must have at least as many tokens on its input channels as it has to consume. • Arcs are marked with the number of tokens produced or consumed. Figure 5.32 presents the basic firing rule. Figure 5.33 shows a simple dataflow model and an associated static scheduling list. This scheduling is a list of firings that can be repeated indefinitely. One cycle through the schedule should return the graph to its original state (by state, we mean the number of tokens in every arc).

© 2006 by Taylor & Francis Group, LLC

5-30

Embedded Systems Handbook

1

1

1

1

1

1

Fired

FIGURE 5.32 Basic firing rule. 1

2

D

A 2

2

2

4 B

A

1

A

2 C

1

B

2

A

A

B

C

D

D

FIGURE 5.33 Simple dataflow model and associated static schedule list. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

This static schedule can be determined through the analysis of the graph in order to find the paths of firing sequences that satisfy, on each arc, the same amount of tokens to be produced and consumed. One balance equation is built per arc, considering that outgoing_weight ∗ origin_node = incoming_weight ∗ destination node For the referred example, the following balance equations can be obtained (one per arc): 2a − 4b = 0 b − 2c = 0 2c − d = 0 2b − 2d = 0 2d − a = 0 The root of the set of presented equations is (a = 4, b = 2, c = 1, d = 2), which gives information about the firing vector associated with the static scheduling list. Yet, in other cases, it is possible that the mentioned balance equations do not have a root, implying that no static schedule is possible.

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms

5-31

(a)

(b)

Fired

True

Fired

SWITCH

SWITCH

T

T

F

F

(c)

True

False

SWITCH

SWITCH T

F

T

F

T

F

T

F

(d)

T

F

T

SELECT

F

SELECT Fired

False

SELECT

SELECT Fired

FIGURE 5.34 Dynamic dataflow actors and dataflow firings. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

Many possible variants of dataflow models have been explored. Especially interesting, among them, is an extension to synchronous dataflow networks allowing dynamic dataflow actors [52]; dynamic actors are actors with at least one conditional input or output port. The canonical dynamic actors are “switch” and “selector,” enabling the test of tokens on one specific incoming arc, and the consumption and production of tokens in a data dependent way. A “Switch” process, for example, has one regular incoming arc, one control incoming arc, and two outgoing arcs; whenever tokens are presented in the incoming arcs, the value carried by the token presented in the control arc will dynamically select the active outgoing arc: the process firing will produce one single token at the selected outgoing arc. Figure 5.34 illustrates dynamic dataflow actor behavior before and after firing. It has to be noted that control ports are never conditional ports. More complex dynamic actors can be used, namely dynamic actors with integer control (instead of Boolean control), originating CASE and ENDCASE actors (generalizations of the SWITCH and SELECT actors, respectively). Aggregation of the referred actors using specific patterns is commonly applied. For instance, integercontrolled dataflow graphs can result from cascading a CASE actor, a set of mutually exclusive processing nodes, and an ENDCASE actor. Control ports of the CASE and ENDCASE actors receive the same value; only one of the processing nodes is executed at a time depending on the value presented at the control port. Most simulation environments targeted for signal processing, such as Matlab-Simulink or Khoros (for image processing), use dataflow models.

5.5 Conclusions As system complexity increases, the modeling formalism used for systems’ specification becomes more important. It is through the usage of a formal MoC that the designer can simulate and analyze the behavior

© 2006 by Taylor & Francis Group, LLC

5-32

Embedded Systems Handbook

of the system, and anticipate, impose, or avoid several proprieties and features. This chapter presented a set of selected modeling formalisms, also called MoCs, addressing embedded system design. Due to the different types of embedded systems applications, some of the formalisms underscore the reactive (control) part of system, while others emphasize the data processing modeling capability. The final goal and ultimate challenge to the designer is to choose the right formalism for the system in hands, allowing the design of more robust and reliable systems, obtained in less development time, and allowing an easier maintainability and lower cost.

References [1] L. Lavagno, A. Sangiovanni-Vincentelli, and E. Sentovich, Models of Computation for Embedded System Design. NATO ASI Proceedings on System Synthesis, Il Ciocco (Italy), 1998. [2] M. Sgroi, L. Lavagno, and A. Sangiovanni-Vincentelli, Formal Models for Embedded System Design. IEEE Design and Test of Computers, 17: 14–17, 2000. [3] Stephen Edwards, Luciano Lavagno, Edward A. Lee, and A. Sangiovanni-Vincentelli, Design of Embedded Systems: Formal Models, Validation, and Synthesis. Proceedings of the IEEE, 85: 366–390, 1997. [4] Frank Vahid and Tony Givargis, Embedded System Design — A Unified Hardware/Software Introduction. John Wiley & Sons, Inc., New York, 2002. [5] Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski, Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. [6] Edsger W. Dijkstra, On the Economy of Doing Mathematics. In Mathematics of Program Construction. Second International Conference, Oxford, UK. Lecture Notes in Computer Science, Vol. 669. R.S. Bird, C.C. Morgan, and J.C.P. Woodcock, Eds., Springer-Verlag, Heidelberg, 1993, pp. 2–10. [7] David Harel, On Visual Formalisms. Communications of the ACM, 31: 514–530, 1988. [8] D. Harel and B. Rumpe, Meaningful Modeling: What’s the Semantics of “Semantics”?, Computer 37:10. IEEE Press, October 2004, pp. 64–72. [9] J. Buck, S. Ha, E.A. Lee, and D.G. Messerschmitt, Ptolemy: A Framework for Simulating and Prototyping Heterogeneous Systems. International Journal of Computer Simulation, Special Issue on Simulation Software Development, 4: 155–182, 1994. [10] Edward A. Lee and A. Sangiovanni-Vincentelli, Comparing Models of Computation. Proceedings of ICCAD’, 234–241, 1996. [11] L. Lamport, Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21: 558–565, 1978. [12] Wolfgang Reisig, Petri Nets: An Introduction. Springer-Verlag, New York, 1985. [13] E.F. Moore, Gedanken-Experiments on Sequential Machines, Automata Studies. Princeton University Press, Princeton, NJ, 1956. [14] G.H.A. Mealy, Method for Synthesizing Sequential Circuits. Bell System Technical Journal, 34: 1045–1079, 1955. [15] Manuel Silva, Las Redes de Petri: en la Automática y la Informática. Editorial AC, Madrid, 1985. [16] Daniel Gajski, Nikil Dutt, Allen Wu, and Steve Lin, High-Level Synthesis — Introduction to Chip and System Design. Kluwer Academic Publishers, Dordrecht, 1992. [17] David Harel, Statecharts: A Visual Formalism for Complex Systems. Science of Computer Programming, 8: 231–274, 1987. [18] Bruce Powel Douglass, Doing Hard Time — Developing Real-Time Systems with UML, Objects, Frameworks, and Patterns. Object Technology Series, Addison-Wesley, Reading, MA, 1999.

© 2006 by Taylor & Francis Group, LLC

Modeling Formalisms

5-33

[19] Bruce Powel Douglass, Real-Time UML — Developing Efficient Objects for Embedded Systems. Object Technology Series, Addison-Wesley, Reading, MA, 1998. [20] Grady Booch, James Rumbaugh, and Ivar Jacobson, The Unified Modeling Language User Guide. Object Technology Series, Addison-Wesley, Reading, MA, 1999. [21] David Harel and Michal Politi, Modeling Reactive Systems with Statecharts — The STATEMATE Approach. McGraw-Hill, New York, 1998. [22] David Harel and Amnon Naamad, The Statemate Semantics of Statecharts. ACM Transactions on Software Engineering and Methodology (TOSEM), 5: 293–333, 1996. [23] Michael von der Beeck, A Comparison of Statecharts Variants. In Formal Techniques in RealTime and Fault-Tolerant Systems. Lecture Notes in Computer Science, Vol. 863, Hans Langmaack, Willem P. de Roever, and Jan Vytopil, Eds., Springer-Verlag, Heidelberg, 1994, pp. 128–148. [24] F. Vahid, S. Narayan, and D.D. Gajski, Speccharts: A VHDL Frontend for Embedded Systems. IEEE Transactions on Computer-Aided Design, 14: 694–706, 1995. [25] http://www.specc.org, 2004. [26] M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, and A. Sangiovanni-Vincentelli, Hardware/Software Codesign of Embedded Systems. IEEE Micro, 14: 26–36, 1994. [27] F. Balarin, M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, C. Passerone, A. Sangiovanni-Vincentelli, E. Sentovich, K. Suzuki, and B. Tabbara, Hardware–Software Co-Design of Embedded Systems — The POLIS Approach. Kluwer Academic Publishers, Dordrecht, 1997. [28] R.K. Brayton, M. Chiodo, R. Hojati, T. Kam, K. Kodandapani, R.P. Kurshan, S. Malik, A. Sangiovanni-Vincentelli, E.M. Sentovich, T. Shiple, and H.Y. Wang, BLIF-MV: An Interchange Format for Design Verification and Synthesis. Technical report UCB/ERL M91/97, U.C. Berkeley, November 1991. [29] ITU-T, Z.100 Specification and Description Language (SDL). ITU, 2002. [30] Laurent Doldi, Validation of Communications Systems with SDL: The Art of SDL Simulation and Reachability Analysis. John Wiley & Sons, New York, 2003. [31] OMG, UML™ Resource Page, http://www.uml.org/, 2004. [32] Rick Reed, Notes on SDL-2000 for the New Millennium. Computer Networks, 35: 709–720, 2001. [33] SDL Forum Society, http://www.sdl-forum.org/, 2004. [34] ITU-T, Z.120 Message Sequence Chart (MSC). ITU, 1999. [35] E. Rudolph, J. Grabowski, and P. Graubmann, Tutorial on Message Sequence Charts (MSC’96). In Tutorials of the First Joint International Conference on Formal Description Techniques for Distributed Systems and Communication Protocols, and Protocol Specification, Testing, and Verification (FORTE/PSTV’96). FORTE’96. Kaiserslautern, Germany, October 1996. [36] Carl Adam Petri, Kommunikation mit Automaten. Ph.D. thesis, University of Bonn, Bonn, West Germany, 1962. [37] Jörg Desel and Wolfgang Reisig, Place/Transition Petri Nets. In Lectures on Petri Nets I: Basic Models. Lecture Notes in Computer Science, Vol. 1491. Springer-Verlag, 1998, pp. 122–173. [38] Eike Best, Design Methods Based on Nets: Esprit Basic Research Action {DEMON}. In Lecture Notes in Computer Science; Advances in Petri Nets 1989, Vol. 424, G. Rozenberg, Ed., Springer-Verlag, Berlin, Germany, 1990, pp. 487–506. [39] J. Desel and G. Juhas, What is a Petri Net? Informal Answers for the Informed Reader. In Unifying Petri Nets, Lecture Notes in Computer Science, Vol. 2128, H. Ehrig, G. Juhas, J. Padberg, and G. Rozenberg, Eds., Springer-Verlag, Heidelberg, 2001, pp. 122–173. [40] Luís Gomes and João-Paulo Barros, On Structuring Mechanisms for Petri Nets Based System Design. IEEE Conference on Emerging Technologies and Factory Automation Proceedings (ETFA’ 2003) Lisbon, Portugal, September 16–19, 2003. [41] Gul Agha, Fiorella de Cindio, and Grzegorz Rozenberg, Eds., Concurrent Object-Oriented Programming and Petri Nets, Advances in Petri Nets. Lecture Notes in Computer Science, Vol. 2001. Springer-Verlag, Heidelberg, 2001.

© 2006 by Taylor & Francis Group, LLC

5-34

Embedded Systems Handbook

[42] Kurt Jensen, Coloured Petri Nets. Basic Concepts, Analysis Methods and Practical Use, Vol. 1–3. Monographs in Theoretical Computer Science. An EATCS Series, Springer-Verlag, Berlin, Germany, 1992–1997. [43] N. Halbwachs, Synchronous Programming for Reactive Systems. Kluwer Academic Publishers, Dordrecht, 1993. [44] A. Benveniste and G. Berry, The Synchronous Approach to Reactive and Real-Time Systems. Proceedings of the IEEE, 79: 1270–1282, 1991. [45] F. Boussinot and R. De Simone, The ESTEREL Language. Proceedings of the IEEE, 79: 1293–1304, 1991. [46] N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud, The Synchronous Data Flow Programming Language LUSTRE. Proceedings of the IEEE, 79: 1305–1319, 1991. [47] A. Benveniste and P. Le Guernic, Hybrid Dynamical Systems Theory and the SIGNAL Language. IEEE Transactions on Automatic Control, 35: 525–546, 1990. [48] G. Berry, The Esterel v5 Language Primer, 2000. Available from http://www-sop.inria.fr/esterel.org. [49] G. Kahn, The Semantic of a Simple Language for Parallel Programming. In Proceedings of the IFIP Congress 74. North-Holland Publishing Co., Amsterdam, 1974. [50] E.A. Lee and T.M. Parks, Dataflow Process Networks. Proceedings of the IEEE, 83: 773–801, 1995. [51] E.A. Lee and D.G. Messerschmitt, Synchronous Data Flow. Proceedings of the IEEE, 75: 1235–1245, 1987. [52] E.A. Lee, Consistency in Dataflow Graphs. IEEE Transactions on Parallel and Distributed Systems, 2: 223–235, 1991. [53] P. Huber, K. Jensen, and R.M. Shapiro, Hierarchies in Coloured Petri Nets. In Proceedings of the 10th International Conference on Applications and Theory of Petri Nets. Bonn, Springer-Verlag, London, 1991, pp. 313–341.

© 2006 by Taylor & Francis Group, LLC

6 System Validation 6.1 6.2

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mathematical Models of Embedded Systems . . . . . . . . . .

6-1 6-2

Transition Systems • Agents • Environments • Classical Theories of Concurrency

6.3

Requirements Capture and Validation . . . . . . . . . . . . . . . . . 6-19 Approaches to Requirements Validation • Tools for Requirements Validation

6.4

J.V. Kapitonova, A.A. Letichevsky, and V.A. Volkov National Academy of Science of Ukraine

Thomas Weigert Motorola

Specifying and Verifying Embedded Systems . . . . . . . . . . 6-29 System Descriptions and Initial Requirements • Static Requirements • Dynamic Requirements • Example: Railroad Crossing Problem • Requirement Specifications • Reasoning about Embedded Systems • Consistency and Completeness

6.5

Examples and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-40 Example: Embedded Operating System • Experimental Results in Various Domains

6.6 Conclusions and Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-50 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-50

6.1 Introduction Toward the end of the 1960s system designers and software engineers faced what was then termed as “software crisis.” This crisis was the direct outcome of the introduction of a new generation of computer hardware. The new machines were substantially more powerful than the hardware available until then, making large applications and software systems feasible. The strategies and skills employed in building software for the new systems did not match the new capabilities provided by the enhanced hardware. The results were delayed projects, sometimes for years, considerable cost overruns, and unreliable applications with poor performance. The need arose for new techniques and methodologies to implement large software systems. The now classic “waterfall” software life-cycle model was then introduced to meet these needs. The 1960s have long gone by, but the software crisis still remains. In fact the situation has worsened — the implementation disasters of the 1960s are being succeeded by design disasters. Software systems have reached levels of complexity at which it is extremely difficult to arrive at complete, or even consistent, specifications and it is nearly impossible to know all the implications of one’s requirement decisions. Further, the availability of hardware and software components may change during the course of the development of a system, forcing a change in the requirements. The customer may be unsure of the requirements altogether. The situation is even worse for embedded systems: the real time and distributed aspects of such systems impose additional design difficulties and introduces the possibility of concurrency

6-1

© 2006 by Taylor & Francis Group, LLC

6-2

Embedded Systems Handbook

pathologies such as deadlock or lifelock, resulting from unforeseen interactions of independently executing system components. A number of new methodologies, such as rapid prototyping, executable specifications, and transformational implementation have been introduced to address these problems in order to arrive at shorter cycle time and increased quality of the developed systems. Although each of these methodologies addresses different concerns they share the underlying assumption that verification and validation be performed as close to the customer requirements as possible. While verification tries to ensure that the system is built “right,” that is, without defects, validation attempts to ensure that the “right” system is developed, that is, a system that matches what the customer actually wants. The customer needs is captured in the system requirements. Many studies have demonstrated that errors in system requirements are the most expensive as they are typically discovered late, when one first interacts with the system; in the worst case such errors can force complete redevelopment of the system. In this chapter, we examine techniques aimed at discovering the unforeseen consequences of requirements as well as omissions in requirements. Requirements should be consistent and complete. Roughly speaking, consistency means the existence of an implementation that meets the requirements; completeness means that the implementation (its function or behavior) is defined uniquely by the requirements. Validation of a system is to establish that the system requirements are consistent and complete. Embedded systems [1–3] consist of several components that are designed to interact with one another and with their environment. In contrast to functional systems, which are specified as functions from input to output values, an embedded system is defined by its properties. A property is a set of desired behaviors that the system should possess. In Section 6.2, we present a mathematical model of embedded systems. Labeled transition systems, representing the environment and agents inserted into this environment by means of a continuous insertion function are used for representing system requirements at all levels of details. Section 6.3 presents a survey of freely available systems that could be used to validate embedded systems as well as references to commercially available systems. In Section 6.4, we present a notation to describe the requirements of embedded systems. Two kinds of requirements are distinguished: static requirements define the properties of system and environment states and the insertion function of the environment; dynamic requirements define the properties of histories and behavior of system and environment. Hoare-style triples are used to formulate static requirements; logical formulae with temporal constraints are used to formulate dynamic requirements. To define transition systems with a complex structure of states we rely on attributed transition systems which allow to split the definition of a transition relation into a definition of transitions on a set of attributes, and formulate general transition rules for the entire environment states. We also present a tool for reasoning about the embedded systems and discuss more formally the consistency and completeness condition for a set of requirements. Our approach does not require the developers to build software prototypes, which are traditionally used for checking consistency of a system under development. Instead, one develops formal specifications and uses proofs to determine consistency of the specification. Finally, Section 6.5 presents the specification of a simple scheduler as a case study and reports the results on applying these techniques to the systems in various application domains. We have observed the following time distribution in the software development cycle: 40% of the cycle time is spent on requirements capture, 20% on coding, and 40% on testing. Requirements capture, includes not only the development of requirements but also their corrections and refinement during the entire development cycle. According to Brooks [4], 15% of the development efforts are spent on validation, that is, ensuring that the system requirements are correct. Therefore, improving validation has a significant impact on development time, even for successful requirement specifications. For failed requirements that forced major system redevelopment, the impact is obviously much higher.

6.2 Mathematical Models of Embedded Systems In the embedded domain, the main properties of the systems concern the interaction of components with each other and with the environment. The primary mathematical notion to formally represent interacting

© 2006 by Taylor & Francis Group, LLC

System Validation

6-3

and communicating systems is that of a labeled transition system. When formulating requirements or developing high-level specifications we are not interested in the internal structure of the states of a system and consider these states, and therefore also the systems, as identical up to some equivalence. The abstraction afforded by this equivalence leads to the general notion of an agent and its behavior. Agents exist in some environment and an explicit definition of the interaction of agents and environments in terms of a function that embeds the agent in this environment helps to specialize the mathematical models to particular characteristics of the subject domain.

6.2.1 Transition Systems The most general abstract model of software and hardware systems, which evolve in time and change states in a discrete way, is that of a discrete dynamic system. It is defined as a set of states and a set of histories, describing the evolution of a system in time (either discrete or continuous). As a special case, a labeled transition system over the set of actions A is a set S of states together with the transition relation a → s and say that a system S moves from the T ⊆ S × A × S. If (s, a, s ) ∈ T , we usually write this as s − state s to state s while performing the action a. (Sometimes the term “event” is used instead of “action.”) An automaton is a more special case, where the set of actions is the set of input/output values. Continuity of time, if necessary, can be introduced by supplying actions with a duration, that is, by considering complex actions (a, t ), where a is a discrete component of an action (its content) and t is a real number representing the duration of a. In timed automata, duration is defined nondeterministically and intervals for possible durations are used instead of specific moments in time. Transition systems separate the observable part of a system, which is represented by actions, from the hidden part, which is represented by states. Actions performed by a system are observable by an external observer and other systems, which can communicate with the given system, synchronizing their actions, and combining their behaviors. The internal states of a system are not observable; they are hidden. Therefore, the representation of states can be ignored when considering the external behavior of a system. The activity of a system can be described by its history which is a sequence of transitions, beginning from an initial state: a1

a2

an1

→ s1 − → · · · sn −→ sn+1 · · · s0 − A history can be finite or infinite. Each history has an observable part (a sequence of actions a1 , a2 , . . . , an , . . .) and a hidden part (a sequence of states). The former is called a trace generated by the initial state s0 (in Reference 5, the term behavior is used instead of trace). Two states are called to be trace-equivalent if the set of all traces generated by these states coincide. A final history cannot be continued: it is infinite or for the last state sn in the sequence, there are an → sn+1 from this state; such a state is called a final state. We distinguish a final state no transitions sn − representing successful termination from deadlock states (states where one part of a system is waiting for an event caused by another part and the latter is waiting for an event caused by the former) and divergent or undefined states. Such states can be defined later or constitute livelocks (states that contain hidden infinite loops or infinite recursive unfolding without observable actions). Transition systems can be nondeterministic in which a system can move from a given state s into different states performing the same action a. A labeled transition system (without hidden transitions) is a a → s and s − → s , it follows that s = s , and that there are no deterministic if for arbitrary transitions s − states representing both successful termination and divergence. To define transition systems with a complex structure of states we rely on attributed transition systems. If e is a state of an environment and f is an attribute of this environment, then the value of this attribute will be denoted as e · f . We will represent a state of an environment with attributes f1 , . . . , fn as an object with public (observable) attributes f1 : t1 , . . . , fn : tn , where t1 , . . . , tn are types, and some hidden private part.

© 2006 by Taylor & Francis Group, LLC

6-4

Embedded Systems Handbook

6.2.2 Agents Agents are objects that can be recognized as separate from the “rest of the world,” that is, other agents or the environment. They change their internal state and can interact with other agents and the environment, performing observable actions. The notion of an agent formalizes such diverse objects as software components, programs, users, clients, servers, active components of distributed systems, and so on. In mathematical terms, agents are labeled transition systems with states considered up to bisimilarity. We are not interested in the structure of the internal states of an agent but only in its observable behavior. The notion of an agent as a transition system considered up to some equivalence has been studied extensively in concurrency theory; van Glabbeek [6] presents a survey of the different equivalence relations that have been proposed to describe concurrent systems. These theories use an algebraic representation of agent states and develop a corresponding algebra so that equivalent expressions define equivalent states. The transition relation is defined on the set of algebraic expressions by means of rewriting rules and recursive definitions. Some representations avoid the notion of a state, and instead, if for some agent E a transition for action a is defined, it is said that the agent performs the action a and thus becomes another agent E . 6.2.2.1 Behaviors Agents with the same behavior (i.e., agents which cannot be distinguished by observing their interaction with other agents and environments) are considered equivalent. We characterize the equivalence of agents in terms of the complete continuous algebra of behaviors F (A). This algebra has two sorts of elements — behaviors u ∈ F (A), represented as finite or infinite trees, and actions a ∈ A, and two operations — prefixing and nondeterministic choice. If a is an action and u is a behavior, prefixing results in a new behavior denoted as a · u. Nondeterministic choice is an associative, commutative, and idempotent binary operation over behaviors denoted as u + v, where u, v ∈ F (A). The neutral element of nondeterministic choice is the deadlock element (impossible behavior) 0. The empty behavior performs no actions and denotes the successful termination of an agent. The generating relations for the algebra of behaviors are as follows: u+v =v +u (u + v) + w = u + (v + w) u+u =u u+0=u ∅·u =0 where ∅ is the impossible action. Both operations are continuous functions on the set of all behaviors over A. The approximation relation ⊆ is a partial order with minimal element ⊥. Both prefixing and nondeterministic choice are monotonic with respect to this approximation: ⊥⊆u u ⊆v ⇒u+w ⊆v +w u ⊆v ⇒a·u ⊆a·v The algebra F (A) is constructed so that prefixing and nondeterministic choice are also continuous with respect to the approximation and it is closed relative to the limits (least upper bounds) of the directed sets of finite behaviors. Thus, we can use the fixed point theorem to give a recursive definition of behaviors starting

© 2006 by Taylor & Francis Group, LLC

System Validation

6-5

from the given behaviors. Finite elements are generated by three termination constants: (successful termination), ⊥ (the minimal element of the approximation relation), and 0 (deadlock). a → v if u can be F (A ) can be considered as a transition system with the transition relation defined by u − represented in the form u = a · v + u . The terminal states are those that can be represented in the form u + , divergent states are that which can be represented in the form u + ⊥. In algebraic terms we can say that u is terminal (divergent) iff u = u + (u = u + ⊥), which follows from the idempotence of nondeterministic choice. Thus, behaviors can be considered as states of a transition system. Let beh(s) denote the behavior of an agent in a state s, then the behavior of an agent in state s can be represented as the solution us ∈ F (A ) of the system us =

a · ut + εs

(6.1)

a

s− →t

where εs = 0 if s is neither terminal nor divergent, εs = if s is terminal but not divergent, εs = ⊥ for divergent but not terminal states, and εs = + ⊥ for states which are both terminal and divergent. If all summands in the representation (6.1) are different, then this representation is unique up to associativity and commutativity of nondeterministic choice. As an example, consider the behavior u defined as u = tick .u. This behavior models a clock that never terminates. It can be represented by a transition system with only one state u which generates the infinite history tick

tick

u −−−→ u −−−→ · · · The infinite tree with only one path representing this behavior can be obtained as the limit of the sequence of finite approximations u (0) = ⊥, u (1) = tick .⊥, u (2) = tick .tick .⊥, . . . . Now consider, u = tick .u + stop . This is a model of a clock which can terminate by performing the action stop, but the number of steps to be done before terminating are not known in advance. The transition system representing this clock has two states, one of which is a terminal state. The first two approximations of this behavior are u (1) = tick .⊥ + stop . u (2) = tick .(tick .⊥ + stop .) + stop . Note that, the second approximation cannot be written in the form tick .tick .⊥+tick .⊥+stop . because distributivity of choice does not hold in behavior algebra. u = tick .u + tick .0 describes a similar behavior but is terminated by deadlock rather than successfully. 6.2.2.2 Bisimilarity Trace equivalence is too weak to capture the notion of the behavior of a transition system. Consider the systems shown in Figure 6.1. Both systems in Figure 6.1 start by performing the action a. But the system at the left-hand side has a choice at the second step to perform either action b or c. The system on the right can only perform an action b and can never perform c or it can only perform c and never perform b, depending on what decision was made at the first step. The notion of bisimilarity [7] captures the difference between these two systems.

© 2006 by Taylor & Francis Group, LLC

6-6

Embedded Systems Handbook

a

a

b

FIGURE 6.1

c

b

a

c

Two systems which are trace equivalent but have different behaviors.

A binary relation R ⊆ S × S on the set of states S of a transition system without terminal and divergent states is called a bisimulation if for each s and t such that (s, t ) ∈ R and for each a ∈ A: a

a

→ s then there exists t ∈ S such that t − → t and (s , t ) ∈ R. 1. If s − a a → t then there exists s ∈ S such that s − → s and (s , t ) ∈ R. 2. If t − Two states s and t are called bisimilar if there exists a bisimulation relation R such that (s, t ) ∈ R. Bisimilarity is an equivalence relation whose definition is easily extended to the case when R is defined as a relation between the states of two different systems, considering the disjoint union of their sets of states. Two transition systems are bisimilar if each state of one of them is bisimilar to some state of the other. For systems with nontrivial sets of terminal states S and divergent states S⊥ , partial bisimulation is considered instead of bisimulation. A binary relation R ⊆ S × S is a partial bisimulation if for all s and t such that (s, t ) ∈ R and for all a ∈ A, / S⊥ then t ∈ / S⊥ . 1. If s ∈ S then t ∈ S and if s ∈ a a 2. If s − → s then there exists t such that t − → t and (s , t ) ∈ R. a a → t then there exists s such that s − → s and (s , t ) ∈ R. 3. If t − A state s of a transition system S is called a bisimilar approximation of t , denoted by s ⊆B t , if there exists a partial bisimulation R such that (s, t ) ∈ R. Bisimilarity s ∼B t can then be introduced as the relation s ⊆B t ∧ t ⊆B s. For attributed transition systems, the additional requirement is that if (s, t ) ∈ R, then s and t have the same attributes. A divergent state without transition approximates arbitrary other states that are not terminal. If s approximates t and s is convergent (not divergent) then t is also convergent, s and t have transitions for the same sets of actions, and satisfy the same conditions as for bisimulation without divergence. Otherwise if s is divergent, the set of actions, for which s has transitions, is only included in the set of actions for which t has transitions, that is, s is less defined than t . For the states of a transition system it can be proved that s ⊆B t ⇔ beh(s) ⊆ beh(t ) s ∼B t ⇔ beh(s) = beh(t ) and, therefore, the states of an agent considered up to bisimilarity can be identified with corresponding behaviors. If S is a set of states of an agent then a set U = {beh(s)|s ∈ S} is a set of all its behaviors. a This set is transition closed which means that u ∈ U and u − → v implies v ∈ U . Therefore, U is also a transition system equivalent to S and can be used as a standard behavior representation of an agent.

© 2006 by Taylor & Francis Group, LLC

System Validation

6-7

For many applications, a weaker equivalence such as weak bisimilarity introduced by Milner [8], or insertion equivalence as discussed in Section 6.2.3, have been considered. Note that, for deterministic systems, if two systems are trace-equivalent, they are also bisimilar. 6.2.2.3 Composition of Behaviors Composition of behaviors is defined as an operation over agents and is expected to preserve equivalence; it can, therefore, also be defined as an operation on behaviors. The sequential composition of behaviors u and v is a new behavior denoted as (u; v) and defined by means of the following inference rules and equations: a

→ u u−

(6.2)

a

(u; v) − → (u ; v ) ((u + ); v) = (u; v) + v

(6.3)

((u + ⊥); v) = (u; v) + ⊥

(6.4)

(0; u) = 0

(6.5)

We consider a transition system with states built from arbitrary behaviors over the set of states A by means of operations of the behavior algebra F (A) and a new operation denoted as (u; v). Expressions are considered up to the equivalence defined by the above equations (thus, the extension of a behavior algebra by this operation is conservative). The inference rule (6.2) defines a transition relation on a set of equivalence classes. From rule (6.2) and equation (6.4) it follows that (; v) = v and (⊥; v) = ⊥. One can prove that (u; ) = u and that sequential composition is associative and distributives to the left ((u + v); w) = (u; w) + (v; w) Sequential composition can also be defined explicitly by the following recursive definition:

(u; v) =

a · (u ; v) +

(ε; v)

u=u+ε

a

u− →u

6.2.2.3.1 Parallel Composition of Behaviors We define an algebraic structure on the set of actions A by introducing the combinator a × b of actions a and b. This operation is commutative and associative with the impossible action ∅ as the zero element (a × ∅ = ∅). As ∅ · u = 0, there are no transitions labeled ∅. The inference rules and equations defining the parallel composition u v of behaviors u and v are a

b

→ u, v − → v , a × b = ∅ u− a×b

u v −−→ u v a

u− → u a

uv − → u v a

u− → u a

u (v + ) − → u

© 2006 by Taylor & Francis Group, LLC

6-8

Embedded Systems Handbook a

v− → v a

uv − → u v a

v− → v a

(u + ) v − → v (u + ) (v + ) = (u + ) (v + ) + (u + ⊥) v = (u + ⊥) v + ⊥ u (v + ⊥) = u (v + ⊥) + ⊥ The following equations for termination constants are direct consequences of these definitions: ε = ε = ε

⊥ε = ε⊥ = ⊥

0 ε = ε 0 = 0 if ε = ε + ⊥ 0 ε = ε 0 = ⊥ if ε = ε + ⊥ Parallel composition is commutative and associative. Parallel composition is the primary means for describing the interaction of agents. The simplest interaction is interleaving, which trivially defines composition as a × b = ∅ for arbitrary actions. Agents in a parallel composition interact with each other and can synchronize via combined actions. Parallel composition can also be defined explicitly by the following recursive definition: (u v ) =

(a × b ) · (u v ) +

a

u− →u b

v− →

v

a

u− →u

a · (u v ) +

b · (u v ) + εu εv

b

v− →v

where εu is a termination constant in the equational representation of behavior u.

6.2.3 Environments An environment E is an agent over an action algebra C with an insertion function. All states of the environment are initial states. The insertion function, denoted by e[u] takes an argument e (the behavior of an environment) and the behavior of an agent over an action algebra A in a given state u (the action algebra of agents may be a parameter of the environment) and yields a new behavior of the same environment. The insertion function is continuous in both its arguments. We consider agents up to a weaker equivalence than bisimilarity. Consider the example in Figure 6.2. Clearly, these systems are not bisimilar. However, if a represents the transmition of a message, and b represents the reception of that message, the second trace on the left-hand side figure would not be possible within an environment that supports asynchronous message passing. Consequentially, both systems would always behave the same. Insertion equivalence captures this difference: the environment can impose constraints on the inserted agent, such as disallowing the behavior b · a, in this example. In such environment, both behaviors shown in Figure 6.2 are considered equivalent. Insertion equivalence depends on the environment and its insertion function. Two agents u and v are insertion equivalent with respect to an environment E, written as u ∼E v, if for all e ∈ E, e[u] = e[v]. Each agent u defines a transformation on the set of environment states; two agents are equivalent with respect to a given environment if they define the same transformation of the environment.

© 2006 by Taylor & Francis Group, LLC

System Validation

6-9

b

a

a

a

b

b

FIGURE 6.2 Two systems which are not bisimilar, but may be insertion equivalent.

External environment

Environment

Insertion function

Agent

FIGURE 6.3 Agents in environment.

After insertion of an agent into an environment, the new environment is ready to accept new agents to be inserted. Since insertion of several agents is a common operation, we shall use the notation e[u1 , . . . , un ] = e[u1 ] · · · [un ] as a convenient shortcut for insertion of several agents. In this expression, u1 , . . . , un are agents inserted into the environment simultaneously, but the order of insertion may be essential for some environments. If we wanted an agent u to be inserted after an agent v, a → s and consider the expression s[v]. Some environments can move we must find some transition e[u] − a independently, suspending the actions of an agent inserted into it. In this case, if e[u] − → e [u], then e [u, v] describes the simultaneous insertion of u and v into the environment in state e as well as the insertion of u when the environment is in a state e and is followed by the insertion of v. An agent can be inserted into the environment e[u1 , u2 , . . . , un ], or that environment can itself be considered as an agent which can be inserted into a new external environment e with a different insertion function. An environment with inserted agents as a transition system is considered up to bisimilarity, but after insertion into a higher level environment it is considered up to insertion equivalence (Figure 6.3). Some example environments arising in real-life situations are: • A vehicle with sensors is an environment for a computer system. • A computer system is an environment for programs.

© 2006 by Taylor & Francis Group, LLC

6-10

Embedded Systems Handbook

• The operating system is an environment for application programs. • A program is an environment for data, especially when considering interpreters or higher-order functional programs. • The web is an environment for applets. 6.2.3.1 Insertion Functions Each environment is defined by its insertion function. The restriction on the insertion function to be continuous is too weak and in practice more restricted types of insertion functions are considered. The states of environments and agents can be represented in algebraic form as expressions of a behavior algebra. To define an insertion function it is sufficient to define transitions on the set of expressions of the type e[u]. We use rules in the form of rewriting logic to define these transitions. The typical forms of such rules are: F (x)[G(y)] → d · F (z)[G (z)] F (x)[G(y)] → F (z)[G (z)] where x = (x1 , . . . , xn ), y = (y1 , . . . , yn ), z = (x1 , x2 , . . . , y1 , y2 , . . .), x1 , x2 , . . . , y1 , y2 are action or behavior variables, F , G, F , G are expressions in the behavior algebra, that is, expressions built by nondeterministic choice and prefixing. More complex rules allow arbitrary expressions on the right-hand side in the behavior algebra extended by insertion as two sorted operation. The first type of rule defines observable transitions d → F (z)[G (z)] F (x)[G(y)] − The second type of rule defines unlabeled transitions which can be used as auxiliary rules. They are not observable outside the environment and can be reduced by the rule ∗

d

e[u] − → e [u ], e [u ] − → e [u ] d

e[u] − → e [u ] ∗

→ means the transitive closure of unlabeled transitions. Special rules or equations must be added where − for termination constants. Rewriting rules must be left linear with respect to the behavior variables, that is, none of the behavior variables can occur more than once in the left-hand side. Additional completeness conditions must be present to ensure all possible states of the environment are covered by the lefthand side of the rules. Under these conditions, the insertion function will be continuous even if there are infinitely many rules. This is because, to compute the function e[u] one needs to know only some finite approximations of e and u. If e and u are defined by means of a system of fixed point equations, these approximations can be easily constructed by unfolding these equations sufficiently many times. Insertion functions that are defined by means of rewriting rules can be classified on the basis of the height of terms F (x) and G(y) in the left-hand side of the rules. The simplest case is when this height is no more than 1, that is, terms are the sum of variables and expressions of the form c · z, where c is an action, and z is a variable. Such insertion functions are called one-step insertions, other important classes are head insertion and look-ahead insertion functions. For head insertion the restriction on the height should not exceed 1 which refers only to the agent behavior term G(y). The term F (x) can be of arbitrary height. Head insertion can be reduced to one-step insertion by changing the structure of the environment but preserving the insertion equivalence of agents. In head insertion, the interaction between the environment and agent is similar to the interaction between the server and the client: a server has information only about the next step in the behavior of the client but knows everything about its own behavior. In a look-ahead insertion environment, the behavior of an agent can be analyzed for arbitrary long (but finite) future steps. We can liken such environment to the interaction between an interpreter and a program.

© 2006 by Taylor & Francis Group, LLC

System Validation

6-11

We consider a one-step insertion which is applied in many practical cases by restricting ourselves to purely additive insertion functions that satisfy the following conditions: ei [u] ei [u] = e

ui = e[ui ]

Given two functions D1 : A × C → 2C and D2 : C → 2C , the transition rules for insertion functions are a

c

u− → u, e − → e , d ∈ D1 (a, c) d

e[u] − → e [u ] c

e− → e , d ∈ D2 (c) d

→ e [u] e[u] − We refer to D1 and D2 as residual functions. The first rule (interaction rule) defines the interaction between the agent and the environment which consists of choosing a matching pair of actions a ∈ A and c ∈ C. Note that, the environment and the agent move independently. If the choice of action is made first by the environment, then the choice of action c by the environment defines a set of actions that the agent may take: a can be chosen only so that D1 (a, c) = ∅. The observable action d must be selected from the set D1 (a, c). This selection can be restricted by the external environment if e[u] considered as an agent is inserted into the environment by other agents inserted into environment e[u] after u. This rule can be combined with rules for unobservable transitions if some action, say τ (as in Milner CCS), is selected in C to hide the transition. For this case we formulate the interaction rule to account for hidden interactions. a

c

→ u, e − → e , τ ∈ D1 (a, c) u− e[u] − → e [u ] The second rule (environment move rule) describes the case when the environment transitions independently of the inserted agent and the agent is waiting until the environment will allow it to move. Unobservable transitions can also be combined with environment moves. Some equations should be added for the case when e or u are termination constants. We shall assume that ⊥[u] = ⊥, 0[u] = 0, e[] = e, e[⊥] = ⊥, and e[0] = 0. There are no specific assumptions about [u] but usually neither nor 0 belong to E. Note that, in the case when ∈ E and [u] = u, insertion equivalence coincides with bisimulation. The definition of the insertion function for one-step insertion discussed earlier will be complete, if we assume that there are no transitions other than those defined by the rules. The definition above can be expressed in the form of rewriting rules as follows: d ∈ D1 (a, c) ⇒ (c · x)[a · y] → d · y d ∈ D2 (c) ⇒ (c · x)[y] → d · x[y] and in the form of explicit recursive definition as e[u] =

c

e− →e a

u− →

u

d∈D1 (a,c)

© 2006 by Taylor & Francis Group, LLC

d · e [u ] +

c

e− →e d∈D2 (c)

d · e [u] + εe [u]

6-12

Embedded Systems Handbook

To compute transitions for the multiagent environment e[u1 , u2 , . . . , un ] we recursively compute transitions for e[u1 ], then for e[u1 , u2 ] = (e[u1 ])[u2 ], and eventually for e[u1 , u2 , . . . , un ] = (e[u1 , u2 , . . . , un−1 ])[un ]. Important special cases of one-step insertion functions are parallel and sequential insertion. An insertion function is called a parallel insertion if e[u, v] = e[u v] This means that the subsequent insertion of two agents can be replaced by the insertion of their parallel composition. The simplest example of a parallel insertion is defined as e[u] = e u. This special case holds when the sets of actions of environment and agents are the same (A = C), b = D1 (a, a × b), and D2 (a) = A. In the case when ∈ E, this environment is a set of all other agents interacting with a given agent in parallel, and insertion equivalence coincides with bisimilarity. Sequential insertion is introduced in a similar way: e[u, v] = e[u; v] This situation holds, for example, when D1 (a, c) = ∅, D2 (c) = C, and [u] = u. 6.2.3.2 Example: Agents over a Shared and Distributed Store As an example, consider a store, which generalizes the notions of memory, data bases, and other information environments used by programs and agents to hold data. An abstract store environment E is an environment over an action algebra C, which contains a set of actions A used by agents inserted into this environment. We shall distinguish between local and shared store environments. The former can interact with an agent inserted into it while this agent is not in a final state and, if another agent is inserted into this environment, the activity of the latter is suspended until the former completes its work. The shared store admits interleaving of the activity of agents inserted into it, and they can interact concurrently through this shared store. 6.2.3.2.1 Local and Shared Store The residual functions for a local store are defined as: D1 (a, c) = {d|c = a × d}, where d = ∅ for d ∈ C\A or d = δ otherwise, and D2 (c) = C and for a shared store as D1 (a, c) = {d|c = a × d},

where d = ∅, d ∈ C, and D2 (c) = C.

It can be proved that the one-step insertion function for a local store is a sequential insertion and that one-step insertion for a shared store is a parallel insertion. In other words, e[u1 , u2 , . . .] = e[u1 ; u2 ; . . .] for a local store, and e[u1 , u2 , . . .] = e[u1 u2 . . .] for a shared store. The interaction move for the local store is defined as a

a×d

u− → u , e −−→ e d

e[u] − → e [u ]

© 2006 by Taylor & Francis Group, LLC

System Validation

6-13

When the store moves according to this rule, an agent inserted into it plays the role of control for this store. A store in a state e[u] can only perform actions which are allowed by the agent u. This action can be combined only with an action d which is not from the action set A and cannot be used by another agent in a transition. The actions returned by the residual function are external actions and can be observed and used only from outside the store environment. Different from a local store, in a shared store environment several agents can perform their actions in parallel according to the rule a1

an

a1 ×···×an ×d

→ u1 , . . . , un − → un , e −−−−−−−→ e u1 − d

e[u1 · · · un v] − → e [u1 · · · un v] An important special case of the store environment E is a memory over a set of names R and a data domain D. The memory can be represented by an attributed transition system with attributes R and states e : R → R . Agent actions are assignments and conditions, and their combinations are possible if they can a be performed simultaneously. If a is a set of assignments, then in a transition e − → e the state e results c×a

from applying a to e. The conjunction of conditions c enables a transition e −−→ e if c is valid on e and a e− → e . 6.2.3.2.2 Multilevel Store For a shared memory store the residual action d in the transition d

e[u1 · · · un v] − → e [u1 · · · un v] is intended to be used by external agents inserted later, but in a multilevel store it is convenient to restrict the interaction with the environment to a given set of agents which have already been inserted. For this purpose, a shared memory can be inserted into a higher level closure environment with an insertion function defined by the equation g [e[u]][v] = g [e[u v]] where g is a state of this environment, e is a shared memory environment, and only the following two rules are used for transitions in the closure environment: c

→ e [u ], c ∈ Cext ∧ c = δ e[u] − ϕext (c,e)

g [e[u]] −−−−→ g [e [u ]] δ

e[u] − → e [u ] g [e[u]] → g [e [u ]] Here Cext is a distinguished set of external actions. Some of external actions can contain occurrences of names from e. The function ϕext substitutes the values of these names in c and performs other transformations to make an action be observable for external environment. Two level insertions can be described in the following way. Let R = R1 ∪ R2 be divided into two nonintersecting parts: the external and internal memory. Let A1 be the set of actions which change only the values of R1 , but can use the values of R2 (external output actions), let A2 be the set of actions which change only the values of R2 , but can use the values of R1 (external input actions), and A3 be the set of actions which change and use only the values of R2 (internal actions). These sets are assumed to be defined on the syntactical level. Redefine the residual function D1 and transitions of E: let a ∈ A and split a into a combination of actions ϕ1 (a) × ϕ2 (a) × ϕ3 (a) so that ϕ1 (a) ∈ A1 , ϕ2 (a) ∈ A2 , and ϕ3 (a) ∈ A3

© 2006 by Taylor & Francis Group, LLC

6-14

Embedded Systems Handbook

(some of these actions may be equal to δ). Define the interaction rule in the following way: (ϕ2 (a)σ )×ϕ3 (a)

a

→ u , e −−−−−−−−−→ e u− cσ ×ϕ1 (a)

e[u] −−−−−→ e [u ] where σ is an arbitrary substitution of names used in conditions and in the right-hand sides of assignments of ϕ2 (a) into the set of their values, bσ is an application of the substitution σ to b, cσ is a substitution written in the form of the condition r1 = σ (r1 ) ∧ r2 = σ (r2 ) ∧ · · · . Define ϕext (b, e) = be, that is a substitution of the values of R2 to b. Consider a two level structure of a store state t [g [e1 [u1 ]] g [e1 [u1 ]] · · · ] where t ∈ D R1 is a shared store and e1 , e2 , . . . ∈ D R2 represent the distributed store (memory). When the component g [ei [ui ]] performs internal actions these are hidden and do not affect the shared memory. Performing external output actions change the names of the shared memory and external input actions receive values from the shared memory to change components of the distributed memory. This construction is easily iterated as the components of a distributed memory can have multilevel structure. 6.2.3.2.3 Message Passing Distributed components can interact via shared memory. We now introduce direct interaction via message passing. Synchronous communication can be organized by extending the set of actions with a combination of actions in parallel composition independently of the insertion function. To describe synchronous data exchange in the most general abstract schema, let u=

a(d) · F (d)

u =

a (d ) · F (d )

d ∈D

d∈D

be two agents which use data domain D for the exchange of information. Functions a and a map the data domain onto the action algebra, functions F and F map elements of the data domain onto behaviors. The parallel composition of u and u is u u =

a(d) × a (d )(F (d) F (d )) +

a(d)×a (d ) =∅

a(d)(F (d) u ) +

d∈D

a (d )(F (d ) u)

d ∈D

(note that, εu = εu = 0, i.e., this is a special case of parallel composition where there are no termination constants). The first summand corresponds to the interaction of two agents. The other two summands reflect the possibility of interleaving. The interaction can be deterministic even if u and u are nondeterministic if a(d) × a (d ) = ∅ has only one solution. Interleaving makes it possible to select other action if u u is embedded into another parallel composition. They can also be hidden by a closure environment (similar to restriction in Calculus of Concurrent Systems, CCS). The exchange of information through combination is bidirectional. An important special case of information exchange is the use of send/receive pairs. For example, consider the following combination rule send (addr , d)×receive (addr , d )=

exch (addr), if addr=addr , d=d ∅, otherwise

In the latter case, if u = send (addr , d) · v

© 2006 by Taylor & Francis Group, LLC

System Validation

6-15

and u =

receive (addr , d ) · F (d )

d ∈D

the interaction summand of the parallel composition will be exch (addr ) · (v F (d)). Asynchronous message passing via channels can be described by introducing a special communication environment. The attributes of this environment are channels and their values are sequences (queues) of stored messages. It is organized similarly to the memory environment but queue operations are used instead of storing. In addition, send and receive actions are separated in time. This environment is a special case of a store environment and can be combined with a store environment keeping separate the different types of attributes and actions.

6.2.4 Classical Theories of Concurrency The theory of interaction of agents and environments [9–11] focuses on the description of multi-agent systems comprised of agents cooperatively working within a distributed information environment. Other mathematical models for specifications of dynamic and real time systems interacting with environments have been developed based on process algebras (CSP, CCS, ACP, etc.), automata models (timed Büchi and Muller automata, abstract state machines [ASM]), and temporal logic (LPTL, LTL, CTL, CTL∗ ). New models are being developed to support different peculiarities of application areas, such as Milner’s π-calculus [12] for mobility and its recent extension to object-oriented descriptions. The environment may change the predefined behavior of an agent. For example, it may contain some other agents designed independently and intended to interact and communicate with the agent during its execution. The classical theories of communication consider this interaction as part of the parallel composition of agents. The influence of the environment can be expressed as an explicit language operation such as restriction (CCS) or hiding (CSP). In contrast to the classical theories of interaction which are based on an implicit and hence not formalized notion of an environment, the theory of interaction of agents and environments studies them as objects of different types. In our approach the environment is considered as a semantic notion and is not explicitly included in the agent. Instead, the meaning of an agent is defined as a transformation of an environment which corresponds to inserting the agent into its environment. When the agent is inserted into the environment, the environment changes and this change is considered to be a property of the agent described. 6.2.4.1 Process Algebras An algebraic theory of concurrency and communication that deals with the occurrence of events rather than with updates of stored values is called a process algebra. The main variants of process algebra are generally known by their acronyms: CCS [8] — Calculus of Concurrent Systems developed by Milner, CSP [13] — Hoare’s Communicating Sequential Processes, and ACP — Algebra of Communicating Processes of Bergstra and Klop [14]. These theories are based on transition systems and bisimulation, and consider interaction of composed agents. They employ nondeterministic choice as well as parallel and sequential compositions as primitive constructs. The influence of the environment on the system may be expressed as an explicit language operation, such as restriction in CCS or hiding in CSP. These theories consider communicating agents as objects of the same type (this type may be parameterized by the alphabets for events or actions) and define operations on these types. The CCS model specifies sets of states of systems (processes) and transitions between these states. The states of a process are terms and the transitions are defined by the operational semantics of the computation, which indicates how and under which conditions a term transforms itself into another term. Processes are represented by the synchronization tree (or process graph). Two processes are identified through bisimulation.

© 2006 by Taylor & Francis Group, LLC

6-16

Embedded Systems Handbook

CCS introduces a special action τ , called the silent action, which represents an internal and invisible transition within a process. Other actions are split into two classes: output actions, which are indicated by an overbar, and input actions, which are not decorated. Synchronization only takes place between a single input and a single output, and the result is always the silent action τ . Thus, a × a˜ = τ , for all actions a. Consequentially, communication serves only as synchronization; its result is not visible. The π -calculus [12] is an enhancement of CCS and models concurrent computation by processes that exchange messages over named channels. A distributed interpretation of the π-calculus provides for synchronous message passing and nondeterministic choice. The π-calculus focuses on the specification of the behavior of mobile concurrent processes, where “mobility” refers to variable communication via named channels, which are the main entities in the π-calculus. Synchronization takes place only between two channel agents when they are available for interchange (a named output channel is indicated by an overbar, while an input channel with the same name is not decorated). The influence of the environment in the π -calculus is expressed as an explicit operation of the language (hiding). As a result of this operation, a channel is declared inaccessible to the environment. CSP explicitly differentiates the set of atomic actions that are allowed in each of the parallel processes. The parallel combinator is indexed by these sets: when (P{A} Q{B} ), P engages only in events from the set A, and Q only in events from the set B. Each event in the intersection of A and B requires a synchronous participation of both processes, whereas other events only require participation of the relevant single process. As a result, a × a˜ = a, for all actions a. The associative and commutative binary operator × describes how the output data supplied by two processes is combined before transmission to their common environment. In CSP, a process is considered to run in an environment which can veto the performance of certain atomic actions. If, at some moment during the execution, no action, in which the process is prepared to engage in, is allowed by the environment, then a deadlock occurs, which is considered to be observable. Since in CSP a process is fully determined by the observations obtainable from all possible finite interactions, a process is represented by its failure set. To define the meaning of a CSP program, we determine the set of states corresponding to normal termination of the program, and the set of states corresponding to its failures. Thus, the CSP semantics is presented in model-theoretic terms: two CSP processes are identified if they have the same failure set (failure equivalence). The main operations of ACP are prefixing and nondeterministic choice. This algebra allows an event to occur with the participation of only a subset of the concurrently active processes perhaps omitting any that are not ready. As a result, the parallel composition of processes is a mixture of synchronization and interleaving, where each of the processes either occurs independently or is combined by × with a corresponding event of another process. The merge operator is defined as Merge(a, b) = (a × b) + (a; b) + (b; a). ACP defines its semantics algebraically; processes are identified through bisimulation. Most differences between CCS, ACP, and CSP can be attributed to differences in the chosen style of presentation of the semantics: the CSP theory provides a model, illustrated with algebraic laws. CCS is a calculus, but the rules and axioms in this calculus are presented as laws, valid in a given model. ACP is a calculus that forms the core of a family of axiomatic systems, each describing some features of concurrency. 6.2.4.2 Temporal Logic Temporal logic is a formal specification language for the description of various properties of systems. A temporal logic is a logic augmented with temporal modalities to allow a specification of the order of events in time, without introducing time explicitly as a concept. Whereas traditional logics can specify properties relating to the initial and final states of terminating systems, a temporal logic is better suited to describe the on-going behavior of nonterminating and interacting (reactive) systems. As an example, Lamport’s TLA (Temporal Logic of Actions) [5,15] is based on Pnueli’s temporal logic [16] with assignment and an enriched signature. It supports syntactic elements taken from programming

© 2006 by Taylor & Francis Group, LLC

System Validation

6-17

languages to ease maintenance of large-sized specifications. TLA uses formulae on behavior, which are considered as a sequence of states. States in TLA are assignments of values to variables. A system satisfies a formula iff that formula is true in all behaviors of this system. Formulae where the arguments are only the old and the new states are called actions. Here, we distinguish between linear and branching temporal logics. In a linear temporal logic, each moment of time has a unique possible future, while in branching temporal logic, each moment of time may have several possible futures. On one hand, linear temporal logic formulae are interpreted over linear sequences of points in time and specify the behavior of a single computation of a system. Formulae of a branching temporal logic, on the other hand, are interpreted over tree-like structures, each describing the behavior of possible computations of a nondeterministic system. Many temporal logics are decidable and corresponding decision procedures exist for linear and branching time logics [17], propositional modal logic [18], and some variants of CTL∗ [19]. These decision procedures proceed by building a canonical model for a set of temporal formulae representing properties of the system to be verified by using techniques from automata theory, semantic tableaux, or binary decision diagrams [20]. Determining whether such properties hold for a system amounts to establishing that the corresponding formulae are true in a model of the system. Model checking based on these decision procedures has been successfully applied to find subtle errors in industrial-size specifications of sequential circuits, communication protocols, and digital controllers [21]. Typically, a system to be verified is modeled as a (finite) state transition graph, and their properties are formulated in an appropriate propositional temporal logic. An efficient search procedure is then used to determine whether the state transition graph satisfies the temporal formulae or not. This technique was first developed in the 1980s by Clarke and Emerson [22] and by Quielle and Sifakis [23] and extended later by Burch et al. [21]. Examples of temporal properties (properties of the interaction between processes in a reactive system) are as diverse as their applications (this classification was introduced in References 2 and 24): • Safety properties state that “something bad never happens” (a program never enters an unacceptable state). • Liveness properties state that “something good will eventually happen” (a program eventually enters a desirable state). • Guarantees specify that an event will eventually happen but does not promise repetitions. • Obligations are disjunctions of safety and guarantee formulae. • Responses specify that an event will happen infinitely many times. • Persistence specifies the eventual stabilization of a system condition after an arbitrary delay. • Reactivity is the maximal class formed from the disjunction of response and persistence properties. • Unconditional Fairness states that a property p holds infinitely often. • Weak Fairness states that if a property p is continuously true then the property q must be true infinitely often. • Strong Fairness states that if a property p is true infinitely often then the property q must be true infinitely often. 6.2.4.3 Timed Automata Timed automata accept timed words — infinite sequences in which a real-valued time of occurrence is associated with each symbol. A timed automaton is a finite automaton with a finite set of real-valued clocks. The clocks can be reset to 0 (independent of each other) with the transitions of the automaton and keep track of the time elapsed since the last reset. Transitions of the automaton put certain constraints on the clock value such that a transition may be taken only if the current values of the clocks satisfy the associated constraints. Timed automata can capture qualitative features of real time systems such as liveness, fairness, and nondeterminism, as well as its quantitative features such as periodicity, bound response, and timing delays.

© 2006 by Taylor & Francis Group, LLC

6-18

Embedded Systems Handbook

Timed automata are a generalization of finite ω-automata (either Büchi automata and Muller automata [25,26]). When Büchi automata are used for modeling finite-state concurrent processes, the verification problem is reduced to that of language inclusion [27,28]. While the inclusion problem for ω-regular languages is decidable [26], for timed automata the inclusion problem is undecidable, which constitutes a serious obstacle in using timed automata as a specification language for validation of finite-state real time systems [29]. 6.2.4.4 Abstract State Machine Gurevich’s ASM project [30,31] attempts to apply formal models of computation to practical specification methods. ASM assumes the “Implicit Turing Thesis” according to which every algorithm can be modeled at its appropriate abstraction level (its algorithm) by a corresponding ASM. ASM descriptions are based on the concept of evolving algebras, which are transition systems on static algebras. Each static algebra represents a state of the modeled system and transition rules are transitions in the modeled system. To simplify the semantics and ease proofs, transitions are limited: they can change only functions, but not sorts, and cannot directly change the universe. A single agent ASM is defined over a vocabulary (a set of functions, predicates, and domains). Its states are defined by assigning an interpretation to the elements of the vocabulary. An ASM program describes the rules of transitioning between states. An ASM program is defined by basic transition rules, such as updates (changes to the interpretation of the vocabulary), conditions (apply only if some specific condition holds), or choice (extracts from a state elements with given properties), or by combinations of transition rules into complex rules. Multiagent ASM consist of a number of agents that execute their ASM program concurrently and interact through globally shared locations of a state. Concurrency between agents is modeled by partially ordered runs. The program steps executed by each agent are linearly ordered; in addition, program steps in different programs are ordered if they represent causality relations. Multiagent ASM rely on a continuous global system time to model time-related aspects. 6.2.4.5 Rewriting Logic Rewriting logic [32] allows to prove assertions about concurrent systems with states changing under transitions. Rewriting logic extends equational logic and constitutes a logical framework in which many logics and semantic formalisms can be represented naturally (i.e., without distorting encoding). Similar to algebras allowing a semantic interpretation to equational logic, models of rewriting logic are concurrent systems. Moreover, models of concurrent computation, object-oriented design languages, architectural description languages, and languages for distributed components also have natural semantics in rewriting logic [33]. In rewriting logic, system states are in a bijective correspondence with formulae (modulo whatever structural axioms are satisfied by such formulae, e.g., modulo associativity or commutativity of connectives) and concurrent computations in a system are in a bijective correspondence with proofs (modulo appropriate notions of equivalence). Given this equivalence between computation and logic, a rewriting logic axiom of the form t → t has two readings. Computationally, it means that a fragment of a system state that is an instance of the pattern t can change to the corresponding instance of t concurrently with any other state changes. The computational meaning is that of a local concurrent transition. Logically, it just means that we can derive the formula t from the formula t , that is, the logical reading is that of an inference rule. Computation consists of rewriting to a normal form, that is, an expression that can no further be rewritten; when the normal form is unique, it is taken as the value of the initial expression. When rewriting equal terms always leads to the same normal form, the set of rules is said to be confluent and rewriting can be used to check for equality.

© 2006 by Taylor & Francis Group, LLC

System Validation

6-19

Rewriting logic is reflective [34,35], and thus, important aspects of its meta-theory can be represented at the object level in a consistent way. The language Maude [36,37] has been developed at SRI to implement a framework for rewriting logic. The language design and implementation of Maude systematically leverage the reflexivity of rewriting logic and make the meta-theory of rewriting logic accessible to the user allowing to create within the logic a formal environment for the logic with tools for formal analysis, transformation, and theorem proving.

6.3 Requirements Capture and Validation In Reference 38, requirements capture is defined as an engineering process for determining what artifacts are to be produced as the result of the development effort. The process involves the following steps: • • • • •

Requirements identification Requirements analysis Requirements representation Requirements communication Development of acceptance criteria and procedures

Requirements can be considered as an agreement between the customer and the developer. As agreements, they must be understandable to the customer as well as to the developer and the level of formalization depends on the common understanding and the previous experience of those involved in the process of requirements identification. The main properties of the system to be developed including its purpose, functionality, conditions of use, efficiency, safety, liveness, or fairness properties, are specified in the requirements along with the main goals of the development project. The requirements also include an explanation of terms referenced, information about other already developed systems which should be reused in the development process, and possible decisions on implementation and structuring of the system. It is well recognized that identifying and correcting problems in requirements and early design phase avoids far more expensive repairs later. Boehm quotes late life-cycle fixes to be a hundred times more expensive than corrections made during the early phases of system development [39]. In Reference 40, Boehm documents that the relative cost of repairing errors increases exponentially with the life-cycle phase at which the error was detected. Kelly et al. [41] documents a significantly higher density of defects found during the requirements phase as compared with later life-cycle phases. Early life-cycle defects are also very prevalent: in Reference 42, it was shown that of 197 critical faults found during integration testing of spacecraft software, only 3 were programming mistakes. The other 194 were introduced at earlier stages. Fifty percent of these faults were owing to flawed requirements (mainly omissions) for individual components, 25% were owing to flawed designs for these components, and the remaining 25% were owing to flawed interfaces between components and incorrect interactions between them. A number of other studies [43,44] reveal that most errors in software systems originate in the requirements and functional specifications. If the errors were detected as soon as possible, their repair would have been least expensive. A requirements specification must describe the external behavior of a system in terms of observable events and actions. The latter describes interaction of system components with their environments including other components and acceptable parts of the external physical world, that is, those aspects of the world which can influence the behavior of the system. We consider requirements to be correct if they are consistent and complete. Checking consistency and completeness of requirements is the final task of requirements representation. The informal understanding of the consistency of requirements is the existence of an implementation which satisfies the requirements. In other words, if requirements are inconsistent, then an implementation free of errors (bugs) that satisfies all the requirements cannot be created. Unfortunately, most ways for describing requirements and preliminary designs (natural language, diagrams, pseudocode) do not offer mechanized means of establishing correctness, so the primary means

© 2006 by Taylor & Francis Group, LLC

6-20

Embedded Systems Handbook

to deduce their properties and consequences is through inspections and reviews. To be amenable to analysis, the requirements must first be formalized, that is, rewritten in some formal language. We call this formalized description a requirements specification. It should be free of implementation details and can be used for the development of an executable model of a system, which is called an executable specification. Note that, any description that is formal enough to be amenable to operational interpretation will also provide some method of implementation, albeit usually a rather inefficient one. The existence of an executable specification which satisfies the formalized requirements is a sufficient condition of consistency. It is sufficient, because a final implementation that is free of errors can be extracted from executable specifications using formal methods such as stepwise refinement. Completeness of requirements is understood as the uniqueness of the executable specification considered up to some equivalence. The intuitive understanding of equivalence is as follows: two executable specifications are equivalent if they demonstrate the same behaviors in the same environments. Completeness can also be expressed in terms of determinism of the executable specification (requirements are sufficient for constructing a unique deterministic model). In some cases, incompleteness of requirements is not harmful because it can be motivated by the necessity to suspend implementation decisions until later stages of development. The correspondence between the original requirements and the requirements specification is not formal. Experience has shown that special skills are required to check correspondence between informal and formal requirements. Incompleteness and inconsistencies discovered at this stage are used for improvement and correction of the requirements which are used to correct the requirements specification.

6.3.1 Approaches to Requirements Validation Standard approach to requirements analysis and validation typically involve manual processes such as “walk-throughs” or Fagan-style inspections [45,46]. The term walk-through refers to a range of activities that can vary from cursory peer reviews to formal inspections, although walk-throughs usually do not involve the replicable processes and methodical data collection that characterize Fagan-style inspections. Fagan’s highly structured inspection process was originally developed for hardware logic and later applied to software design and code and eventually extended to all life-cycle phases, including requirements development and high level design [45]. A Fagan inspection involves a review team with the following roles: a Moderator, an Author, a Reader, and a Tester. The Reader presents the design or code to the other team members, systematically walking through every piece of logic and every branch at least once. The Author represents the viewpoint of the designer or coder, and the test perspective is represented by the Tester. The Moderator is trained to facilitate intensive but constructive discussion. When the functionality of the system is well understood, the focus shifts to a search for faults, possibly using a checklist of likely errors to guide the process. The inspection process includes highly structured rework. One of the main advantages of Fagan-style inspections over other conventional forms of verification and validation is that inspections can be applied early in the life cycle. Thus potential anomalies can be detected before they become entrenched in low level design and implementation. Rushby [47] gives an overview of techniques of mechanized formal methods: decision procedures for specialized, but ubiquitous, theories such as arithmetic, equality, and the propositional calculus are helpful in discovering false theorems (especially if they can be extended to provide counter examples) as well as in proving true ones, and their automation dramatically improves the efficiency of proof. Rewriting is essential to efficient mechanization of formal methods. Unrestricted rewriting provides a decision procedure for theories axiomatized by terminating and confluent sets of rewrite rules, but few such theories arise in practice. Integration of various techniques increases the efficiency of these methods. For example, theorem proving attempts to show that a formula follows from given premises, while model checking attempts to show that a given system description is a model for the formula. An advantage of model checking is that, for certain finite-state systems and temporal logic formulae, it can be automated and is more efficient than

© 2006 by Taylor & Francis Group, LLC

System Validation

6-21

theorem proving. Benefits of an integrated system (providing both theorem proving and model checking) are that model checking can be used to discharge some cases in a larger proof and theorem proving can be used to justify reduction to a finite state that is required for automated model checking. Integration of these techniques can provide a further benefit: before undertaking a potentially difficult and costly proof, we may be able to use model checking to examine special cases. Any errors that can be discovered and eliminated in this way will save time and effort during theorem proving. Model checking determines whether a given formula stating a property of the specification is satisfied in a Kripke-model (in the specification represented as a Kripke-model). In the worst case, these algorithms must traverse the whole of the model, that is, visit all states of the corresponding transition system, and consequentially, model checking can be applied mainly for finite-state systems even in the presence of sophisticated means of representing the set of states, such as binary decision diagram (BDD) methods [20], while the proof of theorems about infinite state systems can be done only by means of deductive methods.

6.3.2 Tools for Requirements Validation Formal methods may be classified according to their primary purpose as descriptive or analytic. Descriptive methods focus largely on specifications as a medium for review and discussion, whereas analytic methods focus on the utility of specification as a mathematical model for analyzing and predicting the behavior of systems. Not surprisingly, the different emphasis is reflected in the type of a formal language favored by either methods. Descriptive formal methods emphasize the expressive power of the underlying language and provide a rich type system, often leveraging the notations of conventional mathematics or set theory. These choices in language elements do not readily support automation; instead, descriptive methods typically offer attractive user interfaces and little in the way of deductive machinery. These methods assume that the specification process itself serves as verification, as expressing the requirements in mathematical form leads to detect inconsistencies that are typically overlooked in natural language descriptions. Examples of primarily descriptive formal methods are VDM [48], Z [49], B [50], or LOTOS [51]. Analytic formal methods place emphasis on mechanization and favor specification languages that are less expressive but capable of supporting efficient automated deduction. These methods vary in the degree of automation provided by the theorem prover, or, conversely, by the amount of user interaction in the proof process. They range from automatic theorem proving without user interaction to proof checking without automatic proof steps. The former typically have restricted specification languages and powerful provers that can be difficult to control and offer little feedback on failed proofs, but perform impressively in the hands of experts, for example, Nqthm [52]. Proof checkers generally offer more expressive languages, but require significant manual input for theorem proving, for example, high-order logic (HOL) [53]. Many tools fall somewhere in between, depending on language characteristics and proof methodology, for example, Eves [54,55], or PVS [56]. The goal of mechanized analysis may be either to prove the equivalence between different representations of initial requirements or to establish properties that are considered critical for correct system behavior (safety, liveness, etc.). Tools for analytic formal methods fall into two main categories: state exploration tools (model checkers) and deductive verifiers (automated theorem provers). Model checking [57] is an approach for formally verifying finite-state systems. Formalized requirements are expressed as temporal logic formulas, and efficient symbolic algorithms are used to process a model of the system and check if the specification holds in that model. Widely known tools are VeriSoft [58], SMV [59], or SPIN [60]. Some deductive verifiers either support inference in first order (Larch [61]), others, such as PVS [62], are based on higher-order languages integrated with supporting tools and interactive theorem provers. Today, descriptive methods are often augmented by facilities of mechanized analysis. Also, automated theorem proving and model checking approaches may be integrated in a single environment. For example, a BDD-based model checker can be used as a decision procedure in the PVS [63] theorem prover. In addition to the prover or proof checker, a key feature of analytic tools is the type checker which checks

© 2006 by Taylor & Francis Group, LLC

6-22

Embedded Systems Handbook

specifications for semantic consistency, possibly adding semantic information to the internal representation built by the parser. If the type system of the specification language is not decidable, theorem proving may be required to establish the type consistency of a specification. An overview of verification tools and underlying methods can be found in Reference 64. The most severe limitation to the deployment of formal methods is the need for mathematical sophisticated users, for example, with respect to the logical notation these methods use, which is often an insurmountable obstacle to the adoption of these methods in engineering disciplines. General-purpose decision procedures, as implemented in most of these methods, are not scalable to the size of industrial projects. Another obstacle to the application of deductive tools like PVS is the necessity to develop a mathematical theory formalized on a very detailed level to implement even very simple predicates. Recently, formal methods were successfully applied to specification and design languages widely accepted in the engineering community, such as MSC [65], SDL [51], or UML [66]. Several tool vendors participate in the OMEGA project aimed at the development of formal tools for the analysis and verification of design steps based on UML specifications [67]. SDL specifications can be checked by model checkers as well as automated theorem provers: for example, the IF system from Verimag converts SDL to PROMELA as input to the SPIN model checker [68]. At Siemens, verification of the GSM protocol stack was conducted using the BDD-based model checker SVE [69]. An integrated framework for processing SDL specification has been implemented based on the automated theorem prover ACL2 [70]. Ptk [71] provides semantic analysis of MSC diagrams and generates test scripts from such diagrams in a number of different languages including SDL, TTCN, and C. FatCat [72] locates situations of nondeterminacy in a set of MSC diagrams. In the following, we give a cursory overview of some of the wider known tools supporting the application of formal methods to specifications. This survey reviews only tools that are freely available, at least for research use. A number of powerful commercial verification technologies have been developed, for example: ACL2 (Computational Logic, USA), ASCE (Adelard, UK), Atelier B (STERIA Méditerranée, France), B-Toolkit (B-Core, UK), CADENCE (Cadence, USA), Escher (Escher Technologies, UK), FDR (Formal Systems, UK), ProofPower (ICL, UK), Prover (Prover Technology, Sweden), TAU (Telelogic), Valiosys (Valiosys, France), and Zola (Imperial Software, UK). A more detailed survey of commercial tools aimed at formal verification is given in Reference 73. The tools and environments surveyed provide only a sample of the wide variety of tools available. In particular, in the area of model checking a large number of implementations support the verification of specifications written in different notations and supporting different temporal logics. For example, Kronos [74] allows modeling of real time systems by timed automata, that is, automata extended with a finite set of real-valued clocks, used to express timing constraints. It supports TCTL, an extension of temporal logic that allows quantitative temporal claims. UPAAL [75] represents systems as networks of automata extended with clocks and data variables. These networks are compiled from a nondeterministic guarded command language with data types. The VeriSoft [58] model checker explores the state space of systems composed of concurrent processes executing arbitrary code (written in any language) and searches for concurrency pathologies such as deadlock, lifelock, divergence, and for violation of user-specified assertions. 6.3.2.1 Descriptive Tools 6.3.2.1.1 Vienna Development Method [76] Vienna development method (VDM) is a model-oriented formal specification and design method based on discrete mathematics, originally developed at IBM’s Vienna Laboratory. It is a model-oriented formal specification and design method based on discrete mathematics. Tools to support formalization using VDM include parsers, typecheckers, proof support, animation, and test case generators. In VDM, a system is developed by first specifying it formally and proving that the specification is consistent, then iteratively refining and decomposing the specification provided that each refinement satisfies the previous specification. This process continues until the implementation level is reached. The VDM Specification Language (VDM-SL) was standardized by ISO in 1996 and is based on firstorder logic with abstract data types. Specifications are written as constructive specifications of an abstract

© 2006 by Taylor & Francis Group, LLC

System Validation

6-23

data type, by defining a class of objects and a set of operations that act upon these objects. The model of a system or subsystem is then based on such an abstract data type. A number of primitive data types are provided in the language along with facilities for user-defined types. VDM has been used extensively in Europe [48,77,78]. Tools have been developed to support formalization using VDM, for example, Mural [79] which aids formal reasoning via a proof assistant. 6.3.2.1.2 Z [49,80] Z evolved from a loose notation for formal specifications to a standardized language with tool support provided by a variety of third parties. The formal specification notation has been developed by the Programming Research Group at Oxford University. It is based on Zermelo-Fraenkel set theory and first-order predicate logic. Z is supported by graphical representations, parsers, typecheckers, prettyprinters, and a proof assistant implemented in HOL providing proof checkers as well as a full-fledged theorem prover. The standardization of Z through ISO solidified the tool base and enhanced interest in mechanized support. The basic Z form is called a schema, which is used to introduce functions. Models are constructed by specifying a series of schemata using a state transition style. Several object-oriented extensions to Z have been proposed. Z has been used extensively in Europe (primarily in the United Kingdom) to write formal specifications for various industrial software development efforts and has resulted in two awards for technological achievement: for the IBM CICS project and for a specification of the IEEE standard for floating-point arithmetic. To leverage Z in embedded systems design, Reference 81 extended Z by temporal interval logic and automated reasoning support through Isabelle and the SMV model checker. 6.3.2.1.3 B [50] Following the B method, initial requirements are represented as a set of abstract machines, for which an object-based approach is employed at all stages of development. B relies on a wide-spectrum notation to represent all levels of description, from specification through design to implementation. After specifying requirements, they can be checked for consistency which for B means preservation of invariants. In addition, B supports checking correctness of the refinement steps to design and implementation. B is supported through toolkits providing syntax checkers, type checkers, a specification animator, proof-obligation generator, provers allowing different degrees of mechanization, and a rich set of coding tools. It also includes convenient facilities for documenting, cross-referencing, and reviewing specifications. The B method is popular in industry as well as in the academic community. Several international conferences on B have been conducted. An example of the use of B in embedded systems design is reported in Reference 82 which promotes the development of correct software for smart cards through translation of B specifications into embedded C code. 6.3.2.1.4 Rigorous Approach to Industrial Software Engineering [83,84] Rigorous Approach to Industrial Software Engineering (RAISE) is based on a development methodology that evolved from the VDM approach. Under the RAISE methodology, development steps are carefully organized and formally annotated using the RAISE specification language which is a powerful wide-spectrum language for specifying operations and processes allowing derivations between levels of specifications. It provides different styles of specification: model-oriented, algebraic, functional, imperative, and concurrent. The CORE requirements method is also provided as an approach for front-end analysis. Supporting tools provide a window-based editor, parser, typechecker, proof tools, a database, and translators to C and Ada. Derivations from one level to the next generate proof obligations. These obligations may be discharged using proof tools which are also used to perform validation (establishing system properties). Detailed descriptions of the development steps and overall process are available for each tool. The final implementation step has been partially mechanized for common implementation languages.

© 2006 by Taylor & Francis Group, LLC

6-24

Embedded Systems Handbook

6.3.2.1.5 Common Algebraic Specification Language [85] Common Algebraic Specification Language (CASL) was developed as a language for formal specification of functional requirements and modular software design that subsumes many algebraic specification frameworks and also provides tool interoperability. CASL is a complex language with a complete formal semantics comprising a family of formally defined specification languages meant to constitute a common framework for algebraic specification and development [86]. To make tool construction manageable, it allows for reuse of existing tools, for interoperability of tools developed at different sites, and for construction of generic tools that can be used for several languages. The CASL Tool Set, CATS, combines a parser, a static checker, a pretty printer, and facilities for translation of CASL to a number of different theorem provers. Encoding eliminates subsorting and partiality, and thus allows reuse of existing theorem proving tools and term rewriting engines for CASL. Typical applications of a theorem prover in the context of CASL are checking semantic correctness (according to the model semantics) by discarding proof obligations that have been generated during checking of static semantic constraints and validating intended consequences, which can be added to a specification using annotations. This allows a check for consistency with informal requirements. In the scope of embedded systems verification, the Universal Formal Methods (UniForM) Workbench has been deployed in the development of railway control and space systems [87]. This system aims to provide a basis for interoperability of tools and the combination of languages, logics, and methodologies. It supports verification of basic CASL specification encoded in Isabelle and the subsequent implementation of transformation rules for CASL to support correct development by transformation. 6.3.2.1.6 Software Cost Reduction [88] Software cost reduction (SCR) is a formal method for modeling and validating system requirements. SCR models a system as a black box computing output data from the input data. System behavior is represented as a finite-state machine. SCR is based on tabular notations that are relatively easy to understand. To develop correct requirements with SCR, a user shall perform four types of activities supported by SCR tools. First, a specification is developed using the SCR tabular notation using the specification editor. Second, the specification is automatically analyzed for violations of application-independent proprieties, such as nondeterminism and missing cases, using an extension of the semantic tableaux algorithm. To validate the specification, the user may run scenarios, sequences of observable events, through the SCR simulator and for checking application-dependent properties the user can apply the Spin model checker by translating the specification into Promela. A toolset has been developed by the Naval Research Laboratory, including a specification editor, a simulator for symbolically executing the specification, and formal analysis tools for testing the specification for selected properties. SCR has been applied primarily to the development of embedded control systems including the A-7E aircrafts operational flight program, a submarine communications system, and safety-critical components of two nuclear power plants [89]. 6.3.2.1.7 EVES [55] EVES is an integrated environment supporting formal development of systems from requirements to code. Additionally, it may be used for formal modeling and mathematical analysis. To date, EVES applications have primarily been in the realm of security-critical systems. EVES relies on the wide-spectrum language Verdi, ranging from a variant of classical set theory with a library mechanism for information hiding and abstraction to an imperative programming language. The EVES mathematics is based on ZFC set theory without the conventional distinction between terms and formulae. Supporting tools are a well-formedness checker, the integrated automated deduction system NEVER, a proof checker, reusable library framework, interpreter, and compiler. Development is treated as theory extension: each declaration extends the current theory with a set of symbols and axioms pertaining to those symbols. Proof obligations are associated with every declaration

© 2006 by Taylor & Francis Group, LLC

System Validation

6-25

to guarantee conservative extension. The EVES library is a repository of reusable concepts (e.g., a variant of the Z mathematical toolkit is included with EVES) and is the main support for scaling, information hiding, and abstraction. Library units are either specification units (axiomatic descriptions), model units (models or implementations of specifications), or freeze units (for saving work in progress). 6.3.2.2 Deductive Veriﬁers 6.3.2.2.1 High-Order Logic [53] High-order logic is an environment for interactive theorem proving in HOL, that is, predicate calculus with terms from the typed lambda calculus. HOL provides a parser, pretty-printer, typechecker, as well as forward and goal oriented theorem provers. It interfaces HOL to the MetaLanguage (ML) which allows representation of terms and theorems of the logic, of proof strategies, and of logical theories. The HOL system is an interactive mechanized proof assistant. It supports both forward and backward proofs. The forward proof style applies inference rules to existing theorems in order to obtain new theorems and eventually the desired goal. Backward or goal oriented proofs start with the goal to be proven. Tactics are applied to the goal and subgoals until the goal is decomposed into simpler existing theorems. HOL provides a general and expressive vehicle for reasoning about various classes of systems. Some of the applications of HOL include the specification and verification of compilers, microprocessors, interface units, algorithms, and formalization of process algebras, program refinement tools, and distributed algorithms. Initially, HOL was aimed at hardware specification and verifying but later its application was extended to many other domains. Since 1988 an annual meeting of the HOL community evolved into a large international conference. 6.3.2.2.2 Isabelle [90] Isabelle, developed at Cambridge University, is a generic theorem prover providing a high degree of automation and supporting a wide variety of built-in logics: many-sorted first-order logic, constructive and classical versions, higher-order logic, Zermelo–Fraenkel set theory, an extensional version of Martin-Löf ’s Type Theory, two versions of the Logic for Computable Functions, the classical first-order sequent calculus, and modal logic. New logics are introduced by specifying their syntax and inference rules. Proof procedures can be expressed using tactics. A generic simplifier performs rewriting by equality relations and handles conditional and permutative rewrite rules, performs automatic case splits, and extracts rewrite rules from context. A generic package supports classical reasoning in a first-order logic, set theory, etc. The proof process is automated to allow long chains of proof steps, reasoning with and about equations, and proofs about facts of linear arithmetic. Isabelle aims at the formalization of mathematical proofs. Some large mathematical theories have been formally verified and are available to a user. These include elementary number theory, analysis, and set theory. For example, Isabelle’s Zermelo–Fraenkel set theory derives general theories of recursive functions and data structures (including mutually recursive trees and forests, infinite lists, and infinitely branching trees). Isabelle has been applied to formal verification as well as reasoning about the correctness of computer hardware, software, and computer protocols. Reference 91 has applied Isabelle to prove correctness of safety-critical embedded software: an HOL implemented in Isabelle has been used to model both specification and implementation of initial requirements; the problem of implementation correctness is reduced to a mathematical theorem to be proven. 6.3.2.2.3 PVS [56,62] PVS provides an integrated environment for the development and analysis of formal specifications and is intended primarily for the formalization of requirements and design-level specifications, and for the rigorous analysis of difficult problems. It has been designed to benefit from synergetic usage of different formalisms in its unified architecture.

© 2006 by Taylor & Francis Group, LLC

6-26

Embedded Systems Handbook

The PVS specification language is based on classical, typed HOL with predicate subtypes, dependent typing, and abstract data types. The highly expressive language is tightly integrated with its proof system and allows automated reasoning about type dependencies. PVS offers a rich type system, strict typechecking, and powerful automated deduction with integrated decision procedures for linear arithmetic and other useful domains, and a comprehensive support environment. PVS specifications are organized into parameterized theories that may contain assumptions, definitions, axioms, and theorems. Definitions are guaranteed to provide conservative extension. Libraries of proved specifications from a variety of domains are available. The PVS prover supports a fully automated mode as well as an interactive mode. In the latter, the user chooses among various inference primitives (induction, quantifier reasoning, conditional rewriting, simplification using specific decision procedures, etc.). Automated proofs are based on user-defined strategies composed from inference primitives. Proofs yield scripts that can be edited and reused. Modelchecking capabilities are integrated with the verification system and can be applied for automated checking of temporal properties. PVS has been applied to algorithms and architecture for fault-tolerant flight control systems, to problems in real-time system design, and to hardware verification. Reference 92 combined PVS with industrial, UML-based development. Similarly, the Boderc project at the Embedded Systems Institute aims to integrate UML-based software design for embedded systems into a common framework that is suitable for multidisciplinary system engineering. 6.3.2.2.4 Larch [61] Larch is a first-order specification language supporting equational theories embedded in a first-order logic. The Larch Prover (LP) is designed to treat equations as rewrite rules and carry out other inference steps such as induction and proof by cases. The user may introduce operators and assertions about operators as part of the formalization process. The system is comprised of parser, type checker, and a user-directed prover. Larch Prover is designed to work midway between proof checking and fully automatic theorem proving. Users may direct the proof process at a fairly high level. LP attempts to carry out routine steps in a proof automatically and provide useful information about why proofs fail, but is not designed to find difficult proofs automatically. 6.3.2.2.5 Nqthm [52] Nqthm is a toolset based on the powerful heuristic Boyer–Moore theorem prover for a restricted logic (a variant of pure applicative Lisp). There is no explicit specification language; rather, one writes specifications directly in the Lisp-like language that encodes the quantifier-free, untyped logic. Recursion is the main technique for defining functions and, consequentially, mathematical induction is the main technique for proving theorems. The system consists of parser, pretty-printer, limited typechecker (the language is largely untyped), theorem prover, and animator. The highly automated prover can be driven by large databases of previously supplied (and proven) lemmas. The tool distribution comes with many examples of formalized and proved applications. For over a decade, the Nqthm series of provers has been used to formalize a wide variety of computing problems including safety-critical algorithms, operating systems, compilers, security devices, microprocessors, and pure mathematics. Two well-known industrial applications are a model of a Motorola digital signal processing (DSP) chip and the proof of correctness of the floating point division algorithm for the AMD5K 86 microprocessor. 6.3.2.2.6 Nuprl [93] Nuprl was originally designed by Bates and Constable at Cornell University and has been expanded and improved over the past 15 yr by a large group of students and research associates. Nuprl is a highly extensible open system that provides for interactive creation of proofs, formulae, and terms in a typed language which is constructive type theory with extensible syntax. The Nuprl system supports HOLs and rich type theories. The logic and the proof systems are built on a highly regular untyped term structure,

© 2006 by Taylor & Francis Group, LLC

System Validation

6-27

a generalization of the lambda calculus and mechanisms given for reduction of such terms. The style of the Nuprl logic is based on the stepwise refinement paradigm for problem solving in which the system encourages the user to work backwards from goals to subgoals until one reaches what is known. Nuprl provides a window-based interactive environment for editing, proof generation, and function evaluation. The system incorporates a sophisticated display mechanism that allows users to customize the display of terms. Based on structure editing, the system is free to display terms without regard to parsing of syntax. The system also includes the functional programming language ML as its metalanguage; users extend the proof system by writing their own proof generating programs (tactics) in ML. Since tactics invoke the primitive Nuprl inference rules, user extensions via tactics cannot corrupt system soundness. The system includes a library mechanism and is provided with a set of libraries supporting the basic types including integers, lists, and Booleans. The system also provides an extensive collection of tactics. The Nuprl system has been used as a research tool to solve open problems in constructive mathematics. It has been used in formal hardware verification, as a research tool in software engineering, and to teach mathematical logic to Cornell undergraduates. It is now being used to support parts of computer algebra and is linked to the Weyl computer algebra system. 6.3.2.3 State Exploration Tools 6.3.2.3.1 Symbolic Model Verifier (SMV) [59] The SMV system is a tool for checking finite-state systems against specifications of properties. Its highlevel description language supports modular hierarchical descriptions and the definition of reusable components. Properties are described in Computation Tree Logic (CTL), a propositional, branching-time temporal logic. It covers a rich class of properties including safety, liveness, fairness, and deadlock freedom. The SMV input language offers a set of basic data types consisting of bounded integer subranges and symbolic enumerated types, which can be used to construct static, structured types. SMV can handle both synchronous and asynchronous systems, and arbitrary safety and liveness properties. SMV uses a BDD-based symbolic algorithm of model checking to avoid explicitly enumerating the states of the model. With carefully tuned variable ordering, the BDD algorithm yields a system capable of verifying circuits with extremely large numbers of states. The SMV system has been distributed widely and has been used to verify industrial-scale circuits and protocols, including the cache coherence protocol described in the IEEE Futurebus+ standard and the cache consistency protocol developed at Encore Computer Corporation for their Gigamax distributed multiprocessor. Formal verification of embedded systems using symbolic model checking with SMV has been demonstrated in Reference 94: a Petri net-based system model is translated into the SMV input language along with the specification of timing properties. 6.3.2.3.2 Spin [60,95,96] Spin is a widely distributed software package that supports the formal verification of distributed systems It was developed by the formal methods and verification group at Bell Laboratories. Spin relies on the high-level specification language PROMELA (Process MetaLanguage), a nondeterministic language based on Dijkstra’s guarded command language notation and CSP. PROMELA contains primitives for specifying asynchronous (buffered) message passing via channels with an arbitrary number of message parameters. It also allows for the specification of synchronous message passing systems (rendezvous) and mixed systems, using both synchronous and asynchronous communications. The language can model dynamically expanding and shrinking systems, as new processes and message channels can be created and deleted on the fly. Message channel identifiers can be passed in messages from one process to another. Correctness properties can be specified as standard system or process invariants (using assertions), or as general Linear Temporal Logic (LTL) requirements, either directly in the syntax of next time free LTL, or indirectly as Büchi Automata (expressed in PROMELA syntax as never claims). Spin can be used in three modes: for rapid prototyping with random, guided, or interactive simulation; as an exhaustive verifier,

© 2006 by Taylor & Francis Group, LLC

6-28

Embedded Systems Handbook

capable of rigorously proving the validity of user-specified correctness requirements (using partial order reduction to optimize search); and as proof approximation system that can validate very large protocol systems with maximal coverage of the state space. Spin has been applied to the verification of data transfer and bus protocols, controllers for reactive systems, distributed process scheduling algorithms, fault-tolerant systems, multiprocessor designs, local area network controllers, microkernel design, and many other applications. The tool checks the logical consistency of a specification. It reports on deadlocks, unspecified receptions, and flags incompleteness, race conditions, and unwarranted assumptions about the relative speeds of processes. 6.3.2.3.3 COordination SPecification ANalysis (COSPAN) [97] The COSPAN is a general purpose, rapid prototyping tool developed at AT&T that provides a theoretically seamless interface between an abstract model and its target implementation, thereby supporting top-down system development and analysis. It includes facilities for documentation, conformance testing, software maintenance, debugging, and statistical analysis, as well as libraries of abstract data types and reusable pretested components. The COSPAN input language, S/R (selection/resolution), belongs to the omega-regular languages which are expressible as finite-state automata on infinite strings or behavioral sequences. COSPAN is based on homomorphic reduction and refinement of omega-automata, that is, the use of homomorphisms to relate two automata in a process based on successive refinement that guarantees that properties verified at one level of abstraction hold in all successive levels. Reduction of the state space is achieved by exploiting symmetries and modularity inherent in large, coordinating systems. Verification is framed as a language-containment problem: checking consists of determining whether the language of the system automaton is contained in the language of the specification automaton. Omega-automata are particularly well-suited to expressing liveness properties, that is, events that must occur at some finite, but unbounded time. The COSPAN has been used in the commercial development of both software and hardware systems: high-level models of several communications protocols, for example, the X.25 packet switching link layer protocol, the ITU file transfer and management protocol (FTAM), and AT&T’s Datakit universal receiver protocol (URP) level C; verification of a custom VLSI chip to implement a packet layer protocol controller; and analysis and implementation of AT&T’s Trunk Operations Provisioning Administration System (TOPAS). 6.3.2.3.4 MEIJE [98] The MEIJE project at INRIA and the Ecole des Mines de Paris has long investigated concurrency theory and implemented a wide range of tools to specify and verify both synchronous and asynchronous reactive systems. It uses Esterel, a language designed to specify and program synchronous reactive systems, and a graphical notation to describe labeled transition systems. The tools (graphical editors, model checkers, observer generation) operate on the internal structure of automata combined by synchronized product which are generated either from the Esterel programs or from the graphical representations of these automata. MEIJE supports both explicit representation of the automata supporting model checking and compositional reduction of systems using bisimulation or hiding, as well as implicit representation of the automata favoring verification through observers and forward search for properties to verify. To deal with the large state spaces induced by realistic-sized specifications, the MEIJE tools provide various abstraction techniques, such as behavioral abstraction which replaces a sequence of actions by a single abstract behavior, state compression and encoding in BDD, and on-the-fly model checking. Observers can be either directly written in Esterel or can be generated automatically from temporal logic formulae. 6.3.2.3.5 CADP [99] The CADP, developed at INRIA and Verimag, is a tool box for designing and verifying concurrent protocols specified in the ISO language LOTOS and to study formal verification techniques. Systems are specified as networks of communicating automata synchronizing through rendevous or labeled transition systems.

© 2006 by Taylor & Francis Group, LLC

System Validation

6-29

The CAESAR compiler translates the behavioral part of LOTOS specifications into a C program that can be simulated and tested or into a labeled transition system to be verified by ALDEBARAN. The latter allows the comparison and reduction of labeled transition systems by equivalences of various strength (such as bisimulation, weak bisimulation, branching bisimulation, observational equivalence, tau bisimulation, or delay bisimulation). Diagnostic abilities provide the user with explanations when the tools failed to establish equivalence between two labeled transition systems. The OPEN environment provides a framework for developing verification algorithms in a modular way, and various tools are included: interactive simulators, deadlock detection, reachability analysis, path searching, on-the-fly model checker to search for safety, liveness, and fairness properties, and a tool for generating test suites. 6.3.2.3.6 Murphi [100] Murphi is a complete finite-state verification system that has been tested on extensive industrialscale, examples including cache coherence protocols and memory models for commercially designed multiprocessors. The Murphi verification system consists of the Murphi compiler, and the Murphi description language for finite-state asynchronous concurrent systems which is loosely based on Chandy and Misra’s Unity model and includes user-defined data types, procedures, and parameterized descriptions. A version for synchronous concurrent systems is under development. A Murphi description consists of constant and type declarations, variable declarations, rule definitions, start states, and a collection of invariants. The Murphi compiler takes a Murphi description and generates a C++ program that is compiled into a specialpurpose verifier that checks for invariant violations, error statements, assertion violations, deadlock, and liveness. The verifier attempts to enumerate all possible states of the system, while the simulator explores a single path through the state space. Efficient encodings, including symmetry-based techniques, and effective hash-table strategies are used to alleviate state explosion.

6.4 Specifying and Verifying Embedded Systems The problems of consistency and completeness of requirements viewed as mathematical problems are well known to be algorithmically unsolvable even for notations solely based on the first-order predicate calculus. To overcome these difficulties, we have studied the general form of the requirements used in specific subject domains and developed methods of proving sufficient conditions of consistency and completeness. Each requirements specification defines a class of systems compatible with the requirements. All systems in this class are defined at least up to bisimulation, that is, systems with the same observable behavior are considered as equal. However, systems often operate in the context of some environment; when requirements describe the properties of the environment into which a system is inserted, the weaker notion of insertion equivalence is used to distinguish different systems. If the class of systems compatible with the requirements is not empty, we consider the requirements to be consistent. If the class contains only one system (up to bisimulation or insertion equivalence, respectively), the requirements are said to be complete. To represent requirements, we distinguish between a class of logical requirement languages representing behavior in logical form, a class of trace languages, representing behavior in the form of traces, and a class of automata networks languages, representing behavior in terms of states and transitions. The latter are model-oriented [101], in that desired properties or behaviors are specified by giving a mathematical model that exhibits those properties. The disadvantage of model-oriented specifications is that they state what should be done or how something should be implemented, rather than the properties that are required. For property-oriented [101] languages, each description defines not a single model, but a class of models which satisfy the properties defined by the description. The properties expressed are properties of attributed transition systems representing environments and agents inserted into these environments. Only actions and the values of attributes are observable for inserted agents.

© 2006 by Taylor & Francis Group, LLC

6-30

Embedded Systems Handbook

6.4.1 System Descriptions and Initial Requirements The descriptions of observable entities (attributes) of a system and the external environment, their types, and the parameters they depend on, are captured in first-order predicate logic extended by types and certain predefined predicates such as equality or arithmetic inequality. The signature of the language contains a set of attribute names, where each attribute has a type which is defined using direct or axiomatic definitions. Operations defined on the sets of values of different types are used to construct value expressions (terms) from attribute symbols, variables, and constants. If an expression contains attributes, the value of this expression depends not only on the values of its variables, if any, but also on the current state of the environment. Consequentially, it defines a function on the set of states of an environment. The language includes the temporal modalities always and sometimes. Logical statements define properties of the environment, characterize agent actions, or define abstract data types. Initial requirements describe the initial state as a logical statement. If initial requirements are present, requirements refer only to the states reachable from initial states, which are those states satisfying the initial requirements. To describe initial requirements, we use a temporal modality initially. Axioms about the system are introduced by the form let : <statement>;

6.4.2 Static Requirements Static requirements define the change of attributes at any given moment or interval of time depending on the occurrence of events and the previous history of system behavior. Static requirements describe the local behavior of a system, that is, all possible transitions which may be taken from the current state if it satisfies the precondition, after the event forcing these transitions has happened. The general form of static requirements is req :[<prefix>]([<precondition>] -> [after <event description>] <postcondition>); The precondition is a predicate formula of first-order logic, true before the transition; the postcondition is a predicate formula of first-order logic, true after the transition. Both precondition and postcondition may refer to a set of attributes used in the system. Variables are typed and may occur in precondition, event, and postcondition; they link the values of attributes before and after the transition. Only universal quantifiers are allowed in the quantifier prefix. All attributes occur free in the requirements, but if an attribute depends on parameters, the algebraic expressions substituted for the parameter may contain bound variables. If the current state of the environment satisfies the precondition, and allows the event, then after performing this action the new state of the system will satisfy the property expressed by the postcondition. This notation corresponds to Hoare-style triples. Predicates are defined on sets of values, and predicate expressions can be constructed using predicate symbols and value expressions to express properties of states of environments (if they include attributes). The quantifiers forall and exists can be used, as well as the usual propositional connectives. The precondition is a logical statement without explicit temporal modalities and describes observable properties of a system state. It may include predefined temporal functionals that depend on the past behavior of a system or its attributes, for example, the duration functional dur: if P is a statement, then dur P denotes the time passed from the last moment when P became true; its value is a number (real for continuous time and an integer for discrete time). The event denotes the cause of a transition or a change of attribute values. The simplest case of an event is an action. More complex examples are sequences of actions or finite behaviors (i.e., the event is an algebraic expression generated by actions, prefixing, nondeterministic choice, and sequential and parallel compositions). To describe histories we use product P ∗ Q and iteration It(P) over logical statements.

© 2006 by Taylor & Francis Group, LLC

System Validation

6-31

The postcondition is a logical statement denoting the property of attribute values after the transition defined by the static requirement has been completed. It cannot include temporal modalities as well as functionals depending on the past. As an example, consider a system which counts the number of people entering a room. The requirement for an action enter could be written as: req Enter: Forall(n : int)((num = n)− > after(enter) (num = n + 1)); where num is a system attribute representing the number of people in the room. The variable n links the value of the attribute num before and after the transition caused by the enter action.

6.4.3 Dynamic Requirements Dynamic requirements are arbitrary logical statements including temporal modalities and functionals. They should be the consequences of static requirements and, therefore, dynamic requirements are formulated as calls to the prover to establish a logic statement: prove<statement>; We use the temporal modalities always and sometimes, as well as the temporal functional dur and Kleene operations (product and iteration) over logical statements. Temporal statements refer to properties of attributed transition systems with initial states. A logical statement without modalities describes a property of a state of a system while temporal statements describe properties of all states reachable from admissible initial states: if P is a predicate formula, the formula always P means that any state reachable from an admissible initial state possesses property P. The formula sometimes P means that there exists a reachable state that possesses property P. The notion of reachability can be expressed in set-theoretical terms. Temporal modalities can be translated to first-order logic by introducing quantifiers on states and consequentially, it is possible to use first-order provers for proving properties of environments. For synchronous systems (systems with explicit time assignments at each state) we introduce the functional (E time t ) for arbitrary expression E (terms or statements) and integer expression t which denotes the value of Eat time t . Then always and sometimes are defined as: always P ⇔ (∀t ≥ 0)(P time t ) sometimes P ⇔ (∃t ≥ 0)(P time t ) The functional dur is defined in the following way [102]: (dur P = s) time t ⇔ (t ≥ s)∧ ∼ (P) time (t − s) ∧ (∀s )((t ≥ s > t − s) → P time s ) Statements, which do not explicitly mention state or time are considered as referring to an arbitrary current state or an arbitrary current moment of time. Statements with Kleene operations refer to discrete time and are reduced to logical statements as follows: (P1∗ P2∗ . . .∗ Pn ) time t ⇔ (P1 time (t − n + 1)) ∧ (P2 time (t − n + 2)) ∧ · · · ∧ (Pn time t ) It(P) time t ⇔ (∃s ≤ t )(∀s )((t − s ≤ s ≤ t ) → P time s )

6.4.4 Example: Railroad Crossing Problem The railroad crossing problem is a well-known benchmark to assess the expressiveness of development techniques for interactive systems. We illustrate the description of a synchronous system (in discrete time)

© 2006 by Taylor & Francis Group, LLC

6-32

Embedded Systems Handbook

relying on duration functionals. The problem statement is to develop a control device for a railroad crossing so that safety and liveness conditions are satisfied. This system has three components, as shown in Figure 6.4). The n-track railroad has the following observable attributes: InCr is a Boolean variable equal to 1 if a train is at the crossing; Cmg(i) is a Boolean variable equal to 1 if a train is coming on track number i. At the moment this attribute becomes equal to 1, the time left until the train will reach the crossing is not less than d_min and it remains 1 until the train reaches the crossing. Cmg(i) is an input signal to the controller which has a single output signal DirOp. When DirOp equals to 1, the gate starts opening, and when it becomes 0, the gate starts closing. The attribute gate shows the position of the gate. It is equal to open when the gate is completely opened and closed if it is completely closed. The time taken for the gate to open is d_open, the time taken to close is d_close. The requirements text below omits the straightforward static requirements. The dynamic properties of the system are safety and liveness. Safety means that when the train is at the crossing, the gate is closed. Liveness means that the gate will open when the train is at a safe distance (Code 6.1).

InCr

Gate Cmg

n-track railroad

DirOp Controller

Gate

n

FIGURE 6.4

Railroad crossing problem.

Code 1 parameters( d_min, d_close, d_open, WT ); attributes(n:int)( InCr:bool, Cmg(n):bool, DirOp:bool, gate ); let C1:(d_min>d_close); let C2:(d_close>0); let Duration Theorem: Forall(x,d)( always(dur Cmg(x) > d -> ˜(DirOp))-> always(dur ˜(DirOp) > dur Cmg(x)+(-1)*(d+1)) ); /* ------------- Environment spec ------------------------ */ let CrCm: always(InCr->Exist x (dur Cmg(x) > d_min)); let OpnOpnd:always( dur DirOp >d_open ->(gate=opened)); let ClsClsd:always( dur ˜(DirOp)>d_close->(gate=closed)); /* ------------ Controller spec ------------------------ */ let Contr1: always(Exist x (dur Cmg(x) > WT ) -> ˜(DirOp)); let Contr2: always(Forall x (WT >= dur Cmg(x)) -> DirOp ); /* ------------- Safety and Liveness --------------------- */ let(WT=d_min+(-1)*d_close); prove Safety: always(InCr->(gate=closed)); prove Liveness: always( Forall x ((WT > dur Cmg x) -> (gate=opened)));

© 2006 by Taylor & Francis Group, LLC

System Validation

6-33

Note the assumption of the Duration Theorem in the requirements to shorten the proofs of safety and liveness.

6.4.5 Requirement Speciﬁcations The example in Section 6.4.4 is rather simple in the number of requirements. Requirement specifications used in practice to describe embedded systems are typically much more complex. Requirement specifications may consist of hundreds or thousands of static requirements, and a large domain descriptions through attributes and parameters. Each requirement is usually simple but taken together the resultant behavior may be complex and contain inconsistencies or be incomplete. We use attributed transition systems to describe the requirements for embedded systems. The formal specification of requirements consists of the environment description, the description of common system properties in the form of axioms, the insertion function defined by static requirements, and intended properties of the system as a whole defined as dynamic requirements. A typed list of system parameters and a typed list of system attributes are used to describe the structure of the environment. The parameters of the system are variables, which have influence on the behavior of the environment and can change their values from one configuration of the system to another but they never change their value during the execution of the system. Examples of system parameters are the set of tasks for an embedded operating system, the bus threshold for a device controller, etc. System attributes are variables that differ between the observable states of the environment. Attributes may change their values during runtime. Examples of attributes are the queue of tasks, which are ready to be executed by the operating system, or the current data packet for a device controller. As an example, we consider (in simplified form) several fragments of the formalized requirements for an embedded operating system for automotive electronics, OSEK [103]. A typed list of system parameters and a typed list of system attributes describes the structure of the environment (Code 6.2).

Code 2 parameters ( tasks: Set of name, resources: Set of name ); attributes ( suspended: Set of name, ready: Set of name, running: name );

Parameters of the system are variables, which have influence on the behavior of the environment and can change their values from one configuration of the system to another, but never change their value during the execution of the system. The operating system (environment) and executing tasks (agents) interact via service calls. The list of actions contains the names of the services defined provided by the system, including service parameters, if any (Code 6.3). Common system properties are defined as a propositions in first-order logic extended with temporal modalities. For example, consider the following requirements: “the length of the queue of suspended tasks can never be greater than the set of defined tasks.” We will formalize this requirement as follows (Code 6.4). To define the transitions of the system when processing the request for a service we use Hoare-style triples notation, as defined above (Code 6.5).

© 2006 by Taylor & Francis Group, LLC

6-34

Embedded Systems Handbook

Code 3 a: name) ( Activate a, Terminate, Schedule );

Code 4 Let SuspendedLengthReq: Always ((length(suspended)<|tasks|) |/ (length(suspended) = |tasks|));

Code 5 req Activate1: Forall (a:name, s: Set of name, r: Set of name) ( ( (suspended = s) & (ready = r) & (a in s) ) -> after (Activate a) ( (suspended = (s setminus a)) & (ready = (r union a)) ));

The insertion function expressed by this rule is sequential, in that only one running task can be performed at a time, all others are in a state suspended or ready. A task becomes running as a result of performing a schedule action. It is selected from a queue of ready tasks ordered by priorities. Agents can change the behavior of the environment by service requests. The interaction between the environment and the agents is defined by an insertion function, which computes the new behavior of the environment with inserted agents. The part of the description of requirements specific to sequential environments is the definition of the interaction of agents and environments, where this interaction is described by the insertion function. The most straightforward way to define this function is through interactive requirements: an action is allowed to be processed if and only if the current state of the environment matches one of the preconditions for service requests. This is denoted as E-(act)->E’ intuitively meaning that the environment E allows the action act and if it will be processed then the environment will be equal to E . The agent (composition of all agents) interacting with the environment requests the service act if and only if it transits from its current state u into state u with action act.u-(act)->u’. The composition of environment and the agent is noted as env(E,u). To define the transition of env(E,u) we use interactive rules (Code 6): Code 6

req ActivateInteract1: Forall (E: env, E’:env, u:agent, u’:agent, a:name)( ( (E-(Activate a)->E’) & (u-(Activate a)->u’) ) -> ( env(E,u) -(Activate a)-> env(E’, (Schedule;u’)))); req ActivateInteract2: Forall (E: env, E’:env, u:agent, u’:agent, a:name)( (˜(E -(Activate a)-> E’) & (u-(Activate a)->u’) ) -> ( (E’.error = 1) & env(E,u) -(Activate a)-> env(E’, bot)) ));

© 2006 by Taylor & Francis Group, LLC

System Validation

6-35

Intuitively, this rule means that when the environment allows the action Activate a and the agent requests the action Activate a, then the composition of agent and environment env(E,u) transitions to the state env(E,(Schedule;u’)) with the action Activate a’, then at the next step the environment will be equal to E and the current agent will be (Schedule;u’). 6.4.5.1 Requirements for Sequential Environments This class of models includes such products as single processor operating systems and single client devices. The definitive characteristic of such systems is that at any moment of time only one service request can be processed by the environment. Agents request services from the environment; they are defined by their behavior. The only way of interaction between the environment and the agents is to interact through service requests. It determines the level of abstraction that we use in the formal definition of the behavior of agents. The insertion functions used for the description of sequential systems is broader than the insertion functions discussed earlier. An inserted agent can start its activity before agents inserted earlier terminate. The active agent can be selected by the environment using various criteria such as priority or other static or dynamic characteristics. To compare agent behaviors, in some cases a look-ahead insertion may be used. Usually, sequential environments are deterministic systems and static requirements should be consistent to define deterministic transitions. Consistency requirements reduce to the following condition: the preconditions for each pair of static requirements referring to the same action must be nonintersecting. In other words, for arbitrary values of attributes there must be at least one of two requirements which has a false precondition. Completeness can also be checked for the set of all static requirements that refer to the same action. Every such set of requirements must satisfy the condition that for arbitrary values of attributes there must be at least one among the requirements that is applicable with a true precondition. 6.4.5.2 Requirements for Parallel Environments A parallel environment is an environment with inserted agents that work in parallel and independently. Agents are considered as transition systems. An agent can transition from one state to another by performing an action at any time when the environment accepts this action. Once the agent has completed the action, it transitions into a new state and, in addition, causes a transition of the environment. As an example, consider the modeling of device interaction protocols. Devices are independent and connected through the environment. They interact by sending and receiving of data packets. The protocol is considered as an algorithm used by devices to interact with other components. Such a device is an agent in the parallel environment. It is represented as a transition system that can cause transition between states by one of the two actions: sending or receiving a packet that is a parameter of these actions. We formalize such requirements by using the notation of Hoare-style triples. Asynchronous parallel environments are highly nondeterministic. Such specifications are easily expressed in sequence diagrams or message sequence charts, as shown in Figure 6.5. The preconditions and postconditions are conditions and states on the message sequence diagram, while the actions represent the message arrows and tasks shown on the diagrams. 6.4.5.3 Requirements for Synchronous Agents As an example of a synchronous system, consider a processor with a bus and its time-dependent behavior. The processor interacts with the bus through signals which appear at every bus cycle (discrete time step). Interaction protocols in the processor-bus system are defined by signal behavior. Every signal can have a value of either 0 or 1. After some event, every signal switches to one of these values. For every signal of the system there is a history describing the conditions of signal switching. Such conditions are called assertion condition (when the signal switches to 1) and deassertion condition (when the signal switches to 0). Formally, the history of signals is a sequence of conjunctions of signals. The situation when signal S1 is equal to 1, signal S2 is equal to 0 in moment tn and signal S1 is equal to 0, signal S2 is equal to 1 in moment tn+1 can be described in the following way: (S1 & ∼ S2) ∗ (∼ S1 & S2) © 2006 by Taylor & Francis Group, LLC

6-36

Embedded Systems Handbook CCCH(k)

SDGC_Call(i, Group_ID)

ACG

DAP SDGC_Page_Request

(Group_ID, LAI,) SDGC_Paging_Request (List_Of_Targets) SDGC_Page_Request_Type_1 (ms_id=1) SDGC_Page_Request_Type_1 (ms_id=2) SDGC_Page_Request_Type_1 (ms_id=3) SDGC_Page_Response_Type_1 (ms_id,) SDGC_Complete (MS_ID, Group_Id=0,) SDGC_Page_Response_Type_1 (ms_id,) SDGC_Complete (MS_ID, Group_Id=0,)

SDGC_Page_Response_Type_1 (ms_id,) SDGC_Complete (MS_ID, Group_Id=0,)

FIGURE 6.5

Sample MSC diagram.

Let signal S3 have as assertion condition that it will be equal to 1 after the above history of signals S1 and S2 . This fact can be described as (S1 & ∼ S2) ∗ (∼ S1 & S2)− > after(1) S3 using triples notation. This condition can be reflected on a wave (or timing) diagram (Figure 6.6). The consistency condition is fulfilled if signal q will not be changed into 1 and 0 in the same cycle. In other words, if there are two requirements P → q and P → ∼q, then preconditions P and P cannot be true simultaneously. Static requirements for synchronous systems can use Kleene expressions over conditions and duration functions with numeric inequalities in preconditions. These requirements are converted into standard form with logic statements relating to adjacent time intervals.

© 2006 by Taylor & Francis Group, LLC

System Validation

6-37

t1

t2

t3

S1

S2

S3

FIGURE 6.6 Sample wave diagram.

6.4.6 Reasoning about Embedded Systems The theory of agents and environments has been implemented in the system 3CR [104]. The kernel of our system [105] consists of a simulator for a generic Action Language (AL) [10,11] for the description of system behaviors, of services for automatic exploration of the behavior tree of a system, and of a theorem prover for first-order predicate logic, enriched with a theory of linear equations and inequalities. It provides the following technologies supporting the development, verification, and validation of requirements for embedded systems: • Prove the internal consistency and completeness of static requirements of a system. • Prove dynamic properties of the system defined by static requirements including safety, liveness, and integrity conditions. • Translate systems described in standard engineering languages (e.g., MSC, SDL, or wave diagrams) into the first-order format described earlier and simulate these models in user-defined environments. • Generate test suites for a system defined by verified requirements specifications and validate the implementations of the system against these test cases. These facilities can be used in automated as well as in interactive mode. To determine consistency and completeness of requirements for interactive systems we rely on the theory of interaction of agents and environments as the underlying formal machinery. 6.4.6.1 Algebraic Programming The mathematical models described in Section 6.2 can be made more concrete by imposing structure on the state space of transition systems. An universal approach is to consider an algebraic structure of the set of states of a system. Then states are represented by algebraic expressions and transitions can conveniently be defined by (conditional) rewriting rules. A combination of conditional rewriting rules with congruence on the set of algebraic expressions can be defined in terms of rewriting logic [32]. Most modern rewriting techniques are considered primarily in the context of equational theories but could also be applied to first-order or higher-order clausal or nonclausal theorem proving. The main disadvantage of computations with such systems is their relatively weak performance. For instance, rewriting modulo associativity and commutativity (AC-matching) is NP-complete. Consequentially, these systems are usually not powerful enough when “real-life” problems are considered. Our environment [105] supports reasoning in noncanonical rewriting systems. It is possible to combine arbitrary systems of rewriting rules with different rewrite strategies. The equivalence relation (basic congruence) on a set of algebraic expressions is introduced by means of interpreters for operations

© 2006 by Taylor & Francis Group, LLC

6-38

Embedded Systems Handbook

which define a canonical form. The primary strategy of rewriting is a one-step syntactic rewriting with postcanonization by means of reducing the rewritten node to this canonical form. All other strategies are combinations of the primary strategy with different traversals of the tree representing a term structure. Rewrite strategies can be chosen from the library of strategies or written as procedures or functions. The generic AL [11,106] is used for the syntactical representation of agents as programs and is based on the behavior algebra defined in Section 6.2. The main syntactic constructs of AL are prefixing, nondeterministic choice, sequential composition, and parallel composition. Actions and procedure calls are primitive statements. It provides the standard termination constants (successful termination, divergence, deadlock). The semantics of this language is parameterized by an intensional semantics defined through an unfolding function for procedure calls and an interaction semantics defined by the insertion function of an environment into which the program will be inserted. The intensional semantics and the interaction semantics are defined as systems of rewriting rules. The intensional semantics of an AL program is an agent which is obtained by unfolding procedure calls in the program and defining transitions on a set of program states. It is defined independently of the environment by means of rewriting rules for the unfolding function (unfolding rules) up to bisimulation. The left-hand side of an unfolding rule is an expression representing a procedure call. The right-hand side of an unfolding rule is an AL program which may be unfolded further generating more and more exact approximations of the behavior under recursive computation. The only built-in compositions of AL are prefixing and nondeterministic choice. The unfolding of parallel and sequential compositions are flexible and can be adjusted by the user. Alternatives for parallel composition are defined by the choice of the combination operator. For example, when the combination of arbitrary actions is the impossible action, parallel composition is reduced to interleaving. On the other hand, exclusion of interleaving from the unfolding rules defines parallel composition as synchronization at each step (similar to hand shaking in Milner’s π-calculus). The interaction semantics of AL programs is defined through the insertion function. Programs are considered up to insertion equivalence. Rewriting rules which define the insertion function (insertion rules) have the following structure: the left-hand side of an insertion rule is the state or behavior of the environment with a sequence of agents inserted into this environment (represented as AL programs). The right-hand side is a program in AL augmented by “calls” to the insertion function denoted as env(E, u), where E is an environment state expression and u is an AL program. To compute the interaction semantics of AL program one uses both the unfolding rules for procedure calls and the insertion rules to unfold calls to the insertion function. In this approach, the environment is considered as a semantic notion and is not explicitly included in the agent. Instead, the meaning of an agent is defined as a transformation of an environment which corresponds to inserting the agent into its environment. When the agent is inserted into the environment, the environment changes and this change is considered to be a property of the agent described. 6.4.6.2 Simulating of Transition Systems The AL has been implemented by means of a simulator [10,106,107], an interactive program which generates all histories of an environment with inserted agents and which can explore the behavior of this environment step-by-step, starting from any possible initial state, with branching at nondeterministic points and backtracking to previous states. The simulator permits forward and backward moves along histories; in automatic mode it can search for states satisfying predefined properties (deadlock, successful termination, etc.) or properties defined by the user. The generation of histories may be user guided and thus permits examination of different histories. The user can retrieve information about the current state of a system and change this state by means of inserting new agents using different insertion functions. Arbitrary data structures can be used for the representation of the states of an environment and the environment actions. The set of states of an environment is closed under the insertion function e[u] which is denoted in the simulator as env(e, u). The agent u is represented by an AL expression. Arbitrary algebraic data structures can be used for the representation of agent actions and procedure calls.

© 2006 by Taylor & Francis Group, LLC

System Validation

6-39

The core of the simulator is specified as a nondeterministic transition system that functions as an environment for the system model. Actions of the simulating environment are expressed by means of calls for services of the simulator. Local services define one-step transition of the simulated system. Global services permit the user to compute different properties of the behavior of a simulated system. The user can formulate the property of a state by means of a rewriting rule system or some other predicate function and the simulator will search for the existence of a state satisfying the property among the states reachable from the current state. Examples of such properties are deadlock, successful termination, undefined states, and so on. 6.4.6.3 Theorem Proving The proof system [108] is based on the interactive evidence algorithm [109–111] — a Gentzen-style calculus with unification used for first-order reasoning. The Interactive Evidence Algorithm is a sequent calculus and relies on the construction of an auxiliary goal as the main inference step which allows easy control of the direction of the search for proofs at each step through the choice of auxiliary goals. This algorithm can be represented as a combination of two calculi: inference in the calculus of auxiliary goals is used as a single-step inference in the calculus of conditional sequents. In a sense, the interactive evidence algorithm generalizes logic programming in that for the latter, auxiliary goals are extracted from Horn disjuncts while in the interactive evidence algorithm they are extracted from arbitrary formulae with quantifiers (which need not be skolemized). The interactive evidence algorithm is implemented as a nondeterministic algebraic program extracted from the calculus based on the simulator for AL. This program is inserted as an agent into a control environment which searches for a proof, organizes interaction with the user and the knowledge bases, and implements strategies and heuristics to speed up the proof search. The control environment contains the assumptions of a conditional sequent, and so the local information can be combined with other information taken from knowledge base agents and used in search strategies. The prover is invoked by the function prove implemented as a simple recursive procedure with backtracking which takes an initial conditional sequent as argument and searches for a path from the initial statement to axioms, and this path is converted to a proof. The inference search is nondeterministic owing to disjunction rules. Predicates are considered up to equivalence defined by means of all Boolean equations except distributivity. A function Can defined by means of a system of rewriting rules defines the reduction of predicate formulae as well as propositional formulae to a normal form. Predicate formulae are considered up to renaming of bound variables and equations ¬(∃x)p = (∀x)¬p, ¬(∀x)p = (∃x)¬p. Associativity, commutativity, and idempotence of conjunction and disjunction as well as the laws of contradiction, excluded middle, and the laws for propositional constants are used implicitly in these equations.

6.4.7 Consistency and Completeness The notion of consistency of requirements in general is equivalent to the existence of an implementation or model of a system that satisfies these requirements. Completeness means that this model is unique up to some predefined equivalence. The traditional way of proving consistency is to develop a model coded in some programming or simulation language and to prove that this code is correct with respect to requirements. However, direct proving of correctness is difficult because it demands computing necessary invariant conditions for the states of a program. Another method is generating the space of all possible states of a system reachable from the initial states and checking whether the dynamic requirements are satisfied in each state. This approach is known as model checking and many systems which support model checking have been developed. Unfortunately, model checking is realistic only if the state space is finite, and all reachable states can be generated in a reasonable amount of time. Our approach proves consistency and completeness of requirements directly, without developing a model or implementation of the system. We prove that the static requirements define the system completely and that dynamic properties of consistent requirements are all the logical consequences of static

© 2006 by Taylor & Francis Group, LLC

6-40

Embedded Systems Handbook

requirements. Based on this assumption, one can define an executable specification using only static requirements and then execute it using a simulator. We distinguish between the consistency and completeness of static requirements and dynamic consistency. The first is defined in terms of static requirements only and reflects the property of a system to be deterministic to actions by the environment. For example, a query from a client to a server as the action of an inserted agent can be selected nondeterministically, but the response must be defined by static requirements selected in a deterministic manner. When all dynamic requirements are the consequences of static requirements, we say the system is dynamically consistent. Sufficient conditions for the consistency of static requirements depend on subject domains and implicit assumptions about the change of observable attributes. For example, for the classes of asynchronous systems considered previously, the condition for internal consistency is simply that the conjunction of two preconditions corresponding to different rules with the same action is not satisfiable. Completeness means that the disjunction of all preconditions for all rules corresponding to the same action is generally valid. For synchronous systems, on the other hand, it is the nonsatisfiability of two preconditions corresponding to rules which define conflicting changes to the same (usually binary) attribute. The incompleteness of static requirements usually is not harmful, it merely postpones design decisions to the implementation stage. However, it is harmful if there exists an implementation which meets the static requirements but does not meet the dynamic requirements. Dynamic consistency of requirements (the invariance of dynamic conditions expressed using the temporal modality “always”) can be proven inductively using the structure of static requirements. Consistency checking proceeds by formulating and proving consistency conditions for every pair of static requirements with the same starting event. Every such pair of requirements must satisfy the condition that for arbitrary values of attributes there must be at least one of the two requirements which has a false precondition or the postconditions are equivalent. Completeness of requirements means that there exists exactly one model for the requirements up to some equivalence. We distinguish two main cases depending on the focus of the requirements specification. If the specification defines the environment, the equivalence of environments needs to be considered. Otherwise, if an agent is defined by the requirements, the equivalence of agents needs to be examined. Let e and e be two environment states (of the same or different environments). We say that e and e are equivalent if for an arbitrary agent u the states e[u] and e [u] are bisimilar (from the equivalence of two environment states it follows that for insertion equivalent agents u and u , e[u] and e [u ] are also bisimilar). If there are restrictions on possible behaviors of the agents, we consider admissible agents rather than arbitrary agents. Let E and E be two environments (each being a set of environment states and an insertion function). These environments are equivalent if each state of one of the environments is equivalent to some state of the other. If the set of environments defines an agent for a given environment E, logical completeness (with respect to agent definition) means that all agents satisfying these requirements are insertion equivalent with respect to the environment E, that is, if u and u satisfy the requirements, then for all e ∈ E, e[u] ∼E e[u ]. We check completeness for the set of all static requirements that refer to the same starting event. Every such set of requirements must satisfy the condition that for arbitrary values of attributes there must be at least one among the requirements that is applicable with a true precondition.

6.5 Examples and Results Figure 6.7 exhibits a design process using the 3CR [104] tool set. The requirements for a system are represented as input text written in the formal requirements language or translated from engineering notations, such as SDL or MSC. Static requirements are sent to the checker which establishes their consistency and completeness. The checker analyzes a requirement statement and generates a logical

© 2006 by Taylor & Francis Group, LLC

System Validation

6-41

Static requirements

Checker

Prover

Dynamic requirements

Generate executablespec Behavior model Simulator

Environment model

Generate tests

Structure model

Validate

FIGURE 6.7 Design process.

statement expressing the consistency of the given requirement with other requirements already accepted, as well as a statement expressing completeness after all static requirements have been accepted. Then this statement is submitted to the prover in order to search for a proof. The prover may return one of three answers: proved, not proved, or unknown. In the case where consistency could not be proven, one of the following types of inconsistencies is considered. • Inconsistent formalization. This type of inconsistency can be eliminated through improved formalization, if the postconditions are consistent for the states where all preconditions are true. Splitting the requirements can help. • Inconsistency resulting from incompleteness. This is the case when two requirements are consistent, but nonintersection of preconditions cannot be proved because complete knowledge of the subject domain is not available. A discussion with experts or the authors of the requirements is recommended. • Inconsistency. Preconditions are intersected, but postconditions are inconsistent after performing an action. This is a sign of a possible error, which can be corrected only by the change of requirements. If the intersection is not reachable, the inconsistency will not actually arise. In this case, a dynamic property can be formulated and proven. Dynamic properties are checked after accepting all static requirements. These are logical statements expressing properties of a system in terms of first-order predicate calculus, extended by temporal modalities, as well as higher-order functions and types. If an inductive proof is needed, all static requirements are used for generating lemmas to prove the inductive step. After checking the consistency and completeness of static requirements, the requirements are used for the automatic generation of an executable specification of a system satisfying the static requirements. At this point, the dynamic requirements have already been proven to be consequences of static requirements, so the system also satisfies the dynamic requirements. The next step of system design would be the use of

© 2006 by Taylor & Francis Group, LLC

6-42

Embedded Systems Handbook

the obtained information in the next stages of development. For example, executable specifications can be used for generating complete test cases for system test.

6.5.1 Example: Embedded Operating System In this section, we shall describe a general model which could be used for developing formal requirements for embedded operating systems such as OSEK [103]. The requirements for the OSEK operating system can serve as an example of the application of the general methodology of checking consistency. These requirements comprise two documents: OSEK Concept and OSEK API. The first document contains an informal description of conformance classes (BCC1, BCC2, ECC1, ECC2, ECC3) and requirements of the main services of the system. The second document refines the requirements in terms of C function headers and types of service calls. Two kinds of requirements can be distinguished in these documents. Static requirements define permanent properties of the operating system, which must be true for arbitrary states and any single-step transition. These requirements refer to the structure of operating system states and their changes in response to the performance of services. Dynamic requirements state global system properties such as the absence of deadlocks or priority inversions. Using the theory of interaction of agents and environments as the formalism for the description of OSEK, an environment consists of a processor (or processor network), an operating system, and the external world which interacts with the environment via some kind of communication network; agents are tasks interacting with the operating system and communication network via services. We use nondeterministic agents over a set of actions representing operating system services as models of tasks. The states of the environment are characterized by a set of observable attributes with actions corresponding to the actions of task agents. Each attribute defines a partial function from the set E of environment states to the set of values D. E is considered as an algebra with the set of (internal or external) operations defined on it. The domain D should be defined as abstract as possible, for example, by means of set theoretic constructions (functions, relations, powersets) over abstract data types represented as initial algebras, in order to be independent as much as possible of the details of implementation when formulating the requirements specifications. In monoprocessor systems only one agent is in the active state, that is capturing a processor resource. If e is a state of the environment with no active agents then in the representation e[u] of the environment the state u is a state of an active agent. All other agents are in nonactive states (suspended and ready states for OSEK) and are included into the state e as parts of the values of attributes. The properties of an environment can be divided into static and dynamic properties. Static properties define one-step transitions of a system; dynamic properties define the properties of the total system. The general form of a rule for transitions is: c

a

→ e , u − → u e− d

e[u] − → e [u ] In this rule d, e , and u depend on parameters appearing in the assumptions. Usually, if a = c = d (synchronization), e = e , and u = u , albeit there can be special cases such as hiding, scheduling points, or interrupt routines. c → e for the environment we first define the transition rules To define the properties of a transition e − for the attributes associating a transition system to each attribute. The states of a system associated to the attribute p is a pair p : v where v is a value of a type associated with the attribute p. All transition systems are defined jointly, that is the transitions of one attribute can depend on the current values or transitions of other ones. After defining the transitions for attributes the transitions for environment states must be

© 2006 by Taylor & Francis Group, LLC

System Validation

6-43

defined in such a way that the following general rule should be true: c Let p1 , . . . , pn be attributes of a state e of the environment and e · p1 = v1 , . . . , e · pn = vn . Let e − → e , c e is not a terminal state (, ⊥, or 0), and pi : vi − → pi : vi for all i ∈ I ⊆ [1 : n] where I is the set of all / I . From indices for which such transitions defined. Then e · pi = vi for i ∈ I and e · pi = e · pi for i ∈ c

→ e then e · pi = e · pi for all i ∈ I ⊆ [1 : n]. this definition it follows that if I = ∅ and e − In the case, when two states of the environment are bisimilar, this rule is sufficient to define the transitions of the environment. Otherwise we can introduce the hidden part of the environment state and consider transitions of attributes jointly with this hidden component. For space considerations, in Section 6.5.1.1 we show only the example of a simple scheduler applicable to this class of operating systems. 6.5.1.1 Requirements Speciﬁcation for a Simple Scheduler This example of a simplified operating system providing initial loading and scheduling for tasks and interrupt processing is used as a benchmark to demonstrate the approach for formalizing and checking the consistency of requirements. We use the terminology of OSEK [103]. The attributes of the scheduler are: • • • •

Active, a name Priority, a partial function from names to natural numbers Ready, a list of name/agent pairs Call, a partial function from names to agents

The empty list and the everywhere undefined function are denoted as Nil. These attributes are defined only for nonterminal and deterministic states. The actions of task agents are calls for services: • • • •

new_task (a, i), a is a name of an agent, i is an integer activate a, a is a name terminate schedule

In the following requirements we assume that the current state of the environment is e[u] and that c → u for a given service c. The values of attributes are their values in a state e. We define the transitions u− d

→ e [u ]. e[u] − The actions of environment include all task actions and, in addition, the following actions which are specific only for the environment and are addressed to an external observer of scheduler activity: • • • • • • • • • •

loaded a, a is a name activated a, a is a name activate_error schedule_error terminated a, a is a name schedule u, u is an agent scheduled a, a is a name wait start_interrupt end_interrupt

6.5.1.1.1 Requirements for new_task This action substitutes the old task with the same name if it was previously defined in the scheduler or adds the task to an environment as a new task otherwise. Transitions for the attributes: new_task(a:v,i)

priority : f −−−−−−−−−→ priority : f [a := i]

© 2006 by Taylor & Francis Group, LLC

6-44

Embedded Systems Handbook

We use the following notation for the redefinition of functions: if f : X → Y and x ∈ X then f [x := y] is / X it a new function g such that g (x) = y and g (x ) = f (x ) for x = x (assignment for functions). If x ∈ is added to the domain of a function and then an assignment is performed. new_task (a:v,i) call : f −−−−−−−−−−−−→ f [a := v] Now the task agent v becomes the initial state of a task named a. new_task is defined by the following rule: new_task (a:v,i) new_task (a:v,i) e −−−−−−−−−−−−→ e , u −−−−−−−−−−−−→ u loaded a e[u] −−−−−−−→ e [u ] 6.5.1.1.2 Requirements for Activate We use the following notation: if p is an attribute, its value is a function and x is in the domain of this function, then p(x) denotes the current value of this function on x. call a = v activate a ready : r −−−−−−−−−→ ready : ord(a : v, r) The function ord is defined on the set of lists of pairs (a : u) where a is a name and u is an agent and this function must satisfy the following system of axioms where all parameters are assumed to be universally quantified: ord(a : , r) = r priority b ≤ priority a ⇒ ord(a : u, b : v, r) = (b : v, ord(a : u, r)) Hence ready is a queue of task agents ordered by priorities and adding a pair (a : u) put this pair as the last one among all pairs of the same priority as a. The rules are: activate a activate a e −−−−−−−−−→ e , u −−−−−−−−−→ u , a ∈ Dom(call ) activated a e[u] −−−−−−−−−−−→ e [u ] activate a u −−−−−−−−−→ u , a ∈ / Dom(call ) − activate_error e[u] −−−−−−−−−−−−−−−→ ⊥ An undefined state of the environment only means that a decision about the behavior of the environment in this case is left for the implementation stage. For instance, the definition can be extended so that the environment sends an error message and calls error processing programs or continuous its functioning ignoring the incorrect action. 6.5.1.1.3 Requirements for Terminate terminate u −−−−−−−−−→ u activated (e.active) e[u] −−−−−−−−−−−−−−−→ e[schedule ] 6.5.1.1.4 Requirements for Schedule Let P(u, b, v, s) = P1 ∨ P2 where P1 = e.active = Nil ∧ ord(e.active : u, e.ready ) = (b : v, s) P2 = e.active = Nil ∧ u = ∧ e.ready = (b : v, s)

© 2006 by Taylor & Francis Group, LLC

System Validation

6-45

Let r = e.ready , and a = e.active , then the rules for attributes are: P(u, b, v, s) schedule u ready : r −−−−−−−−−−→ ready : s P(u, b, v, s) schedule u active : a −−−−−−−−−−→ active : b Note that, transitions for attributes and therefore for the environment are highly nondeterministic because the parameter u is an arbitrary agent behavior. But this nondeterminism disappears in the rule for scheduling which restricts the possible values for u to no more than one value. The rules are: schedule u schedule P(u , b, v, s), e −−−−−−−−−−→ e , u −−−−−−−−→ u scheduled b e[u] −−−−−−−−−−−→ e [v] schedule u −−−−−−−−→ u , e.active = Nil ∧ u = schedule_error e[u] −−−−−−−−−−−−−−−→ ⊥ schedule u −−−−−−−−→ , e.ready = Nil wait e[u] −−−−→ e [] Therefore, if a task has no name (it can happen if a task is initially inserted into an environment) it can use scheduling only as the last action, otherwise it is an error. And if there is nothing to schedule, the scheduling action is ignored. 6.5.1.1.5 Interrupts The simplest way to introduce interrupts to our model is to hide the occurrence of interrupts and the choice of the start of interrupt processing. Only actions which show the start and the end of interrupt processing are observable. The rules are: start_interrupt e −−−−−−−−−−−−−−−−→ e [v] start_interrupt e −−−−−−−−−−−−−−−−→ e [v; end_interrupt ; u] We have no transitions for attributes labeled by the interrupt action so in this transition e and e have the same values for all attributes. The program v is an interrupt processing routine. end_interrupt u −−−−−−−−−−−−−−→ u end_interrupt e[u] −−−−−−−−−−−−−−→ e[u ] Nesting of interrupts can be of arbitrary depth. The action end_interrupt is an environment action but it is used by the inserted agent after interrupt started to show the end of interrupt processing. Therefore, the set of actions for inserted agent is extended, but still it is not an action of an agent before its insertion into the environment.

© 2006 by Taylor & Francis Group, LLC

6-46

Embedded Systems Handbook

6.5.1.1.6 Termination When all tasks are successfully terminated, the scheduler reaches the waiting state: wait active : a −−−−→ active : Nil wait ready : Nil, e −−−−→ e wait e[] −−−−→ e [] 6.5.1.1.7 Dynamic Requirements A state e of an environment is called initial if e.ready = e.active = Nil, and the domains of functions e.priority and e.call are empty. Let E0 be the set of all states reachable from the initial states. Define En+1 , n = 0, 1, . . . as a set of all states reachable from the states e[u], where e ∈ En and u is an arbitrary task agent. The set E of admissible states is defined as a union E = E0 ∪ E1 ∪ . . .. Multiple insertion rules show that the insertion function is sequential. Dynamic requirements for environment states are as follows: • E does not contain the deadlock state 0. • There are no undefined states in E except for those which result from error actions. • Tasks are scheduled in FIFO discipline for the tasks of the same priority, tasks of a higher priority are scheduled first and interrupt actions are nested as brackets. 6.5.1.1.8 Consistency The only nonconstructive transition in the requirements specification of the simple scheduler is the insertion of an arbitrary agent as an interrupt processing routine. If we restrict the corresponding transitions to the selection from some finite set (even nondeterministically) the requirements will be executable. To prove dynamic properties, first some invariant properties for E (always statements) must be proved. Then after their formalization, dynamic properties are inferred from these invariants: • • • •

Dom(e.priority ) = Dom(e.call ) (a : u) ∈ e.ready ⇒ a ∈ Dom(e.priority ) e.active = Nil ⇒ e.active ∈ Dom(e.priority ) e.ready is ordered by priority

In the invariants formulated above, e is assumed to be nonterminal. 6.5.1.2 Input Text to the Consistency Checker The consistency checker accepts static requirements represented in the form of Hoare-style triples and dynamic requirements in the form of logical formulae. Requirements include the description of typed attributes and actions. The following input text is obtained from the description of simple scheduler considered above. It is statically consistent and can be used for proving dynamic properties of the scheduler. Each requirement describes the change of a state of environment with the inserted agent represented as the value of the attribute active_task. The value of this attribute is the behavior of a previously inserted agent which is currently active. The predicate active_task–>a · u is used to represent the transition a → u. The action axiom is needed to prove consistency for action wait (Code 6.7). active_task − Code 7 attributes( active: name, priority: name -> Nat, ready: list of (name:agent), call: name -> agent,

© 2006 by Taylor & Francis Group, LLC

System Validation

6-47

active_task: agent ); actions(a:name,u:agent,i:int)( new_task(a:u,i), activate a, terminate, schedule, loaded a, activated a, activate_error, schedule_error, terminated a, schedule u, scheduled a, wait, start_interrupt, end_interrupt ); Let action axiom: Forall x(˜(x.Delta = Delta)); Let ord Delta: Forall(a,r)(ord(a:Delta,r) = r); Let ord: Forall(a,b,u,v,r)( (priority b <= priority a) & ˜(a = Delta) -> (ord(a:u,b:v,r) = (b:v,ord(a:u,r)))); /* ------------ new_task ------------------------------ */ req new_task: Forall(a:name, (u,v):agent, i:int)( (active_task --> new_task(a:v,i).u) -> after(loaded a) ((active_task = u) & (priority a = i) & (call a = v))); /* ------------ activate ------------------------------ */ req activate success: Forall(a:name,(u,v):agent, r:list of(name:agent))( ((active_task --> activate a.u) & (ready = r) &(call a = v) & ˜(v = Nil)) -> after(activated a) (active_task = u & ready = ord(a:v,r))); req activate error: Forall(a:name,u:agent)( ((active_task --> activate a.u) & (call a = Nil)) -> after activate_error bot); /* ------------ terminate ----------------------------- */ req terminate: Forall(a:name, u:agent)( ((active_task --> terminate.u) & (active = a)) -> after(terminated a) (active_task = schedule)); /* ------------ schedule ------------------------------ */ req schedule success active: Forall((u,v):agent, a:name,s:list of(name:agent))( ((active_task --> schedule.u) & ˜(active = Nil) & (ord(active:u,ready) = (a:v,s))) -> after(scheduled a) ((active_task =u) & (active = a) & (ready = s))); req schedule success not active: Forall(v:agent, a:name,s:list of(name:agent))( ( (active_task = schedule) & (active = Nil) & (ready = (a:v,s)))

© 2006 by Taylor & Francis Group, LLC

6-48

Embedded Systems Handbook

-> after(scheduled a) ((active_task =v) & (active = a) & (ready = s))); req schedule error: Forall(u:agent)( ((active_task --> schedule.u) & (active = Nil) & ˜(u = Delta)) -> after schedule_error bot); req schedule final: Forall(v:agent, b:name,s:list of(name:agent))( ((active_task --> schedule.Delta) & (ready = Nil)) -> after wait (active_task = Delta)); /* ------------ interrupt ------------------------------ */ req start interrupt: Forall((u,v):agent)( ((active_task = u) & (interrup_process = v)) -> after start_interrupt (active_task = (v;end_interrupt;u))); req end interrupt: Forall(u:agent)( (active_task --> end_interrupt.u) -> after end_interrupt (active_task = u)); /* ------------ termination --------------------------- */ req termination: Forall(u:agent)( (active_task = Delta) & (ready = Nil) -> after wait (active_task = Delta)) /* ------------ dynamic properties -------------------- */ prove always Forall(a:name)( a in_set Dom(priority)<=>a in_set Dom(call); prove always Forall(a:name,u:agent)( (a:u)in_list(ready)-> a in_set Dom(priority)); prove always ˜(active = Nil)-> active in_set Dom(priority); prove always is_ord ready

6.5.2 Experimental Results in Various Domains We have developed specializations for the following subject domains: sequential asynchronous environments, parallel asynchronous environments, and sequential synchronous agents. We have conducted a number of projects in each domain to determine the effectiveness of formal requirements verification. Figure 6.8 exhibits the performance of our provers. We show the measurements in terms of MSC diagrams, a familiar engineering notation often used to describe embedded systems. The chart on the left shows performance in terms of “arrows,” that is, communications between instances on an MSC diagram. We can see that the performance is roughly linear to the number of arrows up to roughly 800 arrows per diagram. Note that, a typical diagram has much less arrows, no more than hundred in most cases. The chart on the right shows that performance is linear in the number of MSC diagrams (of typical size). Jointly, these charts indicate that the system is scalable to realistically sized applications. 6.5.2.1 OSEK OSEK [103] is a representative example of an asynchronous sequential environment. The OSEK standard defines an open, embedded operating system for automotive electronics. The OSEK formal model has been described as an environment for application tasks of different types, considered as agents inserted into this environment. The actions common for agents and environment are the services of the operating system. The system is multitasking but has only one processor and only one task is running at any given moment and, therefore, the system is considered to be sequential. The

© 2006 by Taylor & Francis Group, LLC

System Validation

6-49 800

0.20 600 0.15 (sec)

Proving time per arrow (sec)

0.25

0.10

400

200

0.05 0.00 0

200

400

600 Arrows

800

1000

1200

0 25

50

75

100

125

150

MSC

FIGURE 6.8 Performer of prover in terms of MSC diagrams.

system is asynchronous because all actions performed by tasks independently of the operating system are not observable and so the time between two services cannot be taken into account. Static requirements are represented by transition rules with preconditions and postconditions. The reachable states for OSEK can be characterized by integrity conditions. After developing the formal requirements for OSEK, the proof system was used to prove static consistency and completeness of the requirements. Several interesting dynamic properties of the requirements were also proven. The formalization of OSEK requirements led to the discovery of 12 errors in the nonformal OSEK standard. For example, Section 6.7.5 of the OSEK/VDX specification [103] defines a transition related to the current priority of a task in the case when it has a priority less than the ceiling priority of the resource; however, no transition is defined in the case when the current priority of the task is equal to the ceiling priority. All these errors were documented and the corrections have been integrated into the OSEK standard. In the formal specification, we have covered 10 services defined by the OSEK standard and have proven the consistency and completeness of this specification. This covers approximately 40% of the complete OSEK standard. Moreover, we have found a number of mistakes in the other parts of the OSEK standard, which prevented formalization of the rest of the standards document. Consistency and completeness of the covered parts of the standard (49 requirements) were proven, after making corrections for the above mentioned defects. The proof of consistency took approximately 7 min on a Pentium III computer with 256M of RAM running the Red Hat Linux Operating System. 6.5.2.2 RIO The RapidIO Interconnect Protocol [112] is an example of a parallel asynchronous environment. This is a protocol for a set of processor elements to communicate amongst each other. There are three layers of abstraction developed: logic, transport, and physical layers. The static requirements for RIO are standard (pre- and postconditions referring to the adjacent moments of time). But while in OSEK an action is uniquely defined by a running task, in RIO it is generated by a nondeterministic choice of one of the processor elements that generates an observable action. The formal requirements description of RIO for logic (14 requirements) and transport layers (6 requirements) was obtained from the documentation and proved to be consistent and complete (46 sec); 46 requirements for the physical layer have been proven consistent in 8.5 min. 6.5.2.3 V’ger The formal requirements for the protocol used by the SC-V’ger processor [113] for communicating with other processor elements of a system via the MAXbus bus device were extracted from the documentation of the MAXbus and from discussions with experts. V’ger is a representative example of a synchronous

© 2006 by Taylor & Francis Group, LLC

6-50

Embedded Systems Handbook

sequential agent inserted into a parallel environment. V’ger is a deterministic automaton with binary input–output signals and shared data available from the bus. The attributes of the system are its input– output signals and its shared data. Originally, there are no actions and we can consider the clock signal synchronizing the system as the only observable action. Static requirements are written using assertion/deassertion conditions for output signals. Each requirement is a rule for setting the signal to a given value (0 or 1). The precondition is a history of conditions represented in a Klenee-like algebra with time. Several rules can be applied at the same moment. For the static consistency conditions, the preconditions of two rules which set the same attribute to different values can never be true at the same lock interval. There are no static completeness conditions because we define the semantics of the requirements text so that if there are no rules to change the output value, it remains in the same state as in the previous moment of time. We use binary attribute symbols as predicates and as long as there are no other predicate symbols the systems represents a propositional calculus. To prove statements with Klenee algebra expressions, these must first be reduced to first-order logic, that is, to requirements with preconditions referring to one moment of time (without histories). A converter has been developed for the automatic translation of subject domains relying on Kleene algebra and the interval calculus notation. The set of reachable states of V’ger is not defined in first-order logic, and the proof of the consistency condition is only a sufficient condition for consistency. A more powerful yet still sufficient condition is the provability of consistency conditions by standard induction from static requirements. There exists a sequence of increasingly powerful conditions which converge to the results obtained by model checking. All 26 V’ger requirements have been proven to be consistent (192 sec).

6.6 Conclusions and Perspectives In this chapter, we reviewed tools and methods to ensure that the “right” system is developed, by which we mean a system that matches what the customer really wants. Systems that do not match customer requirement result in cost overruns owing to later changes of the system at best, and, in the worst case, may never be deployed. Based on the mathematical model of the theory of agents and interactions we developed a set of tools capable of establishing the consistency and completeness of system requirements. Roughly speaking, if the requirements are consistent, an implementation which meets the requirements is possible; if the requirements are complete, this implementation is defined uniquely by the requirements. We discuss how to represent requirements specifications for formal validation and exhibit experimental results of deploying these tools to establish the correctness of embedded software systems. This chapter also reviews other models of system behavior and other tools for system validation and verification. Our experience has shown that dramatic quality improvements are possible through formal validation and verification of systems under development. In practice, deployment of these techniques will require increased upstream development effort: thorough analysis of requirements and their capture in specification languages result in a longer design phase. In addition, significant training and experience are needed before significant benefits can be achieved. Nevertheless, the improvements in quality and reduction in effort in later development phases warrant this investment, as application of these methods in pilot projects has demonstrated.

References [1] D. Harel and A. Pnueli. On the development of reactive systems. In K. Apt, Ed., Logics and Models of Concurrent Systems. NATO ASI Series, vol. 13. Springer-Verlag, pp. 477–498. [2] Z. Manna and A. Pnueli. The Temporal Logic of Reactive and Concurrent Systems. Springer-Verlag, Heidelberg, 1992. [3] Z. Manna and A. Pnueli. Temporal Verification of Reactive Systems: Safety. Springer-Verlag, Heidelberg, 1995.

© 2006 by Taylor & Francis Group, LLC

System Validation

6-51

[4] F.P. Brooks. The Mythical Man-Month: Essays on Software Engineering. Addison-Wesley, Reading, MA, 1995. [5] L. Lamport. Introduction to TLA. SRC Technical note 1994-001, 1994. [6] R.J. van Glabbeek. Notes on the methodology of CCS and CSP. Theoretical Computer Science, 177: 329–349, 1997. [7] D.M.R. Park. Concurrency and automata on infinite sequences. In Proceedings of the 5th GI Conference. Lecture Notes in Computer Science, vol. 104. Springer-Verlag, Heidelberg, 1981. [8] R. Milner. Communication and Concurrency. Prentice Hall, New York, 1989. [9] J.V. Kapitonova and A.A. Letichevsky. On constructive mathematical descriptions of subject domains. Cybernetics, 4: 408–418, 1988. [10] A.A. Letichevsky and D.R. Gilbert. Towards an implementation theory of nondeterministic concurrent languages. Second Workshop of the INTAS-93-1702 Project: Efficient Symbolic Computing, St Petersburg, October 1996. [11] A.A. Letichevsky and D.R. Gilbert. A general theory of action languages. Cybernetics and System Analysis, 1: 12–31, 1998. [12] R. Milner. The polyadic π-calculus: a tutorial. In F.L. Bauer, W. Brauer, and H. Schwichtenberg, Eds., Logic and Algebra of Specification. Springer-Verlag, Heidelberg, 1993, pp. 203–246. [13] C.A.R. Hoare. Communicating Sequential Processes. Prentice Hall, New York, 1985. [14] J.A. Bergstra and J.W. Klop. Process algebra for synchronous communication. Information and Control, 60: 109–137, 1984. [15] L. Lamport. The temporal logic of actions. ACM Transactions on Programming Languages and Systems, 16(3): 872–923, 1994. [16] A. Pnueli. The temporal logic of programs. In Proceedings of the 18th Annual Symposium on the Foundations of Computer Science, November 1977, pp. 46–52. [17] E. Emerson and J. Halpern. Decision procedures and expressiveness in the temporal logic of branching time. Journal of Computer and System Science, 30: 1–24, 1985. [18] M.J. Fisher and R.E. Ladner. Propositional modal logic of programs. In Proceedings of the 9th ACM Annual Symposium on Theory of Computing, pp. 286–294. [19] E. Emerson. Temporal and modal logic. In J. van Leeuwen, Ed., Handbook of Theoretical Computer Science. MIT Press, Cambridge, MA, 1991, pp. 997–1072. [20] R. Bryant. Graph-based algorithms for Boolean function manipulation. IEEE Transactions on Computers, 35: 677–691. [21] J. Burch, E. Clarke, K. McMillan, D. Dill, and L. Hwang. Symbolic model checking: 10–20 states and beyond. Information and Computation, 98: 142–170, 1992. [22] E. Clarke and E. Emerson. Synthesis of synchronization skeletons for branching time temporal logic. In The Workshop on Logic of Programs. Lecture Notes in Computer Science, vol. 131. SpringerVerlag, Heidelberg, 1981, pp. 128–143. [23] J. Quielle and J. Sifakis. Specification and verification of concurrent systems in CESAR. In Proceedings of the 5th International Symposium on Programming, pp. 142–158. [24] L. Lamport. What good is temporal logic? In R. Mason, Ed., Information Processing-83: Proceedings of the 9th IFIP World Computer Congress, Elsevier, 1983, pp. 657–668. [25] M. Abadi and L. Lamport. Composing specifications. ACM Transactions on Programming Languages and Systems, 15: 73–132, 1993. [26] W. Thomas. Automata on infinite objects. In J. van Leeuwen, Ed., Handbook of Theoretical Computer Science. MIT Press, Cambridge, MA, 1991, pp. 131–191. [27] A.P. Sistla, M. Vardi, and P. Wolper. The complementation problem for Büchi automata with application to temporal logic. Theoretical Computer Science, 49: 217–237, 1987. [28] M. Vardi and P. Wolper. An automata-theoretic approach to automatic program verification. In Proceedings of the 1st IEEE Symposium on Logic in Computer Science, pp. 332–344. [29] H. Rodgers. Theory of Recursive Functions and Effective Computability. McGraw-Hill, New York, 1967.

© 2006 by Taylor & Francis Group, LLC

6-52

Embedded Systems Handbook

[30] Y. Gurevich. Evolving algebras: an attempt to discover semantics. In G. Rozenberg and A. Salomaa, Eds., Current Trends in Theoretical Computer Science, World Scientific, River Edge, NJ, 1993, pp. 266–292. [31] Y. Gurevich. Evolving algebras 1993: Lipari guide. In E. Börger, Ed., Specification and Validation Methods. University Press, 1995, pp. 9–36. [32] J. Meseguer. Conditional rewriting logic as a unified model of concurrency. Theoretical Computer Science, 96: 73–155, 1992. [33] P. Lincoln, N. Marti-Oliet, and J. Meseguer. Specification, transformation and programming of concurrent systems in rewriting logic. In G. Bleloch et al., Eds., Proceedings of the DIMACS Workshop on Specification of Parallel Algorithms American Mathematical Society, Providence, 1994. [34] M. Clavel. Reflection in General Logics and Rewriting Logic with Application to the Maude Language. Ph.D. thesis, University of Navarra, 1998. [35] M. Clavel and J. Meseguer. Axiomatizing reflective logics and languages. In G. Kicrales, Ed., Reflection’96. 1996, pp. 263–288. [36] M. Clavel, F. Durán, S. Eker, P. Lincoln, N. Martí-Oliet, J. Meseguer, and J. Quesada. Towards Maude 2.0. In F. Futatsugi, Ed., Proceedings of the 3rd International Workshop on Rewriting Logic and its Applications. Notes in Theoretical Computer Science, vol. 36, Elsevier, 2000. [37] J. Meseguer and P. Lincoln. Introduction in Maude. Technical report, SRI International, 1998. [38] J. Brackett. Software Requirements. Technical report SEI-CM-19-1.2, Software Engineering Institute, 1990. [39] B. Boehm. Industrial software metrics top 10 list. IEEE Software, 4: 84–85, 1987. [40] B. Boehm. Software Engineering Economics. Prentice Hall, New York, 1981. [41] J.C. Kelly, S.S. Joseph, and H. Jonathan. An analysis of defect densities found during software inspections. Journal of Systems Software, 17: 111–117, 1992. [42] R. Lutz. Analyzing requirements errors in safety-critical embedded sytems. In IEEE International Symposium Requirements Engineering, San Diego, 1993, pp. 126–133. [43] T. DeMarco. Structured Analysis and System Specification. Yourdon Press, New York, 1979. [44] C.V. Ramamoorthy, A. Prakash, W. Tsai, and Y. Usuda. Software engineering: problems and perspectives. Computer, 17: 191–209, 1984. [45] M.E. Fagan. Design and code inspections to reduce errors in program evelopment. IBM Systems Journal, 15: 182–211, 1976. [46] M.E. Fagan. Advances in software inspection. IEEE Transactions on Software Engineering, 12: 744–751, 1986. [47] J. Rushby. Formal Methods and their Role in the Certification of Critical Systems. Technical report CSL-95-1, March 1995. [48] C.B. Jones. Systematic Software Development Using VDM. Prentice Hall, New York, 1990. [49] J.M. Spivey. Understanding Z: A Specification Language and its Formal Semantics. Cambridge University Press, London, 1988. [50] J.-R. Abrial. The B-Book: Assigning Programs to Meanings. Cambridge University Press, London, 1996. [51] International Organization for Standardization — Information Processing Systems — Open Systems Interconnection. Lotos — A Formal Description Technique Based on the Temporal Ordering of Observational Behavior. ISO Standard 8807. Geneva, 1988. [52] R.S. Boyer and J.S. Moore. A Computational Logic Handbook. Academic Press, New York, 1988. [53] M.J.C. Gordon and T.F. Melham, Eds., Introduction to HOL. Cambridge University Press, London, 1993. [54] D. Craigen, S. Kromodimoeljo, I. Meisels, B. Pase, and M. Saaltink. EVES: an overview. In VDM’91: Formal Software Development Methods. Lecture Notes in Computer Science, vol. 551. Springer-Verlag, Heidelberg, 1991, pp. 389–405. [55] M. Saaltink, S. Kromodimoeljo, B. Pase, D. Craigen, and I. Meisels. Data abstraction in EVES. In Formal Methods Europe’93, Odense, April 1993.

© 2006 by Taylor & Francis Group, LLC

System Validation

6-53

[56] S. Owre, N. Shankar, and J.M. Rushby. User Guide for the PVS Specification and Verification System. Technical report, SRI International, 1996. [57] E. Clarke, O. Grumberg, and D. Peled. Model Checking. MIT Press, Cambridge, MA, 2000. [58] P. Godefroid. VeriSoft: A tool for the automatic analysis of concurrent reactive software. In Proceedings of the 9th Conference on Computer Aided Verification. Lecture Notes in Computer Science, vol. 1254. Springer-Verlag, Heidelberg, 1997, pp. 476–479. [59] J. Burch, E. Clarke, D. Long, K. McMillan, and D. Dill. Symbolic model checking for sequential circuit verification. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 13(4): 401–424, 1994. [60] G. Holzmann. The SPIN Model Checker, Primer and Reference Manual. Addison-Wesley, Reading, MA, 2004. [61] S.J. Garland and J.V. Guttag. A Guide to LP, the Larch Prover. Technical report, DEC Systems Research Center Report 82, 1991. [62] J. Crow, S. Owre, J. Rushby, N. Shankar, and M. Srivas. A tutorial introduction to PVS. In WIFT ’95: Workshop on Industrial-Strength Formal Specification Techniques. Boca Raton, FL, April 1995. [63] S. Rajan, N. Shankar, and M. Srivas. An integration of model checking with automated proof checking. In Proceedings of the 7th International Conference on Computer Aided Verification — CAV ’95. Lecture Notes in Computer Science, vol. 939. Springer-Verlag, Heidelberg, 1995, pp. 84–97. [64] B. Berard, Ed., Systems and Software Verification: Model-Checking Techniques and Tools. SpringerVerlag, Heidelberg, 2001. [65] International Telecommunications Union. Recommendation Z.120 — Message Sequence Charts. Geneva, 2000. [66] Object Management Group. Unified Modeling Language Specification, 2.0. 2003. [67] J. Hooman. Towards formal support for UML-based development of embedded systems. In Proceedings of the 3rd PROGRESS Workshop on Embedded Systems, 2002, pp. 71–76. [68] M. Bozga, J. Fernandez, L. Ghirvth, S. Graf, J.P. Krimm, L. Mounier, and J. Sifakis. IF: an intermediate representation for SDL and its applications. In Proceedings of the 9th SDL Forum, Montreal, June 1999. [69] F. Regensburger and A. Barnard. Formal verification of SDL systems at the Siemens mobile phone department. In Tools and Algorithms for the Construction and Analysis of Systems — ACAS’98. Lecture Notes in Computer Science, vol. 1384. Springer-Verlag, Heidelberg, 1998, pp. 439–455. [70] O. Shumsky and L. J. Henschen. Developing a framework for verification, simulation and testing of SDL specifications. In M. Kaufmann and J.S. Moore, Eds., Proceedings of the ACL2 Workshop 2000, Austin, 2000. [71] P. Baker, P. Bristow, C. Jervis, D. King, and B. Mitchell. Automatic generation of conformance tests from message sequence charts. In Proceedings of the 3rd SAM (SDL And MSC) Workshop, Telecommunication and Beyond, Aberystwyth. Lecture Notes in Computer Science, 2003, p. 2599. [72] B. Mitchell, R. Thomson, and C. Jervis. Phase automaton for requirements scenarios. In Proceedings of the Feature Interactions in Telecommunications and Software Systems, vol. VII, 2003, pp. 77–87. [73] L. Philipson and L. Hogskola. Survey compares formal verification tools. EETIMES, 2001. http://www.eetimes.com/story/OEG20011128S0037 [74] S. Yovine. Kronos: A verification tool for real-time systems. International Journal of Software Tools for Technology Transfer, 1: 123–133, 1997. [75] P. Pettersson and K. Larsen. UPPAAL2k. Bulletin of the European Association for Theoretical Computer Science, 70: 40–44, 2000. [76] D. Bjorner and C.B. Jones, Eds., The Vienna development method: the meta-language. In Logic Programming. Lecture Notes in Computer Science, vol. 61. Springer-Verlag, Heidelberg, 1978. [77] Y. Ledru and P.-Y. Schobbens. Applying VDM to large developments. ACM SIGSOFT Software Engineering Notes, 15: 55–58, 1990.

© 2006 by Taylor & Francis Group, LLC

6-54

Embedded Systems Handbook

[78] A. Puccetti and J.Y. Tixadou. Application of VDM-SL to the development of the SPOT4 programming messages generator. FM ’99: World Congress on Formal Methods, VDM Workshop, Toulouse, 1999. [79] J.C. Bicarregui and B. Ritchie. Reasoning about VDM developments using the VDM support tool in Mural. In VDM 91: Formal Software Development Methods. Lecture Notes in Computer Science, vol. 551. Springer-Verlag, Heidelberg, 1991, pp. 371–388. [80] A. Diller. Z: An Introduction to Formal Methods. John Wiley & Sons, New York, 1990. [81] W. Grieskamp, M. Heisel, and H. Dorr. Specifying embedded systems with statecharts and Z: an agenda for cyclic software components. In Proceedings of the Formal Aspects of Software Engineering — FASE ’98. Lecture Notes in Computer Science, vol. 1382. Springer-Verlag, Heidelberg, 1998. [82] D. Bert, S. Boulmé, M.-L. Potet, A. Requet, and L. Voisin. Adaptable translator of B specifications to embedded C programs. In Formal Methods 2003. Lecture Notes in Computer Science, vol. 2805. Springer-Verlag, Heidelberg, 2003, pp. 94–113. [83] R. Milne. The Semantic Foundations of the RAISE Specification Language. RAISE report REM/11, STC Technology, 1990. [84] M. Nielsen, K. Havelund, K. Wagner, and C. George. The RAISE language, methods, and tools. Formal Aspects of Computing, 1: 85–114, 1989. [85] T. Mossakowski, Kolyang, and B. Krieg-Bruckner. Static semantic analysis and theorem proving for CASL. In F. Parisi Presicce, Ed., Proceedings of the 12th Workshop on Algebraic Development Techniques. Lecture Notes in Computer Science, vol. 1376. Springer-Verlag, Heidelberg, 1998, pp. 333–348. [86] P.D. Mosses. COFI: the common framework initiative for algebraic specification and development. In TAPSOFT’97: Theory and Practice of Software Development. Lecture Notes in Computer Science. vol. 1214. Springer-Verlag, Heidelberg, 1997, pp. 115–137. [87] B. Krieg-Brückner, J. Peleska, E. Olderog, and A. Baer. The UniForM workbench, a universal development environment for formal methods. In J. Wing, J. Woodcock, and J. Davies, Eds., FM’99, Formal Methods. Lecture Notes in Computer Science, vol. 1709. Springer-Verlag, Heidelberg, 1999, pp. 1186–1205. [88] C.L. Heitmeyer, J. Kirby, and B. Labaw. Tools for formal specification, verification and validation of requirements. In Proceedings of the 12th Annual Conference on Computer Assurance, Gaithersburg, June 1997. [89] S. Easterbrook, R. Lutz, R. Covington, Y. Ampo, and D. Hamilton. Experiences using lightweight formal methods for requirements modeling. IEEE Transactions on Software Engineering, 24: 4–14, 1998. [90] L.C. Paulson. Isabelle: A Generic Theorem Prover. Lecture Notes in Computer Science, vol. 828. Springer-Verlag, Heidelberg, 1994, pp. 23–34. [91] B.J. Krämer and N. Völker. a highly dependable computer architecture for safety-critical control applications. Real-Time Systems Journal, 13: 237–251, 1997. [92] D. Muthiayen. Real-Time Reactive System Development — A Formal Approach Based on UML and PVS. Technical report, Concordia University, 2000. [93] P.B. Jackson. The Nuprl Proof Development System, Reference Manual and User Guide. Cornell University, Ithaca, NY, 1994. [94] L. Cortes, P. Eles, and Z. Peng. Formal coverification of embedded systems using model checking. In Proceedings of the 26th EUROMICRO Conference, Maastricht, September 2000, pp. 106–113. [95] G. Holzmann. Design and Validation of Computer Protocols. Prentice Hall, New York, 1991. [96] G. Holzmann. The model checker SPIN. IEEE Transactions on Software Enginering, 23: 3–20, 1997. [97] R. Kurshan. Automata-Theoretic Verification of Coordinating Processes. Princeton University Press, Princeton, NJ, 1993. [98] R. de Simone and M. Lara de Souza. Using partial-order methods for the verification of behavioural equivalences. In G. von Bochmann, R. Dssouli, and O. Rafiq, Eds., Formal Description Techniques VIII, 1995. © 2006 by Taylor & Francis Group, LLC

System Validation

6-55

[99] J. Fernandez, H. Garavel, A. Kerbrat, R. Mateescu, L. Mounier, and M. Sighireanu. CADP: a protocol validation and verification toolbox. In Proceedings of the 8th Conference on ComputerAided Verification. New Brunswick, August 1996, pp. 437–440. [100] D. Dill, A. Drexler, A. Hu, and C. Yang. Protocol verification as a hardware design aid. In IEEE International Conference on Computer Design: VLSI in Computers and Processors. October 1992, pp. 522–525. [101] E. Astegiano and G. Reggio. Formalism and method. Theoretical Computer Science, 236: 3–34, 2000. [102] Z. Chaochen, C.A.R. Hoare, and A.P. Ravn. A calculus of durations. Information Processing Letter, 40: 269–276, 1991. [103] OSEK Group. OSEK/VDX. Operating System.Version 2.1. May 2000. [104] S.N. Baranov, V. Kotlyarov, J. Kapitonova, A. Letichevsky, and V. Volkov. Requirement capturing and 3CR approach. In Proceedings of the 26th International Computer Software and Applications Conference, Oxford, 2002, pp. 279–283. [105] J.V. Kapitonova, A.A. Letichevsky, and S.V. Konozenko. Computations in APS. Theoretical Computer Science, 119: 145–171, 1993. [106] D.R. Gilbert and A.A. Letichevsky. A universal interpreter for nondeterministic concurrent programming languages. In M. Gabbrielli, Ed., Fifth Compulog Network Area Meeting on Language Design and Semantic Analysis Methods, September 1996. [107] T. Valkevych, D.R. Gilbert, and A.A. Letichevsky. A generic workbench for modelling the behaviour of concurrent and probabilistic systems. In Workshop on Tool Support for System Specification, Development and Verification, TOOLS98, Malente, June 1998. [108] A.A. Letichevsky, J.V. Kapitonova, and V.A. Volkov. Deductive tools in algebraic programming system. Cybernetics and System Analysis, 1: 12–27, 2000. [109] A. Degtyarev, A. Lyaletski, and M. Morokhovets. Evidence algorithm and sequent logical inference search. In H. Ganzinger, D. McAllester, and A. Voronkov, Eds., Logic for Programming and Automated Reasoning (LPAR’99). Lecture Notes in Computer Science, vol. 1705. Springer-Verlag, 1999, pp. 44–61. [110] V.M. Glushkov, J.V. Kapitonova, A.A. Letichevsky, K.P. Vershinin, and N.P. Malevanyi. Construction of a practical formal language for mathematical theories. Cybernetics, 5: 730–739, 1972. [111] V.M. Glushkov. On problems of automata theory and artificial intelligence. Cybernetics, 5: 3–13, 1970. [112] Motorola. RIO Interconnect Globally Shared Memory Logical Specification. Motorola, 1999. [113] Motorola. SC-V’ger Microprocessor Implementation Definition. Motorola, 1997. [114] S. Abramsky. A domain equation for bisimulation. Information and Computation, 92: 161–218, 1991. [115] R. Alur and D. Dill. A theory of timed automata. Theoretical Computer Science, 126: 183–235, 1994. [116] S.N. Baranov, C. Jervis, V. Kotlyarov, A. Letichevsky, and T. Weigert. Leveraging UML to deliver correct telecom applications. In L. Lavagno, G. Martin, and B. Selic, Eds., UML for Real: Design of Embedded Real-Time Systems. Kluwer Academic Publishers, Amsterdam, 2003. [117] J. Bicarregui, T. Dimitrakos, B. Matthews, T. Maibaum, K. Lano, and B. Ritchie. The VDM+B project: objectives and progress. In World Congress on Formal Methods in the Development of Computing Systems. Toulouse, September 1999. [118] G. Booch, J. Rumbaugh, and I. Jacobson. Unified Modeling Language User Guide. Addison-Wesley, Reading, MA, 1997. [119] S. Chandra, P. Godefroid, and C. Palm. Software model checking in practice: an industrial case study. In Proceedings of the International Conference on Software Engineering, Orlando, May 2002.

© 2006 by Taylor & Francis Group, LLC

6-56

Embedded Systems Handbook

[120] E. Clarke, I. Draghicescu, and R. Kurshan. A Unified Approach for Showing Language Containment and Equivalence between Various Types of Omega-Automata. Technical report, Carnegie-Mellon University, 1989. [121] F. Van Dewerker and S. Booth. Requirements Consistency — A Basis for Design Quality. Technical report, Ascent Logic, 1998. [122] E. Felt, G. York, R. Brayton, and A. Vincentelli. Dynamic variable reordering for BDD minimization. In Proceedings of the EuroDAC, 1993, pp. 130–135. [123] M. Fitting. A Kripke-Kleene semantics for logic programs. Journal of Logic Programming, 2: 295–312, 1985. [124] I. Graham. Migrating to Object Technology. Addison-Wesley, Reading, MA, 1995. [125] Green Mountain Computing Systems. Green Mountain VHDL Tutorial, 1995. [126] International Telecommunications Union. Recommendation Z.100 — Specification and Description Language. Geneva, 1999. [127] B. Jacobs. Objects and classes, coalgebraically. In B. Freitag, C.B. Jones, C. Lengauer, and H.-J. Schek, Eds., Object-Orientation with Parallelism and Persistence. Kluwer Academic Publishers, 1996, pp. 83–101. [128] I. Jacobson. Object-Oriented Software Engineering, A Use Case Driven Approach. Addison-Wesley, Reading, MA, 1992. [129] N.D. Jones, C. Gomard, and P. Sestoft. Partial Evaluation and Automatic Program Generation. Prentice Hall, New York, 1993. [130] J.V. Kapitonova, T.P. Marianovich, and A.A. Mishchenko. Automated design and simulation of computer systems components. Cybernetics and System Analysis, 6: 828–840, 1997. [131] M. Kaufmann and J.S. Moore. ACL2: an industrial strength version of NQTHM. In Proceedings of the 11th Annual Conference on Computer Assurance (COMPASS96), June 1996, pp. 23–34. [132] S. Kripke. Semantical considerations on modal logic. Acta Philosophica Fennica, 16: 83–94, 1963. [133] J. van Leeuwen, Ed., Handbook of Theoretical Computer Science. MIT Press, Cambridge, MA, 1991. [134] A.A. Letichevsky, and J.V. Kapitonova. Mathematical information environment. In Proceedings of the 2nd International THEOREMA Workshop, Linz, June 1998, pp. 151–157. [135] A.A. Letichevsky and D.R. Gilbert. Agents and environments. In Proceedings of the 1st International Scientific and Practical Conference on Programming, Kiev, 1998. [136] A.A. Letichevsky and D.R. Gilbert. A model for interaction of agents and environments. In Selected Papers from the 14th International Workshop on Recent Trends in Algebraic Development Techniques. Lecture Notes in Computer Science. vol. 1827, 2004, pp. 311–328. [137] P. Lindsay. On transferring VDM verification techniques to Z. In Proceedings of Formal Methods Europe — FME’94, Barcelona, October 1994. [138] W. McCune. Otter 3.0 Reference Manual and Guide. Technical report, Argonne National Laboratory Report ANL-94, 1994. [139] K. McMillan. Symbolic Model Checking. Kluwer Academic Publishers, Dordrecht, 1993. [140] M. Morockovets and A. Luzhnykh. Representing mathematical texts in a formalized natural like language. In Proceedings of the 2nd International THEOREMA Workshop, Linz, June 1998, pp. 157–160. [141] T. Nipkow, L. Paulson, and Markus Wenzel. Isabelle/HOL — A Proof Assistant for Higher-Order Logic. Lecture Notes in Computer Science, vol. 2283. Springer-Verlag, Heidelberg, 2002. [142] S. Owre, J.M. Rushby, and N. Shankar. A prototype verification system. In D. Kapur, Ed., Proceedings of the 11th International Conference on Automated Deduction (CADE). Lecture Notes in Artificial Intelligence, vol. 601. Springer-Verlag, Heidelberg, 1992, pp. 748–752. [143] G. Plotkin. A Structured Approach to Operational Semantics. Technical report, DAIMI FN-19, Aarhus University, 1981. [144] K.S. Rubin and A. Goldberg. Object behavior analysis. Communications of the ACM, 35: 48–62, 1992.

© 2006 by Taylor & Francis Group, LLC

System Validation

6-57

[145] R. Rudell. Dynamic variable reordering for ordered binary decision diagrams. In Proceedings of the IEEE/ACM ICCAD’93, 1993, pp. 42–47. [146] J. Rushby. Mechanized formal methods: where next? In J. Wing and J. Woodcock, Eds., FM99: The World Congress in Formal Methods. Lecture Notes in Computer Science, vol. 1708. Springer-Verlag, Heiderberg, 1999, pp. 48–51. [147] J. Rushby, S. Owre, and N. Shankar. Subtypes for specifications: predicate subtypes in PVS. IEEE Transactions on Software Engineering, 24: 709–720, 1998. [148] M. Saeki, H. Horai, and H. Enomoto. Software development process from natural language specification. In International Conference on Software Engineering. Pittsburgh, March 1989, pp. 64–73. [149] J. Tsai and T. Weigert. Knowledge-Based Software Development for Real-Time Distributed Systems. World Scientific Publishers, Singapore, 1993. [150] M. Vardi. Verification of concurrent programs — the automata-theoretic framework. In Proceedings of the 2nd IEEE Symposium on Logic in Computer Science, pp. 167–176. [151] T. Weigert and J. Tsai. A logic-based requirements language for the specification and analysis of real-time systems. In Proceedings of the 2nd Conference on Object-Oriented Real-Time Dependable Systems, Laguna Beach, 1996, pp. 8–16.

© 2006 by Taylor & Francis Group, LLC

Design and Veriﬁcation Languages 7 Languages for Embedded Systems Stephen A. Edwards

8 The Synchronous Hypothesis and Synchronous Languages Dumitru Potop-Butucaru, Robert de Simone, and Jean-Pierre Talpin

9 Introduction to UML and the Modeling of Embedded Systems Øystein Haugen, Birger Møller-Pedersen, and Thomas Weigert

10 Verification Languages Aarti Gupta, Ali Alphan Bayazit, and Yogesh Mahajan

© 2006 by Taylor & Francis Group, LLC

7 Languages for Embedded Systems 7.1 7.2

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7-1 7-2

Assembly Languages • The C Language • C++ • Java • Real-Time Operating Systems

7.3

Hardware Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7-8

Verilog • VHDL

7.4

Dataflow Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12

7.5

Hybrid Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14

Kahn Process Networks • Synchronous Dataflow Esterel • SDL • SystemC

Stephen A. Edwards Columbia University

7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18

7.1 Introduction An embedded system is a computer masquerading as a non-computer that must perform a small set of tasks cheaply and efficiently. A typical system might have communication, signal processing, and user interface tasks to perform. Because the tasks must solve diverse problems, a language general-purpose enough to solve them all would be difficult to write, analyze, and compile. Instead, a variety of languages have evolved, each best suited to a particular problem domain. The most obvious divide is between languages for software and hardware, but there are others. For example, a language for signal-processing is often more convenient for a particular problem than, say, assembly, but might be poor for control-dominated behavior. This chapter describes popular hardware, software, dataflow, and hybrid languages, each of which excels a certain problems. Dataflow languages are good for signal processing, and hybrid languages combine ideas from the other three classes. Due to space, this chapter only describes the main features of each language. The author’s book on the subject [1] provides many more details on all of these languages.

Some of this chapter originally appeared in the Online Symposium for Electrical Engineers (OSEE).

7-1

© 2006 by Taylor & Francis Group, LLC

7-2

Embedded Systems Handbook

7.2 Software Languages Software languages describe sequences of instructions for a processor to execute (Table 7.1). As such, most consist of sequences of imperative instructions that communicate through memory: an array of numbers that hold their values until changed. Each machine instruction typically does little more than, say, add two numbers, so high-level languages aim to specify many instructions concisely and intuitively. Arithmetic expressions are typical: coding an expression such as ax 2 + bx + c in machine code is straightforward, tedious, and best done by a compiler. The C language provides such expressions, control-flow constructs such as loops and conditionals, and recursive functions. The C++ language adds classes as a way to build new data types, templates for polymorphic code, exceptions for error handling, and a standard library of common data structures. Java is a still higher-level language that provides automatic garbage collection, threads, and monitors to synchronize them.

7.2.1 Assembly Languages An assembly language program (Figure 7.1) is a list of processor instructions written in a symbolic, humanreadable form. Each instruction consists of an operation such as addition along with some operands. For example, add r5,r2,r4 might add the contents of registers r2 and r4 and write the result to r5. Such arithmetic instructions are executed in order, but branch instructions can perform conditionals and loops by changing the processor’s program counter — the address of the instruction being executed. A processor’s assembly language is defined by its opcodes, addressing modes, registers, and memories. The opcode distinguishes, say, addition from conditional branch, and an addressing mode defines how and where data is gathered and stored (e.g., from a register or from a particular memory location). Registers can be thought of as small, fast, easy-to-access pieces of memory. There are roughly four categories of modern assembly languages (Table 7.2). The oldest are those for the so-called complex instruction set computers, or CISC. These are characterized by a rich set of instructions and addressing modes. For example, a single instruction in Intel’s x86 family, a typical CISC processor, can add the contents of a register to a memory location whose address is the sum of two other registers and a constant offset. Such instruction sets are usually convenient for human programmers, who are generally fairly skilled at using a heterogeneous set of tools, and the code itself is usually quite compact. Figure 7.1(a) illustrates a small program in x86 assembly. By contrast, reduced instruction set computers (RISC) tend to have fewer instructions and much simpler addressing modes. The philosophy is that while you generally need more RISC instructions to accomplish something, it is easier for a processor to execute them because it does not need to deal with the complex cases and easier for a compiler to produce them because they are simpler and more uniform. Figure 7.1(b) illustrates a small program in SPARC assembly. TABLE 7.1 Software Language Features Compared

Expressions Control-flow Recursive functions Exceptions Classes and inheritance Templates Namespaces Multiple inheritance Threads and locks Garbage collection

C

C++

Java

• • • ◦

• • • • • • • •

• • • • •

Note: •, full support; ◦, partial support.

© 2006 by Taylor & Francis Group, LLC

◦

• ◦ • •

Languages for Embedded Systems (a)

jmp L1: movl movl L2: xorl divl movl testl jne

7-3 mov b mov mov b mov .LL5: mov .LL3: mov call mov cmp bne mov

L2

(b)

%ebx, %eax %ecx, %ebx %edx, %edx %ebx %edx, %ecx %ecx, %ecx L1

%i0, .LL3 %i1, %i0, .LL3 %i1,

%o1 %i0 %o1 %i0

%o0, %i0 %o1, %o0 .rem, 0 %i0, %o1 %o0, 0 .LL5 %i0, %o1

FIGURE 7.1 Euclid’s algorithm (a) i386 assembly (CISC) and (b) SPARC assembly (RISC). SPARC has more registers and must call a routine to compute the remainder (the i386 has division instruction). The complex addressing modes of the i386 are not shown in this example. TABLE 7.2 Typical Modern Processor Architectures CISC

RISC

DSP

x86 68000

SPARC MIPS ARM

TMS320 DSP56000 ASDSP-21xx

(a) move #samples, r0 move #coeffs, r4 move #n-1, m0 move m0, m4 movep y:input, x:(r0) clr a x:(r0)+, x0 y:(r4)+, y0 rep #n-1 mac x0,y0,a x:(r0)+, x0 y:(r4)+, y0 macr x0,y0,a (r0)movep a, y:output

Microcontroller 8051 PIC AVR

(b) START: MOV ACALL ORL SETB LOOP: CLR SETB SETB WAIT: JB

CLR MOV ACALL SETB AJMP

SP, #030H INITIALIZE P1,#0FFH P3.5 P3.4 P3.3 P3.4 P3.5, WAIT P3.3 A,P1 SEND P3.3 LOOP

FIGURE 7.2 (a) A finite impulse response filter in DSP56001 assembly. The mac instruction (multiply and accumulate) does most of the work, multiplying registers X0 and Y0, adding the result to accumulator A, fetching the next sample and coefficient from memory, and updating circular buffer pointers R0 and R4. The rep instruction repeats the mac instruction in a zero-overhead loop. (b) Writing to a parallel port in 8051 microcontroller assembly. This code takes advantage of the 8051’s ability to operate on single bits.

The third category of assembly languages arise from more specialized processor architectures such as digital signal processors (DSPs) and very-long instruction word processors (VLIWs). The operations in these instruction sets are simple like those in RISC processors (e.g., add two registers); but they tend to be very irregular (only certain registers may be used with certain operations) and support a much higher degree of instruction-level parallelism. For example, Motorola’s DSP56001 can, in a single instruction, multiply two registers, add the result to a third, load two registers from memory, and increase two circular buffer pointers. However, the instruction severely limits which registers (and even which memory) it may use. Figure 7.2(a) shows a filter implemented in 56001 assembly.

© 2006 by Taylor & Francis Group, LLC

7-4

Embedded Systems Handbook

The fourth category includes instruction sets on small (4- and 8-bit) microcontrollers. In some sense, these combine the worst of all worlds: there are few instructions and each cannot do much, much like a RISC processor, and there are also significant restrictions on which registers can be used when, much like a CISC processor. The main advantage of such instruction sets is that they can be implemented very cheaply. Figure 7.2(b) shows a routine that writes to a parallel port in 8051 assembly.

7.2.2 The C Language C is currently the most popular language for embedded system programming. C compilers exist for virtually every general-purpose processor, from the lowliest 4-bit microcontroller to the most powerful 64-bit processor for compute servers. C was originally designed by Dennis Ritchie [2] as an implementation language for the Unix operating system being developed at Bell Labs for a 24K DEC PDP-11. Because the language was designed for systems programming, it provides very direct access to the processor through such constructs as untyped pointers and bit-manipulation operators, things appreciated today by embedded systems programmers. Unfortunately, the language also has many awkward aspects, such as the need to define everything before it is used, that are holdovers from the cramped execution environment in which it was first implemented. A C program (Figure 7.3) contains functions built from arithmetic expressions structured with loops and conditionals. Instructions in a C program run sequentially, but control-flow constructs such as loops of conditionals can affect the order in which instructions execute. When control reaches a function call in an expression, control is passed to the called function, which runs until it produces a result, and control returns to continue evaluating the expression that called the function. C derives its types from those a processor manipulates directly: signed and unsigned integers ranging from bytes to words, floating point numbers, and pointers. These can be further aggregated into arrays and structures — groups of named fields. C programs use three types of memory. Space for global data is allocated when the program is compiled, the stack stores automatic variables allocated and released when their function is called and returns, and the heap supplies arbitrarily-sized regions of memory that can be deallocated in any order. The C language is an ISO standard, but most people consult the book by Kernighan and Ritchie [3]. C succeeds because it can be compiled into very efficient code and because it allows the programmer almost arbitrarily low-level access to the processor when necessary. As a result, virtually every function can be written in C (exceptions include those that must manipulate specific processor registers) and can be expected to be fairly efficient. C’s simple execution model also makes it fairly easy to estimate the efficiency of a piece of code and improve it if necessary.

#include <stdio.h> int main(int argc, char *argv[]) { char *c; while (++argv, --argc > 0) { c = argv[0] + strlen(argv[0]); while (--c >= argv[0]) putchar(*c); putchar(’\n’); } return 0; }

FIGURE 7.3 A C program that prints each of its arguments backwards. The outermost while loop iterates through the arguments (count in argc, array of strings in argv), while the inner loop starts a pointer at the end of the current argument and walks it backwards, printing each character along the way. The ++ and -- prefixes increment the variable they are attached to before returning its value.

© 2006 by Taylor & Francis Group, LLC

Languages for Embedded Systems

7-5

While C compilers for workstation-class machines usually comply closely to ANSI/ISO standard C, C compilers for microcontrollers are often much less standard. For example, they often omit support for floating-point arithmetic and certain library functions. Many also provide language extensions that, while often very convenient for the hardware for which they were designed, can make porting the code to a different environment very difficult.

7.2.3 C++ C++ (Figure 7.4) [4] extends C with structuring mechanisms for big programs: user-defined data types, a way to reuse code with different types, namespaces to group objects and avoid accidental name collisions when program pieces are assembled, and exceptions to handle errors. The C++ standard library includes a collection of efficient polymorphic data types such as arrays, trees, strings for which the compiler generates custom implementations. A class defines a new data type by specifying its representation and the operations that may access and modify it. Classes may be defined by inheritance, which extends and modifies existing classes. For example, a rectangle class might add length and width fields and an area method to a shape class. A template is a function or class that can work with multiple types. The compiler generates custom code for each different use of the template. For example, the same min template could be used for both integers and floating-point numbers. C++ also provides exceptions, a mechanism intended for error recovery. Normally, each method or function can only return directly to its immediate caller. Throwing an exception, however, allows control to return to an arbitrary caller, usually an error-handling mechanism in the main function or similar. Exceptions can be used, for example, to gracefully recover from out-of-memory conditions no matter where they occur, without the tedium of having to check whether every function encountered an out-of-memory condition. Memory consumption is a disadvantage to C++’s exception mechanism. While most C++ compilers do not generate slower code when exceptions are enabled, they do generate larger executables by including tables that record the location of the nearest exception handler. For this reason, many compilers, such as GNU’s gcc, have a flag that completely disables exceptions.

class Cplx { double re, im; public: Cplx(double v) : re(v), im(0) {} Cplx(double r, double i) : re(r), im(i) {} double abs() const { return sqrt(re*re + im*im); } void operator+= (const Cplx& a) { re += a.re; im += a.im; } }; int main() { Cplx a(5), b(3,4); b += a; cout << b.abs() << ’\n’; return 0; }

FIGURE 7.4 A C++ fragment illustrating a partial complex number type and how it can be used (the C++ library has a complete version). This class defines how to create a new complex number from either a scalar or by specifying the real and imaginary components, how to compute the absolute value of a complex number, and how to add a complex number to an existing one.

© 2006 by Taylor & Francis Group, LLC

7-6

Embedded Systems Handbook

C++ is being used more and more within embedded systems, but it is sometimes a less suitable choice than C for a number of reasons. First, C++ is a much more complicated language that demands a much larger compiler, so C++ has been ported to fewer architectures than C. Second, certain language features such as dynamic dispatch (virtual function calls) and exceptions can be too costly to implement in very small embedded systems. It is a more difficult language to learn and use properly, meaning there may be fewer qualified C++ programmers. Also, it is often more difficult to estimate the cost of a certain construct in C++ because the object-oriented programming style encourages many more function calls than the procedural style of C, and the cost of these is harder to estimate.

7.2.4 Java Sun’s Java language [5–7] resembles C++ but is not a superset. Like C++, Java is object-oriented, providing classes and inheritance. It is a higher-level language than C++ since it uses object references, arrays, and strings instead of pointers. Java’s automatic garbage collection frees the programmer from memory management. Java omits a number of C++’s more complicated features. Templates are absent, although there are plans to include them in a future release of the language because they make it possible to write type-safe container classes. Java also omits operator overloading, which can be a boon to readability (e.g., when performing operations on complex numbers) or a powerful obfuscating force. Java also does not support C++’s complex multiple inheritance mechanism completely. But it does provide the notion of an interface — a set of methods provided by a class — that is equivalent to one of the most common uses of multiple inheritance. Java provides concurrent threads (Figure 7.5). Creating a thread involves extending the Thread class, creating instances of these objects, and calling their start methods to start a new thread of control that executes the objects’ run methods. Synchronizing a method or block uses a per-object lock to resolve contention when two or more threads attempt to access the same object simultaneously. A thread that attempts to gain a lock owned by another thread will block until the lock is released, which can be used to grant a thread exclusive access to a particular object. For embedded systems, Java holds promise but also many caveats. On the positive side, it is a simple, powerful language that provides the programmer a convenient set of abstractions. For example, unlike C, Java provides true strings and variable-sized arrays. On the negative side, Java is a heavyweight language, even more so than C++. Its runtime system is large, consisting of either a bytecode interpreter, a just-intime compiler, or perhaps both, and its libraries are absolutely vast. While work has been done on paring down these things, Java still requires a much larger footprint than C. Unpredictable runtimes are a more serious problem for Java. For time-critical embedded systems, Java’s automatic garbage collector, bytecode interpreter, or just-in-time compiler make runtimes both unpredictable and variable, making it difficult to assess efficiency both beforehand and in simulation. The real-time Java specification [8] attempts to address many of these concerns. It introduces mechanisms for more precise control over the scheduling policy for concurrent threads (the standard Java specification is deliberately vague on this point to improve portability), memory regions for which automatic garbage collection can be disabled, synchronization mechanisms for avoiding priority inversion, and various other real-time features such as timers. It remains to be seen, however, whether this specification addresses enough real-time concerns and is sufficiently efficient to be practical. For example, a naive implementation of the memory management policies would be very inefficient.

7.2.5 Real-Time Operating Systems Many embedded systems use a real-time operating system (RTOS) to simulate concurrency on a single processor. An RTOS manages multiple running processes, each written in sequential language such as C. The processes perform the system’s computation and the RTOS schedules them — attempts to

© 2006 by Taylor & Francis Group, LLC

Languages for Embedded Systems

7-7

import java.io.*; class Counter { int value = 0; boolean present = false; public synchronized void count() { try { while (present) wait(); } catch (InterruptedException e) {} value++; present = true; notifyAll(); } public synchronized int read() { try { while (!present) wait(); } catch (InterruptedException e) {} present = false; notifyAll(); return value; } } class Count extends Thread { Counter cnt; public Count(Counter c) { cnt = c; start(); } public void run() { for (;;) cnt.count(); } } class Mod5 { public static void main(String args[]) { Counter c = new Counter(); Count count = new Count(c); int v; for (;;) if ( (v = c.read()) % 5 == 0 ) System.out.println(v); } }

FIGURE 7.5 A contrived Java program that spawns a counting thread to print all numbers divisible by 5. The main method in the Mod5 class creates a new Counter, then a new Count object. The Count class extends the Thread class and spawns a new thread in its constructor by executing start. This invokes its run method, which calls the method count. Both count and read are synchronized, meaning at most one may run on a particular Count object at once, here guaranteeing the counter is either counting or waiting for its value to be read.

A

A

B

A

B

B

C B

A

B

C C

A

B

A

B

C

B completes A completes, allows B to run C completes, A takes priority over B B completes, allows C to run A completes, allows B to resume A preempts B

FIGURE 7.6 The behavior of an RTOS with fixed-priority preemptive scheduling. Rate-monotonic analysis gives process A the highest priority since it has the shortest period; C has the lowest.

meet deadlines by deciding which process runs when. Labrosse [9] describes the implementation of a particular RTOS. Most RTOSes uses fixed-priority preemptive scheduling in which each process is given a particular priority (a small integer) when the system is designed (Figure 7.6). At any time, the RTOS runs the highest-priority running process, which is expected to run for a short period of time before suspending itself to wait for more data. Priorities are usually assigned using rate-monotonic analysis [10] (due to Liu and Layland [11]), which assigns higher priorities to processes that must meet more frequent deadlines.

© 2006 by Taylor & Francis Group, LLC

7-8

Embedded Systems Handbook H

H

M L L

M

H

M

L H

H

H misses its deadline M delays the execution of L H blocks waiting for lock, M runs H preempts M M preempts L L acquires lock on resource L begins running

FIGURE 7.7 Priority inversion illustrated. When low-priority process L acquires a lock on a resource needed by process H, it effectively blocks process H, but then intermediate-priority process M preempts L, preventing it from running and releasing the resource needed by H. Priority inheritance, the common solution, temporarily raises the priority of L to that of H when H requests the resource held by L.

Priority inversion is a fundamental problem in fixed-priority preemptive scheduling that can lead to missed deadlines by enabling a lower-priority process to delay indefinitely the execution of a higherpriority one. Figure 7.7 illustrates the typical scenario: a low priority process L runs and acquires a resource. Shortly thereafter, a high-priority process H preempts L, attempts to acquire the same resource, and blocks waiting for L to release it. This can cause H to miss its deadline even though it is at a higher priority than L. Even worse, if a process M with priority between L and now starts, it can delay the execution of H indefinitely. Process M does not allow L to run since M is at a higher priority, so L cannot execute and release the lock and H will continue to block. Priority inversion is usually solved with priority inheritance. When a process L acquires a lock, its priority is temporarily raised to a level where it will not be preempted by any other process that will also attempt to acquire the lock. Many RTOSes provide a mechanism for doing this automatically.

7.3 Hardware Languages Concurrency and the notion of control is the fundamental difference between hardware and software. In hardware, every part of the “program” is always running, but in software, exactly one part of the program is running at any one time. Software languages naturally focus on sequential algorithms while hardware languages enable concurrent function evaluation, speculation, and concurrency. Ironically, efficient simulation in software is a main focus of the hardware languages presented here, so their discrete-event semantics are a compromise between what would be ideal for hardware and what simulates efficiently. Verilog [12,13] and VHDL [14–17] are the most popular languages for hardware description and modeling (Figure 7.8 and Figure 7.9). Both model systems with discrete-event semantics that ignore idle portions of the design for efficient simulation. Both describe systems with structural hierarchy: a system consists of blocks that contain instances of primitives, other blocks, or concurrent processes. Connections are listed explicitly. Verilog provides more primitives geared specifically toward hardware simulation. VHDL’s primitive are assignments such as a = b + c or procedural code. Verilog adds transistor and logic gate primitives, and allows new ones to be defined with truth tables. Both languages allow concurrent processes to be described procedurally. Such processes sleep until awakened by an event that causes them to run, read and write variables, and suspend. Processes may wait for a period of time (e.g., #10 in Verilog, wait for 10ns in VHDL), a value change (@(a or b), wait on a,b), or an event (@(posedge clk), wait on clk until clk=’1’).

© 2006 by Taylor & Francis Group, LLC

Languages for Embedded Systems

7-9

VHDL communication is more disciplined and flexible. Verilog communicates through wires or regs: shared memory locations that can cause race conditions. VHDL’s signals behave like wires but the resolution function may be user-defined. VHDL’s variables are local to a single process unless declared shared. Verilog’s type system models hardware with four-valued bit vectors and arrays for modeling memory. VHDL does not include four-valued vectors, but its type system allows them to be added. Furthermore, composite types such as C structs can be defined. Overall, Verilog is the leaner language more directly geared toward simulating digital integrated circuits. VHDL is a much larger, more verbose language capable of handing a wider class of simulation and modeling tasks.

7.3.1 Verilog Verilog was first devised in 1984 as an input language for a discrete-event simulator for digital hardware design. It was one of the first hardware description languages able to specify both the circuit and a test bench in the same language, which remains one of its strengths. Verilog has since been pressed into use as both a modeling language and a specification language. Although Verilog is still simulated frequently, it is also frequently fed to a logic synthesis system that translates it into an actual circuit. This is a technically challenging process and not all Verilog constructs can be translated into hardware since Verilog’s semantics are nondeterministic and effectively defined by the behavior of an event-driven simulator. Verilog provides both structural and behavioral modeling styles, and allows them to be combined at will. Consider the simple multiplexer circuit shown in Figure 7.8(a). It can be modeled in Verilog as a schematic composed of logic gates (Figure 7.8[b]), with a continuous assignment statement that represents logic using an expression (Figure 7.8[c]), with a truth table as a “user-defined primitive” (Figure 7.8[d]), or with imperative, event-driven code (Figure 7.8[e]). The imperative modeling style is particularly useful for creating testbenches: models of an environment that stimulate a particular circuit and check its behavior. Figure 7.8(f) illustrates such a testbench, which instantiates a multiplexer (the instance is called “dut” — device under test) and starts a simple process (the initial block) to apply inputs and monitor outputs. Running Figure 7.8(f) in a Verilog simulator gives a partial truth table for the multiplexer. As these examples illustrate, a Verilog program is composed of modules. Each module has an interface with named input and outputs ports and contains one or more instances of other modules, continuous assignments, and imperative code in initial and always blocks. Modules perform the same information hiding function as functions in imperative languages: a module’s contents is not visible from outside and names for instances, wires, and whatnot inside a module do not have to differ from those in other modules. Verilog programs manipulate four-valued bit vectors intended to model digital hardware. Each bit is 0, 1, X, representing unknown, or Z, used to represent an undriven tri-state bus. While such vectors are very convenient for modeling circuitry, one of Verilog’s shortcomings is the lack of a more complicated type system. It does provide arrays of bit vectors but no other aggregate types. The plumbing within a module comes in two varieties, one for structural modeling, the other for behavioral. Structural components, such as instances of primitive logic gates and other modules, communicate through wires, each of which may be connected to drivers such as gates or continuous assignments. Conceptually, the value of a wire is computed constantly from whatever drives it. Practically, the simulator evaluates the expression in a continuous assignment whenever any of its inputs changes. Behavioral components communicate through regs, which behave like memory in traditional programming languages. The value of a reg is set by an assignment statement executed within an initial or always block, and that value persists until the next time the reg is assigned. While a reg can be used to model a state-holding element such as a latch or flip–flop, it is important to remember that they are really just memory. Figure 7.8(e) illustrates this: a reg is used to store the output of the mux, even though it is not a state-holding element. This is because imperative code can only change the value of regs, not wires.

© 2006 by Taylor & Francis Group, LLC

7-10

Embedded Systems Handbook (a) a

(b)

f1 g4

g1 nsel

b sel

g3

f

g2

module mux(f,a,b,sel); output f; input a, b, sel; and g1(f1, a, nsel), g2(f2, b, sel); or g3(f, f1, f2); not g4(nsel, sel);

f2

endmodule (c)

(d)

module mux(f,a,b,sel); output f; input a, b, sel;

table 1?0 : 1; 0?0 : 0; ?11 : 1; ?01 : 0; 11? : 1; 00? : 0; endtable endprimitive

assign f = sel ? a : b; endmodule

(e)

(f)

module mux(f,a,b,sel); output f; input a, b, sel; reg f; always @(a or b or sel) if (sel) f = a; else f = b; endmodule

primitive mux(f,a,b,sel); output f; input a, b, sel;

module testbench; reg a, b, sel; wire f; mux dut(f, a, b, sel); initial begin $display("a,b,sel -> f"); $monitor($time,, "%b%b%b -> ", a, b, sel, f); a = 0; b = 0 ; sel = 0; #10 a = 1; #10 sel = 1; #10 b = 1; #10 sel = 0; end endmodule

FIGURE 7.8 Verilog examples. (a) A multiplexer circuit, (b) the multiplexer described as a Verilog structural model, (c) the multiplexer described using a continuous assignment, (d) a user-defined primitive for the multiplexer, (e) the multiplexer described with imperative code, (f) a testbench for the multiplexer.

Verilog is a large language that contains many now-little-used features such as switch-level transistor models, pure event handling, and complicated delay specifications, all remnants of previous design methodologies. Today, switch-level modeling is rarely used because Verilog’s precision is too low for circuits that take advantage of this behavior (a continuous simulator such as SPICE is preferred). Delays are rarely used because static timing analysis has replaced event-driven simulation as the timing analysis method of choice because its speed and precision. Nevertheless, Verilog remains one of the most commonly-used languages for hardware design. SystemVerilog, a recently-introduced standard (2002), is an extension to the Verilog language designed to aid in the creation of large specifications. It adds a richer set of datatypes, including C-like structures, unions, and multidimensional arrays, a richer set of processes (e.g., an always_comb block has an

© 2006 by Taylor & Francis Group, LLC

Languages for Embedded Systems

7-11

implied sensitivity to all variables it references), the concept of an interface to encapsulate communication and function between blocks, and many other features. Whether SystemVerilog supplants Verilog as a standard language for hardware specification remains to be seen, but it does have the advantage of being an obvious evolutionary improvement over previous versions of Verilog.

7.3.2 VHDL The VHDL language (VHDL is a two-level acronym, standing for VHSIC [Very High Speed Integrated Circuit] Hardware Description Language) was designed to be a flexible modeling language for digital systems. It has fewer built-in features such as Verilog’s four-valued bit vectors, gate, and transistor-level models. Instead, it has very flexible type and package systems that allow such things to be specified in the language. Unlike Verilog, VHDL draws a strong distinction between the interface to a hierarchical object and its implementation. VHDL interfaces are called entities and their implementations are called architectures. Figure 7.9 illustrates how these are used in a simple model: the entities are essentially named lists of ports and the architectures consist of named lists of component instances. While this increases the verbosity of the language, it makes it possible to use different implementations, perhaps at differing levels of abstraction. Like Verilog, VHDL supports structural, dataflow, and behavioral modeling styles, illustrated in Figure 7.9. As in Verilog, they can be mixed. In the three styles, an architecture is specified by listing components and their connections (structural), as a series of equations (dataflow, like Verilog’s assign declarations), or as a sequence of imperative instructions (behavioral, like Verilog’s always blocks). In general, a process runs until it reaches a wait statement. This suspends the process until a particular event occurs, which may be an event on a signal, a condition on a signal, a timeout, or any combination of these. By itself, wait terminates a process. At the other extreme, wait on Clk until Clk = ’1’ for 5ns; waits for the clock to rise or for 5 ns, whichever comes first. Combinational processes, which always run in response to a change on any of their inputs, are common enough to warrant a shorthand. Thus, process(A, B, C) effectively executes a wait on A, B, C statement at the end. VHDL’s type system is much more elaborate than Verilog’s. It provides integers, floating-point numbers, enumerations, and physical quantities. Integers and floating-point numbers include a range specification. For example, a 16-bit integer might be declared as type address is range 16#0000# to 16#FFFF#;

Enumerated literals may be single characters or identifiers. Identifiers are useful for FSM states and single characters are useful for Boolean wire values. Typical declarations: type Bit is (’0’, ’1’); type FourV is (’0’, ’1’, ’X’, ’Z’); type State is (Reset, Running, Halted);

Objects in VHDL, such as types, variables, and signals, have attributes such as size, base, and range. Such information can be useful for, say, iterating over all elements in an array. For example, if type Index is range 31 downto 0, then Index’LOW is 0. Access to information about signals can be used for collecting simulation statistics. For example, if Count is a signal, then Count’EVENT is true when there is an event on the signal. VHDL has a powerful library and package facility for encapsulating and reusing definitions. For example, the standard logic library for VHDL includes types for representing wire states and standard functions such as AND and OR that operate on these types. Verilog has such facilities built in, but is not powerful enough to allow such functionality to be written as a library.

© 2006 by Taylor & Francis Group, LLC

7-12

Embedded Systems Handbook (a) entity NAND is port (a: in Bit; b: in Bit; y: out Bit); end NAND; (b) architecture arch1 of mux2 is signal cc, ai, bi : Bit; -- internal signals

component Inverter -- component interface port (a:in Bit; y: out Bit); end component; component AndGate port (a1, a2:in Bit; y: out Bit); end component; component OrGate port (a1, a2:in Bit; y: out Bit); end component; begin I1: A1: A2: O1: end; (c)

Inverter AndGate AndGate OrGate

port port port port

map(c => a, y => cc); -- by name map(a, c, ai); -- by position map(a1 => b, a2 => cc, y => bi); map(a1 => ai, a2 => bi, y => d);

architecture arch2 of mux2 is signal cc, ai, bi : Bit; begin cc <= not c; ai <= a and c; bi <= b and cc; d <= ai or bi; end;

(d) architecture arch3 of mux2 is begin process(a, b, c) -- sensitivity list begin if c = ’1’ then d <= a; else d <= b; end if; end process; end;

FIGURE 7.9 VHDL examples. Compare with Figure 7.8. (a) The entity declaration for the multiplexer, which defines its interface, (b) a structural description of the multiplexer from Figure 7.8(a), (c) a dataflow description with one equation per gate, (d) an imperative behavioral description.

7.4 Dataﬂow Languages The hardware and software languages described earlier have semantics very close to that of their implementations (e.g., as instructions on a sequential processor or as digital logic gates), which makes for efficient realizations, but some problems are better described using different models of computation. Many embedded systems perform signal processing tasks such as reconstructing a compressed audio signal. While such tasks can be described and implemented using the hardware and software languages described earlier, signal processing tasks are more conveniently represented with systems of processes

© 2006 by Taylor & Francis Group, LLC

Languages for Embedded Systems

7-13

that communicate through queues. Although clumsy for general applications, dataflow languages are a perfect fit for signal-processing algorithms, which use vast quantities of arithmetic derived from linear system theory to decode, compress, or filter data streams that represent periodic samples of continuouslychanging values such as sound or video. Dataflow semantics are natural for expressing the block diagrams typically used to describe signal-processing algorithms, and their regularity makes dataflow implementations very efficient because otherwise costly run-time scheduling decisions can be made at compile time, even in systems containing multiple sampling rates.

7.4.1 Kahn Process Networks Kahn Process Networks [18] form a formal basis for dataflow computation. Kahn’s systems consist of processes that communicate exclusively through unbounded point-to-point first-in, first-out queues (Figure 7.10). Reading from a port makes a process wait until data is available, but writing to a port always completes immediately. Deterministic behavior is the most unique aspect of Kahn’s networks. Processes’ blocking read behavior guarantees the overall system behavior (specifically, the sequence of data tokens that flow through each queue) is the same regardless of the relative execution rates of the processes, that is, regardless of the scheduling policy. This is generally a very desirable property because it provides a guarantee about the behavior of the system, ensures that simulation and reality will match, and greatly simplifies the design task since a designer is not obligated to ensure this herself. Balancing processes’ relative execution rates to avoid an unbounded accumulation of tokens is the challenge in scheduling a Kahn network. One general approach, proposed in Parks’ thesis [19] places process f(in int u, in int v, out int w) { int i; bool b = true; for (;;) { i = b ? wait(u) : wait(w); printf("%i\n", i); send(i, w); b = !b; } } process g(in int u, out int v, out int w) { for (;;) { send(wait(u), v); send(wait(u), w); } } process h(in int u, out int v, int init) { send(v, init); for(;;) send(wait(u), v); } channel int X, Y, Z, T1, T2; f(Y, Z, X); g(X, T1, T2); h(T1, Y, 0); h(T2, Z, 1);

FIGURE 7.10 A Kahn Process Network written in a C-like dialect. Here, processes are functions that run continuously, may be attached to communication channels, and may call wait to wait for data on a particular port and send to write data to a particular port. The f process alternately copies from its u and v ports to its w port; the g process does the opposite, copying its u port to alternately v and w; and h simply copies its input to its output.

© 2006 by Taylor & Francis Group, LLC

7-14

In

Embedded Systems Handbook

11

Filt

18

Hil

24

Eq

2

2

2 2 Mul

2

Mul

2

2

2 1 Conj

22

21

Fork

12

11

Add

1

1

1 1

1

Biq

22

11

Deco

Out

1 2

Biq sc

Deci

1

1

1 Fork

1

FIGURE 7.11 A modem in SDF. Each node represents a process. The labels on each arc indicate the number of tokens sent or received by a process each time it fires.

artificial limits on the size of each buffer. Any process that writes to a full buffer blocks until space is available, but if the system deadlocks because all buffers are full, the scheduler increases the capacity of the smallest buffer. In practice, Kahn networks are rarely used in their pure form since they are fairly costly to schedule and their completely deterministic behavior is sometimes overly restrictive since they cannot easily handle sporadic events (e.g., an occasional change of volume level in a digital volume control) or serverlike behavior where the environment may make requests in an unpredictable order. Nevertheless, Kahn’s model still has useful properties and forms a starting point for other dataflow models.

7.4.2 Synchronous Dataﬂow Lee and Messerschmitt’s [20] Synchronous Dataflow (SDF) fixes the communication patterns of the blocks in a Kahn network (Figure 7.11 is an example after Bhattacharyya et al. [21]). Each time a block runs, it consumes and produces a fixed number of data tokens on each of its ports. Although more restrictive than Kahn networks, SDF’s predictability allows it to be scheduled completely at compile time, producing very efficient code. Scheduling operates in two steps. First, the rate at which each block fires is established by considering the production and consumption rates of each block at the source and sink of each queue. For example, the arc between the Hil and Eq nodes in Figure 7.11 implies Hil runs twice as frequently. Once the rates are established, any algorithm that simulates the execution of the network without buffer underflow will produce a correct schedule if one exists. However, more sophisticated techniques reduce generated code and buffer sizes by better ordering the execution of the blocks (see Bhattacharyya et al. [22]). Synchronous dataflow specifications are built by assembling blocks typically written in an imperative language such as C. The SDF block interface is specific enough to make it easy to create libraries of general-purpose blocks such as adders, multipliers, and even FIR filters. While SDF is often used as a simulation language, it is also well-suited to code generation. It enables a practical technique for generating code for digital signal processors, for which C compilers often cannot generate efficient code. Assembly code is handcrafted for each block in a library, and code synthesis consists of assembling these handwritten blocks, sometimes generating extra code that handles the interblock buffers. For large, specialized blocks such as fast Fourier transforms, this can be very effective because most of the generated code was carefully optimized by hand.

7.5 Hybrid Languages The languages in this section use even more novel models of computation than the hardware, software, or dataflow languages presented earlier (Table 7.3). While such languages are more restrictive than generalpurpose ones, they are much better-suited for certain applications. Esterel excels at discrete control by blending software-like control flow with the synchrony and concurrency of hardware. Communication

© 2006 by Taylor & Francis Group, LLC

Languages for Embedded Systems TABLE 7.3

7-15 Hybrid Language Features Compared

Concurrency Hierarchy Preemption Determinism Synchronous communication Buffered communication FIFO communication Procedural Finite-state machines Dataflow Multi-rate dataflow Software implementation Hardware implementation

Esterel

SDL

SystemC

• • • • •

• •

• • ◦ ◦ • • • •

• • • •

◦ • • ◦ • • •

• ◦ ◦

Note: •, full support; ◦, partial support.

protocols are SDL’s forte; it uses extended finite-state machines with single input queues. SystemC provides a very flexible discrete-event simulation environment built on C++.

7.5.1 Esterel Intended for specifying control-dominated reactive systems, Esterel [23] combines the control constructs of an imperative software language with concurrency, preemption, and a synchronous model of time like that used in synchronous digital circuits. In each clock cycle, the program awakens, reads its inputs, produces outputs, and suspends. An Esterel program communicates through signals that are either present or absent each cycle. In each cycle, each signal is absent unless an emit statement for the signal runs and makes the signal present for that cycle only. Esterel guarantees determinism by requiring each emitter of a signal to run before any statement that tests the signal. Esterel is strongest at specifying hierarchical state machines. In addition to sequentially composing statements (separated by a semicolon), it has the ability to compose arbitrary blocks of code in parallel (the double vertical bars) and abort or suspend a block of code when a condition is true. For example, the every-do construct in Figure 7.12 effectively wraps a reset statement around two state machines running in parallel.

7.5.2 SDL SDL is a graphical specification language developed for describing telecommunication protocols defined by the ITU [24] (Ellsberger [25] is more readable). A system consists of concurrently-running FSMs, each with a single input queue, connected by channels that define which messages they carry. Each FSM consumes the most recent message in its queue, reacts to it by changing internal state or sending messages to other FSMs, changes to its next state, and repeats the process. Each FSM is deterministic, but because messages from other FSMs may arrive in any order because of varying execution speed and communication delays, an SDL system may behave nondeterministically. In addition to a fairly standard textual format, SDL has a formalized graphical notation. There are three types of diagrams. Flowcharts define the behavior of state machines at the lowest level (Figure 7.13). Block diagrams illustrating the communication among state machines local to a single processor are at the next level up. Each communication channel is labeled with the set of messages that it conveys. The top level is another block diagram that depicts the communication among processors. The communication channels

© 2006 by Taylor & Francis Group, LLC

7-16

Embedded Systems Handbook module Example: input S, I; output O; signal R, A in every S do await I; weak abort sustain R when immediate A; emit O || loop pause; pause; present R then emit A end; end end end end module

FIGURE 7.12 An Esterel program modeling a shared resource. This implements two parallel threads (separated by ||), one that waits for an I signal, then asserts R until it received an A from the other thread and emits an O. Meanwhile, the second thread emits an R in response to an A in alternate cycles.

Estab

Close

Packet

Seqn: = Seq

Fin?

Seq: = Seq +1

Seqn: = Ack

Fin : = 1

Ack: = Ack + 1 Rst?

Len(9)

Ackn: = Seqn + 1

Packet

Ack : = 1

wait1

Len(9)

Closed

Size?

FIGURE 7.13 A fragment of an SDL flowchart specification for a TCP protocol. The rounded boxes denote states (Estab, wait1, and Closed). Immediately below Estab are inward-pointing boxes that receive signals (Close, Packet). The square and diamond boxes below these are actions and decisions. The outward-pointed boxes (e.g., Packet) emit signals.

© 2006 by Taylor & Francis Group, LLC

Languages for Embedded Systems

7-17

in these diagrams are also labeled with the signals they convey, but are assumed to have significant delay, unlike the channels among FSMs in a single processor. The behavior of an SDL state machine is straightforward. At the beginning of each cycle, it gets the next signal in its input queue and sees if there is a “receive” block for that signal off the current state. If there is, the associated code is executed, possibly emitting other signals and moving to a next state. Otherwise, the signal is simply discarded and the cycle repeats from the same state. By itself, such semantics have a hard time dealing with signals that arrive out-of-order, but SDL has an additional construct for handling this condition. The save construct is like the receive construct, appearing immediately a state and matching a signal, but when it matches it stores the signal in a buffer that holds it until another state has a matching rule.

7.5.3 SystemC The SystemC language (Figure 7.14) is a C++ subset for system modeling. A SystemC specification is simulated by compiling it with a standard C++ compiler and linking in freely-distributed class libraries from www.systemc.org. The SystemC language builds systems from Verilog- and VHDL-like modules. Each has a collection of I/O ports and may contain instances of other modules or processes defined by a block of C++ code. SystemC uses a discrete-event simulation model. The SystemC scheduler executes the code in a process in response to an event such as a clock signal, or a delay. This model resembles that used in Verilog and VHDL, but has the flexibility of operating with a general-purpose programming language. SystemC began life aiming to replace Verilog or VHDL as a hardware description language (it did not offer designers a sufficiently compelling reason to switch), but has since moved beyond that. Very often in system design, it is desirable to run simulations to estimate such high-level behavior as bus activity or memory accesses. Historically, designers had custom written simulators in a general-purpose language such as C, but this was time-consuming because of the need to write a new simulation kernel (i.e., something that provided concurrency) for each new simulator. SystemC is emerging as a standard for writing system-level simulations. While not perfect, it works well enough and makes it fairly easy to glue large pieces of existing software together. Although Verilog has a PLI (programming language interface) that allows arbitrary C/C++ code to be linked and run simultaneously with a simulation, the higher integration of the SystemC approach is more efficient.

#include "systemc.h" struct complex_mult : sc_module { sc_in a, b; sc_in c, d; sc_out x, y; sc_in_clk clock; void do_mult() { for (;;) { x = a * c - b * d; wait(); y = a * d + b * c; wait(); } } SC_CTOR(complex_mult) { SC_CTHREAD(do_mult, clock.pos()); } };

FIGURE 7.14 A SystemC model for a complex multiplier.

© 2006 by Taylor & Francis Group, LLC

7-18

Embedded Systems Handbook

SystemC supports transaction-level modeling, in which bus transactions, rather than being modeled on a per-cycle basis as would be done in a language such as Verilog, are modeled as function calls. For example, a burst-mode bus transfer would be modeled with a function that marks the bus as in use, advances simulation time according to the number of bytes to be transferred, actually copies the data in the simulator, and marks the bus as unused. Nowhere in the simulation would the actual sequence of signals and bits transferred over the bus appear.

7.6 Summary Currently, most embedded systems are programmed using C for software and Verilog, or possibly VHDL, for hardware components such as FPGAs or ASICs, but this will probably change. The increased complexity of such designs makes a compelling case for different, higher-level languages. Years ago, designers made the jump from assembly to C, and the higher-level constructs of Java are growing more attractive despite its performance loss. Domain-specific languages, especially for signal-processing problems, already have a significant beachhead, and will continue to make inroads. Most signal processing algorithms are already prototyped using a higher-level language (Matlab), but it remains to be seen whether synthesis from Matlab will ever be practical. For hardware, the direction is less clear. While modeling languages such as SystemC will continue to grow in importance, there is currently no clear winner for the successor to VHDL and Verilog. Roughly a decade ago, a different, high-level subset of VHDL and Verilog was proposed as the new “behavioral” synthesis subset, but did not catch on because it was too limiting, largely because of restrictions placed on it by the synthesis algorithms. Additions such as SystemVerilog are incremental, if helpful, improvements, but will not provide the quantum leap forward that synthesis from the RTL (register-transfer level) subsets of Verilog and VHDL provided. Perhaps future hardware languages may contain constructs such as Esterel’s.

References [1] Stephen A. Edwards. Languages for Digital Embedded Systems. Kluwer, Boston, MA, September 2000. [2] Dennis M. Ritchie. The Development of the C Language. In History of Programming Languages II. Thomas J. Bergin, Jr. and Richard G. Gibson, Jr., Eds. ACM Press, New York and Addison-Wesley, Reading, MA, 1996. [3] Brian W. Kernighan and Dennis M. Ritchie. The C Programming Language, 2nd ed. Prentice Hall, Upper Saddle River, NJ, 1988. [4] Bjarne Stroustrup. The C++ Programming Language, 3rd ed. Addison-Wesley, Reading, MA, 1997. [5] Ken Arnold, James Gosling, and David Holmes. The Java Programming Language, 3rd ed. Addison-Wesley, Reading, MA, 2000. [6] James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. The Java Language Specification, 2nd ed. Addison-Wesley, Reading, MA, 2000. [7] Tim Lindholm and Frank Yellin. The Java Virtual Machine Specification. Addison-Wesley, Reading, MA, 1999. [8] Greg Bollella, Ben Brosgol, Peter Dibble, Steve Furr, James Gosling, David Hardin, Mark Turnbull, Rudy Belliardi, Doug Locke, Scott Robbins, Pratik Solanki, and Dionisio de Niz. The Real-Time Specification for Java. Addison-Wesley, Reading, MA, 2000. [9] Jean Labrosse. MicroC/OS-II. CMP Books, Lawrence, Kansas, 1998. [10] Loic P. Briand and Daniel M. Roy. Meeting Deadlines in Hard Real-Time Systems: The Rate Monotonic Approach. IEEE Computer Society Press, New York, 1999.

© 2006 by Taylor & Francis Group, LLC

Languages for Embedded Systems

7-19

[11] C. L. Liu and James W. Layland. Scheduling Algorithms for Multiprogramming in a Hard RealTime Environment. Journal of the Association for Computing Machinery, 20: 46–61, 1973. [12] IEEE Computer Society. IEEE Standard Hardware Description Language Based on the Verilog Hardware Description Language (1364–1995). IEEE Computer Society Press, New York, 1996. [13] Donald E. Thomas and Philip R. Moorby. The Verilog Hardware Description Language, 4th ed. Kluwer, Boston, MA, 1998. [14] IEEE Computer Society. IEEE Standard VHDL Language Reference Manual (1076–1993). IEEE Computer Society Press, New York, 1994. [15] Douglas L. Perry. VHDL, 3rd ed. McGraw-Hill, New York, 1998. [16] Ben Cohen. VHDL Coding Styles and Methodologies, 2nd ed. Kluwer, Boston, MA, 1999. [17] Peter J. Ashenden. The Designer’s Guide to VHDL. Morgan Kaufmann, San Francisco, CA, 1996. [18] Gilles Kahn. The Semantics of a Simple Language for Parallel Programming. In Information Processing 74: Proceedings of IFIP Congress 74. North-Holland, Stockholm, Sweden, August 1974, pp. 471–475. [19] Thomas M. Parks. Bounded Scheduling of Process Networks. PhD thesis, University of California, Berkeley, 1995. Available as UCB/ERL M95/105. [20] Edward A. Lee and David G. Messerschmitt. Synchronous Data Flow. Proceedings of the IEEE, 75: 1235–1245, 1987. [21] Shuvra S. Bhattacharyya, Praveen K. Murthy, and Edward A. Lee. Synthesis of Embedded Software from Synchronous Dataflow Specifications. Journal of VLSI Signal Processing Systems, 21: 151–166, 1999. [22] Shuvra S. Bhattacharyya, Ranier Leupers, and Peter Marwedel. Software Synthesis and Code Generation for Signal Processing Systems. IEEE Transactions on Circuits and Systems — II: Analog and Digital Signal Processing, 47: 849–875, 2000. [23] Gérard Berry and Georges Gonthier. The Esterel Synchronous Programming Language: Design, Semantics, Implementation. Science of Computer Programming, 19: 87–152, 1992. [24] International Telecommunication Union. ITU-T Recommendation Z.100: Specification and Description Language. International Telecommunication Union, Geneva, 1999. [25] Jan Ellsberger, Dieter Hogrefe, and Amardeo Sarma. SDL: Formal Object-Oriented Language for Communicating Systems, 2nd ed. Prentice Hall, Upper Saddle River, NJ, 1997.

© 2006 by Taylor & Francis Group, LLC

8 The Synchronous Hypothesis and Synchronous Languages 8.1 8.2

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Synchronous Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . .

8-1 8-2

What For? • Basic Notions • Mathematical Models • Implementation issues

8.3

Imperative Style: Esterel and SyncCharts . . . . . . . . . . . . . .

8-5

Syntax and Structure • Semantics • Compilation and Compilers • Analysis/Verification/Test Generation: Benefits from Formal Approaches

8.4

Dumitru Potop-Butucaru IRISA

Robert de Simone INRIA

Jean-Pierre Talpin IRISA

The Declarative Style: Lustre and Signal . . . . . . . . . . . . . . . 8-11 A Synchronous Model of Computation • Declarative Design Languages • Compilation of Declarative Formalisms

8.5 8.6

Success Stories — A Viable Approach for System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18 Into the Future: Perspectives and Extensions . . . . . . . . . . 8-18 Asynchronous Implementation of Synchronous Specifications

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20

8.1 Introduction Electronic Embedded Systems are not new, but their pervasive introduction in ordinary-life objects (cars, phones, home appliances) brought a new focus onto design methods for such systems. New development techniques are needed to meet the challenges of productivity in a competitive environment. This handbook reports on a number of such innovative approaches to the matter. Here, we shall concentrate on Synchronous Reactive (S/R) languages [1–4]. S/R languages rely on the synchronous hypothesis, which divides computations and behaviors into a discrete sequence of computation steps that are equivalently called reactions or execution instants. In itself,

Partly supported by the ARTIST IST European project.

8-1

© 2006 by Taylor & Francis Group, LLC

8-2

Embedded Systems Handbook

this assumption is rather common in practical embedded system design. But the synchronous hypothesis adds the fact that, inside each instant, the behavioral propagation is well-behaved (causal), so that the status of every signal or variable is established and defined prior to being tested or used. This criterion, which may be seen at first as an isolated technical requirement, is in fact the key point of the approach. It ensures strong semantic soundness by allowing universally recognized mathematical models such as the Mealy machines and the digital circuits to be used as supporting foundations. In turn, these models give access to a large corpus of efficient optimization, compilation, and formal verification techniques. The synchronous hypothesis also guarantees full equivalence between various levels of representation, thereby avoiding altogether the pitfalls of nonsynthesizability of other similar formalisms. In that sense, the synchronous hypothesis is, in our view, a major contribution to the goal of model-based design of embedded systems. Structured languages have been introduced for the modeling and programming of S/R applications. They are roughly classified into two families: Imperative languages, such as Esterel [5–7] and SyncCharts [8], provide constructs to shape controldominated programs such as hierarchical synchronous automata, in the wake of the StateCharts formalism, but with a full-fledged treatment of simultaneity, priority, and absence notification of signals in a given reaction. Thanks to this, signals assume a consistent status for all parallel components in the system at any given instant. Declarative languages, such as Lustre [9] and Signal [10], shape applications based on intensive data computation and data-flow organization, with the control flow part operating under the form of (internally generated) activation clocks. These clocks prescribe which data computation blocks are to be performed as a part of the current reaction. Here again, the semantics of the languages deal with the issue of behavior consistency, so that every value needed in a computation is indeed available at that instant. Here, we shall describe the synchronous hypothesis and its mathematical background, together with a range of design techniques empowered by the approach and a short comparison with neighboring formalisms; then, we introduce both classes of S/R languages, with their special features and a couple of programming examples; finally, we comment on the benefits and shortcomings of S/R modeling, concluding with a look at future perspectives and extensions.

8.2 The Synchronous Hypothesis 8.2.1 What For? Program correctness (the process performs as intended) and program efficiency (it performs as fast as possible) are major concerns in computer science, but they are even more stringent in the embedded area, as no online debugging is feasible, and time budgets are often imperative (for instance in multimedia applications). Program correctness is sought by introducing appropriate syntactic constructs and dedicated languages, making programs more easily understandable by humans, as well as allowing high-level modeling and associated verification techniques. Provided semantic preservation is ensured down to actual implementation code, this provides reasonable guarantees on functional correctness. However, while this might sound obvious for traditional software compilation schemes, the hardware synthesis process is often not “seamless,” as it includes manual rewriting. Program efficiency is traditionally handled in the software world by algorithmic complexity analysis, and expressed in terms of individual operations. But in modern systems, owing to a number of phenomena, this “high-level” complexity reflects rather imperfectly the “low-level” complexity in numbers of clock cycles spent. In the hardware domain, one considers various levels of modeling, corresponding to more abstract (or conversely more precise) timing account: transaction level, cycle accurate, time accurate.

© 2006 by Taylor & Francis Group, LLC

Synchronous Hypothesis and Languages

8-3

One possible way (amongst many) to view synchronous languages is to take up the analogy of cycle-accurate programming to a more general setting, including (reactive) software as well. This analogy is supported by the fact that simulation environments in many domains (from scientific engineering to Hardware Description Language [HDL] simulators) often use lockstep computation paradigms, very close to the synchronous cycle-based computation. In these settings, cycles represent logical steps, not physical time. Of course timing analysis is still possible afterwards, and in fact often simplified by the previous division into cycles. The focus of synchronous languages is thus to allow modeling and programming of systems where cycle (computation step) precision is needed. The objective is to provide domain-specific structured languages for their description, and to study matching techniques for efficient design, including compilation/synthesis, optimization, and analysis/verification. The strong condition insuring the feasibility of these design activities is the synchronous hypothesis, described in Section 8.2.2.

8.2.2 Basic Notions What has come to be known as the synchronous hypothesis, laying foundations for S/R systems, is really a collection of assumptions of a common nature, sometimes adapted to the framework considered. We shall avoid heavy mathematical formalization in this presentation, and defer the interested reader to the existing literature, such as References 3 and 4. The basics are: Instants and reactions. Behavioral activities are divided according to (logical, abstract) discrete time. In other words, computations are divided according to a succession of nonoverlapping execution instants. In each instant, input signals possibly occur (for instance, by being sampled), internal computations take place, and control and data are propagated until output values are computed and a new global system state is reached. This execution cycle is called the reaction of the system to the input signals. Although we used the word “time” just before, there is no real physical time involved, and instant durations need not be uniform (or even considered!). All that is required is that reactions converge and computations are entirely performed before the current execution instant ends and a new one begins. This empowers the obvious conceptual abstraction that computations are infinitely fast (“instantaneous,” “zero-time”), and take place only at discrete points in (physical) time, with no duration. When presented without sufficient explanations, this strong formulation of the synchronous hypothesis is often discarded by newcomers as unrealistic (while, again, it is only an abstraction, amply used in other domains where “all-or-nothing” transaction operations take place). Signals. Broadcast signals are used to propagate information. At each execution instant, a signal can either be present or absent. If present, it also carries some value of a prescribed type (“pure” signals exists as well, that carry only their presence status). The key rule is that a signal must be consistent (same present/absent status, same data) for all read operations during any given instant. In particular, reads from parallel components must be consistent, meaning that signals act as controlled shared variables. Causality. The crucial task of deciding whenever a signal can be declared absent is of utter importance in the theory of S/R systems, and an important part of the theoretical body behind the synchronous hypothesis. This is of course especially true for local signals, that are both generated and tested inside the system. The fundamental rule is that the present status and value of a signal should be defined before they are read (and tested). This requirement takes various practical forms depending on the actual language or formalism considered, and we shall come back to this later. Here, note that “before” refers to causal dependency in the computation of the instant, and not to physical or even logical time between successive instants [11]. The synchronous hypothesis ensures that all possible schedules of operations amount to the same result (convergence); it also leads to the definition of “correct” programs, as opposed to ill-behaved ones where no causal scheduling can be found. Activation conditions and clocks. Each signal can be seen as defining (or generating) a new clock, ticking when it occurs; in hardware design, this is called gated clocks. Clocks and sub-clocks, either externally or internally generated, can be used as control entities to activate (or not) component blocks of the system. We shall also call them activation conditions.

© 2006 by Taylor & Francis Group, LLC

8-4

Embedded Systems Handbook

8.2.3 Mathematical Models If one forgets temporarily about data values, and one accepts the duality of present/absent signals mapped on to true/false values, then there is a natural interpretation of synchronous formalisms as synchronous digital circuits at schematic gate level, or “netlists” (roughly RTL level with only Boolean variables and registers). In turn, such circuits have a straightforward behavioral expansion into Mealy Finite State Machines (FSMs). The two slight restrictions given here are not essential: the adjunction of types and values into digital circuit models has been successfully attempted in a number of contexts, and S/R systems can also be seen as contributing to this goal. Meanwhile, the introduction of clocks and present/absent signal status in S/R languages departs drastically from the prominent notion of sensitivity list generally used to define the simulation semantics of HDLs. We now comment on the opportunities made available through the interpretation of S/R systems into Mealy machines or netlists: Netlists. Here, we consider netlists a simple form, as Boolean equation systems defining the values of wires and Boolean registers as a Boolean function of other wires and previous register values. Some wires represent input and output signals (with value true indicating signal presence), others are internal variables. This type of representation is of special interest because it can provide exact dependency relations between variables, and thus a good representation level to study causality issues with accurate analysis. Notions of “constructive” causality have been the subject of much attention here. They attempt to refine the usual crude criterion for synthesizability, which forbids cyclic dependencies between nonregister variables (so that a variable seems to depend upon itself in the same instant), but neither takes into account the Boolean interpretation, nor the potentially reachable configurations. Consider the equation x = y ∨ z, while it has been established that y is the constant true. Then x does not really depend on z, since its (constant) value is forced by y’s. Constructive causality seeks for the best possible faithful notion of true combinatorial dependency taking the Boolean interpretation of functions into account. For details, see Reference 12. Another equally important aspect of the mathematical model is that a number of combinatorial and sequential optimization techniques have been developed over the years, in the context of hardware synthesis approaches. The main ones are now embedded in the SIS and MVSIS optimization suites, from UC Berkeley [13,14]. They come as a great help in allowing programs written in high-level S/R formalisms to compile into efficient code, either software or hardware targeted [15]. Mealy machines. Mealy machines are finite-state automata corresponding strictly to the synchronous assumption. In a given state, provided a certain input valuation (a subset of present signals), the machine reacts by immediately producing a set of output signals before entering a new state. The Mealy machines can be generated from netlists (and by extension from any S/R system). The Mealy machine construction can then be seen as a symbolic expansion of all possible behaviors, computing the space of reachable states (RSS) on the way. But while the precise RSS is won, the precise causal dependencies relations are lost, which is why both Mealy FSM and netlists models are useful in the course of S/R design [16]. When the RSS is extracted, often in symbolic Binary Decision Diagram (BDD) form, it can be used in a number of ways: we already mentioned that constructive causality only considers dependencies inside the RSS; similarly, all activities of model-checking formal verification, and test coverage analysis are strongly linked to the RSS construction [17–20]. The modeling style of netlists can be extrapolated to block-diagram networks, often used in multimedia digital signal processing, by adding more types and arithmetic operators, as well as activation conditions to introduce some amount of control flow. The declarative synchronous languages can be seen as attempts to provide structured programming to compose large systems modularly in this class of applications, as described in Section 8.4. Similarly, imperative languages provide ways to program in a structured way, hierarchical systems of interacting Mealy FSMs, as described in Section 8.3.

© 2006 by Taylor & Francis Group, LLC

Synchronous Hypothesis and Languages

8-5

reaction () { decode state ; read input ; compute ; write output ; encode state ; }

FIGURE 8.1 The reaction function is called at each instant to perform the computation of the current step.

8.2.3.1 Synchronous Hypothesis versus Neighboring Models Many quasi-synchronous formalisms exist in the fields of embedded system (cosimulation): the simulation semantics of SystemC and regular HDLs at RTL level, or the discrete-step Simulink/Stateflow simulation, or the official StateCharts semantics, for instance. Such formalisms generally employ a notion of physical time in order to establish when to start the next execution instant. Inside the current execution instant, however, delta-cycles allow zero-delay activity propagation, and potentially complex behaviors occur inside a given single reaction. The main difference here is that no causality analysis (based on the synchronous hypothesis) is performed at compilation time, so that an efficient ordering/scheduling cannot be precomputed before simulation. Instead, each variable change recursively triggers further recomputations of all depending variables in the same reaction.

8.2.4 Implementation Issues The problem of implementing a synchronous specification mainly consists in defining the step reaction function that will implement the behavior of an instant, as shown in Figure 8.1. Then, the global behavior is computed by iterating this function for successive instants and successive input signal valuations. Following the basic mathematical interpretations, the compilation of a S/R program may either consist in the expansion into a flat Mealy FSM, or in the translation into a flat netlist (with more types and arithmetic operators, but without activation conditions). Here, the runtime implementation consists in the execution of the resulting Mealy machine or netlist. In the first case, the automaton structure is implemented as a big top-level switch between states. In the second case, the netlist is totally ordered in a way compatible with causality, and all the equations in the ordered list are evaluated at each execution instant. These basic techniques are at the heart of the first compilers, and some industrial ones. In the last decade fancier implementation schemes have been sought, relying on the use of activation conditions: during each reaction, execution starts by identifying the “truly useful” program blocks that are marked as “active.” Then only the actual execution of the active blocks is scheduled (a bit more dynamically) and performed in an order that respects the causality of the program. In the case of declarative languages, the activation conditions come in the form of a hierarchy of clock under samplings — the clock tree, obtained through a “clock calculus” computation performed at compile time (see Section 8.4.3). In the case of imperative formalisms, activation conditions are based on the halting points (where the control flow can stop between execution instants) and on the signal-generated (sub-)clocks (see Section 8.3.3).

8.3 Imperative Style: Esterel and SyncCharts For control-dominated systems, comprising a fair number of (sub-)modes and macro-states with activity swapping between them, it is natural to employ a description style that is algorithmic and imperative, describing the changes and progression of control in an explicit flow. In essence, one seeks to represent hierarchical (Mealy) FSMs, but with some data computation and communication treatment performed inside states and transitions. Esterel provides this in a textual fashion, while SyncCharts propose a graphical counterpart, with visual macro-states. It should be noted that systems here remain in finite-state (at least their control structure).

© 2006 by Taylor & Francis Group, LLC

8-6

Embedded Systems Handbook

8.3.1 Syntax and Structure Esterel introduces a specific pause construct, used to divide behaviors into successive instants (reactions). The pause statement excepts, control is flowing through sequential, parallel, and if–then–else constructs, performing data operations, and interprocess signaling. But it stops at pause, memorizing the activity of that location point for the next execution instant. This provides the needed atomicity mechanism, since the instant is over when all currently active parallel components reach a pause statement. The full Esterel language contains a large number of constructs that facilitate modeling, but there exists a reduced kernel of primitive statements (corresponding to the natural structuring paradigms) from which all the other constructs can be derived. This is of special interest for model-based approaches, because only primitives need to be assigned semantics as transformations in the model space. The semantics of the primitives are then combined to obtain the semantics of composed statements. Figure 8.2 provides the list of primitive operators for the data-less subset of Esterel (also called Pure Esterel). A few comments are here in order: • In p; q the reaction where p terminates is the same as the reaction where q starts (control can be split into reactions only by pause statements inside p or q). • The loop constructs do not terminate, unless aborted from above. This abortion can be owing to an external signal received by an abort statement, or to an internal exception raised through the trap/exit mechanism, or to any of the two (like for the weak abort statement). The body of a loop statement should not instantly terminate, or else the loop will unroll endlessly in the same instant, leading to divergence. This is checked by static analysis techniques. Finally, loops are the only means of defining iterating behaviors (there is no general recursion), so that the system remains in finite-state. • The present signal testing primitive allows an else part. This is essential to the expressive power of the language, and has strong semantic implications pertaining to the synchronous hypothesis. It is enough to note that, according to the synchronous hypothesis, signal absence can effectively be asserted. • The difference between “abort p when S” and “weak abort p when S” is that in the first case signal S can only come from outside p and its occurrence prevents p from executing during the execution instant where S arrives. In the second case, S can also be emitted by p, and the preemption occurs only after p has completed its execution for the instant. Enforces precedence by parenthesis

[ p] pause

Suspends the execution until next instant

p;

Executes p, then q as soon as p terminates

q

loop p end

Iterates p forever in sequence

[ p || q]

Executes p and q in parallel, synchronously

signal S inend

Declares local signal S in p

emit S

Emits signal S

present S then p else q end

Executes p or q upon S being present or absent!

abort p when S

Executes p until Soccurs (exclusive)

weak abort p when S

Executes p until S occurs (inclusive)

suspend p when S

Executes p unless S occurs

trap T in p end

Declare/catch exception T in p

exit T

Raise exception T

FIGURE 8.2

Pure Esterel statements.

© 2006 by Taylor & Francis Group, LLC

Synchronous Hypothesis and Languages

8-7

• Technically speaking, the trap/exit mechanisms can emulate the abort statements. But we feel that the ease of understanding makes the latter worth including in the set of primitives. Similarly, we shall sometimes use “await S” as a shorthand for “abort loop pause end when S,” and “sustain S” for “loop emit S end.” Most of the data-handling part of the language is deferred to a general-purpose host language (C, C++, Java, …). Esterel only declares type names, variables types, and function signatures (which are used as mere abstract instructions). The actual type specifications and function implementations must be provided and linked at later compilation time. In addition to the structuring primitives of Figure 8.2, the language contains (and requires) interface declarations (for signals, most notably), and modular division with submodules invocation. Submodule instantiation allows signal renaming, that is, transforming virtual name parameters into actual ones (again, mostly for signals). Rather than providing a full user manual for the language, we shall illustrate most of these features on an example. The small example of Figure 8.3 has four input signals and one output signal. Meant to model a cyclic computation like a communication protocol, the core of our example is the loop that awaits the input I, emits O, and then awaits J before instantly restarting. The local signal END signals the completion of loop cycles. When started, the await statement waits for the next clock cycle where its signal is present. The computation of all the other statements present in our example is performed during a single clock cycle, so that the await statements are the only places where control can be suspended between reactions (they preserve the state of the program between cycles). A direct consequence is that the signals I and J must come in different clock cycles in order not to be discarded. The loop is preempted by the exception handling statement trap when “exit T” is executed. In this case, trap instantly terminates, control is given in sequence, and the program terminates. The preemption protocol is triggered by the input signal KILL, but the exception T is raised only when END is emitted. The program is suspended — no computation is performed and the state is kept unchanged — in clock cycles where the SUSP signal is received. A possible execution trace for our program is given in Figure 8.4.

module Example: input I,J,KILL,SUSP; output O; suspend trap T in %exception handler, performs the preemption signal END in loop %basic computation loop await I;emit O;await J;emit END end || %preemption protocol, triggered by KILL await KILL;await END;exit T end end; when SUSP %suspend signal end module

FIGURE 8.3 A simple Esterel program modeling a cyclic computation (such as a communication protocol) that can be interrupted between cycles and which can be suspended.

© 2006 by Taylor & Francis Group, LLC

8-8

Embedded Systems Handbook Clock

Inputs

0

Any

1

I

2

KILL

3

Outputs Comments All inputs discarded O Preemption protocol triggered Nothing happens

4

J,SUSP

5

J

Suspend, J discarded END emitted, T raised, program terminates

FIGURE 8.4 A possible execution trace for our example.

8.3.2 Semantics Esterel enjoys a full-fledged formal semantics, in the form of Structural Operational Semantic (SOS) rules [12]. In fact, there are two main levels of such rules, with the coarser describing all potential, logically consistent behaviors, while the more precise one only selects those that can be obtained in a constructive way (thereby discarding some programs as “unnatural” in this respect). This issue can be introduced with two small examples: present S then emit S end

present S else emit S end

In the first case the signal S can logically be assumed as either present or absent: if assumed present, it will be emitted, so it will become present; if assumed absent, it will not be emitted. In the second case, following a similar reasoning, the signal can be neither present nor absent. In both cases, anyhow, the analysis is done by “guessing” before branching to the potentially validating emissions. While more complex causality paradoxes can be built using the full language, these two examples already show that the problem stems from the existence of causality dependencies inside a reaction, prompted by instantaneous sequential control propagation and signal exchanges. The so-called constructive causality semantics of Esterel checks precisely that control and signal propagation are well-behaved, so that no “guess” is required. Programs that pass this requirement are deemed as “correct,” and they provide deterministic behaviors for whatever input is presented to the program (which is a desirable feature in embedded system design).

8.3.3 Compilation and Compilers Following the pattern presented in Section 8.2.4, the first compilers for Esterel were based on the translation of the source into (Mealy) finite automata or into digital synchronous circuits at netlist level. Then, the generated sequential code was a compiled automata or netlist simulator. The automata-based compilation [7] was used in the first Esterel compilers (known as Esterel V3). Automaton generation was done here by exhaustive expansion of all reachable states using symbolic execution (all data is kept uninterpreted). Execution time was then theoretically optimal, but code size could blow up (as the number of states), and huge code duplication was mandatory for actions that were performed in several different states. The netlist-based compilation (Esterel V5) is based on a quasi-linear, structural Esterel-to-circuits translation scheme [21] that ensures the tractability of compilation even for the largest examples. The drawback of the method is the reaction time (the simulation time for the generated netlist), which increases linearly with the size of the program. Apart from these two compilation schemes, which have matured into full industrial-strength compilers, several attempts have been made to develop a more efficient, basically event-based type of compilation that follows more readily the naive execution path and control propagation inside each reaction, and in particular executes “as much as possible” only the truly active parts of the program.1 Here, we mention three 1 Recall that this is a real issue in Esterel, since programs may contain reaction to absence of signals, and determining

this absence may require to check that no emission remains possible in the potential behaviors, whatever feasible test

© 2006 by Taylor & Francis Group, LLC

Synchronous Hypothesis and Languages

8-9

(b) Concurrent control-flow graph

(a) Hierarchical state representation Program activation Program start

Suspend Trap Signal Par

Loop Seq

FIGURE 8.5

GRC intermediate representation for our Esterel example.

such approaches: the Saxo compiler of Closse and co-workers [22], the EC compiler of Edwards [23], and the GRC2C compiler of Potop-Butucaru and de Simone [24]. All of them are structured around flowgraph-based intermediate representations that are easily translated into well-structured sequential code. The different intermediate representations also give the differences between approaches, by determining which Esterel programs can be represented, and what optimization and code generation techniques can be applied. We exemplify on the GRC2C compiler [24], which is structured around the GRC intermediate form. The GRC representation of our example, given in Figure 8.5, uses two graph-based structures — a hierarchical state representation (HSR) and a concurrent control-flow graph (CCFG) — to preserve most of the structural information of the Esterel program while making the control flow explicit with few graphbuilding primitive nodes. The HSR is an abstraction of the syntax tree of the initial Esterel program. It can be seen as a structured data memory that preserves state information across reactions. During each instant, a set of activation conditions (clocks) is computed from this memory state, to drive the execution toward active instructions. The CCFG represents, in an operational fashion, the computation of an instant (the transition function). During each reaction, the dynamic CCFG operates on the static HSR by marking/unmarking component nodes (subtrees) with “active” tags as they are activated or deactivated by the semantics. For instance, when we start our small example (Figure 8.3 and Figure 8.5), the “program start (1)” and “program (0)” HSR nodes are active, while all the statements of the program (and the associated HSR nodes) are not. Like in any instant, control enters the CCFG by the topmost node and uses the first state decoding node (labeled 0) to read the state of the HSR and branch to the start behavior, which sets the “program start (1)” indicator to inactive (with “exit 1”), and activates “await I” and “await KILL” (with “enter 8” and “enter 11”). branches could be taken. To achieve this goal at a reasonable computational price, current compilers require, in fact, additional restrictions — in essence, the acyclicity of the dependency/causality graph at some representation level. Acyclicity ensures constructiveness, because any topological order of the operations in the graph gives an execution order which is correct for all instants.

© 2006 by Taylor & Francis Group, LLC

8-10

Embedded Systems Handbook

The HSR also serves as a repository for tags, which record redundancies between various activation clocks, and are used by the optimization and code generation algorithms. Such a tag is #, which tells that at most one child of the tagged node can retain control between reactions at a time (the activation clocks of the branches are exclusive). Other tags (not figured here) are computed through complex static analysis of both the HSR and CCFG. The tags allow efficient optimization and sequential code generation. The CCFG is obtained by making the control flow of the Esterel program explicit (a structural, quasilinear translation process).2 Usually, it can be highly optimized using classical compiler techniques and some methods derived from circuit optimization, both driven by the HSR tags computed by static analysis. Code generation from a GRC representation is done by encoding the state on sequential variables, and by scheduling the CCFG operators using classical compilation techniques [25]. The Saxo compiler of Closse and co-workers [22] uses a discrete-event interpretation of Esterel to generate a compiled event-driven simulator. The compiler flow is similar to that of VeriSUIF [26], but Esterel’s synchronous semantics are used to highly simplify the approach. An event graph intermediate representation is used here to split the program into a list of guarded procedures. The guards intuitively correspond to events that trigger computation. At each clock cycle, the simulation engine traverses the list once, from the beginning to the end, and executes the procedures with an active guard. The execution of a procedure may modify the guards for the current cycle and for the next cycle. The resulting code is slower than its GRC2C-generated counterpart for two reasons: first, it does not exploit the hierarchy of exclusion relations determined by switching statements like the tests. Second, optimization is less effective because the program hierarchy is lost when the state is (very redundantly) encoded using guards. The EC compiler of Edwards [23] treats Esterel as having control-flow semantics (in the spirit of [25,27]) in order to take advantage of the initial program hierarchy and produce efficient, well-structured C code. The Esterel program is first translated into a CCFG representing the computation of a reaction. The translation makes the control flow explicit and encodes the state access operations using tests and assignments of integer variables. Its static scheduling algorithm takes advantage of the mutual exclusions between parts of the program and generates code that uses program counter variables instead of simple Boolean guards. The result is therefore faster than its Saxo-generated counterpart. However, it is usually slower than the GRC2C-generated code because the GRC representation preserves the state structure of the initial Esterel program and uses static analysis techniques to determine redundancies in the activation pattern. Thus, it is able to better simplify the final state representation and the CCFG.

8.3.4 Analysis/Veriﬁcation/Test Generation: Beneﬁts from Formal Approaches We claimed that the introduction of well-chosen structuring primitives, endowed with formal mathematical semantics and interpretations as well-defined transformations in the realms of Mealy machines and synchronous circuits, was instrumental in allowing powerful analysis and synthesis techniques as part of the design of synchronous programs. What are they, and how do they appear in practice to enhance the confidence in the correctness of safety-critical embedded applications? Maybe, the most obvious is that synchronous formalisms can fully benefit from the model-checking and automatic verification usually associated to the netlist and Mealy machine representations, and now widely popular in the hardware design community with the PSL/SuGaR and assertion-based design approaches. Symbolic BDD- and SAT-based model-checking techniques are thus available on all S/R systems. Moreover, the structured syntax allows in many cases the introduction of modular approaches, or guide abstraction techniques with the goal of reducing complexity of analysis. The ability of formal methods akin to model-checking can also be used to automatically produce test sequences that seek to reach the best possible coverage in terms of visited states or exercised transitions. Here again specific techniques were developed to match the S/R models. 2 Such a process is necessary, because most Esterel statements pack together two distinct, and often disjoint behaviors:

one for the execution instants where they are started, and one for instants where control is resumed from inside.

© 2006 by Taylor & Francis Group, LLC

Synchronous Hypothesis and Languages

8-11

Also, symbolic representations of the reachable state spaces (or abstracted over-approximations), which can effectively be produced and certified correct, thanks to the formal semantics, can be used in the course of compilation and optimization. In particular for Esterel, the RSS computation allows more “correct” programs with respect to constructiveness: indeed causal dependencies may vary in direction depending on the state. If all dependencies are put together regardless of the states, then a causality cycle may appear, while not all components of the cycle may be active at the same instant, and so no real cycle exists (but it takes a dynamic analysis to establish this). Similarly, the RSS may exhibit combinatorial relations between registers encoding the local states, so that register elimination is possible to further simplify the state space structure. Finally, the domain-specific structuring primitives empowering dedicated programming can also be seen as an important criterion. Readable, easily understandable programs are a big step toward correct programs. And when issues of correctness are not so plain and easy, as for instance, when regarding proper scheduling of behaviors inside a reaction to respect causal effects, then powerful abstract hypothesis are defined in the S/R domain that define admissible orderings (and build them for correct programs). A graphical version of Esterel, named SyncCharts for synchronous StateCharts, has been defined to provide a visual formalism with a truly synchronous semantics.

8.4 The Declarative Style: Lustre and Signal The presentation of declarative formalisms implementing the synchronous hypothesis as defined in Section 8.2 can be cast into a model of computation (proposed in Reference 28 consisting of a domain of traces/behaviors and of a semilattice structure that renders the synchronous hypothesis using a timing equivalence relation: clock equivalence. Asynchrony can be superimposed on this model by considering a flow equivalence relation. Heterogeneous systems [29] can also be modeled by parameterizing the composition operator using arbitrary timing relations.

8.4.1 A Synchronous Model of Computation We consider a partially-ordered set of tags t to denote instants (which are seen, in the sense of Section 8.2.2, as symbolic periods in time during which one reaction takes place). The relation t1 ≤ t2 says that t1 occurs before t2 . A minimum tag exists, denoted by 0. A totally ordered set of tags C is called a chain and denotes the sampling of a possibly continuous or dense signal over a countable series of causally related tags. Events, signals, behaviors, and processes are defined as follows: • • • •

An event e is a pair consisting of a value v and a tag t . A signal s is a function from a chain of tags to a set of values. A behavior b is a function from a set of names x to signals. A process p is a set of behaviors that have the same domain.

In the remainder, we write tags(s ) for the tags of a signal s, vars(b ) for the domains of b, b |X for the projection of a behavior b on a set of names X , and b /X for its complementary. Figure 8.6 depicts a behavior (b) over three signals named x, y, and z. Two frames depict timing domains formalized by chains of tags. Signal x and y belong to the same timing domain: x is a down-sampling of y. Its events are synchronous to odd occurrences of events along y and share the same tags, for example, t1 . Even tags of y, for example, t2 , are ordered along its chain, for example, t1 < t2 , but absent from x. Signal z belongs to a different timing domain. Its tags, for example, t3 are not ordered with respect to the chain of y, for example, t1 ≤ t3 and t3 ≤ t1 . The synchronous composition of the processes p and q is denoted by p | q. It is defined by the union b ∪ c of all behaviors b (from p) and c (from q) that hold the same values at the same tags b |I = c|I for all signal x ∈ I = vars(b ) ∩ vars(c ) they share. Figure 8.7 depicts the synchronous composition, right, of the behaviors b, left, and the behavior c, middle. The signal y, shared by b and c, carries the same tags and the same values in both b and c. Hence, b ∪ c defines the synchronous composition of b and c.

© 2006 by Taylor & Francis Group, LLC

8-12

Embedded Systems Handbook

FIGURE 8.6 A behavior (named b) over three signals (x, y, and z) belonging to two clock domains.

FIGURE 8.7

Synchronous composition of b ∈ p and c ∈ q.

FIGURE 8.8

Scheduling relations between simultaneous events.

FIGURE 8.9

Relating synchronous behaviors by stretching.

A scheduling structure is defined to schedule the occurrence of events along signals during an instant t . A scheduling → by a preorder relation between dates xt where t represents the time and x the location of the event. Figure 8.8 depicts such a relation, superimposed to the signals x and y of Figure 8.6. The relation yt1 → xt1 , for instance, requires y to be calculated before x at the instant t1 . Naturally, scheduling is contained in time: if t < t then xt →b xt for any x and b and if xt →b xt then t < t . A synchronous structure is defined by a semilattice structure to denote behaviors that have the same timing structure. The intuition behind this relation (depicted in Figure 8.9) is to consider a signal as an elastic with ordered marks on it (tags). If the elastic is stretched, marks remain in the same relative and partial order but have more space (time) between each other. The same holds for a set of elastics: a behavior. If elastics are equally stretched, the order between marks is unchanged. In Figure 8.9, the timescale of x and y changes but the partial timing and scheduling relations are preserved. Stretching is a partial-order relation which defines clock equivalence. Formally, a behavior c is a stretching of b of same domain, written b ≤ c, if there exists an increasing bijection on tags f that preserves the timing and scheduling relations. If so, c is the image of b by f . Last, the behaviors b and c are called clock-equivalent, written b ∼ c, iff there exists a behavior d such that d ≤ b and d ≤ c.

8.4.2 Declarative Design Languages The declarative design languages Lustre [9] and Signal [10] share the core syntax of Figure 8.10 and can both be expressed within the synchronous model of computation of Section 8.4.1. In both languages,

© 2006 by Taylor & Francis Group, LLC

Synchronous Hypothesis and Languages

8-13

FIGURE 8.10 A common syntactic core for Lustre and Signal.

if

then

-> pre

FIGURE 8.11 The if–then–else condition in Lustre. node counter

(tick,

reset:

bool)

returns

(count:

int);

let count

= if

true->reset

then 0 else if tick then pre count+1

else pre count;

FIGURE 8.12 A resettable counter in Lustre.

a process P is an infinite loop that consists of the synchronous composition P | Q of simultaneous equations x = y f z over signals named x, y, and z. Both Lustre and Signal support the restriction of a signal name x to a process P, noted P/x. The analogy stops here as Lustre and Signal differ in fundamental ways. Lustre is a single-clocked programming language, while Signal is a multi-clocked (polychronous) specification formalism. This difference originates in the choice of different primitive combinators (named f in Figure 8.10) and results in orthogonal system design methodologies. 8.4.2.1 Combinators for Lustre In a Lustre process, each equation processes the nth event of each input signal during the nth reaction (to possibly produce an output event). As it synchronizes upon availability of all inputs, the timing structure of a Lustre program is easily captured within a single clock domain: all input events are related to a master clock and the clock of the output signals is defined by sampling the master. There are three fundamental combinators in Lustre: Delay. “x = pre y” initially lets x undefined and then defines it by the previous value of y. Followed-by. “x = y -> z” initially defines x by the value v, and then by z. The pre and -> operators are usually used together, like in “x = v -> pre (y),” to define a signal x initialized to v and defined by the previous value of y. Scade, the commercial version of Lustre, uses a one-bit analysis to check that each signal defined by a pre is effectively initialized by an -> . Conditional. “x = if b then y else z” defines x by y if b is true and by z if b is false. It can be used without alternative “x = if b then y” to sample y at the clock b, as shown in Figure 8.11. Lustre programs are structured as data-flow functions, also called nodes. A node takes a number of input signals and defines a number of output signals upon the presence of an activation condition. If that condition matches an edge of the input signal clock, then the node is activated and possibly produces output. Otherwise, outputs are undetermined or defaulted. As an example, Figure 8.12 defines a resettable counter. It takes an input signal tick and returns the count of its occurrences. A boolean reset signal can be triggered to reset the count to 0. We observe that the boolean input signals tick and reset are synchronous to the output signal count and define a data-flow function.

© 2006 by Taylor & Francis Group, LLC

8-14

Embedded Systems Handbook

:= FIGURE 8.13 The delay operator in Signal.

:= default

:= when

FIGURE 8.14 The merge operator in Signal. process counter = (? event tick, reset ! integer value) (| value := (0 when reset) default ((value$ init 0 + 1) when tick) default (value$ init 0) |);

FIGURE 8.15 A resettable counter in Signal.

8.4.2.2 Combinators for Signal As opposed to nodes in Lustre, equations x := y f z in Signal more generally denote processes that define timing relations between input and output signals. There are three primitive combinators in Signal: Delay. “x := y$1 init v” initially defines the signal x by the value v and then by the previous value of the signal y. The signal y and its delayed copy “x := y$1 init v” are synchronous: they share the same set of tags t1 , t2 , . . . . Initially (at t1 ), the signal x takes the declared value v. At tag tn , x takes the value of y at tag tn−1 . This is displayed in Figure 8.13. Sampling. “x := y when z” defines x by y when z is true (and both y and z are present); x is present with the value v2 at t2 only if y is present with v2 at t2 and if z is present at t2 with the value true. When this is the case, one needs to schedule the calculation of y and z before x, as depicted by yt2 → xt2 ← zt2 . Merge. “x = y default z” defines x by y when y is present and by z otherwise. If y is absent and z present with v1 at t1 then x holds (t1 , v1 ). If y is present (at t2 or t3 ) then x holds its value whether z is present (at t2 ) or not (at t3 ). This is depicted in Figure 8.14. The structuring element of a Signal specification is a process. A process accepts input signals originating from possibly different clock domains to produce output signals when needed. Recalling the example of the resettable counter (Figure 8.12), this allows, for instance, to specify a counter (pictured in Figure 8.15) where the inputs tick and reset and the output value have independent clocks. The body of counter consists of one equation that defines the output signal value. Upon the event reset, it sets the count to 0. Otherwise, upon a tick event, it increments the count by referring to the previous value of value and adding 1 to it. Otherwise, if the count is solicited in the context of the counter process (meaning that its clock is active), the counter just returns the previous count without having to obtain a value from the tick and reset signals. A Signal process is a structuring element akin to a hierarchical block diagram. A process may structurally contain sub-processes. A process is a generic structuring element that can be specialized to the timing context of its call. For instance, a definition of the Lustre counter (Figure 8.12) starting from the specification of Figure 8.15 consists of the refinement depicted in Figure 8.16. The input tick and reset clocks expected by the process counter are sampled from the boolean input signals tick and reset

© 2006 by Taylor & Francis Group, LLC

Synchronous Hypothesis and Languages

8-15

process synccounter = (? boolean tick, reset ! integer value) (| value := counter (when tick, when reset) | reset ˆ= tick ˆ= value |);

FIGURE 8.16

Synchronization of the counter interface. ˆ

ˆ+ ˆ

FIGURE 8.17

ˆ*

when

(clock expression) (clock relations)

The syntax of clock expressions and clock relations (equations).

:=

ˆ ˆ ˆ

:= when := default

FIGURE 8.18

ˆ

ˆ-

ˆ ˆ ˆ

when

ˆ ˆ ˆ ˆ+ ˆ

when ˆ ˆ- ˆ

The clock inference system of Signal.

by using the “when tick” and “when reset” expressions. The count is then synchronized to the inputs by the equation reset ˆ= tick ˆ= count.

8.4.3 Compilation of Declarative Formalisms The analysis and code generation techniques of Lustre and Signal are necessarily different, tailored to handle the specific challenges determined by the different models of computation and programming paradigms. 8.4.3.1 Compilation of Signal Sequential code generation starting from a Signal specification starts with an analysis of its implicit synchronization and scheduling relations. This analysis yields the control- and data-flow graphs that define the class of sequentially executable specifications and allow to generate code. 8.4.3.1.1 Synchronization and Scheduling Analysis In Signal, the clock ˆx of a signal x denotes the set of instants at which the signal x is present. It is represented by a signal that is true when x is present and that is absent otherwise. Clock expressions (see Figure 8.17) represent control. The clock “when x” (respectively “when not x”) represents the time tags at which a boolean signal x is present and true (respectively false). The empty clock is denoted by 0. Clock expressions are obtained using conjunction, disjunction, and symmetric difference over other clocks. Clock equations (also called clock relations) are Signal processes: the equation “eˆ= e ” synchronizes the clocks e and e while “eˆ< e ” specifies the containment of e in e . Explicit scheduling relations “x → y when e” allow the representation of causality in the computation of signals (e.g., x after y at the clock e). A system of clock relations E can be easily associated (using the inference system P:E of Figure 8.18) to any Signal process P, to represent its timing and scheduling structure. 8.4.3.1.2 Hierarchization The clock and scheduling relations E of a process P define the control- and data-flow graphs that hold all necessary information to compile a Signal specification upon satisfaction of the property of endochrony, as illustrated in Figure 8.19. A process is said to be endochronous iff given a set of input signals (x and y in Figure 8.19) and flow-equivalent input behaviors (datagrams on the left of Figure 8.19), it has the capability to reconstruct a unique synchronous behavior up to clock-equivalence: the datagrams of the

© 2006 by Taylor & Francis Group, LLC

8-16

Embedded Systems Handbook

Input buffer

Input buffer

Endochronous process

Endochronous process

FIGURE 8.19 Endochrony: from flow-equivalent inputs to clock-equivalent outputs.

input signals in the middle of Figure 8.19 and of the output signal on the right of Figure 8.19 are ordered in clock-equivalent ways. To determine the order x y in which signals are processed during the period of a reaction, clock relations E play an essential role. The process of determining this order is called hierarchization and consists of an insertion algorithm that proceeds in three easy steps: 1. First, equivalence classes are defined between signals of same clock: if E ⇒ ˆxˆ= ˆy then x y (we write E ⇒ E iff E implies E ). 2. Second, elementary partial order relations are constructed between sampled signals: if E ⇒ ˆxˆ= when y or E ⇒ ˆxˆ= when not y then y x. 3. Last, assume a partial order of maximum z such that E ⇒ ˆz = ˆyf ˆw (for some f ∈ { ˆ+ , ˆ* , ˆ- }) and a signal x such that y x w, then insertion consist of attaching z to x by x z. The insertion algorithm proposed in Reference 30 yields a canonical representation of the partial order by observing that there exists a unique minimum clock x below z such that rule 3 holds. Based on the order , one can decide whether E is hierarchical by checking that its clock relation has a minimum, written min E ∈ vars(E ), so that ∀x ∈ vars(E ), ∃y ∈ vars(E ), y x. If E is furthermore acyclic (i.e., E ⇒ x → x when e implies E ⇒ eˆ= 0, for all x ∈ vars(E )) then the analyzed process is endochronous, as shown in Reference 28. Example 8.1 The implications of hierarchization for code generation can be outlined by considering the specification of one-place buffer in Signal (Figure 8.20, left). Process buffer implements two functionalities. One is the process alternate which desynchronizes the signals i and o by synchronizing them to the true and false values of an alternating boolean signal b. The other functionality is the process current. It defines a cell in which values are stored at the input clock ˆi and loaded at the output clock ˆo. cell is a predefined Signal operation defined by: x := y cell z init v =def (m := x$1 init v | x := y default m | ˆxˆ= ˆy ˆ+ ˆz )/m Clock inference (Figure 8.20, middle) applies the clock inference system of Figure 8.18 to the process buffer to determine three synchronization classes. We observe that b, c_b, zb, and zo are synchronous and define the master clock synchronization class of buffer. There are two other synchronization classes, c_i and c_o, that corresponds to the true and false values of the boolean flip–flop variable b, respectively: b≺c_b≺zb≺zo and b c_i≺i and b c_o≺o This defines three nodes in the control-flow graph of the generated code (Figure 8.20, right). At the main clock c_b, b, and c_o are calculated from zb. At the sub-clock b, the input signal i is read. At the sub-clock c_o the output signal o is written. Finally, zb is determined. Notice that the sequence of instructions follows the scheduling relations determined during clock inference.

© 2006 by Taylor & Francis Group, LLC

Synchronous Hypothesis and Languages

8-17

process buffer = (? i ! o) (| alternate (i, o)

buffer_iterate () { (| c_b ˆ= b

b = !zb;

| o := current (i)

| b

ˆ= zb

c_o = !b;

|) where

| zb

ˆ= zo

if (b) {

process alternate = (? i, o ! ) (| zb := b$1 init true

| c_i := when b

if (!r_buffer_i(&i))

| c_i ˆ= i

return FALSE;

| b := not zb

| c_o := when not b

}

| o ˆ= when not b

| c_o ˆ= o

if (c_o) {

| i ˆ= when b

| i -> zo when ˆi

o = i;

|) / b, zb;

| zb -> b

w_buffer_o(o);

process current = (? i ! o) (| zo := i cell ˆo init false | o

:= zo when ˆo

| zo -> o when ˆo

}

|) / zb, zo, c_b,

zb = b;

c_o, c_i, b;

|) / zo;

FIGURE 8.20

return TRUE; }

Specification, clock analysis, and code generation in Signal.

8.4.3.2 Compilation of Lustre Whereas Signal uses a hierarchization algorithm to find a sequential execution path starting from a system of clock relations, Lustre leaves this task to engineers, which must provide a sound, fully synchronized program in the first place: well-synchronized Lustre programs correspond to hierarchized Signal specifications. The classic compilation of Lustre starts with a static program analysis that checks the correct synchronization and cycle freedom of signals defined within the program. Then, it essentially partitions the program into elementary blocks activated upon boolean conditions [9] and focuses on generating efficient code for high-level constructs, such as iterators for array processing [31]. Recent efforts have been conducted to enhance this compilation scheme by introducing effective activation clocks, whose soundness is checked by typing techniques. In particular, this was applied to the industrial SCADE version, with extensions [32,33]. 8.4.3.3 Certiﬁcation The simplicity of the single-clocked model of Lustre eases program analysis and code generation. Therefore, its commercial implementation — Scade by Esterel Technologies — provides a certified C code generator. Its combination to Sildex (the commercial implementation of Signal by TNI-Valiosys) as a frontend for architecture mapping and early requirement specification is the methodology advocated in the IST project Safeair (URL: http://www.safeair.org). The formal validation and certification of synchronous program properties has been the subject of numerous studies. In Reference 34, a coinductive axiomatization of Signal in the proof assistant Coq [35], based on the calculus of constructions [36], is proposed. The application of this model is twofold. It allows, first of all, for the exhaustive verification of formal properties of infinite-state systems. Two case studies have been developed. In Reference 37, a faithful model of the steam-boiler problem was given in Signal and its properties proved with Signal’s Coq model. In Reference 38, it is applied to proving the correctness of real-time properties of a protocol for loosely time-triggered architectures (TTAs), extending previous work proving the correctness of its finite-state approximation [39]. Another important application of modeling Signal in the proof assistant Coq is being explored: the development of a reference compiler translating Signal programs into Coq assertions. This translation

© 2006 by Taylor & Francis Group, LLC

8-18

Embedded Systems Handbook

allows to represent model transformations performed by the Signal compiler as correctness-preserving transformations of Coq assertions, yielding a costly yet correct-by-construction synthesis of the target code. Other approaches to the certification of generated code have been investigated. In Reference 40, validation is achieved by checking a model of the C code generated by the Signal compiler in the theorem prover PVS with respect to a model of its source specification (translation validation). Related work on modeling Lustre have equally been numerous and started in Reference 41 with the verification of a sequential multiplier using a model of stream functions in Coq. In Reference 42, the verification of Lustre programs is considered under the concept of generating proof obligations and by using PVS. In Reference 43, a semantics of Lucid-Synchrone, an extension of Lustre with higher-order stream functions, is given in Coq.

8.5 Success Stories — A Viable Approach for System Design Synchronous and reactive formalisms appeared in the early nineties and the theory matured and expanded since then to cover all the topics presented in this chapter. Research groups were active mostly in France, but also notably in Germany and in the United States. Several large academic projects were completed, including the IST Syrf, Sacres, and Safeair projects, as well as industrial early-adopters ones. S/R modeling and programming environments are today marketed by two French software houses, Esterel Technologies for Esterel and SCADE/Lustre, and TNI-Valiosys for Sildex/Signal. The influence of S/R systems tentatively pervaded to hardware CAD products such as Synopsys CoCentric Studio and Cadence VCC, despite the omnipotence of classical HDLs there. The Ptolemy cosimulation environment from UC Berkeley comprises a S/R domain based on the synchronous hypothesis. There have been a number of industrial take-ups on S/R formalisms, most of them in the aeronautics industry. Airbus Industries is now using Scade for the real design of parts of the new Airbus A-380 aircraft. S/R languages are also used by Dassault Aviation (for the next-generation Rafale fighter jet) and Snecma ([4] gives an in-depth coverage of these prominent collaborations). Car and phone manufacturers are also paying increasing attention (for instance, at Texas Instruments), as well as advanced development teams in embedded hardware divisions of prominent companies (such as Intel).

8.6 Into the Future: Perspectives and Extensions Future advances in and around synchronous languages can be predicted in several directions: Certified compilers. As already seen, this is the case for the basic SCADE compiler. But as the demand becomes higher, owing to the critical-safety aspects of applications (in transportation fields notably), the impact of full-fledged operational semantics backing the actual compilers should increase. Formal models and embedded code targets. Following the trend of exploiting formal models and semantic properties to help define efficient compilation and optimization techniques, one can consider the case of targeting distributed platforms (but still with a global reaction time). Then, the issues of spatial mapping and temporal scheduling of elementary operations composing the reaction inside a given interconnect topology become a fascinating (and NP complete) problem. Heuristics for user guidance and semiautomatic approaches are the main topic of the SynDEx environment [44,45]. Of course this requires an estimation of the time budgets for the elementary operations and communications. Desynchronized systems. In larger designs, the full global synchronous assumption is hard to maintain, especially if long propagation chains occur inside a single reaction (in hardware, for instance, the clock tree cannot be distributed to the whole chip). Several types of answers are currently being brought to this issue, trying to instill a looser coupling of synchronous modules in a desynchronized network (one then talks of “Globally Asynchronous Locally Synchronous,” GALS, systems). In the theory of latency-insensitive design, all processes are supposed to be able to stall until the full information is synchronously available.

© 2006 by Taylor & Francis Group, LLC

Synchronous Hypothesis and Languages

8-19

The exact latency duration meant to recover a (slower) synchronous model are computed afterwards, only after functional correctness on the more abstract level is achieved [46, 47]. Fancier approaches, trying to save on communications and synchronizations, are introduced in Section 8.6.1. Relations between transactional and cycle-accurate levels. If synchronous formalisms can be seen as a global attempt at transferring the notion of cycle-accurate modeling to the design of SW/HW embedded systems, then the existing gap between these levels must also be reconsidered in the light of formal semantics and mathematical models. Currently, there exists virtually no automation for the synthesis of RTL from TLM levels. The previous item, with its well-defined relaxation of synchronous hypothesis at specification time, could be a definite step in this direction (of formally linking two distinct levels of modeling). Relations between cycle-accurate and timed models. Physical timing is of course a big concern in synchronous formalisms, if only to validate the synchronous hypothesis and establish converging stabilization of all values across the system before the next clock tick. While in traditional software implementations one can decide that the instant is over when all treatments were effectively completed, in hardware or other real-time distributed settings a true compile-time timing analysis is in order. Several attempts have been made in this direction [48, 49].

8.6.1 Asynchronous Implementation of Synchronous Speciﬁcations The relations between synchronous and asynchronous models have long remained unclear, but investigations in this direction have recently received a boost owing to demands coming from the engineering world. The problem is that many classes of embedded applications are best modeled (at least in part) under the cycle-based synchronous paradigm, while their desired implementation is not. This problem covers implementation classes that become increasingly popular (such as distributed software or even complex digital circuits such as the Systems-on-a-Chip), hence the practical importance of the problem. Such implementations are formed of components that are only loosely connected through communication lines that are best modeled as asynchronous. At the same time, the existing synchronous tools for specification, verification, and synthesis are very efficient and popular, meaning that they should be used for most of the design process. In distributed software, the need for global synchronization mechanisms always existed. However, in order to be used in aerospace and automotive applications, an embedded system must also satisfy very high requirements in the areas of safety, availability, and fault tolerance. These needs prompted the development of integrated platforms, such as TTA [50], which offer higher-level, proven synchronization primitives, more adapted to specification, verification, and certification. The same correctness and safety goals are followed in a purely synchronous framework by two approaches: the AAA methodology and the SynDEx software of Sorel and co-workers [45] and the Ocrep tool of Girault and co-workers [51]. Both approaches take as input a synchronous specification, an architecture model, and some real-time and embedding constraints, and produce a distributed implementation that satisfies the constraints and the synchrony hypothesis (supplementary signals simulate at runtime the global clock of the initial specification). The difference is that Ocrep is rather tailored for control-dominated synchronous programs, while SynDEx works best on data-flow specifications with simple control. In the (synchronous) hardware world, problems appear when the clock speed and circuit size become large enough to make global synchrony unfeasible (or at least very expensive), most notably in what concerns the distribution of the clock and the transmission of data over long wires between functional components. The problem is to ensure that no communication error occurs owing to the clock skew or the interconnect delay between the emitter and the receiver. Given the high cost (in area and power consumption) of precise clock distribution, it appears in fact that the only long-term solution is the division of large systems into several clocking domains, accompanied by the use of novel on-chip communication and synchronization techniques. When the multiple clocks are strongly correlated, we talk about mesochronous or plesiochronous systems. However, when the different clocks are unrelated (e.g., for power saving reasons), the resulting circuit is best modeled as a GALS system where the synchronous domains are connected through asynchronous

© 2006 by Taylor & Francis Group, LLC

8-20

Embedded Systems Handbook

communication lines (e.g., FIFOs). Such approaches are pausible clocking by Yun and Donohue [52], or, in a framework where a global, reference clock is still preserved, latency-insensitive design by Carloni et al. [46]. A multi-clock extension of the Esterel language [53] has been proposed for the description of such systems. A more radical approach to the hardware implementation of a synchronous specification is desynchronization [54], where the clock subsystem is entirely removed and replaced with asynchronous handshake logic. The advantages of such implementations are those of asynchronous logic: smaller power consumption, average-case performance, smaller electromagnetic interference. At an implementation-independent level, several approaches propose solutions to various aspects of the problem of GALS implementation. The loosely TTAs of Benveniste et al. [39] define a sampling-based approach to (inter-process) FIFO construction. More important, Benveniste et al. [55] define semantics preservation — an abstract notion of correct GALS implementation of a synchronous specification (asynchronous communication is modeled here as message passing). Latency insensitivity ensures in a very simple, highly constrained way the semantics preservation. Less constraining and higher-level conditions are the compositional criteria of finite flow-preservation of Talpin et al. [56,57] and of weak endo-isochrony of Potop-Butucaru et al. [58]. While finite-flow preservation focuses on checking equivalence through finite desynchronization protocols, weak endo-isochrony allows to exploit the internal concurrency of synchronous systems in order to minimize signalization, and to handle infinite behaviors.

References [1] Nicolas Halbwachs. Synchronous programming of reactive systems. In Computer Aided Verification (CAV’98). Kluwer Academic Publishers, 1998, pp. 1–16. [2] Gérard Berry. Real-time programming: general-purpose or special-purpose languages. In Information Processing 89, G. Ritter, Ed. Elsevier Science Publishers B.V. (North Holland), Amsterdam, 1989, pp. 11–17. [3] Albert Benveniste and Gérard Berry. The synchronous approach to reactive and real-time systems. Proceedings of the IEEE, 79: 1270–1282, 1991. [4] Albert Benveniste, Paul Caspi, Stephen Edwards, Nicolas Halbwachs, Paul Le Guernic, and Robert de Simone. Synchronous languages twelve years later. Proceedings of the IEEE. Special Issue on Embedded Systems, January 2003. [5] Gérard Berry and Laurent Cosserat. The Synchronous Programming Language Esterel and its Mathematical Semantics, Vol. 197 of Lecture Notes in Computer Science, 1984. [6] Frédéric Boussinot and Robert de Simone. The Esterel language. Proceedings of the IEEE, September 1991. [7] Gérard Berry and Georges Gonthier. The Esterel synchronous programming language: design, semantics, implementation. Science of Computer Programming, 19: 87–152, 1992. [8] Charles André. Representation and analysis of reactive behavior: a synchronous approach. In Computational Engineering in Systems Applications (CESA’96). IEEE-SMC, Lille, France, 1996, pp. 19–29. [9] Nicolas Halbwachs, Paul Caspi, and Pascal Raymond. The synchronous data-flow programming language Lustre. Proceedings of the IEEE, 79: 1991. [10] Albert Benveniste, Paul Le Guernic, and Christian Jacquemot. Synchronous programming with events and relations: the Signal language and its semantics. Science of Computer Programming, 16: 103–149, 1991. [11] Gérard Berry. The Constructive Semantics of Pure Esterel. Esterel Technologies, 1999. Electronic Version Available at http://www.esterel-technologies.com. [12] Tom Shiple, Gérard Berry, and Hervé Touati. Constructive analysis of cyclic circuits. In Proceedings of the International Design and Testing Conference (ITDC). Paris, 1996. [13] Ellen Sentovich, Kanwar Jit Singh, Luciano Lavagno, Cho Moon, Rajeev Murgai, Alexander Saldanha, Hamid Savoj, Paul Stephan, Robert Brayton, and Alberto Sagiovanni-Vincentelli. SIS: a system for sequential circuit synthesis. Memorandum UCB/ERL M92/41, UCB, ERL, 1992.

© 2006 by Taylor & Francis Group, LLC

Synchronous Hypothesis and Languages

8-21

[14] Minxi Gao, Jie-Hong Jiang, Yunjian Jiang, Yinghua Li, Subarna Sinha, and Robert Brayton. MVSIS. In Proceedings of the International Workshop on Logic Synthesis (IWLS’01). Tahoe City, June 2001. [15] Ellen Sentovich, Horia Toma, and Gérard Berry. Latch optimization in circuits generated from high-level descriptions. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’96), 1996. [16] Hervé Touati and Gérard Berry. Optimized controller synthesis using Esterel. In Proceedings of the International Workshop on Logic Synthesis (IWLS’93). Lake Tahoe, 1993. [17] Amar Bouali, Jean-Paul Marmorat, Robert de Simone, and Horia Toma. Verifying synchronous reactive systems programmed in esterel. In Proceedings of the FTRTFT’96, Vol. 1135 of Lecture Notes in Computer Science, 1996, pp. 463–466. [18] Amar Bouali. Xeve, an Esterel verification environment. In Proceedings of the Tenth International Conference on Computer Aided Verification (CAV’98), Vol. 1427 of Lecture Notes in Computer Science. UBC, Vancouver, Canada, June 1998. [19] Robert de Simone and Annie Ressouche. Compositional semantics of Esterel and verification by compositional reductions. In Proceedings of the CAV’94, Vol. 818 of Lecture Notes in Computer Science, 1994. [20] Laurent Arditi, Hédi Boufaïed, Arnaud Cavanié, and Vincent Stehlé. Coverage-Directed Generation of System-Level Test Cases for the Validation of a DSP System, Vol. 2021 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2001. [21] Gérard Berry. Esterel on hardware. Philosophical Transactions of the Royal Society of London, Series A, 19: 87–152, 1992. [22] Daniel Weil, Valérie Bertin, Etienne Closse, Michel Poize, Patrick Vernier, and Jacques Pulou. Efficient compilation of Esterel for real-time embedded systems. In Proceedings of the CASES’00. San Jose, CA, 2000. [23] Stephen Edwards. An Esterel compiler for large control-dominated systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 21: 169–183, February 2002. [24] Dumitru Potop-Butucaru and Robert de Simone. Optimizations for faster execution of Esterel programs. In Formal Methods and Models for System Design, Rajesh Gupta, Paul Le Guernic, Sandeep Shukla, and Jean-Pierre Talpin, Eds. Kluwer, Dordrecht, 2004. [25] Steven Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers, San Francisco, 1997. [26] Robert French, Monica Lam, Jeremy Levitt, and Kunle Olukotun. A general method for compiling event-driven simulations. In Proceedings of the 32nd Design Automation Conference (DAC’95). San Francisco, CA, 1995. [27] Jaejin Lee, David Padua, and Samuel Midkiff. Basic compiler algorithms for parallel programs. In Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Atlanta, GA, 1999. [28] Paul Le Guernic, Jean-Pierre Talpin, and Jean-Christophe Le Lann. Polychrony for system design. Journal of Circuits, Systems and Computers. Special Issue on Application-Specific Hardware Design, 12(3): 261–304, 2003. [29] Albert Benveniste, Paul Caspi, Luca Carloni, and Alberto Sangiovanni-Vincentelli. Heterogeneous reactive systems modeling and correct-by-construction deployment. In Embedded Software Conference (EMSOFT’03). Springer-Verlag, Heidelberg, October 2003. [30] Pascalin Amagbegnon, Loïc Besnard, and Paul Le Guernic. Implementation of the data-flow synchronous language Signal. In Conference on Programming Language Design and Implementation (PLDI’95). ACM Press, New York, 1995. [31] Florence Maraninchi and Lionel Morel. Arrays and contracts for the specification and analysis of regular systems. In Proceedings of the International Conference on Applications of Concurrency to System Design (ACSD’04). IEEE Press, 2004. [32] Jean-Louis Colaço and Marc Pouzet. Clocks as first class abstract types. In Proceedings of the EMSOFT’03, 2003.

© 2006 by Taylor & Francis Group, LLC

8-22

Embedded Systems Handbook

[33] Jean-Louis Colaço, Alain Girault, Grégoire Hamon, and Marc Pouzet. Towards a higher-order synchronous data-flow language. In Proceedings of the EMSOFT’04, 2004. [34] David Nowak, Jean-Rene Beauvais, and Jean-Pierre Talpin. Co-inductive axiomatization of a synchronous language. In Proceedings of the International Conference on Theorem Proving in HigherOrder Logics, Vol. 1479 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 1998. [35] Eduardo Giménez. Un Calcul de Constructions Infinies et son Application à la Vérification des Systèmes Communicants. Ph.D. thesis, Laboratoire de l’Informatique du Parallélisme, Ecole Normale Supérieure de Lyon, December 1996. [36] Benjamin Werner. Une Théorie des Constructions Inductives. Ph.D. thesis, Université Paris VII, Mai, 1994. [37] Mickael Kerboeuf, David Nowak, and Jean-Pierre Talpin. Specification and verification of a steamboiler with Signal-Coq. In International Conference on Theorem Proving in Higher-Order Logics, vol. 1869 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2000. [38] Mickael Kerboeuf, David Nowak, and Jean-Pierre Talpin. Formal proof of a polychronous protocol for loosely time-triggered architectures. In International Conference on Formal Engineering Methods, vol. 2885 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2003. [39] Albert Benveniste, Paul Caspi, Paul Le Guernic, Hervé Marchand, Jean-Pierre Talpin, and Stavros Tripakis. A protocol for loosely time-triggered architectures. In Embedded Software Conference (EMSOFT’02), Vol. 2491 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, October 2002. [40] Amir Pnueli, O. Shtrichman, and M. Siegel. Translation validation: from Signal to C. In Correct System Design Recent Insights and Advance, Vol. 1710 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2000. [41] Christine Paulin-Mohring. Circuits as streams in Coq: verification of a sequential multiplier. In Types for Proofs and Programs, TYPES’95, Vol. 1158 of Lecture Notes in Computer Science, S. Berardi and M. Coppo, Eds. Springer-Verlag, Heidelberg, 1996. [42] Cécile Dumas Canovas and Paul Caspi. A PVS proof obligation generator for Lustre programs. In International Conference on Logic for Programming and Reasoning, Vol. 1955 of Lecture Notes in Artificial Intelligence. Springer-Verlag, Heidelberg, 2000. [43] Sylvain Boulme and Grégoire Hamon. Certifying synchrony for free. In Logic for Programming, Artificial Intelligence and Reasoning, Vol. 2250 of Lecture Notes in Artificial Intelligence. SpringerVerlag, Heidelberg, 2001. [44] Christophe Lavarenne, Omar Seghrouchni, Yves Sorel, and Michel Sorine. The SynDEx software environment for real-time distributed systems design and implementation. In Proceedings of the ECC’91. France, 1991. [45] Thierry Grandpierre, Christophe Lavarenne, and Yves Sorel. Optimized rapid prototyping for real time embedded heterogeneous multiprocessors. In Proceedings of the 7th International Workshop on Hardware/Software Co-Design (CODES’99). Rome, 1999. [46] Luca Carloni, Ken McMillan, and Alberto Sangiovanni-Vincentelli. The theory of latencyinsensitive design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20(9): 1059–1076, 2001. [47] Alberto Sangiovanni-Vincentelli, Luca Carloni, Fernando De Bernardinis, and Marco Sgroi. Benefits and challenges of platform-based design. In Proceedings of the Design Automation Conference (DAC’04), 2004. [48] George Logothetis and Klaus Schneider. Exact high-level WCET analysis of synchronous programs by symbolic state space exploration. In Proceedings of the DATE2003. IEEE Computer Society, Germany, 2003. [49] Etienne Closse, Michel Poize, Jacques Pulou, Joseph Sifakis, Patrick Venier, Daniel Weil, and Sergio Yovine. TAXYS: a tool for the development and verification real-time embedded systems. In Proceedings of the CAV’01, Vol. 2102 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2001.

© 2006 by Taylor & Francis Group, LLC

Synchronous Hypothesis and Languages

8-23

[50] Hermann Kopetz. Real-Time Systems, Design Principles for Distributed Embedded Applications. Kluwer Academic Publishers, Dordrecht, 1997. [51] Paul Caspi, Alain Girault, and Daniel Pilaud. Automatic distribution of reactive systems for asynchronous networks of processors. IEEE Transactions on Software Engineering, 25: 416–427, 1999. [52] Kenneth Yun and Ryan Donohue. Pausible clocking: a first step toward heterogenous systems. In Proceedings of the International Conference on Computer Design (ICCD’96), 1996. [53] Gérard Berry and Ellen Sentovich. Multiclock Esterel. In Proceedings of the CHARME’01, Vol. 2144 of Lecture Notes in Computer Science, 2001. [54] Ivan Blunno, Jordi Cortadella, Alex Kondratyev, Luciano Lavagno, Kelvin Lwin, and Christos Sotiriou. Handshake protocols for de-synchronization. In Proceedings of the International Symposium on Asynchronous Circuits and Systems (ASYNC’04). Crete, Greece, 2004. [55] Albert Benveniste, Benoît Caillaud, and Paul Le Guernic. Compositionality in dataflow synchronous languages: specification and distributed code generation. Information and Computation, 163: 125–171, 2000. [56] Jean-Pierre Talpin, Paul Le Guernic, Sandeep Kumar Shukla, Frédéric Doucet, and Rajesh Gupta. Formal refinement checking in a system-level design methodology. Fundamenta Informaticae. IOS Press, Amsterdam, 2004. [57] Jean-Pierre Talpin and Paul Le Guernic. Algebraic theory for behavioral type inference. Formal Methods and Models for System Design, Chap. VIII, Kluwer Academic Press, Dordrecht, 2004. [58] Dumitru Potop-Butucaru, Benoît Caillaud, and Albert Benveniste. Concurrency in synchronous systems. In Proceedings of the International Conference on Applications of Concurrency to System Design (ACSD’04). IEEE Press, 2004.

© 2006 by Taylor & Francis Group, LLC

9 Introduction to UML and the Modeling of Embedded Systems 9.1 9.2

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . UML Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9-1 9-2

Static System Structure • System Behavior • Execution Architecture • Embedded Systems and UML

9.3

Thomas Weigert Motorola

9-7

Domain Statement and Domain Model • Behavior Overview through a Use Case Diagram • Context Description by a Collaboration Diagram • Behavior Modeling with Interactions • Behavioral Modeling with State Machines • Validation • Generalizing Behavior • Hierarchical Decomposition • The Difference between the UML System and the Final System

Øystein Haugen and Birger Møller-Pedersen University of Oslo

Example — Automatic Teller Machine . . . . . . . . . . . . . . . . .

9.4

A UML Profile for the Modeling of Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-30 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-32

9.1 Introduction Embedded systems have the following characteristics: most often, their software runs on a hardware platform dedicated to a specific task (e.g., a telephony switch). Naturally, the hardware imposes limits on what the software can do. Most embedded systems can be considered reactive — the system is sent a message and it is supposed to give a response, which may involve a change in the state of the controlled hardware. Such software is usually real-time but the performance requirements are more often statistical than absolute. Embedded systems are often required to handle many independent requests at the same time. Typically, this software is parallel and distributed. These characteristics result in embedded systems being expensive to develop. In response to the constraints imposed by the underlying hardware as well as by the concerns of managing parallel and distributed systems, developers of embedded systems embraced modeling techniques early. As a consequence, software and system modeling for embedded system development has matured considerably. For example, telecommunications equipment manufacturers and operators have standardized notations to describe systems as early as the late seventies: the first standardized notation for state diagrams was the ITU (then CCITT) Specification and Description Language (SDL), originally approved in 1976 [1–3]. It recognized

9-1

© 2006 by Taylor & Francis Group, LLC

9-2

Embedded Systems Handbook

that a telecom system would be built of blocks (we might say “objects” today) with a well-defined boundary (today we might say an “interface”) across which explicitly indicated signals (or messages) pass. Proprietary variations of sequence diagrams (these were often referred to as “message flow graphs” or “bounce diagrams”) had long been in use in the telecom industry. The standardization of Message Sequence Charts (MSC) was triggered by a paper by Grabowski and Rudolph [4], leading to the first MSC recommendation in 1992 (the current version is [23]). As specification techniques matured, notations to capture software and system models proliferated [5–8]. With the adoption of the UML specification [9] by the OMG this situation changed. Users and tool vendors alike began to adopt the emerging UML notation as the primary means of visualizing, specifying, constructing, and documenting software artifacts. A major revision [10] has addressed shortcomings and made UML amenable to systems modeling as well as to the representation of embedded systems. This chapter is meant to give an overview of how UML can be used for modeling embedded systems. Supplementary introductions can be found in Reference 11.1 We begin with a terse overview of UML and discuss those features of UML suited to represent the characteristics of embedded systems outlined above. We will illustrate these key features using examples. We will also show how to use UML when describing a simple, easily understood embedded system; we begin by describing the domain statement and domain model of the example system, give an overview of the system behavior through a use case diagram, and use collaboration diagrams to establish the system context. We then show how to model various aspects of the system behavior using interactions and state machines. Interactions describe the system behavior as a set of partially-ordered sequence of events that the system needs to exhibit. State machines describe the system behavior as an automaton that must induce these event sequences. We illustrate a simple process of validating that the state machine descriptions indeed specify the same behavior as the interactions. We then continue the example by exhibiting UML features useful to represent real-life systems: generalization allows to express variation in system behavior abstractly and hierarchical decomposition allows to make large system descriptions more scalable and understandable. Finally, we outline characteristics of embedded systems that are abstracted away by a UML specification but need to be considered when deploying such systems. This chapter concludes by presenting a standardized UML profile (a specification language instantiated from the UML language family) suitable for the modeling of embedded systems.

9.2 UML Overview The UML defines a number of diagrams to describe various aspects of a system: • Class diagrams, composite structure diagrams, component diagrams, and object diagrams specify its static structure. • Activity diagrams, interaction diagrams, state machine diagrams, and use case diagrams provide different mechanisms of describing its behavior. • Deployment diagrams depict its implementation and execution architecture. • Package diagrams define how a system specification is grouped into modular units as the basis for configuration control, storage, access control, and deployment.

9.2.1 Static System Structure The static structure of a system describes the entities that exist in the system, their structural properties, and their relationships to other entities. Entities of a system are classified according to their features. Instances with common features are specified by a classifier. Classifiers may participate in the generalization relationships to other classifiers; features and constraints specified for instances of the general classifier are implicitly specified for instances 1 Part

of the work reported in this chapter has been sponsored by the SARDAS (Securing Availability by Robust Design, Assessment and Specification) project funded by the Norwegian research council.

© 2006 by Taylor & Francis Group, LLC

Introduction to UML and the Modeling of Embedded Systems

9-3

of a more specific classifier. The structural features of a classifier describe relationships that an instance of this classifier has to other instances: for example, an association declares that there may be runtime links between instances of the associated classifiers (when a structural feature at an end of an association is owned by one of the associated classifiers, this structural feature is referred to as attribute). The dynamic nature of those links can be characterized as reference or composition. Taken jointly, the structural features of a classifier define the state of the specified instances. Structured classifiers specify the behavior of an instance as the result of the cooperation of a set of instances (its parts) which interact with each other and the environment of their container instance through well-defined communication channels (connectors). Ports serve to isolate a classifier from its environment by providing a well-defined point of interaction between the classifier and its environment (or the classifier and its internal parts). A port specifies the services an instance provides (offers) to its environment as well as the services that an instance expects (requires) of its environment. The runtime nature of an instance is described by the particular kind of classifier that specifies an instance: a class specifies an instance that exists at runtime (an object) and has similar structure, behavior, and relationships. An association class specifies a relationship between instances of connected classifiers which has features of its own. An interface specifies a set of public features (a contract) that an instance of a classifier that implements the interface must possess (fulfill). A collaboration describes only the relevant aspects of the cooperation of a set of instances by abstracting those properties into roles that those instances may play (a role specifies the required set of features a participating instance must have). A component is a structured class satisfying constraints such that its instances can be deployed on popular component technologies. Values may be specified through a computation that yields this value (either composed from other values as expression or textually as opaque expression), as literals of various types (such as Boolean, Integer, String, unlimited Naturals), literals of enumerations, or as reference to a specific instance. Instance specifications may describe an instance partially, by listing its salient characteristics only. Entities in a modeled system are described by instance specifications which show the values that correspond to each of the structural features of the classifier that specifies the entity. Dependencies describe relationships between model elements such that the semantics of a set of client elements is dependent on one or more supplier elements. Examples of dependencies are abstraction, realization, substitution, implementation, usage, and permission relationships. For example, one model element implements another if an instance of the former has all the properties specified by the latter; a model element uses another if the former requires the latter to function.

9.2.2 System Behavior System behavior is the direct consequence of the actions of objects. Models of behavior describe how the states of these objects, as reflected by their structural features, change over time. The execution of a behavior is caused by events, such as being directly invoked by an action or triggered indirectly. An executing behavior is performed by an object, while emergent behavior results from the interaction of one or more participant objects. A variety of specification mechanisms of behaviors are supported by the UML, such as automata, Petri-net like graphs, informal descriptions, or partially-ordered sequences of events. The styles of behavioral specification differ in their expressive power, albeit the choice of specification style is often one of convenience and purpose; typically, the same kind of behavior could be described by any of the different mechanisms. They may specify behaviors either explicitly, by describing the observable events resulting from the execution of the behavior, or implicitly, by describing a machine that would induce these events. When a behavior is invoked, argument values corresponding to its formal parameters are made available to the execution. A behavior executes within a context object and independently of any other behavior executions. When a behavior completes its execution, a value or set of values is returned corresponding to its result parameters. Behavior specifications may either describe the overall behavior of an instance which is executed when the instance is created, or they may describe the behaviors executed when a behavioral feature is invoked.

© 2006 by Taylor & Francis Group, LLC

9-4

Embedded Systems Handbook

In addition to structural features, classifiers may have behavioral features which specify that an instance of this classifier will respond to a designated request (such as an operation call or a sent signal) by invoking a behavior. Actions are the fundamental unit of behavior specification. An action takes a set of inputs and converts them into a set of outputs, though either or both sets may be empty. In addition, some actions modify the state of the system in which the action executes. Actions may perform operation calls and signal sends, receive events and reply to invocations, access (read and write) structural features of objects and temporary variables, create and destroy objects, and perform computations. A state machine performs actions based on the current state of the machine and an event that triggers a transition to another state, where a state represents a situation during which some implicit, invariant condition holds. Each transition specifies a trigger upon which the transition will fire, provided that its guard conditions hold and the event is not deferred in this state. Possible triggers are the reception of a signal or the invocation of an operation, the change of a Boolean value to true, or the expiration of a predetermined deadline. Any action that may be associated with that transition will be performed; the next event is only processed when all such actions have been completely processed (run-to-completion semantics). In addition, actions may be performed upon entry to a state, while in a state, or upon exit from a state. Transitions may be broken into segments connecting so-called pseudo states: history, join, fork, junction, and choice vertices, as well as entry and exit points. State machines support two structuring mechanisms: submachine states allow factoring and reuse of common aspects of a state machine. Composite states partition the set of states into disjunct regions. An interaction describes system behavior as a set of partially-ordered sequences of events. The parts of the underlying classifier instance are the participants in an interaction; they are represented by lifelines. Events are depicted along the lifelines, ordered from top to bottom. Typical events are the sending and receiving of signals or operation calls, the invocation, and termination of behaviors, as well as the creation or destruction of instances. Events on the same lifeline retain their order, while events on different lifelines may occur in any order (they are interleaved). However, coregions (and parallel merge) make the order of events on a lifeline irrelevant, and an order may be imposed on events on different lifelines through constraints and a general ordering mechanism. Communication between instances is shown by messages which may specify both the content of the communication as well as the nature of the communication (synchronous versus asynchronous, signal versus call, etc.). The compactness and conciseness of interactions is enhanced by providing combinators for fragments of event sequences: alternatives and options describe a choice in behavior; parallel merge, weak sequencing, and strict sequencing define the ordering of events, while critical regions prevent interleaving of events in a fragment altogether; loops indicate that a fragment is repeated a number of times. Sequence fragments may be prefixed by guards which prevent the execution of the fragment when false. Interaction references allow factoring and reuse of common interaction fragments. In addition, the high-level structure of interactions can be depicted in a flow graph like structure, showing how interaction fragments may follow each other, yet make choices, loops, or parallelism explicit in the flow graph. Interactions may also be used to assert behavioral properties of a system. Activities emphasize the sequence of actions, where these actions are initiated because other actions finish executing, because objects and data become available, or because triggering events occur. Activities are actions connected by flows: a data flow routes values between actions and between parameters and actions; a control flow enables actions without the transmission of data. Activities support simple unstructured sequences of actions involving branching and joining; concurrent control and data flow; structured control flows through loops and conditionals; and exception handling. The semantics of activities is gleaned from Petri-nets: activities are interpreted as graphs of nodes (actions, flow-of-control operators, etc.) and edges (data and control flows). A node begins execution when specified conditions on its input tokens are satisfied; when a node begins execution, tokens are taken from the input edges; upon completion, tokens are offered on its output edges. Execution of the actions comprising an activity are solely constraint by flow relationships. If two actions are not ordered through flow relationships, they may execute concurrently, unless otherwise constrained. In addition, there are various loosely defined mechanisms, such as streaming outputs or token weights to impose or vary the manner according to which tokens flow through

© 2006 by Taylor & Francis Group, LLC

Introduction to UML and the Modeling of Embedded Systems

9-5

nodes and along edges (with the intent to support the modeling of processes and work flows). Activities may be hierarchically structured in that actions may invoke behaviors, which may in turn be activities. Further, elements of an activity may be grouped into partitions. These do not affect the token flow, but may, for example, be used to allocate each partition to a separate part of a structured classifier. A use case is the specification of a set of actions performed by the system. Each use case represents a behavior that the system can perform in collaboration with one or more actors, without reference to its internal structure. Use cases are typically leveraged for the description of system requirements, for the specification of requirements that the system imposes on its environment, or for the description of functionality offered by the system.

9.2.3 Execution Architecture The execution architecture (the configuration of runtime processing resources and the processes and objects that execute on them) of a system is specified in terms of a topology of nodes connected through communication paths, representing hardware devices or software execution environments. Deployment is the allocation of a software artifact (physical elements that are the result of a development process) to nodes.

9.2.4 Embedded Systems and UML In Section 9.1, we discussed that embedded systems often run on dedicated hardware, are reactive, respond in real-time with soft time constraints, and are parallel and distributed. Various aspects of UML cater well to modeling these system characteristics. The UML describes system behavior independent of the constraints of a particular platform; a UML system can be viewed as being executed by a virtual UML machine. While this UML machine is not subjected to the limitations of the actual hardware (see Section 9.3.9), it allows to describe system behavior independent of these limitations, or for that matter, special capabilities that the hardware may offer. It will, therefore, be possible to choose the allocation of various aspects of the system functionality to different underlying processing elements based on system needs and architectural considerations. Functionality can be moved from one underlying system component to another, should this be required by the performance requirements of the system. The allocation of functional elements to the underlying hardware can be shown in deployment diagrams. Any ancillary information required to deploy a functional element on a processing element can be expressed as part of the deployment description; the description of the functionality remains clean of these implementation details. A system is reactive when it responds to stimuli received from its environment during the life time of the system. The UML communication model is particularly well-suited for describing system behavior as result of the interaction with the environment. The objects described by a UML specification may have behavioral features, referred to as operations and receptions, that are triggered by environmental stimuli (the receipt of operation calls or signals, respectively). State machines describe how the system responds to such triggers by executing actions in a state-dependent manner. Interactions describe system behavior as a set of (partially ordered) event sequences which typically exhibit how the system responds to stimuli received from its environment. Systems that are required to respond to environmental stimuli in real-time need to be able to process multiple requests concurrently. The environment may present the system with multiple, often concurrent, requests. Handling one request may potentially take a long time but must not delay responding to a different stimulus. The UML provides the description of the system as comprised of multiple, concurrently executing internal components (its parts). The overall system behavior is the result of the cooperation of these parts with each other and the environment. Note that the decomposition into independently executing architectural components is necessary even when assuming infinitely fast processing speeds, as the handling of a given stimulus may require an infinite amount of time, yet must not delay the handling of other requests. All behavioral descriptions afforded by UML are sensitive to the internal architecture

© 2006 by Taylor & Francis Group, LLC

9-6

Embedded Systems Handbook

of a system as described by its composite structure. In addition, time constraints can be expressed for all behavioral elements. A UML system specification can state hard real-time constraints, but it cannot guarantee that the system will meet these constraints. As described above, the system behavior is represented as the result of the cooperation of the internal parts of the system. The UML assumes that the parts of the system execute concurrently and it makes no assumptions as to the allocation on processing elements. Consequentially, UML describes a system as fully parallel and distributed. However, the system functionality can be allocated to underlying processing elements using deployment diagrams, thus constraining the actual system implementation. Of course, these features may also be useful when modeling other systems, but they are essential in representing embedded systems. The UML is a very rich specification language, as it aims to be applicable to any software or system application domain. Consequentially, not each of its constructs is equally amenable to deployment in every possible application domain. Practitioners have proposed methodologies that focus on deploying UML in the embedded system domain. For example, the ITU service description methodology (see Figure 9.1), as adapted to UML [12], identifies the following development stages: Stage 1 gives the description of the services expected of the system to be developed, from the user point of view; Stage 2 describes the user–system interface and the interfaces between different service access points within the system; and Stage 3 gives the description of the detailed system components, as well as of the protocols and message formats. This methodology leverages the following UML diagrams: • The Stage 1 descriptions treat the system as a single entity that provides services to the users. Use cases or use case maps [13] are used to describe the services provided by the system in terms of the

Stage 1 (Service aspects) Step 1.2

Step 1.1

Static description of service

Stage 2 (Functional network aspects) Step 2.2

Information flow diagrams

Service definition and description Step 1.3

Dynamic description of service

Step 2.1

Derivation of a functional model

Step 2.3

Dynamic description of functional entity

Step 2.5

Allocation of functional entities

Step 2.4

Stage 3 (Network implementation aspects) Step 3.1

FIGURE 9.1

Protocols and formats

ITU method for service description.

© 2006 by Taylor & Francis Group, LLC

Step 3.2

Switching and service nodes

Functional entity actions

Introduction to UML and the Modeling of Embedded Systems

9-7

perception of the users as well as the users involved in a service. The dynamic information that is sent and received by the user is described by simple state machines. • Stage 2 identifies the functional capabilities and the information flows needed to support the service as described in Stage 1. Composite structure diagrams describe the architecture of the functional entities comprising the system. Interaction diagrams specify the information flows between functional entities for both successful operation and error conditions. State machines are used to give the dynamic description of each functional entity. The actions performed by each functional entity are characterized in terms of pseudo-code or a formalized action language. In the final step at this stage, functional entities are allocated to underlying system components in composite structure diagrams. • At Stage 3, for each system component, the functional requirements of these elements are defined using class diagrams, composite structure diagrams (for hierarchical decomposition), and state machines (for specification of behavior). In addition, the relationship supported between two functional entities located in different system components must be realized by protocol(s) supported between those system components. These protocols are typically described in the protocol definition language ASN.1 [14]. In the following, we shall leverage some of the key features of UML identified above, composite structure diagrams, sequence diagrams, and state machine diagrams, to illustrate, by example, the modeling of embedded systems using UML.

9.3 Example — Automatic Teller Machine This chapter will use one example — an automated teller machine (ATM) — to illustrate how UML can be used to model an embedded system in a compact, but readable way. While an automated teller machine is not the most exciting example of an embedded system, its functionality and requirements are well understood.

9.3.1 Domain Statement and Domain Model An ATM is a system with mechanical as well as electronic parts. Its purpose is to provide a bank user with cash provided that the user can be authenticated and has adequate funds in the bank account. • In order to be authenticated, the user presents a card to the ATM card reader, and provides a Personal Identification Number (PIN) through the ATM keyboard (which may be a physical keypad or a touch screen pad, or similar). • The ATM is connected electronically, possibly through a network, to the bank such that the account status may be checked online. • The ATM is refilled with cash notes regularly or when the number of specific notes falls below a predetermined limit. • The ATM may provide foreign currency to the customer. We begin by constructing a simple class model describing the domain of the ATM system, see Figure 9.2 and Figure 9.3. The domain model does not yet describe the classes that will be developed as part of the system; rather, it captures the essence of the system as described by the domain statement. We see a number of classes that we expect to appear in the final system as well as high-level relationships between these classes, based on the domain statement: an ATM is associated with users and a bank, and it consists of keyboard, card reader, screen and cash dispenser. Users will have a number of cards that gives them access to accounts maintained by the bank. We may use multiplicities and role names on the relationship to document important characteristics of the domain. Note that according to the domain model, each card gives access to exactly one account.

© 2006 by Taylor & Francis Group, LLC

9-8

Embedded Systems Handbook

User

*

*

ATM

1

*

1

Bank 1

*

* 1

Card

1

Account

myAccount

FIGURE 9.2

Domain class model-I.

ATM

CardReader

FIGURE 9.3

Keyboard

Screen

CashDispenser

Domain model-II.

ATM

CashRepository

Withdrawal User

«include» Authentication «include»

Bank

Currency

FIGURE 9.4

Use cases of the ATM.

9.3.2 Behavior Overview through a Use Case Diagram We rely on use cases in a very restricted way, simply to give an overview of the different services that external entities (users and other systems) expect from the ATM system. As we represent the system by means of a class (here ATM), we define use cases as part of a class, see Figure 9.4. This provides a direct and simple way of specifying what the subject for the use cases is, namely, the ATM. An alternative would be to define use cases as part of packages, combined with some informal indication of what the subject for the use cases is.

© 2006 by Taylor & Francis Group, LLC

Introduction to UML and the Modeling of Embedded Systems

9-9

Bank Context

User-reader ATM-bank :User

:Bank

:ATM User-screen User-keyboard User-cash

FIGURE 9.5

The context of an ATM.

9.3.3 Context Description by a Collaboration Diagram We use collaboration diagrams for mainly two purposes: • To define the context for interactions between the system and entities in its environment. • To tell which other entities the system is dependent upon. In this case only the ATM part will be developed, but it depends upon agreed interfaces to users and banks. While the interaction with the user is elaborated as part of the development of the ATM, the interaction with the bank may be given in advance. Figure 9.5 shows the bank context of the ATM class. The context in which the ATM operates is comprised of two other parts: User and Bank. The parts are linked by connectors, which specify communication paths between the objects playing these parts. Note the difference between this model and the domain model: here we only describe the parts that interact, while classes such as Card and Account are not involved. In addition we do not use the fact that ATMs consist of parts such as, for example, CardReader; instead, we use the port User-Reader to indicate the connection to this internal part of the ATM, thus isolating the internal structure of the ATM from the environment. This collaboration is used to specify a view of the cooperating entities only. It specifies the required features of the parts as well as required communications between them. Any object that plays these parts must at least have the properties specified by the classifiers that type the parts of the collaboration. The system specified by the collaboration may have additional parts not shown, and the objects playing the parts may have additional properties. In addition, we can specify constraints on the number of objects that may play each part. In Figure 9.6 it is specified that there may be up to 10,000 users, up to 100 ATMs, and just one bank. The collaboration in Figure 9.6 specifies that the context in which the ATM works is comprised of three sets of objects. Each part of a collaboration represents a set of objects, in this case a set of User objects, a set of ATM objects, and a set of Bank objects. The specified system will be made up of objects corresponding to each of these parts, as specified by the multiplicities of the parts, but there will not be any object corresponding to the BankContext as such.

9.3.4 Behavior Modeling with Interactions In this section, we will describe the services of the ATM based on the domain statement and the concepts and structure described above. We shall use sequence diagrams to describe the services of the system. A formalization of the approach to sequence diagram modeling shown here has been presented in Reference 15.

© 2006 by Taylor & Francis Group, LLC

9-10

Embedded Systems Handbook

BankContext

User-reader

:User [1...10,000]

ATM-bank User-screen

:Bank

:ATM [1...100]

User-keyboard User-cash

FIGURE 9.6

Bank context with multiplicities on parts.

sd EnterPIN

:User

:ATM

:Bank

msg(“Enter PIN”) Digit Digit Digit Digit Code(cid, pin) OK PIN OK

FIGURE 9.7

EnterPIN — a very simple sequence diagram.

9.3.4.1 A Simple Sequence Diagram for a Simple Situation We first present a very simple sequence diagram describing how to enter a PIN on the ATM. We are later going to apply this simple behavior in more elaborated scenarios. EnterPIN is a sequence diagram that requires no structuring mechanisms. In Figure 9.7, we see that the communicating objects refer to the parts of the collaboration depicted in Figure 9.5 and Figure 9.6. The objects in an interaction are called lifelines as their most significant asset is the vertical line on which the different messages are attached. Each lifeline represents exactly one object and not a set of objects such as in the collaboration of Figure 9.6. The single object may be identified by a selector expression, but typically the interactions refer to an arbitrary object of a set since the objects of the set are indistinguishable for most purposes. Most people, regardless of their competence in UML and computer science, will intuitively interpret EnterPIN correctly. First, the ATM will give a message to the user that a PIN needs to be provided.

© 2006 by Taylor & Francis Group, LLC

Introduction to UML and the Modeling of Embedded Systems

9-11

We rely on a message type msg with a plain string literal as actual parameter. This is a very simplistic approach; in practice, such systems will use strings from a language specific repository to accommodate for different locales (the language may actually be selectable by the user during the interaction). The user knows that this number consists of four digits and presses those digits. The ATM will combine those four digits into the pin code and transmit that code as well as the earlier collected card identification cid to the bank for authentication. In this scenario, the bank returns a message ok to indicate that the pin and the card identification were accepted. At the bottom of the sequence diagram we have placed a continuation with the label PIN OK to indicate that further continuations could start with the assumption that card and pin were accepted. We shall look at continuations shortly. The scenario in Figure 9.7 is obviously not the complete story, even regarding the simple matter of entering the PIN. We have given only one of many possible scenarios that a real system would have to take into account. We have only considered a case where the card and pin are actually accepted. But cards and pins are not always accepted, which is the whole purpose of authentication. To define a situation where the card and/or the pin are rejected, we could construct another scenario that would start out identically to that of Figure 9.7, but would end up with a rejection of the user. Then, we could add yet another scenario where the user keyed in only three digits, or five, or the user entered a letter instead of a digit, and so on. We quickly would arrive at a large number of scenarios for the same small detail. The need for organizing these scenarios and categorizing them is apparent. (For brevity, in this section we limit ourselves to describing that card and pin may be rejected as well as accepted.) In Figure 9.8, we have introduced a combined fragment with the operator alt (alternative) to indicate that there are alternative scenarios. The upper operand ends in a continuation PIN NOK and the lower is identical to that of Figure 9.7. Since the first messages were identical, it is more compact and more transparent to describe the difference only where it appears. Messages are represented by the arrows attached to either side of a lifeline. In Figure 9.8, all the arrows have an open stick arrow head which means that these messages are asynchronous and have an associated signal that is transferred from the sender to the receiver. Real-time embedded systems typically rely on asynchronous messages; for simplicity we shall only use this kind of messages in this chapter. For other

sd EnterPIN

:User

:ATM

:Bank

msg(“Enter PIN”) Digit Digit Digit Digit Code(cid, pin) alt

NOK PIN NOK OK PIN OK

FIGURE 9.8 EnterPIN — with alternatives.

© 2006 by Taylor & Francis Group, LLC

9-12

Embedded Systems Handbook

purposes messages may also describe synchronous or asynchronous (remote) calls requiring a return message. A call is indicated by a filled arrow head. The message is attached to either side of a lifeline. Events are represented by the point where a message meets a lifeline. In other words, a message is the pair of a sending event and a receiving event. There are two important invariants about (simple) sequence diagrams: • Message invariant : the sending event of a message must precede the receiving event of the same message. • Lifeline invariant : events are ordered from top to bottom on a lifeline (with exceptions for events on different combined fragment operands). The formal meaning of a sequence diagram is the set of event traces that it describes in the presence of the two invariants. Even simple diagrams often represent more traces than first meets the eye especially because the different lifelines are independent. For example, the diagram shown in Figure 9.8 describes several traces owing to that the user keys in numbers independently of the ATM consuming them. Thus, the sending events and receiving events may be arbitrarily interleaved, resulting in many different traces. Sometimes different traces represent significant behavioral differences, while in other cases, such as in our example diagram, there is little significance to the relative order of sending and receiving digits as long as the digits are consumed in the order they are sent, which is precisely described by the diagram. 9.3.4.2 Factoring out General/Common Interactions EnterPIN is a simple scenario that is not even mentioned in the use cases of Figure 9.4. We could have introduced it as being used by the Authentication use case. We decided that the card holder has three tries to enter the PIN code correctly. This means that the interaction representing the entering of the PIN code will occur several times in the full interaction for authentication, as shown in Figure 9.9. The scenario described by Authenticate it is also quite intuitive. First, the user will insert a card and thus transfer the card identification cid to the ATM. Then the EnterPIN scenario will take place. The final loop shows a situation that only occurs if the card and pin are not accepted. This is shown by the loop starting with the continuation PIN NOK. Within the loop the EnterPIN scenario will again occur. The loop may execute from 0 to 2 times, as shown by the parameters in the operator tab.

sd Authenticate

:User

:ATM Idle Cardid(cid)

ref

EnterPIN

loop(0,2) PIN NOK msg("Try again!") ref

FIGURE 9.9

Interaction for authentication.

© 2006 by Taylor & Francis Group, LLC

EnterPIN

:Bank

Introduction to UML and the Modeling of Embedded Systems

9-13

The reference to EnterPIN is given by an interaction occurrence as indicated by the symbol ref in a corner tab. Interaction occurrences allow to structure sequence diagrams, but must not be confused with operation calls. These are sub-scenarios; they are not invoked by an executing entity. 9.3.4.3 Describing the Full Withdrawal Service Now, we are ready to describe the full Withdrawal service using the Authenticate scenario as a sub-scenario, as shown in Figure 9.10. There are no new language features in this scenario apart from the obvious fact that combined fragments (in this case, alternatives) may be nested. We have assumed that the ATM will keep the user’s card during the execution of the withdrawal service. The card is returned at the end of the sequence. Again, we show not all possible scenarios that may occur when withdrawing money, but only some of the scenarios that are important and must be considered by the implementers. When the sequence diagrams become complex, consisting of several lifelines and nested combined fragments, a designer may choose to utilize an interaction overview diagram. Such diagrams also describe interactions, but the detailed messages and lifelines are abstracted away while the control structure is shown not as nested rectangles, but as a branching graph.

sd Withdrawal

:User

:ATM

ref

:Bank

Authenticate

alt

PIN OK msg(“Select service”) Withdrawal msg(“Enter amount!”) amount(v)

alt

checkaccount(v)

money(v)

ok

receipt(v)

nok msg(“Amount too large”)

PIN NOK msg(“Illegal entry”)

card card taken

FIGURE 9.10 Withdrawal of native money.

© 2006 by Taylor & Francis Group, LLC

9-14

Embedded Systems Handbook

sd Withdrawal

ref

Authenticate PIN NOK

PIN OK

sd

ref

SpecifyAmount

:User

:ATM

msg(“Illegal entry”)

sd

sd

:User

:ATM money(v)

:Bank ok

:User

:ATM

:Bank

msg(“Amount too large”) nok

receipt(v)

sd :User

:ATM card

card taken

FIGURE 9.11 Withdrawal as interaction overview diagram.

Figure 9.11 shows Withdrawal as an interaction overview diagram. Nodes in the graphs are either references to other interaction diagrams or inlined sequence diagrams. The graph notation is inherited from activity diagrams, but this diagram defines an interaction. The interaction overview diagram provides a high-level overview when the nodes are interaction occurrences, but on the other hand isolating every small piece of an interaction in a separate diagram leads to a loss of understanding, as there may be too many concepts to keep track of. The designer needs to find the appropriate balance of when to apply interaction overview diagrams instead of sequence diagrams. In Figure 9.11, we have chosen to isolate the interaction of specifying the withdrawal amount in a separate sequence diagram (not shown here), and kept the other trivial sequences as inline diagrams.

9.3.5 Behavioral Modeling with State Machines Interactions describe behavior as a sequence of events and may focus on a subset of possible behaviors only. Alternatively, behavior can be described by a state machine that induces those sequences of events.

© 2006 by Taylor & Francis Group, LLC

Introduction to UML and the Modeling of Embedded Systems

9-15

sm EnterPIN send(msg(“Give PIN”)); n=1; PIN=0

enterDigit

[n<4]digit/ n++; PIN=PIN+digit*10(3-n)

[n=4]digit/PIN=PIN+digit send(Code(cid,PIN))

nok

FIGURE 9.12

nok

waitOK

ok

ok

The state machine EnterPIN.

9.3.5.1 Simple State Machine for a Simple Situation The state machine for the behavior of entering the PIN code in terms of four digits is similarly simple, see Figure 9.12. The EnterPIN state machine obtains four digits, computes the PIN from these, and sends a signal carrying both the card identification (cid) and the PIN code (PIN) to the bank in order to check whether this is a valid combination of card identification and PIN code, before it enters the state waitOK, where it waits for the reply from the bank. The state machine is simple, but we have chosen to apply structuring mechanisms to enable its reuse in different situations, where entering a PIN code may be needed. This is done by defining the interface of the state machine by means of two exit points: in case the bank sends the signal ok, EnterPIN exits via the exit point ok, whereas in case the bank sends the signal nok, the EnterPIN state is exited via the exit point nok. 9.3.5.2 High-Level ATM with Submachine States The ATM state machine in Figure 9.13 is at a high level of abstraction in the sense that it describes the overall states and transitions, and that most of the behavior of the state machine is part of submachine states. Submachine states are states that are defined by means of a separate state machine. The state machine EnterPIN defined above is one such submachine. The syntax for a submachine state is a state symbol with an optional name followed by a colon (“:”) and the name of a state machine (e.g., EnterPIN). Entering a submachine state means entering the corresponding state machine via its initial (pseudo) state. When the submachine completes its behavior, it exits via one of the exit points. Each state in the ATM state machine corresponds to the major modes of the ATM and also to the different screens presented to the user. As an example the Service state corresponds to the screen where the user is asked to choose between two services (withdrawal or status). The reason for representing the detailed (partial) behaviors for entering a PIN code, withdrawing fund, and requesting account status by the corresponding submachine states EnterPIN, Withdrawal, and Status is that for the overall ATM behavior it is not important how these detailed behaviors are specified. After having received a CardId, the ATM will enter a state in which a PIN code is entered. The two different outcomes are represented by two exits of the state EnterPIN. In case a valid PIN code is received it will provide the services withdrawal and status, as represented by the state Service. The selection of a specific service is represented by its two exit points, Withdrawal and Status; the behaviors of these services are represented by the two states Withdrawal and Status.

© 2006 by Taylor & Francis Group, LLC

9-16

Embedded Systems Handbook

sm ATM /authN=0 Idle CardId(cid) /authN=0 :EnterPIN

:Service ok Status

Withdrawal

nok [authN<3]/ authN++; send(msg(“Try again”))

:Withdrawal ok

[authN==3]/ authN=0 send(msg (“illegal entry”));

:Status Cancelled

CardOut entry: send(card) cardTaken

FIGURE 9.13

High-level state machine for ATM.

The fact that the card shall be released in three situations (if the PIN is not accepted, when the withdrawal service is completed, and when the status service is completed) is represented by the entry action of the state CardOut. In addition, this state represents the fact that the ATM has to know that the card has been taken by the user (event CardTaken) before a new card can be accepted in state Idle. The use of submachine states and exit points in the ATM state machine implies that we do not (at this level) specify which events trigger the detailed behavior. As an example the submachine state EnterPIN has two exit points with the names ok and nok (these are thus not names of events). The events that trigger the EnterPIN submachine to be exited via the ok exit point are described in the state machine EnterPIN, see Figure 9.12. In this example, the overall ATM only has to know the names of the exits points ok and nok; it does not have to know about the events that trigger the EnterPIN state machine. We use submachine states as: • The overall state machine becomes manageable and concise. • The overall state machine is independent on the type of signals used within the sub-state machine. The state machine ATM needs the following attributes: • authN in order to count the number of attempts of authentication. • cid to store the card identification passed with the CardId signal. (The signal event specification CardId(cid) means that the value carried by the CardId signal is assigned to the local attribute cid.) We represent this information by a classifier symbol stereotyped as state machine, see Figure 9.14. Similarly, the state machine EnterPIN needs two attributes as shown in Figure 9.15: • n to keep track of the number of digits that have been entered. • PIN to hold the PIN code computed from the entered digits.

© 2006 by Taylor & Francis Group, LLC

Introduction to UML and the Modeling of Embedded Systems

9-17

<<statemachine>> ATM authN:integer cid:integer

FIGURE 9.14 Attributes of the ATM state machine.

<<statemachine>> EnterPIN n:integer PIN:integer

FIGURE 9.15 Attributes of EnterPIN.

The attribute cid used when sending of the signal Code(cid,PIN) is the one defined in the ATM state machine. 9.3.5.3 Withdrawal Withdrawal of funds requires that the user provides an amount and that the transaction is verified, see Figure 9.16. The attribute sa represents the selected amount that is to be verified against the account. This attribute is set by the GetAmount submachine. If the amount is too large, a message is given to the user who may enter another amount. The operation sendMoney represents the actions needed in order to deliver the requested amount of funds (Figure 9.17). 9.3.5.4 Flexibility through Specialization To describe the state machine GetAmount, we start out with a simple version that merely lets the user select among predefined amounts (such as 100, 200, 300, 400, etc.). At any time during this selection, the user may cancel the selection, see Figure 9.18. We omit the details of selecting an amount, which is covered by the state machine SelectAmount. A more flexible ATM would give the user the opportunity to enter the desired amount, in addition to selecting among predefined amounts. We define FlexibleATM as a subclass of ATM and extend GetAmount in the subclass, as shown in Figure 9.19. By extending the state GetAmount, the GetAmount submachine of FlexibleATM maintains some of the behavior of the more general GetAmount submachine, see Figure 9.20. The extended state machine GetAmount adds the state EnterAmount and also adds transitions to and from this new state. In addition, it redefines the transition from the entry point again so that the target state is the added state EnterAmount. The inherited state SelectAmount is drawn with a dashed line. Although inherited, this state is shown in the extended state diagram in order to add the transition with the trigger otherAmount, leading to the new state EnterAmount. In general, states and transitions that are not shown are inherited from the general state machine. Inherited states and transitions may be drawn using dashed lines in order to establish the context for extension and redefinition. The resultant state machine defined implicitly for the extended GetAmount state machine is depicted in Figure 9.21. This example illustrates the difference to state machines in earlier versions of UML. Without the structuring mechanisms recently introduced to UML, this state machine would have to be specified as shown in Figure 9.22.

© 2006 by Taylor & Francis Group, LLC

9-18

Embedded Systems Handbook

sm Withdrawal

Cancelled :GetAmount

Cancelled

Again send(CheckAccount(sa))

nok/ send(msg(“Amount too large”))

VerifyTransaction

ok/ sendMoney(sa); send(Receit(sa));

ok

FIGURE 9.16 The withdrawal sub-state machine.

<<statemachine>> Withdrawal sa: Amount sendMoney(a:Amount)

FIGURE 9.17 Properties of withdrawal.

sm GetAmount Send(msg(“select amount”))

Cancel Cancelled : Select Amount

Send(msg(“select another amount”)) amount(sa);

Again

FIGURE 9.18 Simple version of GetAmount.

© 2006 by Taylor & Francis Group, LLC

Introduction to UML and the Modeling of Embedded Systems

9-19

<<statemachine>> ATM <<statemachine>> GetAmount

<<statemachine>> FlexibleATM <<statemachine>> GetAmount {extended}

FIGURE 9.19

Flexible ATM as a specialization of ATM.

sm GetAmount {extended}

:SelectAmount otherAmount/ send(msg(“enter amount”))

:enterAmount

Cancelled

Cancel

ok

Send(msg(“select another amount”))

Again

FIGURE 9.20

GetAmount extended.

This example also illustrates that the use of exit points gives us the ability to defining actions on transitions where they really belong. Consider the transitions triggered by cancel (with state Cancelled as target). In Figure 9.22, this transition is a single transition crossing the boundary of the state GetAmount having a single action. However, in Figure 9.21, where GetAmount is defined as a submachine state with entry and exit points, these transitions will be composed of two (partial) actions: one inside GetAmount (and therefore with access to whatever attributes it may have) and one outside, in the scope of Withdrawal (and therefore with access to attributes of Withdrawal only). The inner actions is performed before the outer action.

© 2006 by Taylor & Francis Group, LLC

9-20

Embedded Systems Handbook

sm GetAmount Send(msg(“select amount”))

Cancel :SelectAmount

otherAmount/ send(msg(“enter amount”)) amount(sa); Cancelled :EnterAmount Cancel

ok

Send(msg(“select another amount”)) Again

FIGURE 9.21 Resulting GetAmount.

sm Withdrawal GetAmount Send(msg(“select amount”))

Cancel SelectAmount otherAmount/ Cancel send(msg(“enter amount”)) amount(sa);

ok

EnterAmount

send(CheckAccount(sa))

nok/ send(msg(“Amount too large”)) ok/ sendMoney(sa); send(Receit(sa));

VerifyTransaction

FIGURE 9.22 Withdrawal with a composite state GetAmount.

© 2006 by Taylor & Francis Group, LLC

Cancelled

Introduction to UML and the Modeling of Embedded Systems

9-21

9.3.6 Validation We have now presented (part of) the requirements on the ATM in the form of sequence diagrams and (part of) its design in the form of a structured state machine. The interesting question now is, of course, whether the design satisfies the requirements. We take the specification of the withdrawal service from Figure 9.10 and walk through the described behavior while checking the state machine design of the ATM, applying the following principles: • Establish an initial alignment between the interaction and the state machine. What is the state of the state machine? • Assume that the messages to the state machine are determined by the interaction. • Check that the actions and especially the output messages owing to the transition of the state machine correspond with the event occurrences on the lifeline. • Perform this test for all traces of the interaction. This procedure of consistency checking can be automated provided that the model is sufficiently precise. In our example, informal text is used at several occasions to simplify the illustration, and this would obstruct automatic checking. It is, however, possible to define the interactions and the state machines such that automatic checking is feasible. To align Withdrawal, we assume that the beginning of the Withdrawal sequence correspond to the Idle state in Figure 9.13. Withdrawal starts with the reference to Authenticate (Figure 9.9). Authenticate begins with the continuation Idle and even though it is not a state invariant it does correspond well with our chosen alignment to the state Idle in the state machine. Then the ATM receives the Cardid message and we look into the state machine to see whether this will trigger a transition to EnterPIN. Hopefully the submachine (see Figure 9.12) corresponds directly to the sub-scenario EnterPIN (see Figure 9.8). The transition triggered by the Cardid message is continued inside the EnterPIN submachine and transmits msg("Give PIN") before it enters the state EnterDigit. We check the sequence diagram and find that this corresponds well. Then the sequence diagram continues to describe that the ATM receives four digits. In the state machine the pin is built up and the loop will terminate before the fourth digit. At the fourth digit another transition will be triggered, a message will be sent to the bank, and the waitOK state will be entered. Depending on the answer from the bank, exit from the state will be either through the ok exit point or the nok exit point. Let us first assume that the sequence diagram entered the nok branch. In this case, the sequence diagram will end in the PIN NOK continuation and it will continue into the loop that starts with the same continuation. There we can see that the ATM transmits the msg("Try again!") message to the user. In the state machine, we have now left the EnterPIN submachine through the nok exit point and returned to the top level ATM state machine. Since the loop has just been entered, the transition will send the appropriate message to the user and return to the EnterPIN state. This corresponds well with the situation within the sequence diagram where we find a reference within the loop to EnterPIN. If the next time around we follow the ok branch, the state machine will transit through the ok exit point and enter the Service state. The sequence diagram has finished the Authenticate scenario and continues in the Withdrawal diagram within the alternative starting with the PIN OK continuation. Prompted by a “Select service!” message, the ATM will now receive a Withdrawal message, which in the state machine Service triggers a transition to the Withdrawal submachine. Within Withdrawal the transition continues into GetAmount where the user is prompted with msg("Select Amount") and then the state SelectAmount is entered. Without going into further detail, we conclude that the user is given the chance to provide an amount and this is input to the ATM. The state machine leaves GetAmount on this trigger and continues to send CheckAccount on the transition to VerifyTransaction within Withdrawal. An ok will cause the state machine to exit through the exit point ok after having transmitted money and a receipt. Finally, the card is ejected and the user takes

© 2006 by Taylor & Francis Group, LLC

9-22

Embedded Systems Handbook

the card to return the state machine to the Idle state. This matches the sequence diagram, and we can conclude that at least for these few traces the ATM state machine correspond to the requirements. There are more traces described in the sequence diagrams; an examination of these traces is left to the reader. On the other hand, there are also more traces induced by the state machine execution than captured in the sequence diagrams. Sequence diagrams, as is the case for most property descriptions, are only partial and concentrate on the most important aspects of the system behavior.

9.3.7 Generalizing Behavior To show how UML diagrams can be used to define generalization of behavior, we start by describing an ATM service where foreign currency can be obtained. 9.3.7.1 Generalizing with Interactions A scenario describing the withdrawal of foreign currency is shown in Figure 9.23; the similarities to the withdrawal of local currency as shown in Figure 9.10 is apparent. The challenge is to highlight the differences and to only describe the commonalities once. In Figure 9.23, we have also shown guards (interaction constraints) on the operands of the inner alternative combined fragment. A guard refers to data values accessible to the instance represented by the lifeline at the first event of the operand. We have stated the guards in informal text since UML has not defined a concrete data language and the evaluation of the guard expressions is not a focus of this chapter. The retrieval of local and foreign currencies can be seen as two specializations of an abstract withdrawal service. The hierarchy of withdrawal services expressed through interactions as classifiers is shown in Figure 9.24. The utility interactions getAmount and giveMoney are defined local to GenWithdrawal so that they can be redefined in specializations. The sequence diagram for GenWithdrawal is shown in Figure 9.25. The interactions for getting the amount and delivery of money have been separated out and replaced by references to local sub-scenarios. To achieve the same behavioral description as that of Figure 9.10 we need to redefine getAmount and giveMoney as shown in Figure 9.26. This figure depicts message gates, a feature that is new in UML sequence diagrams. In the general withdrawal service (Figure 9.25), the ok message enters the giveMoney interaction occurrence; that connection point is called an actual gate. The corresponding formal gate can be found in the definition of giveMoney in Figure 9.26, with the meaning that the message is routed as shown. No event occurs when a message passes through a gate. Gates may be named explicitly, but implicit naming is often sufficient, such as when the identity of the gate is determined by the message name and the direction of the message. Gates are visible from the immediate outside of the interaction occurrence and from the inside of the definition diagram. In order to describe the behavior of currency purchase as first shown in Figure 9.23, we redefine getAmount and giveMoney as shown in Figure 9.27. 9.3.7.2 Generalizing with State Machines In order to express the interaction generalization given earlier, we similarly generalize the ATM state machine to express only the general behavior, that is, we only include behavior that is specific to obtaining the PIN code and to authorization, see Figure 9.28. We assume this state machine to be the state machine of the ATM class, and introduce two subclasses of ATM as shown in Figure 9.29. The state machines for the two subclasses of ATM, as shown in Figure 9.30, add the states Withdrawal and Currency, respectively, together with their transitions, and extend the Service state by the additional exit points Withdrawal and Currency, respectively.

© 2006 by Taylor & Francis Group, LLC

Introduction to UML and the Modeling of Embedded Systems

9-23

sd Currency

:User

:ATM

ref

:Bank

Authenticate

alt

PIN OK msg(“Select service!”) Currency msg(“Enter currency!”) CurrencySelected(curtyp) msg(“Enter amount!”) checkaccount(v(e))

amount(e) alt

[enough on account] money(curtyp,e)

ok

receipt(v)

msg(“Amount too large”)

[inadequate funds] nok

PIN NOK msg(“Illegal entry”)

card card taken

FIGURE 9.23 Withdrawing foreign currency.

We will not delve into the details of the states Withdrawal and Currency, but it can readily be seen that the submachines SelectAmount and EnterAmount may in fact be the same for both as they merely model the behavior of obtaining the amount independently of the chosen currency. While the ordinary withdrawal service assumes this number is the amount in local currency, the CurrencyATM will convert this number from the foreign currency to local currency. Verification against the account is performed in local currency.

9.3.8 Hierarchical Decomposition Until now we have considered the ATM as one atomic entity even though we earlier hinted in the domain model that it was internally structured (Figure 9.3). We shall further explain the behavior of the ATM as the result of the interaction between the parts of the composite structure of the ATM.

© 2006 by Taylor & Francis Group, LLC

9-24

Embedded Systems Handbook

GenWithdrawal sd getAmount sd giveMoney

Withdrawal redefined getAmount redefined giveMoney

FIGURE 9.24

Currency redefined getAmount redefined giveMoney

Generalized withdrawal inheritance hierarchy.

9.3.8.1 Decomposition of Classes While the bank context was specified as a collaboration with a composite structure in terms of User, ATM, and Bank (Figure 9.6), the ATM itself is further specified as a class with a composite structure representing the fact that an ATM consists of a screen, a card reader, a keyboard, and a cash dispenser. The domain model had simply specified that an ATM consists of these parts; we now need to impose additional architectural constraints. For example, the user does not interact directly with the card reader. Instead, the user interacts with the card reader as a part of the ATM. This is expressed by defining the ATM as a composite class with ports for each kind of interaction, and with parts that internally are connected to these ports, as shown in Figure 9.31. The oval shapes indicate behavior ports, which are ports through which the behavior of the ATM class, in this case, the ATM state machine, communicates. Behavior directed at a behavior port is directed at the behavior of the containing class. Note that the types of the parts are not owned by the ATM class, they are simply used. We have not yet examined how the collection of classes of the ATM is organized, but typically one would define a package of classes related to ATM, see Figure 9.32. 9.3.8.2 Decomposition of Lifelines We indicate that the lifeline is detailed in another sequence diagram by a reference in the lifeline header to that sequence diagram (see Figure 9.33). The reference clause indicates which sequence diagram defines the decomposition of the lifeline. Decomposition is relative to both the lifeline and the behavior in which the lifeline exists. Thus, for example, ATM_Withdrawal details the withdrawal service within the ATM (see Figure 9.34). There are strong syntactic requirements ensuring the correspondence between the constructs on the decomposed lifeline (in this case, the ATM in Withdrawal) and the referenced detail diagram (here, ATM_Withdrawal). The series of fragments on the lifeline should have an exact corresponding global fragment in the referenced diagram. In this example, this condition holds: for every interaction occurrence covering ATM there is a corresponding interaction occurrence that covers all the lifelines of ATM in ATM_Withdrawal. The same holds for combined fragments — even nested fragments. The combined fragments in Figure 9.34 extend beyond the boundaries of the decomposition diagram to show that the semantics of this fragment takes into account the enclosing level. There are interaction references of two kinds. Interaction occurrences refer to sub-scenarios at the same level of hierarchical abstraction (such as when Authenticate is referenced from Withdrawal), while decomposition refers from one hierarchy level to the next (such as when ATM_Withdrawal is referenced from ATM in Withdrawal).

© 2006 by Taylor & Francis Group, LLC

Introduction to UML and the Modeling of Embedded Systems

9-25

sd GenWithdrawal

:User

:ATM

ref

:Bank

Authentication

alt

PIN OK

ref getAmount

checkaccount(v(e))

alt

ref

[enough on account] ok

giveMoney receipt(v) msg(“Amount too large”)

nok

[inadequate funds]

PIN NOK msg(“Illegal entry”)

card card taken

FIGURE 9.25

General withdrawal service.

These two kinds of reference must be consistent: if we start with Withdrawal (Figure 9.10) and follow the reference Authenticate (Figure 9.9) and then decompose ATM in Authenticate referencing ATM_Authenticate (Figure 9.35), we should reach the same diagram as if we decomposed ATM in Withdrawal to ATM_Withdrawal (Figure 9.34) and then followed the reference to ATM_Authenticate. Decomposition commutes, that is, the final diagram in such chain of references represents the description of the behavioral intersection of the lifeline (here, the ATM) and the reference (in this case, Authenticate).

9.3.9 The Difference between the UML System and the Final System A UML system defined by a set of interacting state machines can be viewed as being executed by a UML virtual machine. Most often the UML machine (or runtime system) has properties that the real system

© 2006 by Taylor & Francis Group, LLC

9-26

Embedded Systems Handbook

sd giveMoney

sd getAmount :User

:ATM

:User

msg("Select service!")

:ATM money(e)

ok

Withdrawal

msg("Enter amount!") amount(e)

FIGURE 9.26 getAmount and giveMoney for withdrawing native money.

sd giveMoney

sd getAmount :User

:ATM

:User

msg(“Select service!”)

:ATM money(curtyp,e)

ok

Currency msg(“Enter currency!”) CurrencySelected(curtyp) msg(“Enter amount!”) amount(e)

FIGURE 9.27 getAmount and giveMoney for purchasing currency.

will not have. The following differences between the ideal UML machine and the final system have been discussed extensively by Bræk and Haugen [16]. 9.3.9.1 Processing Time In the definition of the ATM, we have not given any time constraints or made assumptions about the processing time of the system functionality. Such time constraints can be introduced by using the simple time model that is part of the common behavior of UML. The ideal UML machine has no limitations in processing capacity; neither is it constrained by finite processor speed or clock frequency. Thus we cannot know whether our ATM will actually fulfill the real requirements that would include adequate response times for the user trying to obtain cash. 9.3.9.2 Errors and Noise UML system specifications may have logical errors, but they do not suffer from physical errors. It is assumed that whenever a message is sent, it will arrive at its destination. The UML machine does not stop without reason, and the contents of signals will not change. In the real world, however, such malfunction does occur and a good designer will have to cope with them during system development.

© 2006 by Taylor & Francis Group, LLC

Introduction to UML and the Modeling of Embedded Systems

9-27

sm ATM Idle /authN=0 CardId(cid) /authN=0 :EnterPIN

:Service ok Status

nok [authN<3]/ authN++; send(msg(“Try again”))

:Status

[authN==3]/ authN2=0 send(msg (“illegal entry”));

CardOut entry: send(card) cardTaken

FIGURE 9.28 Generalized ATM state machine.

ATM

WithdrawalATM

CurrencyATM

FIGURE 9.29 Subclasses of ATM.

Physical errors will, of course, depend on the physical entities involved. To ensure the system is not subjected to such defects, we need to answer questions such as: what are the transmission media? What low-level protocols are used? Are there hardware redundancies?

9.3.9.3 Physical Distribution We have defined an ATM in the context of a bank system. We have said little or nothing about how this system is physically realized: are there physical cables that connect the ATM and the bank or is there a wireless link? The safety of the real system is dependent on how the distribution is implemented. If there are separate cables, any particular error will only affect a small part of the system, while if there is a central wireless communication center, a failure of this center will have global impact.

© 2006 by Taylor & Francis Group, LLC

9-28

Embedded Systems Handbook

sm CurrencyATM

sm WithdrawalATM

:Service {extended}

:Service {extended} Currency

Withdrawal

:Currency

:Withdrawal ok

Cancelled

Cancelled

ok

CardOut

CardOut

FIGURE 9.30 The two specialized state machines.

ATM User-reader

ATM-bank

:CardReader

User-screen

:Screen

User-keyboard :Keyboard

:CashDispenser

User-cash

FIGURE 9.31 The architecture of the ATM by means of a composite class.

Even within the ATM itself there are different kinds of physical errors that may occur. The card reader is an electromechanical device that will need regular maintenance, as is the cash dispenser. The screen is probably longer-lasting, but can also be subjected to damage. The internal communication is short and probably stable, while the external communication between the ATM and the central bank may be more at risk. The physical system must be protected against both unintended malfunction and vandalism or sabotage. The design must be able to recover from physical malfunction, or, at least, it should not react by surrendering cash to the person destroying it.

© 2006 by Taylor & Francis Group, LLC

Introduction to UML and the Modeling of Embedded Systems

9-29

BasicATMclasses User

*

ATM

*

*

1

1

Bank

1 *

* Card

1

1 Account myAccount ATM

CardReader

Keyboard

Screen

CashDispenser

FIGURE 9.32 Package basic ATM classes.

:ATM ref ATM_Withdrawal

FIGURE 9.33 Decomposition reference.

9.3.9.4 Finite Resources The UML machine has infinite resources. Its queues have no limit; the number of parallel processes can be increased at will. Not so in the real world. Message queues have a definite limit — if by no other restriction than that the memory space is finite. Also, the processor word length is finite, limiting the calculation accuracy and how large numbers can be. What should the system do when the finite resources are exhausted? There must be a “plan B” which lets the system recover or introduce emergency measures when these limits are reached. 9.3.9.5 Concurrency and Independence We have started out by considering the ATM as one atomic unit, but later described the ATM as a composite of a number of parts. Whether those parts are implemented as fully concurrent processes or merely as properties of the ATM state machine will make a difference. Communication is modeled as asynchronous, but hardware such as card reader, screen, and cash dispenser often have synchronous direct interfaces.

© 2006 by Taylor & Francis Group, LLC

9-30

Embedded Systems Handbook

sd ATM_Withdrawal

:CardReader

:CashDispenser

ref

:Screen

:Keyboard

:Controller

ATM_Authenticate

alt

ATM_PIN OK msg(“Select service!”)

msg(“Select service!”)

Withdrawal

Withdrawal

msg(“Enter amount!”)

msg(“Enter amount!”)

amount(v)

alt

amount(v) checkaccount(v)

money(v)

money(v)

ok

msg(“Amount too large”)

nok

receipt(v)

msg(“Amount too large”)

ATM_PIN NOK msg(“Illegal entry”)

card

msg(“Illegal entry”)

card

FIGURE 9.34 Decomposition of ATM within withdrawal.

9.4 A UML Proﬁle for the Modeling of Embedded Systems UML aims to be applicable to a wide range of application domains, ranging from health and finance to aerospace to e-commerce. In order to subsume the possible variances of application domains, UML does not define all language concepts (such as its concurrency semantics) to the level of detail necessary to allow unambiguous interpretation. As such, UML defines not a language per se, but a family of languages from which a specific language must first be instantiated by possibly selecting a subset of the modeling concepts, providing a dynamic semantics for these concepts suitable to the application domain, and possibly adding certain concepts unique to the application domain. The mechanism to instantiate a particular language from the UML language family is referred to as a “profile.” In addition to giving detailed semantics where the UML definition is intentionally vague, a profile can also provide notations suitable for the instantiated language.

© 2006 by Taylor & Francis Group, LLC

Introduction to UML and the Modeling of Embedded Systems

9-31

sd ATM_Authenticate

:CardReader

:CashDispenser

:Screen

:Keyboard

:Controller

ATM_Idle Cardid(cid)

Cardid(cid) Code(cid, pin)

ref

ATM_EnterPIN

OK, NOK

loop(0,2) ATM_PIN NOK msg(“Try again!”) ref

msg(“Try again!”) Code(cid, pin)

ATM_EnterPIN NOK

FIGURE 9.35 ATM_Authenticate.

The SDL UML profile [17] focuses on the modeling of reactive, state/event driven systems typically found in embedded applications. They give precise, formal semantics for all concepts and constitute a language for specifying executable models independently of an implementation language. While inheriting the traditional strength of UML in object-oriented data modeling, the SDL UML profile provides the following concepts that are of particular importance in the embedded systems domain: • Modeling of active objects executing concurrently or by interleaving, the hierarchical structure of active objects, and their connection by means of well-defined interfaces. • A complete action language that is independent of implementation languages. In practice, actions may be translated into target languages, but (correct) translation does not change the behavior. Actions are specified in imperative style and may be mixed with graphical notation. • Object oriented data based on single inheritance and with both polymorphic references (objects) and values, even in the same inheritance hierarchy. Type safety is preserved in the presence of covariance through multiple dispatch. • Mapping of the logical layout of data to a transfer syntax per the encoding rules of a protocol. • An exception handling mechanism for behavior specified either through state machines or constructs of the action language that makes it suitable as a design/high-level implementation language. • Composite states that are defined by separate state diagrams (for scalability); entry/exit points are used instead of state boundary crossing (for encapsulation), any composite state can be of a state type (for reuse), and state types can be parameterized (for even more reuse). Composite states may have sequential or interleaving interpretation. • Roadmaps (high-level MSC) for improved structuring of sequence diagrams; inline expressions for compactness of description, and references for better reuse of sequence diagrams. • Object-orientation applied to active objects including inheritance of behavior specified through state machines and inheritance of the (hierarchical) structure and connection of active objects. • Constraints on redefinitions in subclasses and on actual parameters in parameterization that afford strong error checking at modeling time.

© 2006 by Taylor & Francis Group, LLC

9-32

Embedded Systems Handbook

In contrast to the UML, the SDL profile provides an operational semantics for all the constructs of the UML. Developers can rely on the precise description of these constructs to unambiguously understand the meaning of their specifications. In addition, a number of tools have been developed that leverage the formal operational semantics to enable the verification and validation of specifications and to generate applications from the specifications. Sequence diagrams are an excellent starting point for the development of test cases for the final system. Techniques have been developed for deriving test suites from the message sequences captured at the requirements stage [18], together with the definition of the information content of each message. By selecting different system components to serve as the instance under test, test suites for each component can be derived. While during the development of individual service descriptions limited attention is paid to the possibility of services interfering with each other, any potential interaction between services must be eliminated when synthesizing the system behavior from requirements specifications. (Such feature interactions, e.g., by multiple services being triggered by the same message primitive, or the same message primitive having different meaning in concurrently active services, have been recognized as the source of the most costly defects of telecommunication systems.) Tools can be deployed to pinpoint mutual interaction between services executing on system component [19–22]. Finally, the operational semantics of the SDL profile allows the derivation of product code for applications implementing the various software system components. SDL provides sufficient expressive power to fully specify such systems, and many commercial systems have been delivered to customers with code generated by tools directly from component-level design specifications. Auxiliary notations such as ASN.1 provide the necessary expressive power to allow messages between system components to be specified abstractly and to allow the automated generation of data marshaling code imposing the encoding rules of the intra-system messaging. A tool chain leveraging the SDL profile and supporting the verification and validation steps between stages of the software development life cycle as well as the generation of complete applications code from design specifications is available. A number of vendors have shipped embedded systems (e.g., network elements for telecommunications systems) implemented with high levels of automation from these models.

References [1] International Telecommunications Union. Specification and Description Language (SDL). Recommendation Z.100, 2002. [2] J. Ellsberger, D. Hogrefe, and A. Sarma. SDL. Prentice-Hall, New York, 1997. [3] A. Olsen, O. Færgemand, B. Møller-Pedersen, R. Reed, and J.R.W. Smith. Systems Engineering Using SDL-92. North-Holland, Amsterdam, 1994. [4] J. Grabowski and E. Rudolph. Putting Extended Sequence Charts to Practice. In Proceedings of the 4th SDL Forum. North-Holland, Amsterdam, 1989. [5] D. Hatley and I. Pirbhai. Strategies for Real-Time System Specification. Dorset House, New York, 1987. [6] D. Harel. Statecharts: A Visual Approach to Complex Systems. Technical Report CS84-05, Weizmann Institute of Science, 1984. [7] P.T. Ward and S. Mellor. Structured Development for Real-Time Systems. Yourdon, Englewood Cliffs, 1985. [8] B. Selic, G. Gullekson, and P.T. Ward. Real-Time Object Oriented Modeling. John Wiley & Sons, New York, 1994. [9] J. Rumbaugh, I. Jacobson, and G. Booch. Unified Modeling Language Reference Manual. Addison Wesley, Reading, MA, 1999. [10] Object Management Group. UML 2.0 Language Specification. ptc/03-08-02, 2003.

© 2006 by Taylor & Francis Group, LLC

Introduction to UML and the Modeling of Embedded Systems

9-33

[11] Ø. Haugen, B. Møller-Pedersen, and T. Weigert. Structural Modeling with UML 2.0. In UML for Real, L. Lavagno, G. Martin, and B. Selic, Eds. Kluwer Academic Publishers, Boston, MA, 2003. [12] T. Weigert and R. Reed. Specifying Telecommunications Systems with UML. In UML for Real: Design of Embedded Real-Time Systems, L. Lavagno, G. Martin, and B. Selic, Eds. Kluwer Academic Publishers, Dordrecht, 2003. [13] R.J.A. Buhr and R.S. Casselman. Use Case Maps for Object-Oriented Systems. Prentice-Hall, New York, 1996. [14] International Telecommunications Union. SDL Combined with ASN.1 Modules (SDL/ASN.1). Recommendation Z.105, 2001. [15] Ø. Haugen and K. Stølen. STAIRS — Steps to Analyze Interactions with Refinement Semantics. In UML, Vol. 2863 of Lecture Notes in Computer Science. Springer, 2003. [16] R. Bræk and Ø. Haugen. Engineering Real Time Systems. Prentice-Hall, New York, 1994. [17] International Telecommunications Union. SDL Combined with UML. Recommendation Z.109, ITU-T, Geneva, 1999. [18] P. Baker, P. Bristow, C. Jervis, D. King, and B. Mitchell. Automatic Generation of Conformance Tests From Message Sequence Charts. In Proceedings of the 3rd SAM Workshop, Vol. 2599 of Lecture Notes in Computer Science. Springer, 2003, pp. 170–198. [19] L. Helouet. Some Pathological Message Sequence Charts, and How to Detect Them. In Proceedings of the 10th SDL Forum, Vol. 2078 of Lecture Notes in Computer Science. Springer, 2001, pp. 348–364. [20] G. Holzmann. Formal Methods for Early Fault Detection. In Proceedings of the Formal Techniques for Real-Time and Fault Tolerant Systems, Vol. 1135 of Lecture Notes in Computer Science. Springer, 1995, pp. 40–54. [21] R. Alur, G. Holzmann, and D. Peled. An Analyzer for Message Sequence Charts. Software — Concepts and Tools, 17: 70–77, 1996. [22] G. Holzmann. Early Fault Detection Tools. Software — Concepts and Tools, 17: 63–69, 1996. [23] International Telecommunications Union. Message Sequence Charts (MSC). Recommendation Z.120, ITU-T, Geneva, 1999.

© 2006 by Taylor & Francis Group, LLC

10 Veriﬁcation Languages 10.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1 Overview

10.2 Verification Methods: Background . . . . . . . . . . . . . . . . . . . . . 10-2 Simulation-Based Verification • Formal Verification • Assertion-Based Verification • Formal Specification of Properties

10.3 Languages for Hardware Verification . . . . . . . . . . . . . . . . . . . 10-6 HDLs and Interfaces to Programming Languages • Open Verification Library • Temporal e • OpenVera and OVA • ForSpec • Property Specification Language

10.4 Languages for Software Verification . . . . . . . . . . . . . . . . . . . . 10-11 Programming Languages • Software Modeling Languages

Aarti Gupta NEC Laboratories America

Ali Alphan Bayazit and Yogesh Mahajan Princeton University

10.5 Languages for SoCs and Embedded Systems Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13 System-Level Modeling Languages • Domain-Specific System Languages

10.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17 Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17

10.1 Introduction Verification is the process of checking whether a given design is correct with respect to its specification. The specification itself can be in different forms — it can be a nontangible entity such as a designer’s intent, or a document written in a natural language, or expressions written in a formal language. In this chapter, we consider a verification language to be a language with a formal syntax, which supports expression of verification-related tasks. It is very useful for a verification language to have a precise unambiguous semantics. However, we also consider languages where ambiguity in semantics is resolved by accepted practices. Broadly speaking, we focus on the support a verification language provides for dynamic verification based on simulation, as well as static verification based on formal techniques. Though we discuss some high-level behavioral modeling languages also, we highlight their verification-related features. Other details on how they are used for representing higher-level models can be found in other recent articles [1,2]. Verification is increasingly becoming a bottleneck in the design of embedded systems and System on Chips (SoCs). The primary reason is the inability of design automation methodologies to keep up with

10-1

© 2006 by Taylor & Francis Group, LLC

10-2

Embedded Systems Handbook

growing design complexity, due to the increasing number of transistors achievable on a chip in accordance with the well-known Moore’s Law. The emerging trend toward systematic reuse of design components and platforms can help amortize the verification costs across multiple designs. At the same time, it places an even greater responsibility on the verification methodology, because the correctness of reusable, parameterized components must be verified in potentially multiple contexts. The cost of detecting bugs late in the design cycle is very high, both in terms of design respins, and in terms of time lost to market. A verification methodology for system-level design needs to address issues related to component-level verification, as well as system-level integration of these components. For embedded systems, this requires verifying the effects of distributed concurrent computation, not only for hardware, but also for the embedded software. Additional complexity arises due to the heterogeneous nature of the components. For example, many embedded applications include digital controllers and analog sensors and actuators. This requires verifying the hybrid systems, where the dynamics consist of discrete as well as continuous behaviors. Another crucial problem for embedded systems is verifying the interfaces for parameterized components, and third-party IP components, mostly without access to the design details. Finally, realtime constraints play an important role in hardware–software partitioning and scheduling (typically implemented in an on-chip real-time operating system kernel). It is important to specify and verify these real-time assumptions and requirements.

10.1.1 Overview Though there has been progress in verification of embedded systems, there has been relatively little effort at language standardization targeted at verification features described above. At the same time, much of the recent standardization activities for hardware and system-level verification languages (e.g., by Accellera [3] and IEEE Design Automation Standards Committee [DASC] [4]), are very much applicable for embedded systems. Therefore, in this chapter, we describe verification languages for hardware, software, and embedded systems. A comprehensive survey in any of these categories is beyond the scope of this chapter. Instead, we focus on representative languages, including language standards which are available. Note that many of these languages and standards are still evolving, and are likely to undergo further changes from the details described in this chapter. For hardware designs, the popularity of Hardware Description Languages (HDLs), such as Verilog [5] and VHDL [6], combined with the success of logic synthesis technology for Register-Transfer Level (RTL) designs, has led to the emergence of Hardware Verification Languages (HVLs). Examples of these languages, described in Section 10.3, include e, OpenVera, Sugar/PSL, ForSpec. In the area of software verification related to embedded applications, described in Section 10.4, we focus on the use of standard programming languages, such as C/C++, and Java, and on software modeling languages, such as UML, SDL, and Alloy. For embedded systems, described in Section 10.5, we focus on languages for system-level verification, such as SystemC, SpecC, SystemVerilog. We also describe domain-specific verification efforts, such as those based on Esterel, and hybrid systems.

10.2 Veriﬁcation Methods: Background We start by providing some background for verification methods, in order to provide a sufficient context and terminology for our discussion on verification languages. More details can be found in the references cited.

10.2.1 Simulation-Based Veriﬁcation Simulation has been, and continues to be, the primary method for functional verification of hardware and system-level designs. It consists of providing input stimuli to the Design Under Verification (DUV), and checking the correctness of the output response. In this sense, it is a dynamic verification method. The practical success of simulation has largely been due to automation through development of testbenches,

© 2006 by Taylor & Francis Group, LLC

Veriﬁcation Languages

10-3

which provide the verification context for a DUV. (Details of testbench development are described in a recent book [7].) A typical testbench consists of — a generator of testcases (input stimuli), a checker or monitor for checking the output response, and a coverage analyzer for reporting how much of the design functionality has been covered by the testcases. Depending on how much of the internal design state is controllable and observable by the generators and checkers, verification can be classified as black-box (none), white-box (full), or gray-box (partial, added to aid verification). It is impossible to simulate all testcases for designs of modest size. Traditionally, directed testbenches are used to generate specific scenarios of interest, especially to nail down designer’s intent in the early phase of the design cycle. In contrast, constrained random testbenches are used to generate random stimuli, subject to certain constraints added to the testbench. These are very useful for verifying unexpected scenarios, which help in finding bugs in later phases of the design cycle. In all cases, progress is evaluated in terms of coverage metrics, which have been used with varying degrees of success. These include code coverage metrics, such as statement/branch/toggle/expression coverage, and functional coverage metrics, such as state/transition coverage of the FSMs (Finite State Machines) in the design description. As systems become increasingly complex, designs need to be specified at higher levels of abstraction. A transaction is a sequence of lower-level tasks, which implements a logical operation at the higher level. For example, a read transaction by an agent on a shared bus typically consists of transferring the address to the bus, waiting for the data to be ready, and transferring the data from the bus. In complex systems, it is more natural to model intermodule communication at the transaction level, rather than at the signal level. This has led to the development of transaction-based testbenches for system-level verification. These testbenches use adaptors called transactors/BFMs (Bus Functional Models) to translate between the higher and lower levels. Development of testbenches at RTL and system level can be quite tedious. It is useful to have language support for abstraction and object-oriented features, such as abstract data types, data encapsulation, dynamic object creation/destruction, inheritance, and polymorphism. The testbench language must also be able to specify and launch many tasks concurrently, and provide features for synchronization, monitoring multiple in-flight tasks, and reentrancy of tasks. Most of these features were absent from the HDLs, which motivated the emergence of testbench languages, HVLs, and system-level modeling languages, described in Sections 10.3 and 10.5, respectively.

10.2.2 Formal Veriﬁcation In contrast to simulation, formal verification methods do not rely upon the dynamic response of a DUV to certain testcases. Rather, they perform static analysis on a formal mathematical model of the given DUV, to check its correctness with respect to a given specification under all possible input scenarios. In this section, we briefly describe some of these methods. Successful instances of industry applications have been described in a survey [8]. A popular formal method is equivalence checking, where a given DUV is checked for equivalence against a given reference design. Equivalence checking has been applied very successfully in automated hardware design, to check that no errors are introduced during logic synthesis flow from RTL to gate-level design. Since both the reference design and the DUV are adequately represented in standard HDLs, there has been little motivation to develop a special language for equivalence checking. When reference models are unavailable, or where the correctness of the highest-level model needs to be checked, design requirements are expressed in terms of correctness properties. Property checking is used to check whether a given design satisfies a given correctness property, expressed as a formal specification. Most languages for specifying properties are derived from formal logics, or automata theory (described in more detail in Section 10.2.4). The two main techniques for property checking are model checking, and theorem proving. In model checking [9], the DUV is typically modeled as a finite state transition system, the property is specified as a temporal logic formula, and verification consists of checking whether the formula is true in that

© 2006 by Taylor & Francis Group, LLC

10-4

Embedded Systems Handbook

model.1 In theorem proving, both the DUV and the specification are modeled as logic formulas, and the satisfaction relation between them is proved as a theorem, using the deductive proof calculus of a theorem prover. Model-checking techniques have found better acceptance in the industry so far [10,11], primarily because they can be easily automated and they provide counterexamples which are useful for debugging. Though theorem-proving techniques [12–14] can handle more general problems, including infinite state systems, they are not as highly automated, and the learning curve is steeper for new users. The practical application of model-checking techniques is limited by the state explosion problem, that is, the state-space to be searched grows exponentially with the number of state components. Symbolic modelchecking techniques [9,10] use methods based on Binary Decision Diagrams (BDDs) [15] to symbolically manipulate sets of states without explicit enumeration. Though this improves scalability to some degree, these techniques can fail due to the memory explosion for BDDs. As an alternative, bounded modelchecking techniques use methods based on checking Boolean satisfiability (SAT) procedures for finding bounded-length counterexamples [16]. These techniques scale better for finding bugs, but they need additional reasoning for obtaining conclusive proofs.

10.2.3 Assertion-Based Veriﬁcation Another emerging methodology is assertion-based verification. Though its basic elements have been part of industry practice for a while, it has gained attention more recently as a systematic means of enhancing the benefits of simulation and formal verification, and for combining them effectively. In particular, the Accellera organization [3] has been actively involved in developing and promoting language standards for top-down specification using a formal property language, as well as for bottom-up implementation assertions. (The chosen standards are described in Sections 10.3.6 and 10.5.1.3. More details can be found in a recent book [17].) The key ingredient of assertion-based verification is formal specification of properties to capture designer intent at all levels of the design. Properties can be used as assertions, to check for violations of correct behavior or functionality. The checking can be done dynamically during simulation, statically using formal verification techniques, or by a combination of the two. Properties that specify interfaces between modules can also be used as constraints, that is, as assumptions on the input interface of a given module. For simulation, these constraints can be translated into stimulus generators, or added to the constraint-solvers for constrained random simulation. For formal verification, these constraints serve the role of an environment, which can be handled explicitly as an abstract module, or implicitly by the verification method. Assertions and constraints are crucial in exploiting an assume-guarantee verification paradigm, whereby assertions (guarantees) on the output interface of a given module, serve as constraints (assumptions) on the input interface of the downstream module. (Care must be exercised in defining interfaces, in order to avoid a circular reasoning.) This allows verification of a large design to be handled modularly by verifying each of its components. Assertions and constraints also help to assess functional coverage achieved by simulation testbenches, for example, how many assertions have been checked, how much of the constraint space has been covered, etc.

10.2.4 Formal Speciﬁcation of Properties Property specification languages are derived mostly from formal logics and automata theory. There are many related theoretical issues — expressiveness (i.e., what kinds of properties a logic can capture), complexity of the related property checking problem. In general, the more expressive a logic, the higher is the complexity of the property checking problem. There are also many practical issues — ease of use, available tool support, etc. In this section, we briefly describe some logics that form the basis of languages described in the rest of this chapter. 1 A related method is called language containment, where the specification and DUV are represented as automata, and verification consists of checking that the language of the DUV is contained in the language of the specification.

© 2006 by Taylor & Francis Group, LLC

Veriﬁcation Languages

10-5

10.2.4.1 Propositional Logic This is essentially the Boolean logic familiar to most hardware designers. Formulas consist of Booleanvalued variables connected using standard Boolean operators (NOT/AND/OR). The complexity of checking the satisfiability of a Boolean formula (SAT) is known to be NP-complete [18]. When both universal and existential quantifiers are added, the complexity of checking satisfiability of a Quantified Boolean Formula (QBF) is PSPACE-complete [18]. The expressiveness of these logics is fairly limited, but they are well suited for handling bit-level hardware designs. 10.2.4.2 First-Order Predicate Logic The expressiveness of Boolean logic is extended here, by allowing variables to range over elements of an arbitrary set, not necessarily finite. This is very useful for handling higher-level systems and software, for example, to reason directly about integers or real numbers. Though the problem of checking validity of many interesting classes of formulas over these sets is undecidable, automated decision procedures exist for useful subsets of the logic [19,20], which have been successfully utilized for verification of microprocessors, and word-level hardware designs. Higher-order logics, that is, where quantifiers are allowed to range over subsets of sets, have also been used, but mainly by specialists [12]. 10.2.4.3 Temporal Logics Temporal logics are used for specification of concurrent reactive systems, which are modeled as state transition systems labeled by propositions [21]. Here, the expressiveness of predicate logic is further extended, such that the interpretation of a proposition can change dynamically over the states. In addition to the standard Boolean operators, there are typically four temporal operators, with the intended semantics as shown below: • • • •

Gp: p is Globally (always) true Fp: p is true some time in the Future (eventually) Xp: p is true at the neXt instant pUq: p is true Until q becomes true

Temporal logics are well-suited for expressing qualitative temporal behavior. The correctness properties are typically categorized as follows: • Safety properties. Nothing bad happens, for example, G(!p1_critical+!p2_critical) This formula expresses the mutual exclusion property that processes p1 and p2 should never be in the critical section simultaneously. • Liveness properties: Something good eventually happens, for example, G(request → F grant ) This captures absence of starvation, that is, every request for a resource should be eventually granted. • Precedence properties: Ordering of events, for example, G(!p2_req U p1_req →!p2_grant U p1_grant ) This expresses an ordering requirement, that is, if process p1 requests a resource before process p2, then it should be granted the resource earlier. Different kinds of temporal logics have been proposed, depending upon the view of time. Linear Temporal Logic (LTL) [21] takes a linear view, where formulas are interpreted over linear sequences of states. In contrast, Computation Tree Logic (CTL) [9] takes a branching view of time, where formulas are interpreted over a tree of possible computations starting from a state (with additional path quantifiers

© 2006 by Taylor & Francis Group, LLC

10-6

Embedded Systems Handbook

to denote all/some paths). In terms of expressiveness, LTL and CTL are incomparable, that is, each can express a property that cannot be expressed by the other. In terms of model-checking complexity, LTL is linear in the size of the model, but exponential in the size of the specification, while CTL is linear in both. Despite the relative stability of these logics, there is an ongoing debate regarding their suitability for various verification applications [22]. 10.2.4.4 Regular and ω-Regular Languages Though temporal logics are quite useful, they cannot express regular (or ω-regular) properties of sequences. Such properties are easily expressed as languages recognized by finite state automata on finite (infinite) words [23]. Regular expressions are formed by using three basic operations — concatenation (“·”), choice (“|”), and bounded repetition (“∗ ”). For example, a sequence with alternation of a request signal with grant or error signals, can be expressed as (request · (grant|error))∗ . An ω-regular expression consists of the form U · V ω , where U and V are ∗ -free regular expressions, and V ω denotes an infinite repetition of expression V . These are used for specifying properties of systems that do not terminate.

10.3 Languages for Hardware Veriﬁcation In this section, we focus on verification languages for hardware RTL designs. (Languages for system-level hardware designs are covered in Section 10.5.)

10.3.1 HDLs and Interfaces to Programming Languages Register Transfer Level (RTL) designs are typically implemented in standard HDLs, such as Verilog and VHDL. However, it is not practical to implement simulation testbenches in HDLs alone. Indeed the testbench can be purely behavioral, that is, it need not be synthesizable into hardware. Furthermore, it can be implemented at higher levels than RTL. Historically, a popular testbench development approach has been to implement some of its parts in a software programming language, such as C/C++ or Perl. These parts are integrated with HDL simulators through standard programming language interfaces. Unfortunately, this approach not only slows down the simulator, but it also requires significant development effort, such as defining new data types (e.g., 128-bit bus), handling concurrency, dynamic memory objects, etc. [7]. For property specification, VHDL has some support for static assertions, with different severity levels. Although Verilog lacks such explicit constructs, it is straightforward to use the “if ” and “$display” constructs to implement a similar effect. Example 10.1 Suppose we need to check that two signals A and B cannot be high at the same time. The fragment below shows an assertion template and an instance in VHDL. The keyword assert specifies a property that must hold during simulation. [label] assert expression [report message] [severity level] assert (A NAND B) report "error: A & B cannot both be 1" severity 0;

A similar effect can be achieved by using Verilog’s if/$display combination, which specifies the undesired situation, as shown below: always (A or B) begin if (A & B) begin $display("error: A = B = 1"); $finish; // end simulation end end

© 2006 by Taylor & Francis Group, LLC

Veriﬁcation Languages

10-7

10.3.2 Open Veriﬁcation Library Though simple assertions are quite useful, they do not provide a practical way for specifying properties. Temporal properties can be specified using checkers or monitors in HDLs, but this involves a significant development effort. Therefore, there has been a great deal of interest in developing a library of reusable monitors. An example is the Open Verification Library (OVL), available from Accellera [3]. It is not a standalone language, but a set of modules that can be used to check common temporal specifications within Verilog or VHDL design descriptions. Example 10.2 As an example [17] consider the following PCI Local Bus Specification: to prevent AD, C/BE#, and PAR signals from floating during reset, the central resource may drive these lines during reset (bus parking) but only to a logic low level; they may not be driven high. Suppose ad, cbe_, par, rst_ are the Verilog signal names corresponding to AD, C/BE#, PAR, and RST# respectively.2 The given property can be specified within a Verilog implementation of OVL as follows: assert_always master_reset (clk,!rst_,!|(ad ,cbe_ ,par));

Here, assert-always is the name of the library monitor, and master_reset is an assertion instance. On every rising edge of clk signal, whenever !rst_ is high, the monitor asserts that the last parameter should evaluate to true. Here the last parameter is the negation (!) of the bitwise-or (|) over the given bits. Note that this is a simple safety property, which is expected to always hold during simulation. Example 10.3 Consider another example [17] from the PCI Local Bus Specification: the assertion of IRDY# and deassertion of FRAME# should occur as soon as possible after STOP# is asserted, preferably within one to three cycles. For simplicity, consider the FRAME# and STOP# signals only, that is, check whether frame_ will be de-asserted within one to three clock cycles after stop_ is asserted, as shown below: assert_frame#(1,1,3) check_frame_da (clk, true,!stop_, frame_)

Again, check_frame_da is an instance of the assert_frame module defined in the library. Three optional parameters #(1, 1, 3) are used, corresponding to the “severity level,” “minimum number of clocks,” and “maximum number of clocks,” respectively. The monitor will check whether frame_ goes high within one to three clock cycles after !stop_. Here, the severity level is set to 1 to continue the simulation even if the assertion is violated, and the reset parameter is set to true, as an example of where it is not needed. Open Verification Library has many advantages. First, it can be used with any Verilog, VHDL, or a mixed simulator, with no need for additional verification tools. Second, it is open, that is, the library can be modified easily, for example, for assessing functional coverage [17]. Another useful feature of OVL is that it does not slow down simulation, primarily because it is hard to specify very complex assertions. Unfortunately, OVL is used mainly for checking safety properties during simulation, and is not very useful for checking liveness or for formal verification. In some sense, OVL provides a transition from a traditional simulation-based methodology, to an assertion-based methodology.

10.3.3 Temporal e Verisity’s e language is an advanced verification language that is intended to cover many verification aspects. It has been recently chosen by the IEEE DASC [4] for standardization as a verification language. Like many high-level languages, it has constructs for Object-Oriented Programming, such as class definitions, inheritance and polymorphism [7]. It also provides elements of Aspect-Oriented Programming (AOP). AOP allows modifying the functionality of the environment without duplicating or modifying the original code, in a manner more advanced than simple inheritance (see References 7 and 24 for more details). 2 In

PCI Local Bus Specification, signal names ending with “#” indicates that signal is active low. In the examples, we will use “_,” for the same purpose.

© 2006 by Taylor & Francis Group, LLC

10-8

Embedded Systems Handbook

As a testbench language, e provides many constructs related to stimuli generation, such as specification of input constraints and facilities for data packing, as well as for assessing simulation coverage. It also provides support for property specification, and has been used widely both in simulation-based and formal verification. Example 10.4 As an example for stimuli generation, suppose we have a struct type3 frame for modeling an Ethernet frame, with one of the data fields defined as %payload. The type of payload can be defined in e as follows [24]: struct payload { %id :byte; %data :list of byte; keep soft data.size()in [45..1499]; }

In this example, the “%” character in front of the field name means that the corresponding field is physical and represents data to be sent to the DUV. The keep soft keywords are used for bounding the values of the variable — the size of the data field in this case. It also allows specification of weighted ranges or constraints. In the example, the size will be varied automatically within the given range. (Using the “!” character along with “%” would have indicated that the field would not be generated automatically.) Typically, a user-defined function is used for driving stimuli to the DUV. For example, suppose my_frame is an instance of the struct frame. The following e code can be used to input the frame data serially into the DUV: Example 10.5 var bitList: list of bit; bitList = pack (packing.low, my_frame); for each (b) in bitList { ‘testbench.duv.transmit_stream0’= b; wait cycle; };

In this example, the keyword pack provides the mechanism to pack all data fields into a single list of bits, which is then fed serially to the Verilog signal testbench.duv.transmit_stream0. After each bit transfer, the function waits for one clock, denoted by the wait cycle keywords. Support for specification of temporal properties is provided in e through use of Temporal Expressions (TEs). A TE is defined as a combination of events and temporal operators. The language also supports the keyword sync, which is used as a point of synchronization for TEs. Example 10.6 Returning to PCI specifications, consider the following requirement, and its corresponding specification in e: Once a master has asserted IRDY#, it cannot change IRDY# or FRAME # until the current data phase completes regardless of the state of TRDY#. expect @new_irdy => { [..]*((not @irdy_rise) and (not change (frame_))); @ data_phase_complete} @sys.pci_clk; else dut_error("Error, IRDY# or FRAME# changed", "before current data phase completed.");

Here, suppose that the events (shown as @event) have been defined already. The shown expression specifies that whenever IRDY# is asserted (@new_irdy), de-assertion of IRDY# (@irdy_rise) or a change in FRAME# should not occur, until the data phase completes (@data_phase_complete). The use of 3A

struct type basically corresponds to a class type in C++, that is, it allows method definitions along with data definitions. Since it is conceptually similar to other object-oriented languages, we omit the actual syntax.

© 2006 by Taylor & Francis Group, LLC

Veriﬁcation Languages

10-9

@sys.pci_clk denotes that the event pci_clk is used for sampling signals in evaluating the given TE. This feature is also useful for verifying multiple clocked designs.

10.3.4 OpenVera and OVA OpenVera from Synopsys is another testbench language similar to e in terms of functionality and similar to C++ in terms of syntax. Since conceptually OpenVera is very similar to e, we do not include here testbench examples for OpenVera. It has similar constructs for coverage, random stimuli generation, data packing, etc. OpenVera Assertions (OVA) is a standalone language, which is also part of the OpenVera suite [25]. OpenVera comes with a checker library (OVA IP), which is similar to OVL. OVA and OpenVera also have event definitions, repetition operators (∗ [..]), and sequencing, where different sequences can be combined to create more complex sequences using logical and repetition operators. Example 10.7 The following example shows the OVA description of the PCI specification from Example 10.3: clock posedge clk { event chk: if (negedge stop_) then #[1..3]posedge frame_; } assert frame_chk : check (chk)

Note the specification of the sampling clock, the event, and the corresponding action. The implication operation (if–then) is similar to conditionals in programming languages, and # is the cycle delay operator.

10.3.5 ForSpec ForSpec is a specification and modeling language developed at Intel [26]. The underlying temporal logic ForSpec is ForSpec Temporal Logic (FTL), which is composed of regular language operators, and LTL-style temporal operators. ForSpec is aimed at an assume-guarantee verification paradigm, which is suitable for modular verification of large designs. The language also provides explicit support for multiple clocks, reset signals, and past-time temporal operators. Although none of these additional features increases the expressiveness of the basic language, they clearly ease specification of properties in practice. Example templates for asynchronous set/reset using FTL are shown below: accept(boolean_expr & clock)in formula_expr reject(boolean_expr & clock)in formula_expr

Recently, some constructs of ForSpec have also been added to OVA Version 2.3.4 In particular, the concept of “temporal formulas” is added, which can be composed by applying temporal operators on sequences. The supported temporal operators are: followed_by, triggers, until, wuntil, next, wnext [25,27], which can be expressed in terms of standard temporal operators (described in Section 10.2.4). Asynchronous set/reset have also been added.

10.3.6 Property Speciﬁcation Language Property Specification Language (PSL) is the language standard established by Accellera for formal property specification [3]. It originated from the language Sugar, developed at IBM. The first version of Sugar was based on CTL, and was aimed at easing expression of CTL properties by users for the RuleBase model checker [28]. PSL is based on Sugar Version 2.0, where the default temporal logic is based on LTL, called PSL/Sugar Foundation Language. The main advantage of LTL in assertion-based verification is that its 4 Not

yet supported by Synopsys at the time of this writing.

© 2006 by Taylor & Francis Group, LLC

10-10

Embedded Systems Handbook

semantics, that is, evaluation on a single execution path, is more natural for simulation-based methods where a single path is traced at a time. Though CTL is more efficient in model-checking complexity, it is harder to support in simulation-based methods. For formal verification, CTL continues to be supported in PSL through an Optional Branching Extension (OBE). According to the Accellera standard, PSL assertions are declarative. Since many users prefer procedural assertions, nonstandard pragma-based assertions5 are also supported. PSL properties consist of four layers: Boolean layer, temporal layer, verification layer, and modeling layer. The bottom-most is the Boolean layer, which specifies the Boolean expressions that are combined using operators and constructs from the upper layers. The syntax of this layer depends on the hardware modeling language used to represent the design. Currently, PSL supports both Verilog and VHDL. (Sugar also supports the internal language EDL of RuleBase.) The temporal layer specifies the temporal properties, through the use of temporal operators and regular expression operators. The temporal operators (eventually, until, before, next), and some regular expression operators have two different types. One is called the strong suffix, indicated with an “!” at the end (e.g., eventually!), which specifies that the operator must hold before the end of the simulation. The other one, called the weak suffix (without the “!”), denotes that it does not hold only when the chance of it being true vanishes completely. PSL/Sugar also has some predefined temporal constructs regarded as syntactic sugar, targeted at easing usage without adding expressiveness. The verification layer specifies the role of the property for the purpose of verification. This layer supports keywords — assert, assume, assume-guarantee, restrict, restrict-guarantee, cover, and fairness. The keyword assert indicates that the property should be checked as an assertion. The keywords assume and assume-guarantee denote that the property should be used as an assumption, rather than checked as an assertion. The keywords restrict and restrict-guarantee can be used to force the design into a specified state. The keyword cover is used for coverage analysis, and fairness to specify fairness constraints for the DUV. This layer also provides support for organizing properties into modular units. Finally, the modeling layer provides support for modeling the environment of the DUV, in terms of the behavior of the design inputs, and the auxiliary variables and signals required for verification. Example 10.8 Going back to the same PCI specification, as shown in Examples 10.3 and 10.7, the property is written in PSL as follows: assert always (!stop_-> eventually! frame_);

Note that the above specification is qualitative in terms of when the frame_ signal should be de-asserted, that is, the eventually! operator does not specify that it should do so within one to three cycles after stop_ is asserted. Example 10.9 Consider a property that states that a request-acknowledge sequence should be followed by exactly eight data transmissions, not necessarily consecutive. This can be expressed using a SERE (Sugar Extended Regular Expression) as follows: always {req;ack} |=> {start_trans;data[=8];end_trans}

In addition to the number of data transmissions, the property also specifies that start_trans is asserted exactly one cycle after ack is asserted (|=>). The notation [=8] signifies a (not necessarily consecutive) repetition operator, with parameter 8. Another SERE that is equivalent to data[=8] is shown as follows: {!data[*];data;!data[*]}[*8]

PSL/Sugar also allows specification of a sampling clock for each property, or the use of a default clock (as shown in the previous examples). The abort operator can be used to specify reset-related properties, inspired by ForSpec. To support modularity and reusability, SEREs can be named and used in other SEREs or properties. (This feature is similar to events in e and OVA.) 5 These

assertions are escaped from simulators by defining them inside comments.

© 2006 by Taylor & Francis Group, LLC

Veriﬁcation Languages

10-11

Example 10.10 Another PCI specification states that the least significant address bit AD[0] should never be 1 during a data transfer. This can be expressed in PSL as follows [17]: sequence SERE_MEM_ADDR_PHASE = {frame_; !frame_ && mem_cmd}; property PCI_VALID_MEM_BURST_ENCODING = always {SERE_MEM_ADDR_PHASE} |-> {!ad[0]} abort !rst_ @(posedge clk); assert PCI_VALID_MEM_BURST_ENCODING;

In this example, “|->” is a weak suffix implication operator, which denotes that whenever the first sequence holds we should expect to see the second sequence. Note the definition of the named SERE using keyword sequence, and its use within a property specification. The property is used as assertion, to be checked by the verification tool.

10.4 Languages for Software Veriﬁcation In this section, we focus on verification efforts for software relevant to embedded system applications. (For more general software systems and software analysis techniques, see overview articles [29,30] for useful pointers.) In comparison to the hardware design industry, which has been very active in standardization efforts for verification languages, there has been little activity in the area of software systems. Instead, the efforts are targeted at taming the complexity of verification, largely by focusing on specific forms of correctness or specific applications. The correctness requirements for software are typically expressed in the form of Floyd–Hoare style assertions [31] (invariants, pre/postconditions of actions represented in predicate logic) for functional verification, or temporal logic formulas for behavioral verification.

10.4.1 Programming Languages Given the popularity of C/C++ and Java, verifying programs directly written in these languages is very attractive in principle. However, there are many challenging issues — handling of integers/floating point data variables, handling of pointers (in C/C++), function/procedure calls, and object-oriented features, such as classes, dynamic objects, and polymorphism. 10.4.1.1 Veriﬁcation of C/C++ Programs The VeriSoft project [32] at Lucent Technologies focused on guided search for deadlocks and violations of assertions expressed directly in C/C++, and has been used for verifying many communication protocols. The SLAM project [33] has been successfully used at Microsoft for proving the correctness of device drivers written in C. Designers use a special language called SLIC [34] to provide correctness requirements. These requirements are translated automatically into special procedures, which essentially implement automata for checking safety properties. These procedures are added to the given program, such that a bug exists if a statement labeled “error” can be reached in the modified program. Verification is performed by — obtaining a finite state abstract model, performing model checking, and refining the abstract model if needed. Other similar efforts [35,36] have focused largely on the abstraction and refinement techniques so far. There has been relatively little effort in development of a separate verification language. 10.4.1.2 Veriﬁcation of Java Programs There have been many efforts for verification of Java programs also. The Extended Static Checker for Java (ESC/Java) [37] performs static checks that go beyond type checking. It automatically generates verifications conditions for catching common programming errors (such as null dereferences, array bound errors, etc.), and synchronization errors (race conditions, deadlocks). These conditions are transparently checked by a backend theorem prover. It also uses a simple annotation language for the user to provide object invariants and requirements, which aid the theorem prover. The annotation language, called JML

© 2006 by Taylor & Francis Group, LLC

10-12

Embedded Systems Handbook

(Java Modeling Language), is part of a broader effort [38,39]. JML is used to specify the behavior and syntactic interfaces in Java programs, through pre/postconditions and invariants for classes and methods. Its syntax is quite close to that of Java, thereby making it easy to learn by programmers. JML is used by a host of verification tools, which span the spectrum from dynamic runtime assertion checkers and unit testers, to static theorem provers and invariant generators. It has been applied successfully in smartcard applications implemented using a dialect of Java called Java Card [40]. Example 10.11 Consider the following code [37], which shows Java code for a class definition of integer multisets: 1 class Bag { 2 //@invariant 0<=size && size <=elements.length 3 int size; 4 int[] elements; 5 // @requires input != null 6 Bag(int[] input) { 7 ... 8 } 9 int extractMin( ) { 10 ... 11 } 12 }

As shown, the bag class provides two operations (details are not shown) — one for constructing a bag from an array of integers, and another for extracting the smallest element. The example also shows user annotations on Lines 2 and 5, to be used with ESC/Java. The annotation on Line 2 specifies an object invariant, which should hold for every initialized instance of the class. The annotation on Line 5 specifies a precondition, which is assumed to hold by the operation that follows. Model-checking techniques have also been used successfully for Java programs. Bandera [41] and Java PathFinder (JPF) [42,43] use various program analysis methods and abstractions to obtain a finite state model of the given Java program. This model is verified against temporal specifications provided by the user. In many cases, model checking is performed by translation of the (abstract) Java program into Promela, the input language of the model checker SPIN [11]. Promela focuses on system descriptions of asynchronous concurrent processes, and the correctness properties are specified in LTL. SPIN has been used successfully to verify the correctness of numerous protocols, distributed algorithms, controllers for reactive systems, etc. The Bandera project has also focused on development of the Bandera Specification Language (BSL) [44]. BSL provides support for defining general assertions, pre/postconditions on methods, and atomic predicates for specifying temporal properties. There has also been related work [45] on developing a pattern-based approach for specifying complicated temporal properties, and translating them into input for model-checking tools.

10.4.2 Software Modeling Languages The Unified Modeling Language (UML) [46] is emerging as a popular standard for designing objectoriented software systems. There has been recent interest in its use for specification and modeling of real-time and embedded systems [47,48]. It provides different kinds of diagrams to specify the system structural and behavioral aspects, including Statecharts [49], and Message Sequence Charts [50,51]. A language called Object Constraint Language (OCL) [52] has been developed as part of UML, for specifying requirements, such as invariants, and pre/postconditions on operations in class diagrams. Tools for simulation of UML models, and for automatic code generation from UML into C/C++ programs, have become popular for embedded software development in automotive and avionics applications. However,

© 2006 by Taylor & Francis Group, LLC

Veriﬁcation Languages

10-13

formal verification efforts have been hindered by a lack of a precise formal semantics for UML and OCL, which are still under development. The Specification and Description Language (SDL) [53] is another popular language, especially targeted for communication systems. It provides both graphical and textual syntax, for specifying systems at different levels of hierarchy. Tool support for SDL is widely available, especially for automatic code generation to C, C++, Java. Verification is supported by methods based on Message Sequence Charts, as well as model checking [54]. Another set of efforts is based on the use of declarative specification languages, such as Z [55], and more recently Alloy [56]. Their main drawback is that there is little support for automatic extraction of such models from existing programs, or for automatic generation of programs from these models. Furthermore, Z is not very amenable to automatic analysis. However, Alloy does provide automated verification support, and has been used for some embedded applications. The analysis tool in Alloy attempts to find a finite scope model that satisfies all given constraints, by translation to a Boolean SAT problem.

10.5 Languages for SoCs and Embedded Systems Veriﬁcation In this section, we describe verification languages for SoCs and embedded systems. We start by describing system-level behavioral modeling languages that have been heavily influenced by existing HDLs. Next, we describe verification support for domain-specific languages, such as synchronous languages like Esterel, and efforts in the area of hybrid systems.

10.5.1 System-Level Modeling Languages As systems become increasingly complex, it becomes useful to specify designs at higher levels of abstraction. This allows early design exploration for optimizing system requirements, such as performance, power consumption, chip area, etc. Many system-level behavioral modeling languages focus on hardware– software partitioning issues, and development of system-level verification platforms. 10.5.1.1 SystemC and SystemC Veriﬁcation Library SystemC is a C++ based modeling platform that supports hardware module abstractions at RTL, behavioral and system levels, as well as software modules. Both the DUV and the testbench are written in the SystemC language. The platform provides a class library and a simulation kernel. The SystemC Verification Library adds features for transaction-based testbench development. A technique called data introspection is exploited to allow arbitrary data types to be used in constraints, assertions, transaction recording, and other high-level activities. Randomization of data generation for arbitrary data types is supported by defining a distribution through Boolean constraints and probability weights. The verification library also provides a callback mechanism to observe activities at transaction level during simulation, and a minimal set of HDL connection APIs, to permit interfacing with VHDL or Verilog. So far, there has not been much effort to add support for formal specification of properties within SystemC, though the standards can potentially be supported. Example 10.12 A typical example application using the SystemC Verification Library is shown in Figure 10.1 [57], where a transactor bridges the gap between the testbench and the DUV. 10.5.1.2 SpecC SpecC is an executable modeling language based on C, targeted for hardware–software codesign [58]. The SpecC methodology advocates a clear separation between communication and computation in systemlevel designs. It provides support for creating an executable behavioral specification, design exploration including hardware–software partitioning, communication synthesis, and automatic generation of the software as well as hardware components. A significant effort has been made in providing clean formal

© 2006 by Taylor & Francis Group, LLC

10-14

Embedded Systems Handbook

Port

Port-transactor binding

A module instance

An abstract method

A signal

A C++ base class

Class test

Class rw_pipelined_transactor

Class design clk ... data

rw_port

rw_task_if

pipelined_bus_ports

pipelined_bus_ports

FIGURE 10.1 Transaction-level testbench in SystemC [57].

semantics, and in development of an associated simulation kernel. However, so far, there has not been much effort focused on its use in property specification or formal verification. 10.5.1.3 SystemVerilog and SystemVerilog Assertions SystemVerilog is an extension of the Verilog HDL, with the latest version SystemVerilog 3.1 standardized by Accellera [59]. It adds many system-level modeling and verification capabilities to the older Verilog (IEEE Standard 1364). It also provides a Direct Programming Interface, which allows C functions to be called from Verilog code and vice versa. For transaction-level testbench development, SystemVerilog has incorporated many features from C/C++, such as structures, pointers, classes, and inheritance. It also provides support for multiple threads, events and semaphores for interprocess synchronization, mailboxes for interprocess communication, and random number classes to help generate random test vectors for simulation. SystemVerilog also provides support for an assertion-based verification methodology, using the SystemVerilog Assertions (SVAs) language, described below. The SystemVerilog standard has enhanced the scheduling scheme of Verilog, by defining new scheduling phases in which to sample stable signals, to evaluate the assertions, and to execute testbench code. This scheduling scheme avoids races between the DUV and testbench, guarantees that assertions and testbenches see stable values, and ensures common semantics across the simulation, verification, synthesis, and emulation tools. SystemVerilog Assertions combines many ideas from the languages described in Section 10.3. Properties are specified in terms of signal sequences, each of which is associated with a clock for sampling signals. Sequences are first class objects and can be declared, parameterized, or combined to build other sequences. It also provides limited support for combining sequences with multiple clocks. Example 10.13 The following example defines a request–acknowledge sequence called req_ack. It is defined in terms of parameter signals, and can be reused in multiple properties. The notation ##[1:3] indicates 1 to 3 cycles of the associated clock: sequence req_ack (req, del, ack) ; // ack occurs within 1 to 3 cycles after req (req ##[1:3] ack); endsequence

SystemVerilog Assertions also allows dynamic local variables to be declared within a sequence, which remain valid during the current invocation of the sequence. These can be used to check regular properties.

© 2006 by Taylor & Francis Group, LLC

Veriﬁcation Languages

10-15

Example 10.14 Consider the following sequence which captures the I/O behavior of a pipeline register with depth 16 [17]: sequence pipe_operation; int x; (write_en, (x = data_in) ) |-> ##16 (data_out == x); endsequence

Note that the input value is saved in the local variable x at the beginning of the sequence, which is compared with the value of data_out after 16 cycles. (Here, |-> denotes an implication operator which matches the ending of the previous subsequence with the beginning of the next.) Different directives act on properties to indicate their role — assert (check for violations), cover (observe and track when property is exercised), and bind (attaches an externally specified assertion to the code). Assertions are classified as immediate or concurrent. The immediate assertions are like assert statements in an imperative programming language, such as C/C++. They are executed immediately when encountered in the SystemVerilog code, while following its event-based simulation semantics. On the other hand, concurrent assertions usually describe behavior that spans time, and they are executed in the special scheduling phase using sampled values. In both cases, the action taken after evaluating an assertion may include system tasks to control severity, such as “$error,” “$fatal,” and “$warning” [17]. Immediate assertions can be either declarative or procedural. Declarative assertions require enabling conditions to be explicitly provided by the user, which are monitored continuously on clock edges. In the case of procedural assertions, the enabling conditions are partially inferred from the code context. This makes the procedural assertions easier to maintain when the code changes. Example 10.15 Consider a simple parameterized property declaration, and its use as a concurrent assertion [17]: property mutex (clk, reset_n, a, b); @(posedge clk) disable iff (reset_n) (! (a & b )) endproperty assert_mutex: assert property (mutex (clk_a, master_reset_n, write_en, read_en)) ;

The disable iff clause in the mutex property indicates that the property is disabled, that is, treated as if it holds, when the reset_n signal is active. The assertion checks that write_en and read_en cannot occur at the same time, provided master_reset_n is not active.

10.5.2 Domain-Speciﬁc System Languages In this section, we describe some domain-specific languages for embedded system applications, and highlight the support they provide for verification. Unlike the system-level languages influenced by HDLs, there has been relatively little standardization effort aimed at their verification features. 10.5.2.1 Synchronous Languages Esterel is a programming language used for the design of synchronous reactive systems [60]. The language provides features for describing the control aspects of parallel compositions of multiple processes, including elaborate clocking and exception-handling mechanisms. Typically, the control part of a reactive application is written in Esterel, which is combined with a functional part written in C. Esterel has a well-defined semantics, which is used by the associated compilers to automatically generate sequential C code for software modules, or Boolean gate-level netlists implementing FSMs for hardware modules. Standard simulation or model checking can be performed on the synthesized FSMs.

© 2006 by Taylor & Francis Group, LLC

10-16

Embedded Systems Handbook

A central assumption in Esterel and other synchronous languages is the synchrony hypothesis, which assumes that a process can react infinitely fast to its environment. In practice, it is important to validate this assumption for the target machine. An example is the tool TAXYS [61], which checks this assumption by modeling the Esterel application and its environment as a real-time system. Two kinds of real-time constraints are specified as annotations in Esterel code — throughput constraints (which express the requirements that a system react fast enough for the given environment model), and deadline constraints (which express maximum delay between a given input and output of a system). These constraints are checked by Kronos [62], a model checker for real-time systems. Example 10.16 An example of annotated application code [61] in Esterel with deadline constraints is shown below: 1 loop 2 await A; %{# Y = clock(last A) %} 3 call F( ); %{# Fmin < CPU < Fmax %} 4 %{# 0 < clock(last A) < d1 %} 5 end loop 6 || 7 loop 8 await B; 9 call G( ); %{# Gmin < CPU < Gmax %} 10 %{# 0 < Y < d2 %} 11 end loop

In this example, it is required that the function F must terminate within d1 time units after arrival of event A, and also that the function G must terminate within d2 time units after the arrival of event A which was consumed by F. These constraints are specified as shown in Lines 4 and 10 (with Line 2), respectively. Note also that the estimated runtimes for functions F and G are provided in Lines 3 and 9, respectively. 10.5.2.2 Languages for Hybrid Systems Many embedded system applications require interaction between digital and analog components, for example, automotive controllers, avionics systems, robotic systems, manufacturing plant controllers, etc. Since many such applications are also safety-critical, there has been a great deal of interest in the verification of such systems, called hybrid systems. The verification efforts can be broadly classified into those using classical control-theoretic methods, and others using automata-based methods. A popular paradigm for control-theoretic methods is the MATLAB-based toolset, with the Simulink/ Stateflow modeling language [63]. It provides standard control engineering components to model the continuous domain, and a Statechart-like language to model the discrete controller. MATLAB-based tools are used for analysis, optimization, and simulation of the continuous behavior specified in terms of differential equations. These tools have been used very successfully for applications dominated by continuous dynamics. However, complex interaction between the continuous and discrete control components has not received much attention. Furthermore, the simulation semantics has not been related to a formal semantics in any standard way. The automata-based methods typically use discrete abstractions of the hybrid system in a way that preserves the properties of interest, typically expressed in temporal logic [64]. Many model checkers for handling these abstractions have been applied in industry settings, for example, Hytech [65], Uppaal [66], Kronos [62], Charon [67]. The details of the discrete abstractions, and the resulting automata models, are beyond the scope of this chapter. In most cases, the correctness requirement is to avoid a set of “bad” states, typically specified using the syntax of the modeling language itself. The model checkers perform an exact (or approximate) reachability analysis to (conservatively) check that all reachable states are safe.

© 2006 by Taylor & Francis Group, LLC

Veriﬁcation Languages

10-17

There has also been work on translating subsets of the popular Simulink/Stateflow-based models to the formal automata-based models, using abstraction and model-checking techniques [68,69]. Recently, there has been some standardization effort in this domain — the Hybrid Systems Interchange Format (HSIF) standard is aimed at defining an interchange format for hybrid system models that can be shared between modeling and analysis tools. So far, the focus has been on representation of the system dynamics, which include both continuous and discrete behaviors. Support for verification-related features will potentially follow its wider adoption.

10.6 Conclusions We have presented a tutorial on verification languages in industry practice for verification of embedded systems and SoCs. We have described features that aid development of testbenches, and specification of correctness properties, to be used by simulation-based as well as formal verification methods. With verification becoming a critical activity in the design cycle of such systems today, these languages are receiving a lot of attention, and there are several standardization efforts underway.

Acknowledgments The authors would like to thank Sharad Malik for valuable comments on the manuscript, and Rajiv Alur and Franjo Ivancic for helpful discussions on verification of hybrid systems.

References [1] S. Edwards, Languages for Embedded Systems. In The Industrial Information Technology Handbook, D.R. Zurawski, Ed. CRC Press, Boca Raton, FL, 2004. [2] A. Bunker, G. Gopalakrishnan, and S. McKee, Formal Hardware Specification Languages for Protocol Compliance Verification. ACM Transactions on Design Automation of Electronic Systems, 9: 1–32, 2004. [3] Accellera Organization. http://www.accellera.org [4] IEEE, Design Automation Standards Committee (DASC). http://www.dasc.org [5] D.E. Thomas and P.R. Moorby, The Verilog Hardware Description Languages. Kluwer Academic Publishers, Norwell, MA, 1991. [6] D.R. Coelho, The VHDL Handbook. Kluwer Academic Publishers, Norwell, MA, 1989. [7] J. Bergeron, Writing Testbenches, Functional Verification of HDL Models. Kluwer Academic Publishers, Dordrecht, 2003. [8] E.M. Clarke, J. Wing et al., Formal Methods: State of the Art and Future Directions. ACM Computing Surveys, 28: 626–643, 1997. [9] E.M. Clarke, O. Grumberg, and D. Peled, Model Checking. MIT Press, Lancaster, England, 1999. [10] K.L. McMillan, Symbolic Model Checking: An Approach to the State Explosion Problem. Kluwer Academic Publishers, Dordrecht, 1993. [11] G.J. Holzmann, The Model Checker SPIN. IEEE Transactions of Software Engineering, 23: 279–295, 1997. [12] M.J.C. Gordon, R. Milner, and C.P. Wadsworth, Edinburgh LCF: A Mechanized Logic of Computation, Vol. 78. Springer-Verlag, Heidelberg, 1979. [13] R.S. Boyer and J.S. Moore, A Computational Logic Handbook. Academic Press, New York, 1988. [14] S. Owre, J.M. Rushby, and N. Shankar, PVS: A Prototype Verification System. In Proceedings of the International Conference on Automatic Deduction (CADE), Vol. 607 of Lecture Notes on Computer Science. Springer-Verlag, Saratoga, NY, 1992. [15] R.E. Bryant, Graph-Based Algorithms for Boolean Function Manipulation. IEEE Transactions on Computers, C-35: 677–691, 1986.

© 2006 by Taylor & Francis Group, LLC

10-18

Embedded Systems Handbook

[16] A. Biere, A. Cimatti, E.M. Clarke, and Y. Zhu, Symbolic Model Checking without BDDs. In Proceedings of the Workshop on Tools and Algorithms for Analysis and Construction of Systems (TACAS), Vol. 1579 of Lecture Notes on Computer Science. Springer-Verlag, London, UK, 1999. [17] H. Foster, A. Krolnik, and D. Lacey, Assertion Based Design. Kluwer Academic Publishers, Dordrecht, 2003. [18] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Co., San Francisco, CA, 1979. [19] C. Barrett, D. Dill, and J. Levitt, Validity Checking for Combinations of Theories with Equality. In Proceedings of the Formal Methods in Computer Aided Design (FMCAD), Vol. 1166 of Lecture Notes in Computer Science. Springer-Verlag, 1996. [20] R.E. Bryant, S. Lahiri, and S. Seshia, Modeling and Verifying Systems using a Logic of Counter Arithmetic with Lambda Expressions and Uninterpreted Functions. In Proceedings of the Conference on Computer Aided Verification, Vol. 2404 of Lecture Notes in Computer Science. Springer-Verlag, 2002. [21] A. Pnueli, The Temporal Logic of Programs. In Proceedings of the 18th IEEE Symposium on Foundation of Computer Science. IEEE Press, 1977, pp. 46–57. [22] M. Vardi, Branching vs. Linear Time: Final Showdown. In Proceedings of the Tools and Algorithms for Analysis and Construction of Systems (TACAS), Vol. 2031 of Lecture Notes in Computer Science. Springer-Verlag, London, UK, 2003. [23] W. Thomas, Automata on Infinite Objects. In Handbook of Theoretical Computer Science, Vol. B. Elsevier and MIT Press, Cambridge, MA, 1990, pp. 133–191. [24] Z. Kirshenbaum, Understanding the “e” Verification Languages, 2003. http://www.eetimes.com/ story/OEG20030529S0072 [25] OpenVera Language Reference Manual: Assertions Version 2.3. 2003. http://www.openvera.org [26] R. Armoni, L. Fix, A. Flaisher, R. Gerth, B. Ginsburg, T. Kanza, and A. Landver, The ForSpec Temporal Logic: A New Temporal Property-Specification Language. In Proceedings of Tools and Algorithms for Analysis and Construction of Systems (TACAS), Vol. 2280 of Lecture Notes on Computer Science. Springer-Verlag, London, UK, 2001. [27] Synopsys, OVA White Paper, 2003. www.openvera.org [28] I. Beer, S. Ben-David, C. Eisner, D. Fisman, A. Gringauze, and Y. Rodeh, The Temporal Logic Sugar. In Proceedings of the International Conference on Computer Aided Verification, Vol. 2102 of Lecture Notes in Computer Science. Springer-Verlag, 2001. [29] D. Craigen, S. Gerhart, and T. Ralston, Formal Methods Reality Check: Industrial Usage. IEEE Transactions of Software Engineering, 21: 90–98, 1995. [30] D. Jackson and M. Rinard, Software Analysis: A Roadmap. In The Future of Software Engineering, A. Finkelstein, Ed. ACM Press, 2000. [31] C.A.R. Hoare, An Axiomatic Basis for Computer Programming. Communications of the ACM, 12: 576–580, 1969. [32] P. Godefroid, Model Checking for Programming Languages using VeriSoft. In Proceedings of the ACM Symposium on Principles of Programming Languages. ACM Press, 1997. [33] T. Ball and S. Rajamani, The SLAM Toolkit. In Proceedings of the Conference on Computer Aided Verification, Vol. 2102 of Lecture Notes on Computer Science. Springer-Verlag, London, UK, 2001. [34] T. Ball and S. Rajamani, SLIC: A Specification Language for Interface Checking (of C). Microsoft Research MSR-TR-2001-21, 2001. [35] T.A. Henzinger, R. Jhala, R. Majumdar, and G. Sutre, Software Verification with Blast. In Proceedings of the 10th SPIN Workshop on Model Checking, Vol. 2648 of Lecture Notes on Computer Science. Springer-Verlag, Heidelberg, 2003. [36] D. Kroening, E.M. Clarke, and K. Yorav, Behavioral Consistency of C and Verilog Programs using Bounded Model Checking. In Proceedings of the ACM/IEEE Design Automation Conference. ACM Press, 2003.

© 2006 by Taylor & Francis Group, LLC

Veriﬁcation Languages

10-19

[37] C. Flanagan, K.R.M. Leino, M. Lillibridge, G. Nelson, J. Saxe, and R. Stata, Extended Static Checking for Java. In Proceedings of the ACM Conference on Programming Language Design and Implementation (PLDI). ACM Press, 2002. [38] G.T. Leavens, A.L. Baker, and C. Ruby, Preliminary Design of JML: A Behavioral Interface Specification Language for Java. Technical report 98-06t, Iowa State University, Department of Computer Science, 2002. [39] G.T. Leavens, K.R.M. Leino, E. Poll, C. Ruby, and B. Jacobs, JML: Notations and Tools Supporting Detailed Design in Java. In Proceedings of the Conference on Object-Oriented Programming, Systems, Languages, and Applications. ACM Press, 2000. [40] E. Poll, J.v.d. Berg, and B. Jacobs, Specification of the JavaCard API in JML. In Proceedings of the Smart Card Research and Advanced Application Conference, 2000. [41] J.C. Corbett, M.B. Dwyer, J. Hatcliff, S. Laubach, C.S. Pasareanu, Robby, and H. Zheng, Bandera: Extracting Finite-State Models from Java Source Code. In Proceedings of the International Conference on Software Engineering. IEEE Press, 2000. [42] K. Havelund and T. Pressburger, Model Checking Java Programs Using Java PathFinder. International Journal on Software Tools for Technology Transfer (STTT), 2: 366–381, 2000. [43] W. Visser, K. Havelund, G. Brat, and S. Park, Model Checking Programs. In Proceedings of IEEE International Conference on Automated Software Engineering. IEEE Press, 2000. [44] J.C. Corbett, M.B. Dwyer, J. Hatcliff, and Robby, A Language Framework for Expressing Checkable Properties of Dynamic Software. In Proceedings of the SPIN Software Model Checking Workshop, 2000. [45] M.B. Dwyer, G. Avrunin, and J.C. Corbett, Patterns in Property Specifications for Finite-State Verification. In Proceedings of the International Conference on Software Engineering. IEEE Press, 1999. [46] J. Rumbaugh, I. Jacobson, and G. Booch, The Unified Modeling Language User’s Guide. Addison Wesley, Reading, MA, 1999. [47] G. Martin, UML for Embedded Systems Specification and Design: Motivation and Overview. In Proceedings of Design Automation and Test Europe (DATE), 2002. [48] B. Selic, The Real-Time UML Standard: Definition and Application. In Proceedings of Design Automation and Test Europe (DATE), 2002. [49] D. Harel, StateCharts: A Visual Formalism for Complex Systems. Science of Computer Programming, 8: 231–274, 1987. [50] ITU, Message Sequence Chart: International Telecommunication Union-T Recommendation, 1999. [51] A. Muscholl and D. Peled, From Finite State Communication Protocols to High-Level Message Sequence Charts. In Proceedings of the International Symposium on Mathematical Foundations of Computer Science, Vol. 2076 of Lecture Notes on Computer Science. Springer-Verlag, 2001. [52] J. Warmer and A. Kleppe, The Object Constraint Language: Precise Modeling with UML. AddisonWesley, Reading, MA, 2000. [53] J. Ellsberger, D. Hogrefe, and A. Sarma, SDL: Formal Object-Oriented Language for Communication Systems. Prentice Hall, New York, 1997. [54] V. Levin and H. Yenigun, SDLcheck: A Model Checking Tool. In Proceedings of the Conference on Computer Aided Verification, Vol. 2102 of Lecture Notes on Computer Science. Springer-Verlag, 2001. [55] J.M. Spivey, The Z Notation: A Reference Manual. Prentice-Hall, New York, 1992. [56] D. Jackson, Alloy: A Lightweight Object Modelling Notation. ACM Transactions on Software Engineering and Methodology (TOSEM), 11: 256–290, 2002. [57] N. Ip and S. Swan, Using Transaction-Based Verification in SystemC, 2002. http://www.systemc.org/ [58] D.D. Gajski, J. Zhu, J. Doemer, A. Gerstlauer, and S. Zhao, SpecC: Specification Language and Methodology. Kluwer Academic Publishers, Dordrecht, 2000.

© 2006 by Taylor & Francis Group, LLC

10-20

Embedded Systems Handbook

[59] Accellera, SystemVerilog 3.1: Accellera’s Extensions to Verilog, 2003. http://www.eda.org/ sv-ec/SystemVerilog_3.1_final.pdf [60] G. Berry and G. Gonthier, The Esterel Synchronous Programming Language: Design, Semantics, Implementation. Science of Computer Programming, 19: 87–152, 1992. [61] E. Closse, M. Poize, J. Pulou, J. Sifakis, P. Venier, D. Weill, and S. Yovine, TAXYS: A Tool for the Development and Verification of Real-Time Embedded Systems. In Proceedings of the Conference on Computer Aided Verification (CAV), 2001. [62] C. Daws, A. Olivero, S. Tripakis, and S. Yovine, The Tool Kronos. In Proceedings of the Hybrid Systems III, Verification and Control, Vol. 1066 of Lecture Notes on Computer Science. Springer-Verlag, NY, 1996. [63] MATLAB/Simulink, http://www.mathworks.com [64] R. Alur, T.A. Henzinger, G. Lafferriere, and G. Pappas, Discrete Abstractions of Hybrid Systems. In Proceedings of the IEEE, 2000. [65] T.A. Henzinger, P.-H. Ho, and H. Wong-Toi, HyTech: A Model Checker for Hybrid Systems. International Journal of Software Tools for Technology Transfer, 1: 110–122, 1997. [66] K. Larsen, P. Pettersson, and W. Yi, UPPAAL in a Nutshell. International Journal of Software Tools for Technology Transfer, 1: 134–152, 1997. [67] R. Alur, T. Dang, J. Esposito, R. Fierro, Y. Hur, F. Ivancic, V. Kumar, I. Lee, P. Mishra, G. Pappas, and O. Sokolsky, Hierarchical Hybrid Modeling of Embedded Systems. In Proceedings of the EMSOFT ’01: First Workshop on Embedded Software, Vol. 2211 of Lecture Notes on Computer Science. Springer-Verlag, 2001. [68] B.I. Silva, K. Richeson, B.H. Krogh, and A. Chutinam, Modeling and Verification of Hybrid Dynamical System Using CheckMate. In Proceedings of the Automation of Mixed Processes: Hybrid Dynamic Systems (ADPM), 2000. [69] A. Tiwari, N. Shankar, and J. M. Rushby, Invisible Formal Methods for Embedded Control Systems. Proceedings of the IEEE, 91: 29–39, 2003.

© 2006 by Taylor & Francis Group, LLC

Operating Systems and Quasi-Static Scheduling 11 Real-Time Embedded Operating Systems: Standards and Perspectives Ivan Cibrario Bertolotti

12 Real-Time Operating Systems: The Scheduling and Resource Management Aspects Giorgio C. Buttazzo

13 Quasi-Static Scheduling of Concurrent Specifications Alex Kondratyev, Luciano Lavagno, Claudio Passerone, and Yosinori Watanabe

© 2006 by Taylor & Francis Group, LLC

11 Real-Time Embedded Operating Systems: Standards and Perspectives 11.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1 11.2 Operating System Architecture and Functions . . . . . . . . 11-3 Overall System Architecture • Process and Thread Model • Processor Scheduling • Interprocess Synchronization and Communication • Network Support • Additional Functions

11.3 The POSIX Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-9 Attribute Objects • Multithreading • Process and Thread Scheduling • Real-Time Signals and Asynchronous Events • Interprocess Synchronization and Communication • Thread-Specific Data • Memory Management • Asynchronous and List Directed Input and Output • Clocks and Timers • Cancellation

11.4 Real-Time, Open-Source Operating Systems . . . . . . . . . . 11-23 11.5 Virtual-Machine Operating Systems . . . . . . . . . . . . . . . . . . . 11-25

Ivan Cibrario Bertolotti IEIIT-CNR – Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni

Related Work • Views of Processor State • Operating Principle • Virtualization by Instruction Emulation • Processor-Mode Change • Privileged Instruction Emulation • Exception Handling • Interrupt Handling • Trap Redirection • VMM Processor Scheduler

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-33

11.1 Introduction Informally speaking, a real-time computer system is a system where a computer senses events from the outside world and reacts to them; in such an environment, the timely availability of computation results is as important as their correctness. Instead, the exact definition of embedded system is somewhat less clear. In general, an embedded system is a special-purpose computer system built into a larger device and is usually not programmable by the end user.

11-1

© 2006 by Taylor & Francis Group, LLC

11-2

Embedded Systems Handbook

The major areas of difference between general purpose and embedded computer systems are cost, performance, and power consumption: often, embedded systems are mass produced, thus reducing their unit cost is an important design goal; in addition, mobile battery-powered embedded systems, such as cellular phones, have severe power budget constraints to enhance their battery life. Both these constraints have a profound impact on system performance, because they entail simplifying the overall hardware architecture, reducing clock speed, and keeping memory requirements to a minimum. Moreover, embedded computer systems often lack traditional peripheral devices, such as a disk drive, and interface with application-specific hardware instead. Another common requirement for embedded computer systems is some kind of real-time behavior; the strictness of this requirement varies with the application, but it is so common that, for example, many operating system vendors often use the two terms interchangeably, and refer to their products either as “embedded operating systems” or “real-time operating systems for embedded applications.” In general, the term “embedded” is preferred when referring to smaller, uniprocessor computer systems, and “real-time” is generally used when referring to larger appliances, but the today’s rapid increase of available computing power and hardware features in embedded systems contributes to shade this distinction. Recent examples of real-time systems include many kinds of widespread computer systems, from large appliances like phone switches to mass-market consumer products such as printers and digital cameras. Therefore, a real-time operating system must not only manage system resources and offer a welldefined set of services to application programs, like any other operating system does, but must also provide guarantees about the timeliness of such services and honor them, that is, its behavior must be predictable. Thus, for example, the maximum time the operating system will take to perform any service it offers must be known in advance. This proves to be a tight constraint, and implies that real-time does not have the same meaning as “real fast,” because it often conflicts with other operating system’s goals, such as good resource utilization and coexistence of real-time and nonreal-time jobs, and adds further complexity to the operating system duty. Also, it is highly desirable that a real-time operating system optimize some operating system parameters, mainly context switch time and interrupt latency, which have a profound influence on the overall response time of the system to external events; moreover, in embedded systems, the operating system footprint, that is, its memory requirements, must be kept to a minimum to reduce costs. Last, but not the least, due to the increasing importance of open system architectures in software design, the operating system services should be made available to the real-time application through a standard application programming interface. This approach promotes code reuse, interoperability, and portability, and reduces the software maintenance cost. The chapter is organized as follows: Section 11.2 gives a brief refresher on the main design and architectural issues of operating systems and on how these concepts have been put in practice. Section 11.3 discusses the main set of international standards concerning real-time operating systems and their application programming interface; this section also includes some notes on mechanisms seldom mentioned in operating system theory but of considerable practical relevance, namely real-time signals, asynchronous I/O operations, and timers. Next, Section 11.4 gives a short description of some widespread open-source real-time operating systems, another recent and promising source of novelty in embedded system software design. In particular, since open-source operating systems have no purchasing cost and are inherently royalty free, their adoption can easily cut down the cost of an application. At the end of the chapter, Section 11.5 presents an overview of the operating principle and goals of a seldom-mentioned class of operating systems, namely operating systems based on virtual machines. Although they are perceived to be very difficult to implement, these operating systems look very promising for embedded applications, in which distinct sets of applications, each with its own requirements in term of real-time behavior and security, are executed on the same physical processor; hence, they are an active area of research.

© 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

11-3

11.2 Operating System Architecture and Functions The main goal of this section is to give a brief overview on the architecture of operating systems of interest to real-time application developers, and on the functions they accomplish on behalf of the applications that run on them. See, for example, References 1 and 2 for more general information, and Reference 3 for an in-depth discussion about the internal architecture of the influential Unix operating system.

11.2.1 Overall System Architecture An operating system is a very complex piece of software; accordingly, its internal architecture can be built around several different designs. Some designs that have been tried in practice and are in common use are: Monolithic systems. This is the oldest design, but it is still popular for very small real-time executives intended for deeply embedded applications, and for the real-time portion of more complex systems, due to its simplicity and very low processor and memory overhead. In monolithic systems, the operating system as a whole runs in privileged mode, and the only internal structure is usually induced by the way operating system services are invoked: applications, running in user mode, request operating system services by executing a special trapping instruction, usually known as the system call instruction. This instruction brings the processor into privileged mode and transfers control to the system call dispatcher of the operating system. The system call dispatcher then determines which service must be carried out, and transfers control to the appropriate service procedure. Service procedures share a set of utility procedures, which implement generally useful functions on their behalf. Interrupt handling is done directly in the kernel for the most part and interrupt handlers are not fullfledged processes. As a consequence, the interrupt handling overhead is very small because there is no full task switching at interrupt arrival, but the interrupt handling code cannot invoke most system services, notably blocking synchronization primitives. Moreover, the operating system scheduler is disabled while interrupt handling is in progress, and only hardware prioritization of interrupt requests is in effect, hence the interrupt handling code is implicitly executed at a priority higher than the priority of all other tasks in the system. To further reduce processor overhead on small systems, it is also possible to run the application as a whole in supervisor mode. In this case, the application code can be bound with the operating system at link time and system calls become regular function calls. The interface between application code and operating system becomes much faster, because no user-mode state must be saved on system call invocation and no trap handling is needed. On the other hand, the overall control that the operating system can exercise on bad application behavior is greatly reduced and debugging may become harder. In this kind of systems, it is usually impossible to upgrade individual software components, for example, an application module, without replacing the executable image as a whole and then rebooting the system. This constraint can be of concern in applications where software complexity demands the frequent replacement of modules, and no system down time is allowed. Layered systems. A refinement and generalization of the monolithic system design consists of organizing the operating system as a hierarchy of layers at system design time. Each layer is built upon the services offered by the one below it and, in turn, offers a well-defined and usually richer set of services to the layer above it. Operating system interface and interrupt handling are implemented like in monolithic systems; hence, the corresponding overheads are very similar. Better structure and modularity make maintenance easier, both because the operating system code is easier to read and understand, and because the inner structure of a layer can be changed at will without interfering with other layers, provided the interlayer interface does not change. Moreover, the modular structure of the operating system enables the fine-grained configuration of its capabilities, to tailor the operating system itself to its target platform and avoid wasting valuable memory space for operating system functions that are never used by the application. As a consequence, it is possible to enrich the operating system with many capabilities, for example, network support, without sacrificing

© 2006 by Taylor & Francis Group, LLC

11-4

Embedded Systems Handbook

its ability to run on very small platforms when these features are not needed. A number of operating systems in use today evolved into this structure, often starting from a monolithic approach, and offer sophisticated build or link-time configuration tools. Microkernel systems. This design moves many operating system functions from the kernel up into operating system server processes running in user mode, leaving a minimal microkernel and reducing to an absolute minimum the amount of privileged operating system code. Applications request operating system services by sending a message to the appropriate operating system server and waiting for a reply. The main purpose of the microkernel is to handle the communication between applications and servers, to enforce an appropriate security policy on such communication, and to perform some critical operating system functions, such as accessing I/O device registers, that would be difficult, or inefficient, to do from user-mode processes. This kind of design makes the operating system easier to manage and maintain. Also, the messagepassing interface between user processes and operating system components encourages modularity and enforces a clear and well-understood structure on operating system components. Moreover, the reliability of the operating system is increased: since the operating system servers run in user mode, if one of them fails some operating system functions will no longer be available, but the system will not crash. Moreover, the failed component can be restarted and replaced without shutting down the whole system. Last, the design is easily extended to distributed systems, where operating system functions are split across a set of distinct machines connected by a communication network. This kind of systems are very promising in terms of performance, scalability, and fault tolerance, especially for large and complex real-time applications. By contrast, making the message-passing communication mechanism efficient can be a critical issue, especially for distributed systems, and the system call invocation mechanism induces more overhead than in monolithic and layered systems. Interrupt requests are handled by transforming them into messages directed to the appropriate interrupt handling task as soon as possible: the interrupt handler proper runs in interrupt service mode and performs the minimum amount of work strictly required by the hardware, then synthesizes a message and sends it to an interrupt service task. In turn, the interrupt service task concludes interrupt handling running in user mode. Being an ordinary task, the interrupt service task can, at least in principle, invoke the full range of operating system services, including blocking synchronization primitives, and must not concern itself with excessive usage of the interrupt service processor mode. On the other hand, the overhead related to interrupt handling increases, because the activation of the interrupt service task requires a task switch. Virtual machines. The internal architecture of operating systems based on virtual machines revolves around the basic observation that an operating system must perform two essential functions: multiprogramming and system services. Accordingly, those operating systems fully separate these functions and implement them as two distinct operating system components: a virtual machine monitor that runs in privileged mode, implements multiprogramming, and provides many virtual processors identical in all respects to the real processor it runs on, and one or more guest operating systems that run on the virtual processors, and implement system services. Different virtual processors can run different operating systems, and they must not necessarily be aware of being run in a virtual machine. In the oldest approach to virtual machine implementation, guest operating systems are given the illusion of running in privileged mode, but are instead constrained to operate in user mode; in this way, the virtual machine monitor is able to intercept all privileged instructions issued by the guest operating systems, check them against the security policy of the system, and then perform them on behalf of the guest operating system itself. Interrupt handling is implemented in a similar way: the virtual machine monitor catches all interrupt requests and then redirects them to the appropriate guest operating system handler, reverting to user mode in the process; thus, the virtual machine monitor can intercept all privileged instructions issued by the guest interrupt handler, and again check and perform them as appropriate.

© 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

11-5

The full separation of roles and the presence of a relatively small, centralized arbiter of all interactions between virtual machines has the advantage of making the enforcement of security policies easier. The isolation of virtual machines from each other also enhances reliability because, even if one virtual machine fails, it does not bring down the system as a whole. In addition, it is possible to run a distinct operating system in each virtual machine thus supporting, for example, the orderly coexistence between a real-time and a general-purpose operating system. By contrast, the perfect implementation of virtual machines requires hardware assistance, both to make it feasible and to be able to emulate privileged instructions with a reasonable degree of efficiency. A variant of this design adopts an interpretive approach, and allows the virtual machines to be different from the physical machine. For example, Java programs are compiled into byte-code instructions suitable for execution by an abstract Java virtual machine. On the target platform an interpreter executes the byte code on the physical processor, thus implementing the virtual machine. More sophisticated approaches to virtual machine implementation are also possible, the most common one being on-the-fly code generation, also known as just-in-time compilation.

11.2.2 Process and Thread Model A convenient and easy to understand way to design real-time software applications is to organize them as a set of cooperating sequential processes. A process is an activity, namely the activity of executing a program, and encompasses the program being executed, the data it operates on, and the current processor state, including the program counter and registers. In particular, each process has its own address space. In addition, it is often convenient to support multiple threads of control within the same process, sharing the same address space. Threads can be implemented for the most part in user mode, without the operating system’s kernel intervention; moreover, when the processor is switched between threads, the address space remains the same and must not be switched. Both these facts make processor switching between threads very fast with respect to switching between processes. On the other hand, since all threads within a process share the same address space, there can be only a very limited amount of protection among them with respect to memory access; hence, for example, a thread is allowed to pollute by mistake another thread’s data and the operating system has no way to detect errors of this kind. As a consequence, many small operating systems for embedded applications only support threads to keep overheads and hardware requirements to a minimum, while larger operating systems for more complex real-time applications offer the user a choice between a single or multiple process model to enhance the reliability of complex systems. Another important operating system design issue is the choice between static and dynamic creation of processes and threads: some operating systems, usually oriented toward relatively simple embedded applications, only support static tasks, that is, all tasks in the system are known in advance and it is not possible to create and destroy tasks while the system is running; thus, the total number of tasks in the system stays constant for all its life. Other operating systems allow us to create and destroy tasks at runtime, by means of a system call. Dynamic task creation has the obvious advantage of making the application more flexible, but it increases the complexity of the operating system, because many operating system data structures, first of all the process table, must be allocated dynamically and their exact size cannot be known in advance. In addition, the application code requires a more sophisticated error-handling strategy, with its associated overheads, because it must be prepared to cope with the inability of the operating system to create a new task, due to lack of resources.

11.2.3 Processor Scheduling The scheduler is one of the most important components of a real-time operating system, as it is responsible for deciding to which runnable threads the available processors must be assigned, and for how long. Among dynamic schedulers — that is, schedulers that perform scheduling computations at runtime — while the

© 2006 by Taylor & Francis Group, LLC

11-6

Embedded Systems Handbook

application is running, several algorithms are in use and offer different tradeoffs between real-time predictability, implementation complexity, and overhead. Since the optimum compromise often depends on the application’s characteristics, most real-time operating systems support multiple scheduling policies simultaneously and the responsibility of a correct choice falls on the application programmer. The most common scheduling algorithms supported by real-time operating systems and specified by international standards are: First in, first out with priority classes. Under this algorithm, also known as fixed priority scheduling, there is a list of runnable threads for each priority level. When a processor is idle, the scheduler takes the runnable thread at the head of the highest-priority, nonempty thread list and runs it. When the scheduler preempts a running thread, because a higher-priority task has become runnable, the preempted thread becomes the head of the thread list for its priority; when a blocked thread becomes runnable again, it becomes the tail of the thread list for its priority. The first in, first out scheduler never changes thread priorities at runtime; hence, the priority assignment is fully static; a well-known approach to static priority assignment for periodic tasks is the rate monotonic policy, in which task priorities are inversely proportional to their periods. In order to ensure that none of the threads can monopolize the processor, when multiple, runnable threads share the same priority level, the basic algorithm is often enhanced with the additional constraint that, when a running thread has been executing for more than a maximum amount of time, the quantum, that thread is forcibly returned to the tail of its thread list and a new thread is selected for execution; this approach is known as round-robin scheduling. Earliest deadline first. The earliest deadline first scheduler assigns thread priorities dynamically. In particular, this scheduler always executes the thread with the nearest deadline. It can be shown that this algorithm is optimal for uniprocessor systems, and supports full processor utilization in all situations. However, its performance under overload can be poor and dynamically updating thread priorities on the base of their deadlines may be computationally expensive, especially when this scheduler is layered on a fixed-priority lower-level scheduler. Sporadic server. This scheduling algorithm was first introduced in Reference 4, where a thorough description of the algorithm can be found. The sporadic server algorithm is suitable for aperiodic event handling where, for timeliness, events must be handled at a certain, usually high, priority level, but lower-priority threads with real-time requirements could suffer from excessive preemption if that priority level were maintained indefinitely. It acts on the base of two main scheduling parameters associated with each thread: the execution capacity and the replenishment period. Informally, the execution capacity of a thread represents the maximum amount of processor time that the thread is allowed to consume at high priority in a replenishment period. The execution capacity of a thread is preserved until an aperiodic request for that task occurs, thus making it runnable; then, thread execution depletes its execution capacity. The sporadic server algorithm replenishes the thread’s execution capacity after some or all of its capacity is consumed by thread execution; the schedule for replenishing the execution capacity is based on the thread’s replenishment period. Should the thread reach its processor usage upper limit, its execution capacity becomes zero and it is demoted to a lower-priority level, thus avoiding excessive preemption against other threads. When replenishments have restored the execution capacity of the thread above a certain threshold level, the scheduler promotes the thread to its original priority again.

11.2.4 Interprocess Synchronization and Communication An essential function of a multiprogrammed operating system is to allow processes to synchronize and exchange information; as a whole, these functions are known as InterProcess Communication (IPC). Many interprocess synchronization and communication mechanisms have been proposed and were objects of

© 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

11-7

extensive theoretical study in the scientific literature. Among them, we recall: Semaphores. A semaphore, first introduced by Dijkstra in 1965, is a synchronization device with an integer value, and on which the following two primitive, atomic operations are defined: • The P operation, often called DOWN or WAIT, checks if the current value of the semaphore is greater than zero. If so, it decrements the value and returns to the caller; otherwise, the invoking process goes into the blocked state until another process performs a V on the same semaphore. • The V operation, also called UP, POST, or SIGNAL, checks whether there is any process currently blocked on the semaphore. In this case, it wakes exactly one of them up, allowing it to complete its P; otherwise, it increments the value of the semaphore. The V operation never blocks. Semaphores are a very low-level IPC mechanism; therefore, they have the obvious advantage of being simple to implement, at least on uniprocessor systems, and of having a very low overhead. By contrast, they are difficult to use, especially in complex applications. Also related to mutual exclusion with semaphores is the problem of priority inversion. Priority inversion occurs when a high-priority process is forced to wait for a lower-priority process to exit a critical region, a situation in contrast with the concept of relative task priorities. Most real-time operating systems take this kind of blocking into account and implement several protocols to bound it; among these we recall the priority inheritance and the priority ceiling protocols. Monitors. To overcome the difficulties of semaphores, in 1974/1975, Hoare and Brinch Hansen introduced a higher-level synchronization mechanism, the monitor. A monitor is a set of data structures and procedures that operate on them; data structures are shared among all processes that can use the monitor, and it effectively hides the data structures it contains; hence, the only way to access the data associated with the monitor is through the procedures in the monitor itself. In addition, all procedures in the same monitor are implicitly executed in mutual exclusion. Unlike semaphores, the responsibility of ensuring mutual exclusion falls on the compiler, not on the programmer, because monitors are a programming language construct, and the language compiler knows about them. Inside a monitor, condition variables can be used to wait for events. Two atomic operations are defined on a condition variable: • The WAIT operation releases the monitor and blocks the invoking process until another process performs a SIGNAL on the same condition variable. • The SIGNAL operation unblocks exactly one process waiting on the condition variable. Then, to ensure that mutual exclusion is preserved, the invoking process is either forced to leave the monitor immediately (Brinch Hansen’s approach), or is blocked until the monitor becomes free again (Hoare’s approach). Message passing. Unlike all other IPC mechanisms described so far, message passing supports explicit data transfer between processes; hence, it does not require a shared memory and lends itself well to be extended to distributed systems. This IPC method provides for two primitives: • The SEND primitive sends a message to a given destination. • Symmetrically, RECEIVE receives a message from a given source. Many variants are possible on the exact semantics of these primitives; they mainly differ in the way messages are addressed and buffered. A commonplace addressing scheme is to give to each process/thread in the system a unique address, and to send messages directly to processes. Otherwise, a message can be addressed to a mailbox, a message container whose maximum capacity is usually specified upon creation; in this case, message source and destination addresses are mailbox addresses. When mailboxes are used, they also provide some amount of message buffering, that is, they hold messages that have been sent but have not been received yet. Moreover, a single task can own multiple mailboxes and use them to classify messages depending on their source or priority. A somewhat contrary

© 2006 by Taylor & Francis Group, LLC

11-8

Embedded Systems Handbook

approach to message buffering, simpler to implement but less flexible, is the rendezvous strategy: the system performs no buffering. Hence, when using this scheme the sender and the receiver are forced to run in lockstep because the SEND does not complete until another process executes a matching RECEIVE and, conversely, the RECEIVE waits until a matching SEND is executed.

11.2.5 Network Support There are two basic approaches to implement network support in a real-time operating system and to offer it to applications: • The POSIX standard (Portable Operating System Interface for Computing Environments) [5] specifies the socket paradigm for uniform access to any kind of network support that many realtime operating systems provide them. Sockets, fully described in Reference 3, were first introduced in the “Berkeley Unix” operating system and are now available on virtually all general-purpose operating systems; as a consequence, most programmers are likely to be proficient with them. The main advantage of sockets is that they support in a uniform way any kind of communication network, protocol, naming conventions, hardware, and so on. Semantics of communication and naming are captured by communication domains and socket types, both specified upon socket creation. For example, communication domains are used to distinguish between IPv4 and X25 network environments, whereas the socket type determines whether communication will be streambased or datagram-based and also implicitly selects which network protocol a socket will use. Additional socket characteristics can be setup after creation through abstract socket options; for example, socket options provide a uniform, implementation-independent way to set the amount of receive buffer space associated with a socket. • Some operating systems, mostly focused on a specific class of embedded applications, offer network support through a less general, but more rich and efficient, application programming interface. For example, Reference 6 is an operating system specification oriented to automotive applications; it specifies a communication environment (OSEK/VDX COM) less general than sockets and oriented to real-time message-passing networks, such as the Controller Area Network (CAN). In this case, for example, the application programming interface allows applications to easily set message filters and perform out-of-order receives, thus enhancing their timing behavior; both these functions are supported with difficulty by sockets, because they do not fit well with the general socket paradigm. In both cases, network device drivers are usually supplied by third-party hardware vendors and conform to a well-defined interface defined by the operating system vendor. The network software itself, although it is often bundled with the operating system and provided by the same vendor, can be obtained from third-party software houses, too. Often, these products are designed to run on a wide variety of hardware platforms and operating systems, and come with source code; hence, it is also possible to port them to custom operating systems developed in-house, and they can be extended and enhanced by the end user.

11.2.6 Additional Functions Even if real-time operating systems sometimes do not implement several major functions that are now commonplace in general-purpose operating systems, such as demand paging, swapping, and filesystem access, they must be concerned with other, less well-known functions that ensure or enhance system predictability, for example: • Asynchronous, real-time signals and cancellation requests, to deal with unexpected events, such as software and hardware failures, and to gracefully degrade the system’s performance should a processor overload occur. © 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

11-9

• High-resolution clocks and timers, to give real-time processes an accurate notion of elapsed time. • Asynchronous I/O operations, to decouple real-time processes from the inherent unpredictability of many I/O devices.

11.3 The POSIX Standard The original version of the Portable Operating System Interface for Computing Environments, better known as “the POSIX standard,” was first published between 1988 and 1990, and defines a standard way for applications to interface with the operating system. The set now includes over 30 individual standards, and covers a wide range of topics, from the definition of basic operating system services, such as process management, to specifications for testing the conformance of an operating system to the standard itself. Among these, of particular interest is the System Interfaces (XSH) Volume of IEEE Std 1003.1-2001 [5], which defines a standard operating system interface and environment, including real-time extensions. The standard contains definitions for system service functions and subroutines, language-specific system services for the C programming language, and notes on portability, error handling, and error recovery. This standard has been constantly evolving since it was first published in 1988; the latest developments have been crafted by a joint working group of members of the IEEE Portable Applications Standards Committee, members of The Open Group, and members of ISO/IEC Joint Technical Committee 1. The joint working group is known as the Austin Group, after the location of the inaugural meeting held at the IBM facility in Austin, Texas in September 1998. The Austin Group formally began its work in July 1999, after the subscription of a formal agreement between The Open Group and the IEEE, with the main goal of revising, updating, and combining into a single document the following standards: ISO/IEC 9945-1, ISO/IEC 9945-2, IEEE Std 1003.1, IEEE Std 1003.2, and the Base Specifications of The Open Group Single UNIX Specification Version 2. For real-time systems, the latest version of IEEE Std 1003.1 [5] incorporates the real-time extension standards listed in Table 11.1. Since embedded systems can have strong resource limitations, the IEEE Std 1003.13-1998 [7] profile standard groups functions from the standards mentioned above into units of functionality. Implementations can then choose the profile most suited to their needs and to the computing resources of their target platforms. Operating systems invariably experience a delay between the adoption of a standard and its implementation, hence functions defined earlier in time are usually supported across a wider number of operating systems. For this reason, in this section we concentrate only on the functions that are both related to real-time software development and are actually available on most real-time operating systems at the date of writing, including multithreading support. In addition, we assume that the set of functions common to both the POSIX and the ISO C [8] standards is well known to readers, hence we will not describe it. Table 11.2 summarizes the functional groups of IEEE Std 1003.1-2001 that will be discussed next.

11.3.1 Attribute Objects Attribute objects are a mechanism devised to support future standardization and portable extension of some entities specified by the POSIX standard, such as threads, mutual exclusion devices, and condition variables, without requiring that the functions operating on them be changed. TABLE 11.1 Real-Time Extensions Incorporated into IEEE Std 1003.1-2001 Standard 1003.1b 1003.1c 1003.1d 1003.1j

© 2006 by Taylor & Francis Group, LLC

Description Basic real-time extensions; first published in 1993 Threads extensions; published in 1995 Additional real-time extensions; published in 1999 Advanced real-time extensions; published in 2000

11-10 TABLE 11.2

Embedded Systems Handbook Basic Functional Groups of IEEE Std 1003.1-2001

Functional group

Main functions

Multiple threads

pthread_create pthread_exit pthread_join pthread_detach pthread_equal pthread_self

Process and thread scheduling

sched_setscheduler sched_getscheduler sched_setparam sched_getparam pthread_setschedparam pthread_getschedparam pthread_setschedprio pthread_attr_setschedpolicy pthread_attr_getschedpolicy pthread_attr_setschedparam pthread_attr_getschedparam sched_yield sched_get_priority_max sched_get_priority_min sched_rr_get_interval

Real-time signals

sigqueue pthread_kill sigaction sigaltstack sigemptyset sigfillset sigaddset sigdelset sigismember sigwait sigwaitinfo sigtimedwait

Interprocess synchronization and communication

mq_open mq_close mq_unlink mq_send mq_receive mq_timedsend mq_timedreceive mq_notify mq_getattr mq_setattr sem_init sem_destroy sem_open sem_close sem_unlink sem_wait sem_trywait sem_timedwait sem_post sem_getvalue pthread_mutex_destroy pthread_mutex_init pthread_mutex_lock pthread_mutex_trylock pthread_mutex_timedlock pthread_mutex_unlock pthread_cond_init pthread_cond_destroy pthread_cond_wait pthread_cond_timedwait pthread_cond_signal pthread_cond_broadcast shm_open close shm_unlink mmap munmap

Thread-specific data

pthread_key_create pthread_getspecific pthread_setspecific pthread_key_delete

Mem. management

mlock mlockall munlock munlockall mprotect

Asynchronous and list directed I/O

aio_read aio_write lio_listio aio_error aio_return aio_fsync aio_suspend aio_cancel

Clocks and timers

clock_gettime clock_settime clock_getres timer_create timer_delete timer_getoverrun timer_gettime timer_settime

Cancellation

pthread_cancel pthread_setcancelstate pthread_setcanceltype pthread_testcancel pthread_cleanup_push pthread_cleanup_pop

In addition, they provide for a clean isolation of the configurable aspects of said entities. For example, the stack address (i.e., the location in memory of the storage to be used for the thread’s stack) is an important attribute of a thread, but it cannot be expressed portably and must be adjusted when the program is ported to a different architecture. The use of attribute objects allows the programmer to specify thread’s attributes in a single place, rather than spread them across every instance of thread creation; moreover, the same set of attributes can be shared for multiple objects of the same kind so as, for example, to set up classes of threads with similar attributes. Figure 11.1 shows how attribute objects are created, manipulated, used to configure objects when creating them, and finally destroyed. As an example, the function and attribute names given in the figure are those used for threads, but the same general architecture is also used for the attributes of mutual exclusion devices (Section 11.3.5.3) and condition variables (Section 11.3.5.4). In order to be used, an attribute object must be first initialized by means of the pthread_attr_init function; this function also fills the attribute object with a default value for all attributes defined by the

© 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

Use attribute object to create one or more threads pthread_create()

stackaddr

0x24000

...

Get/Set individual attributes, e.g., pthread_attr_getstackaddr() pthread_attr_setstackaddr()

Destroy attribute object pthread_attr_destroy()

...

Initialize attribute object pthread_attr_init()

11-11

Name

Value

Attribute object with named attributes

FIGURE 11.1 Attribute objects in POSIX.

implementation. When it is no longer needed, an attribute object should be destroyed by invoking the pthread_attr_destroy function on it. An attribute object holds one or more named attributes; for each attribute, the standard specifies a pair of functions that the user can call to get and set the value of that attribute. For example, the pthread_attr_getstackaddr and pthread_attr_setstackaddr functions get and set the stack address attribute, respectively. After setting the individual attributes of an attribute object to the desired value, the attribute object can be used to configure one or more entities specified by the standard. Hence, for example, the thread attribute object being described can be used as an argument to the pthread_create function which, in turn, creates a thread with the given set of attributes. Last, it should be noted that the attribute objects are defined as opaque types, so they shall be accessed only by the functions just presented, and not by manipulating their representation directly, even if it is known, because doing this is not guaranteed to be portable across different implementations of the standard.

11.3.2 Multithreading The multithreading capability specified by the POSIX standard includes functions to populate a process with new threads. In particular, the pthread_create function creates a new thread within a process and sets up a thread identifier for it, to be used to operate on the thread in the future. After creation, the thread immediately starts executing a function passed to pthread_create as an argument; moreover, it is also possible to pass an argument to the function in order, for example, to share the same function among multiple threads and nevertheless be able to distinguish them. The pthread_create function also takes an optional reference to an attribute object as argument. The attributes of a thread determine, for example, the size and location of the thread’s stack and its scheduling parameters; the latter will be described further in Section 11.3.3. A thread may terminate in three different ways: • By returning from its main function • By explicitly calling the pthread_exit function • By accepting a cancellation request (Section 11.3.10) In any case, the pthread_join function allows the calling thread to synchronize with, that is wait for, the termination of another thread. When the thread finally terminates, this function also returns to the caller a summary information about the reason of the termination. For example, if the target thread terminated itself by means of the pthread_exit function, pthread_join returns the status code passed to pthread_exit in the first place.

© 2006 by Taylor & Francis Group, LLC

11-12

Embedded Systems Handbook

If this information is not desired, it is possible to save system resources by detaching a thread, either dynamically by means of the pthread_detach function, or statically by means of a thread’s attribute. In this way, the storage associated to that thread can be immediately reclaimed when the thread terminates. Additional utility functions are provided to operate on thread identifiers; hence, for example, the pthread_equal function checks whether two thread identifiers are equal (i.e., whether they refer to the same thread), and the pthread_self function returns the identifier of the calling thread.

11.3.3 Process and Thread Scheduling Functions in this group allow the application to select a specific policy that the operating system must follow to schedule a particular process or thread, and to get and set the scheduling parameters associated with that process or thread. In particular, the sched_setscheduler function sets both the scheduling policy and parameters associated to a process, and sched_getscheduler reads them back for examination. The sched_setparam and sched_getparam functions are somewhat more limited, because they set and get the scheduling parameters but not the policy. All functions take a process identifier as argument, to uniquely identify a process. For threads, the pthread_setschedparam and pthread_getschedparam functions set and get the scheduling policy and parameters associated with a thread; pthread_setschedprio directly sets the scheduling priority of the given thread. All these functions take a thread identifier as argument and perform a dynamic access to the thread scheduling parameters; in other words, they can be used when the thread already exists in the system. On the other hand, the functions pthread_attr_setschedpolicy, pthread_attr_getschedpolicy, pthread_attr_setschedparam, and pthread_attr_getschedparam store and retrieve the scheduling policy and parameters of a thread into and from an attribute object, respectively; in turn, the attribute object can subsequently be used to create one or more threads. The general mechanism of attribute objects in the POSIX standard has been discussed in Section 11.3.1. An interesting and useful side effect of pthread_setschedprio is that, when the effect of the function is to lower the priority of the target thread, the thread is inserted at the head of the thread list of the new priority, instead of the tail. Hence, this function provides a way for an application to temporarily raise its priority and then lower it again to its original value, without having the undesired side effect of yielding to other threads of the same priority. This is necessary, for example, if the application is to implement its own strategies for bounding priority inversion. Last, the sched_yield function allows the invoking thread to voluntarily relinquish the CPU in favor of other threads of the same priority; the invoking thread is linked to the end of the list of ready processes for its priority. In order to support the orderly coexistence of multiple scheduling policies, the conceptual scheduling model defined by the standard and depicted in Figure 11.2 assigns a global priority to all threads in the system and contains one ordered thread list for each priority; any runnable thread will be on the thread list for that thread’s priority. When appropriate, the scheduler shall select the thread at the head of the highest-priority, nonempty thread list to become a running thread, regardless of its associated policy; this thread is then removed from its thread list. When a running thread yields the CPU, either voluntarily or by preemption, it is returned to the thread list it belongs to. The purpose of a scheduling policy is then to determine how the operating system scheduler manages the thread lists, that is, how threads are moved between and within lists when they gain or lose access to the CPU. Associated with each scheduling policy is a priority range, which must span at least 32 distinct priority levels; all threads scheduled according to that policy must lie within that priority range, and priority ranges belonging to different policies can overlap in whole or in part. The sched_get_priority_min and

© 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

11-13

Multiple scheduling policies; each has its own local priority range

Local priority range

Thread lists, one for each global priority in the system

Local priority range

..

..

.

Global priority range

Each scheduling policy controls the placement of threads in its priority range

The mapping between local and global priority ranges is chosen at system configuration time

Selection of the highest-priority thread by the scheduler. Threads are returned to their thread list on yield or preemption

FIGURE 11.2 Processor scheduling in POSIX.

sched_get_priority_max functions return the allowable range of priority for a given scheduling policy. The mapping between the multiple local priority ranges, one for each scheduling policy active in the system, and the single global priority range is usually performed by a simple relocation and is either fixed or programmable at system configuration time, depending on the operating system. In addition, operating systems may reserve some global priority levels, usually the higher ones, for interrupt handling. The standard defines the following three scheduling policies, whose algorithms have been briefly presented in Section 11.2.3: • First in, first out (SCHED_FIFO) • Round robin (SCHED_RR) • Optionally, a variant of the sporadic server scheduler (SCHED_SPORADIC) Most operating systems set the execution time limit of the round-robin scheduling policy statically, at system configuration time; the sched_rr_get_interval function returns the execution time limit set for a given process; the standard provides no portable way to set the execution time limit dynamically. A fourth scheduling policy, SCHED_OTHER, can be selected to denote that a thread no longer needs a specific real-time scheduling policy: general-purpose operating systems with real-time extensions usually revert to the default, nonreal-time scheduler when this scheduling policy is selected. Moreover, each implementation is free to redefine the exact meaning of the SCHED_OTHER policy and can provide additional scheduling policies besides those required by the standard, but any application using them will no longer be fully portable.

© 2006 by Taylor & Francis Group, LLC

11-14

Embedded Systems Handbook

11.3.4 Real-Time Signals and Asynchronous Events Signals are a facility specified by the ISO C standard and widely available on most operating systems; they provide a mechanism to convey information to a process or thread when it is not necessarily waiting for input. The IEEE Std 1003.1-2001 further extends the signal mechanism to make it suitable for real-time handling of exceptional conditions and events that may occur asynchronously with respect to the notified process like, for example: • An error occurring during the execution of the process, for example, a memory reference through an invalid pointer. • Various system and hardware failures, such as a power failure. • Explicit generation of an event by another process; as a special case, a process can also trigger a signal directed to itself. • Completion of an I/O operation started by the process in the past, and for which the process did not perform an explicit wait. • Availability of data to be read from a message queue. The signal mechanism has a significant historical heritage; in particular, it was first designed when multithreading was not yet in widespread use and its interface and semantics underwent many changes since their inception. Therefore, it owes most of its complexity to the need of maintaining compatibility with the historical implementations of the mechanism made, for example, by the various flavors of the influential Unix operating systems; however, in this section the compatibility interfaces will not be discussed for the sake of clarity and conciseness. With respect to the ISO C signal behavior, the IEEE Std 1003.1-2001 specifies two main enhancements of interest to real-time programmers: 1. In the ISO C standard, the various kinds of signals are identified by an integer number (often denoted by a symbolic constant in application code) and, when multiple signals of different kind are pending, they are serviced in an unspecified order; the IEEE Std 1003.1-2001 continues to use signal numbers but specifies that for a subset of their allowable range, between SIGRTMIN and SIGRTMAX, a priority hierarchy among signals is in effect, so that the lowest-numbered signal has the highest priority of service. 2. In the ISO C standard, there is no provision for signal queues, hence when multiple signals of the same kind are raised before the target process had a chance of handling them, all signals but the first are lost; the IEEE Std 1003.1-2001 specifies that the system must be able to keep track of multiple signals with the same number by enqueuing and servicing them in order. Moreover, it also adds the capability of conveying a limited amount of information (a union sigval, capable of holding either an integer or a pointer) with each signal request, so that multiple signals with the same signal number can be distinguished from each other. The queueing policy is always FIFO, and cannot be changed by the user. As outlined above, each signal has a signal number associated to it, to identify its kind; for example, the signal associated to memory access violations has the number SIGSEGV associated to it. Figure 11.3 depicts the life of a signal from its generation up to its delivery. Depending on their kind and source, signals may be directed to either a specific thread in the process, or to the process as a whole; in the latter case, every thread belonging to the process is a candidate for the delivery of the signal, by the rules described later. It should also be noted that for some kinds of events, the POSIX standard specifies that the notification can also be carried out by the execution of a handling function in a separate thread, if the application so chooses; this mechanism is simpler and clearer than the signal-based notification, but requires multithreading support on the system side.

© 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives Process-level action (may ignore the signal completely) sigaction()

11-15

Per-thread signal masks and explicit wait pthread_sigmask() Thread 1

1 Thread 2

Signal generation, directed to a specific thread or to the process as a whole sigqueue(), pthread_kill(), event notification 3

Execution of the action associated with the signal: * Return of sigwait() * Default action (with process-level side effects) * Signal handler

Thread n

2 Choice of a single target thread (only for signals directed to the process) The signal stays pending if there is no thread suitable for immediate delivery

FIGURE 11.3 Signal generation and delivery in POSIX.

11.3.4.1 Generation of a Signal As outlined above, most signals are generated by the system rather than by an explicit action performed by a process. For these, the POSIX standard specifies that the decision of whether the signal must be directed to the process as a whole or to a specific thread within a process must be carried out at the time of generation and must represent the source of the signal as closely as possible. In particular, if a signal is attributable to an action carried out by a specific thread, for example, a memory access violation, the signal shall be directed to that thread and not to the process. If such an attribution is either not possible or not meaningful as it is the case, for example, of the power failure signal, the signal shall be directed to the process. Besides various error conditions, an important source of signals generated by the system relate to asynchronous event notification and are always directed to the process; as an example, Section 11.3.8 will describe the mechanism behind the notification of completion for asynchronous I/O operations. On the other hand, processes have the ability of synthesizing signals by means of two main interfaces, depending on the target of the signal: • The sigqueue function, given a process identifier and a signal number, generates a signal directed to that process; an additional argument, a union sigval, allows the caller to associate a limited amount of information to the signal, provided that the SA_SIGINFO flag is set for that signal number. Additional interfaces exist to generate a signal directed to a group of processes, for example, the killpg function. However, they have not been extended for real time and hence they do not have the ability of associating any additional information to the signal. • The pthread_kill function generates a signal directed to a specific thread within the calling process and identified by its thread identifier. It is not possible to generate a signal directed to a specific thread of another process.

© 2006 by Taylor & Francis Group, LLC

11-16

Embedded Systems Handbook

11.3.4.2 Process-Level Action For each kind of signal defined in the system, that is, for each valid signal number, processes may set up an action by means of the sigaction function; the action may consist of either: • Ignore the signal completely • A default action performed by the operating system on behalf of the process, and possibly with process-level side effects, such as the termination of the process itself • The execution of a signal handling function specified by the programmer In addition, the same function allows the caller to set zero or more flags associated to the signal number. Of the rather large set of flags specified by the POSIX standard, the following ones are of particular interest to real-time programmers: • The SA_SIGINFO flag, when set, enables the association of a limited amount of information to each signal; this information will then be conveyed to the signaled process or thread. In addition, if the action associated with the signal is the execution of a user-specified signal handler, setting this flag extends the arguments passed to the signal handler to include additional information about the reason why the signal was generated and about the receiving thread’s context that was interrupted when the signal was delivered. • The SA_RESTART flag, when set, enables the automatic, transparent restart of interruptible system calls when the system call is interrupted by the signal. If this flag is clear, system calls that were interrupted by a signal fail with an error indication and must be explicitly restarted by the application, if appropriate. • The SA_ONSTACK flag, when set, commands the switch of the process or thread to which the signal is delivered to an alternate stack for the execution of the signal handler; the sigaltstack function can be used to set the alternate stack up. If this flag is not set, the signal handler executes on the regular stack of the process or thread. It should be noted that the setting of the action associated with each kind of signal takes place at the process level, that is, all threads within a process share the same set of actions; hence, for example, it is impossible to set two different signal handling functions (for two different threads) to be executed in response to the same kind of signal. Immediately after generation, the system checks the process-level action associated with the signal in the target process, and immediately discards the signal if that action is set to ignore it; otherwise, it proceeds to check whether the signal can be acted on immediately. 11.3.4.3 Signal Delivery and Acceptance Provided that the action associated to the signal at the process level does not specify to ignore the signal in the first place, a signal can be either delivered to or accepted by a thread within the process. Unlike the action associated to each kind of signal discussed above, each thread has its own signal mask; by means of the signal mask, each thread can selectively block some kinds of signals from being delivered to it, depending on their signal number. The pthread_sigmask function allows the calling thread to examine or change (or both) its signal mask. A signal mask can be set up and manipulated by means of the functions: sigemptyset: initializes a signal mask so that all signals are excluded from the mask. sigfillset: initializes a signal mask so that all signals are included in the mask. sigaddset: given a signal mask and a signal number, adds the specified signal to the signal mask; it has no effect if the signal was already in the mask. sigdelset: given a signal mask and a signal number, removes the specified signal from the signal mask; it has no effect if the signal was not in the mask. sigismember: given a signal mask and a signal number, checks whether the signal belongs to the signal mask or not.

© 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

11-17

A signal can be delivered to a thread if and only if that thread does not block the signal; when a signal is successfully delivered to a thread, that thread executes the process-level action associated with the signal. On the other hand, a thread may perform an explicit wait for one or more kinds of signals, by means of the sigwait function; that function stops the execution of the calling thread until one of the signals passed as argument to sigwait is conveyed to the thread. When this occurs, the thread accepts the signal and continues past the sigwait function. Since the standard specifies that signals in the range from SIGRTMIN to SIGRTMAX are subject to a priority hierarchy, when multiple signals in this range are pending, the sigwait shall consume the lowest-numbered one. It should also be noted that for this mechanism to work correctly, the thread must block the signals that it wishes to accept by means of sigwait (through its signal mask), otherwise signal delivery takes precedence. Two, more powerful, variants of the sigwait function exist: sigwaitinfo has an additional argument used to return additional information about the signal just accepted, including the information associated with the signal when it was first generated; furthermore, sigtimedwait also allows the caller to specify the maximum amount of time that shall be spent waiting for a signal to arrive. The way in which the system selects a thread within a process to convey a signal depends on where the signal is directed: • If the signal is directed toward a specific thread, only that thread is a candidate for delivery or acceptance. • If the signal is directed to a process as a whole, any thread belonging to that process is a candidate to receive the signal; in this case, the system selects exactly one thread within the process with the appropriate signal mask (for delivery), or performing a suitable sigwait (for acceptance). If there is no suitable thread to convey the signal when it is first generated, the signal remains pending until its delivery or acceptance becomes possible, by following the same rules outlined above, or the process-level action associated with that kind of signal is changed and set to ignore it. In the latter case, the system forgets everything about the signal, and all other signals of the same kind.

11.3.5 Interprocess Synchronization and Communication The main interprocess synchronization and communication mechanisms offered by the standard are the semaphore and the message queue, both described in Section 11.2.4. The blocking synchronization primitives have a nonblocking and a timed counterpart, to enhance their real-time predictability. Moreover, multithreading support also adds support for mutual exclusion devices, condition variables, and other synchronization mechanisms. The scope of these mechanisms can be limited to threads belonging to the same process to enhance their performance. 11.3.5.1 Message Queues The mq_open function either creates or opens a message queue and connects it with the calling process; in the system, each message queue is uniquely identified by a name, like a file. This function returns a message queue descriptor that refers to and uniquely identifies the message queue; the descriptor must be passed to all other functions that operate on the message queue. Conversely, mq_close removes the association between the message queue descriptor and its message queue. As a result, the message queue descriptor is no longer valid after successful return from this function. Last, the mq_unlink function removes a message queue, provided no other processes reference it; if this is not the case, the removal is postponed until the reference count drops to zero. The number of elements that a message queue is able to buffer, and their maximum size, are constant for the lifetime of the message queue, and are set when the message queue is first created. The mq_send and mq_receive functions send and receive a message to and from a message queue, respectively. If the message cannot be immediately stored or retrieved (e.g., when mq_send is executed on a full message queue) these functions block as long as appropriate, unless the message queue was opened

© 2006 by Taylor & Francis Group, LLC

11-18

Embedded Systems Handbook

with the nonblocking option set; if this is the case, these functions return immediately if they are unable to perform their job. The mq_timedsend and mq_timedreceive functions have the same behavior, but allow the caller to place an upper bound on the amount of time they may spend waiting. The standard allows to associate a priority to each message, and specifies that the queueing policy of message queues must obey the priority so that, for example, mq_receive retrieves the highest-priority message that is currently stored in the queue. The mq_notify function allows the caller to arrange for the asynchronous notification of message arrival at an empty message queue, when the status of the queue transitions from empty to nonempty, according to the mechanism described in Section 11.3.4. The same function also allows the caller to remove a notification request it made previously. At any time, only a single process may be registered for notification by a message queue. The registration is removed implicitly when a notification is sent to the registered process, or when the process owning the registration explicitly removes it; in both cases, the message queue becomes available for a new registration. If both a notification request and an mq_receive call are pending on a given message queue, the latter takes precedence, that is, when a message arrives at the queue, it satisfies the mq_receive and no notification is sent. Last, the mq_getattr and mq_setattr functions allow the caller to get and set, respectively, some attributes of the message queue dynamically after creation; these attributes include, for example, the nonblocking flag just described and may also include additional, implementation-specific flags. 11.3.5.2 Semaphores Semaphores come in two flavors: unnamed and named. Unnamed semaphores are created by the sem_init function and must be shared among processes by means of the usual memory sharing mechanisms provided by the system. On the other hand, named semaphores created and accessed by the sem_open function exist as named objects in the system, like the message queues described above, and can therefore be accessed by name. Both functions, when successful, associate the calling process with the semaphore and return a descriptor for it. Depending on the kind of semaphore, either the sem_destroy (for unnamed semaphores) of the sem_close function (for named semaphores) must be used to remove the association between the calling process and a semaphore. For unnamed semaphores, the sem_destroy function also destroys the semaphore; instead, named semaphores must be removed from the system with a separate function, sem_unlink. For both kinds of semaphore, a set of functions implements the classic P and V primitives, namely: • The sem_wait function performs a P operation on the semaphore; the sem_trywait and sem_timedwait functions perform the same function in polling mode and with a user-specified timeout, respectively. • The sem_post function performs a V operation on the semaphore. • The sem_getvalue function has no counterpart in the definition of semaphore found in literature and returns the current value of a semaphore. 11.3.5.3 Mutexes A mutex is a very specialized binary semaphore that can only be used to ensure the mutual exclusion among multiple threads; it is therefore simpler and more efficient than a full-fledged semaphore. Optionally, it is possible to associate to each mutex a protocol to deal with priority inversion. The pthread_mutex_init function initializes a mutex and prepares it for use; it takes an attribute object as argument, working according to the general mechanism described in Section 11.3.1 and useful to specify the attributes of the mutex like, for example, the priority inversion protocol to be used for it. When default mutex attributes are appropriate, a static initialization technique is also available; in particular, the macro PTHREAD_MUTEX_INITIALIZER can be used to initialize a mutex that the application has statically allocated.

© 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

11-19

In any case, the pthread_mutex_destroy function destroys a mutex. The following main functions operate on the mutex after creation: • The pthread_mutex_lock function locks the mutex if it is free; otherwise, it blocks until the mutex becomes available and then locks it; the pthread_mutex_trylock function does the same, but returns to the caller without blocking if the lock cannot be acquired immediately; the pthread_mutex_timedlock function allows the caller to specify a maximum amount of time to be spent waiting for the lock to become available. • The pthread_mutex_unlock function unlocks a mutex. Additional functions are defined for particular flavors of mutexes; for example, the pthread_mutex_getprioceiling and pthread_mutex_setprioceiling functions allow the caller to get and set, respectively, the priority ceiling of a mutex, and make sense only if the priority ceiling protocol has been selected for the mutex, by means of a suitable setting of its attributes. 11.3.5.4 Condition Variables A set of condition variables, in concert with a mutex, can be used to implement a synchronization mechanism similar to the monitor without requiring the notion of monitor to be known at the programming language level. A condition variable must be initialized before use by means of the pthread_cond_init function; this function takes an attribute object as argument, which can be used to configure the condition variable to be created, according to the general mechanism described in Section 11.3.1. When default attributes are appropriate, the macro PTHREAD_COND_INITIALIZER is available to initialize a condition variable that the application has statically allocated. Then, the mutex and the condition variables can be used as follows: • Each procedure belonging to the monitor must be explicitly bracketed with a mutex lock at the beginning, and a mutex unlock at the end. • To block on a condition variable, a thread must call the pthread_cond_wait function giving both the condition variable and the mutex used to protect the procedures of the monitor as arguments. This function atomically unlocks the mutex and blocks the caller on the condition variable; the mutex will be reacquired when the thread is unblocked, and before returning from pthread_cond_wait. To avoid blocking for a (potentially) unbound time, the pthread_cond_timedwait function allows the caller to specify the maximum amount of time that may be spent waiting for the condition variable to be signaled. • Inside a procedure belonging to the monitor, the pthread_cond_signal function, taking a condition variable as argument, can be called to unblock at least one of the threads that are blocked on the specified condition variable; the call has no effect if no threads are blocked on the condition variable. The rather relaxed specification of unblocking at least one thread, instead of exactly one, has been adopted by the standard to simplify the implementation of condition variables on multiprocessor systems, and to make it more efficient, mainly because condition variables are often used as the building block of higher-level synchronization primitives. • A variant of pthread_cond_signal, called pthread_cond_broadcast, is available to unblock all threads that are currently waiting on a condition variable. As before, this function has no effect if no threads are waiting on the condition variable. When no longer needed, condition variables shall be destroyed by means of the pthread_cond_destroy function, to save system resources. 11.3.5.5 Shared Memory Except message queues, all IPC mechanisms described so far only provide synchronization among threads and processes, and not data sharing.

© 2006 by Taylor & Francis Group, LLC

11-20

Embedded Systems Handbook

Moreover, while all threads belonging to the same process share the same address space, so that they implicitly and inherently share all their global data, the same is not true for different processes; therefore, the POSIX standard specifies an interface to explicitly set up a shared memory object among multiple processes. The shm_open function either creates or opens a new shared memory object and associates it with a file descriptor, which is then returned to the caller. In the system, each shared memory object is uniquely identified by a name, like a file. After creation, the state of a shared memory object, in particular all data it contains, persists until the shared memory object is unlinked and all active references to it are removed. Instead, the standard does not specify whether a shared memory object remains valid after a reboot of the system or not. Conversely, close removes the association between a file descriptor and the corresponding shared memory object. As a result, the file descriptor is no longer valid after successful return from this function. Last, the shm_unlink function removes a shared memory object, provided no other processes reference it; if this is not the case, the removal is postponed until the reference count drops to zero. It should be noted that the association between a shared memory object and a file descriptor belonging to the calling process, performed by shm_open, does not map the shared memory into the address space of the process. In other words, merely opening a shared memory object does not make the shared data accessible to the process. In order to perform the mapping, the mmap function must be called; since the exact details of the address space structure may be unknown to, and uninteresting for the programmer, the same function also provides the capability of choosing a suitable portion of the caller’s address space to place the mapping automatically. The function munmap removes a mapping.

11.3.6 Thread-Speciﬁc Data All threads belonging to the same process implicitly share the same address space, so that they have shared access to all their global data. As a consequence, only the information allocated on the thread’s stack, such as function arguments and local variables, is private to each thread. On the other hand, it is often useful in practice to have data structures that are private to a single thread, but can be accessed globally by the code of that thread. The POSIX standard responds to this need by defining the concept of thread-specific data, of which Figure 11.4 depicts the general usage. The pthread_key_create function creates a thread-specific data key visible to, and shared by, all threads in the process. The key values provided by this function are opaque objects used to access thread-specific data. In particular, the pair of functions pthread_getspecific and pthread_setspecific take a key as argument and allow the caller to get and set, respectively, a pointer uniquely bound with the given Thread-specific data key, created by pthread_key_create()

Thread-specific data key

Thread-specific data of T1 pthread_getspecific Thread-specific data of T2 Thread T1 pthread_getspecific Thread T2

FIGURE 11.4 Thread-specific data in POSIX.

© 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

11-21

key, and private to the calling thread. The pointer bound to the key by pthread_setspecific persists for the life of the calling thread, unless it is replaced by a subsequent call to pthread_setspecific. An optional destructor function may be associated with each key when the key itself is created. When a thread exits, if a given key has a valid destructor, and the thread has a valid (i.e., not NULL) pointer associated with that key, the pointer is disassociated and set to NULL, and then the destructor is called with the previously associated pointer as argument. When it is no longer needed, a thread-specific data key should be deleted by invoking the pthread_key_delete function on it. It should be noted that, unlike in the previous case, this function does not invoke the destructor function associated with the key, so that it is the responsibility of the application to perform any cleanup actions for data structures related to the key being deleted.

11.3.7 Memory Management The standard allows processes to lock parts or all of their address space in main memory by means of the mlock and mlockall functions; in addition, mlockall also allows the caller to demand that all of the pages that will become mapped into the address space of the process in the future must be implicitly locked. The lock operation both forces the memory residence of the virtual memory pages involved, and prevents them from being paged out in the future. This is vital in operating systems that support demand paging and must nevertheless support any real-time processing, because the paging activity could introduce undue and highly unpredictable delays when a real-time process attempts to access a page that is currently not in the main memory and must therefore be retrieved from secondary storage. When the lock is no longer needed, the process can invoke either the munlock or the munlockall function to release it and enable demand-paging again. Other memory management functions, such as the mmap function already described in Section 11.3.5.5, establish a mapping between the address space of the calling process and a memory object, possibly shared between multiple processes. The mapping facility is general enough; hence, it can also be used to map other kinds of objects, such as files and devices, into the address space of a process, provided both the hardware and the operating system have this capability, which is not mandated by the standard. For example, once a file is mapped, a process can access it simply by reading or writing the data at the address range to which the file was mapped. Finally, it is possible for a process to change the access protections of portions of its address space by means of the mprotect function; in this case, it is assumed that protections will be enforced by the hardware. For example, to prevent inadvertent data corruption due to a software bug, one could protect critical data intended for read-only usage against write access.

11.3.8 Asynchronous and List Directed Input and Output Many operating systems carry out I/O operations synchronously with respect to the process requesting them. Thus, for example, if a process invokes a file read operation, it stays blocked until the operating system has finished it, either successfully or unsuccessfully. As a side effect, any process can have at most one pending I/O operation at any given time. While this programming model is simple, intuitive, and perfectly adequate for general-purpose systems, it shows its limits in a real-time environment, namely: • I/O device access timings can vary widely, especially when an error occurs; hence, it is not always wise to suspend the execution of a process until the operation completes, because this would introduce a source of unpredictability in the system. • It is often desirable, for example, to enhance system performance by exploiting I/O hardware parallelism, to start more than one I/O operation simultaneously, under the control of a single process.

© 2006 by Taylor & Francis Group, LLC

11-22

Embedded Systems Handbook

To satisfy these requirements, the standard defines a set of functions to start one or more I/O requests, to be carried out in parallel with process execution, and whose completion status can be retrieved asynchronously by the requesting process. Asynchronous and list-directed I/O functions revolve around the concept of asynchronous I/O control block, struct aiocb; this structure contains all the information needed to describe an I/O operation, and contains members to: • Specify the operation to be performed, read or write. • Identify the file on which the operation must be carried out, by means of a file descriptor. • Determine what portion of the file the operation will operate upon, by means of a file offset and a transfer length. • Locate a data buffer in memory to be used to store or retrieve the data read from, or to be written to, the file. • Give a priority classification to the operation. • Request the asynchronous notification of the completion of the operation, either by a signal or by the asynchronous execution of a function, as described in Section 11.3.4. Then, the following functions are available: • The aio_read and aio_write functions take an I/O control block as argument and schedule a read or a write operation, respectively; both return to the caller as soon as the request has been queued for execution. • As an extension, the lio_listio function schedules a list of (possibly asynchronous) I/O requests, each described by an I/O control block, with a single function call. • The aio_error and aio_return functions allow the caller to retrieve the error and status information associated with an I/O control block, after the corresponding I/O operation has been completed. • The aio_fsync function asynchronously forces all I/O operations associated with the file indicated by the I/O control block passed as argument and currently queued to the synchronized I/O completion state. • The aio_suspend function can be used to block the calling thread until at least one of the I/O operations associated with a set of I/O control blocks passed as argument completes, or up to a maximum amount of time. • The aio_cancel function cancels an I/O operation that has not been completed yet.

11.3.9 Clocks and Timers Real-time applications very often rely on timing information to operate correctly; the POSIX standard specifies support for one or more timing bases, called clocks, of known resolution and whose value can be retrieved at will. In the system, each clock has its own unique identifier. The clock_gettime and clock_settime functions get and set the value of a clock, respectively, while the clock_getres function returns the resolution of a clock. Clock resolutions are implementation-defined and cannot be set by a process; some operating systems allow the clock resolution to be set at system generation or configuration time. In addition, applications can set one or more per-process timers, using a specified clock as a timing base, by means of the timer_create function. Each timer has a current value and, optionally, a reload value associated with it. The operating system decrements the current value of timers according to their clock and, when a timer expires, it notifies the owning process with an asynchronous notification of timer expiration; as described in Section 11.3.4, the notification can be carried out either by a signal, or by awakening a thread belonging to the process. On timer expiration, the operating system also reloads the timer with its reload value, if it has been set, thus possibly realizing a repetitive timer.

© 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

11-23

When a timer is no longer needed, it shall be removed by means of the timer_delete function, that both stops the timer and frees all resources allocated to it. Since, due to scheduling or processor load constraints, a process could lose one or more notifications of expiration, the standard also specifies a way for applications to retrieve, by means of the timer_getoverrun function, the number of “missed” notifications, that is, the number of extra timer expirations that occurred between the time at which a given timer expired and when the notification associated with the expiration was eventually delivered to, or accepted by, the process. At any time, it is also possible to store a new value into, or retrieve the current value of a timer by means of the timer_settime and timer_gettime functions, respectively.

11.3.10 Cancellation Any thread may request the cancellation of another thread in the same process by means of the pthread_cancel function. Then, the target thread’s cancelability state and type determines whether and when the cancellation takes effect. When the cancellation takes effect, the target thread is terminated. Each thread can atomically get and set its own way to react to a cancellation request by means of the pthread_setcancelstate and pthread_setcanceltype functions. In particular, three different settings are possible: • The thread can ignore cancellation requests completely. • The thread can accept the cancellation request immediately. • The thread can be willing to accept the cancellation requests only when its execution flow crosses a cancellation point. A cancellation point can be explicitly placed in the code by calling the pthread_testcancel function. Also, it should be remembered that many functions specified by the POSIX standard act as implicit cancellation points. The choice of the most appropriate response to cancellation requests depends on the application and is a trade-off between the desirable feature of really being able to cancel a thread, and the necessity of avoiding the cancellation of a thread while it is executing in a critical section of code, both to keep the guarded data structures consistent and to ensure that any IPC object associated with the critical section, for example, a mutex, is released appropriately; otherwise, the critical region would stay locked forever, likely inducing a deadlock in the system. As an aid to do this, the POSIX standard also specifies a mechanism that allows any thread to register a set of cleanup handlers on a stack to be executed, in LIFO order, when the thread either exits voluntarily, or accepts a cancellation request. The pthread_cleanup_push and pthread_cleanup_pop functions push and pop a cleanup handler into and from the handler stack; the latter function also has the ability to execute the handler it is about to remove.

11.4 Real-Time, Open-Source Operating Systems Although in the general-purpose operating system camp a handful of products dominates the market, there are more than 30 real-time operating systems available for use today, both commercial and experimental, and new ones are still being developed. This is due both to the fact that there is much research in the latter area, and that real-time embedded applications are inherently less homogeneous than generalpurpose applications, such as those found in office automation; hence, ad hoc operating system features are often needed. In addition, the computing power and the overall hardware architecture of embedded systems are much more varied than, for example, those of personal computers. This section focuses on open-source real-time operating systems, a recent source of novelty in embedded system software design. Open-source operating systems are especially promising, for two main reasons: 1. The source code of an open-source real-time operating system can be used both to develop real-world applications, and for study and experimentation. Therefore, open-source operating

© 2006 by Taylor & Francis Group, LLC

11-24

Embedded Systems Handbook

systems often implement the most advanced, state-of-the-art architectures and algorithms because researchers can play with them at will and their work can immediately be reflected into real applications. 2. Open-source operating systems have no purchasing cost and are inherently royalty free, so their adoption can cut down the costs of an application. Moreover, one of the most well-known issues of open-source operating systems, namely the lack of official technical support, has recently found its way to a solution with more and more consulting firms specializing in their support. Among the open-source operating systems that have found their way into commercial products we recall: eCos. The development of eCos [9] is coordinated by Red Hat Inc. and is based on a modular, layered real-time kernel. The most important innovation in eCos is its extensive configuration system that operates at kernel build time with a large number of configuration points placed at source code level, and allows a very fine-grained adaptation of the kernel itself to both application needs and hardware characteristics. The output of the configuration process is an operating system library that can then be linked with the application code. Its application programming interface is compatible with the POSIX standard, but it does not support multiple processes with independent address spaces, even when a Memory Management Unit (MMU) is available. µClinux. It is a stripped-down version of the well-known Linux operating system. The most interesting features of µClinux [10] are the ability to run on microcontrollers that lack an MMU and its small size, compared with a standard Linux kernel. As is, µClinux does not have any real-time capability, because it inherits its standard processor scheduler from Linux; however, both RT-Linux [11] and RTAI [12] real-time extensions are available for it. RT-Linux and RTAI. These are hard real-time-capable extensions to the Linux operating system; they are similar, and their architecture was first outlined in 1997 [11–13]. The main design feature of both RT-Linux and RTAI is the clear separation between the real-time and nonreal-time domains: in RT-Linux and RTAI, a small monolithic real-time kernel runs real-time tasks, and coexists with the Linux kernel. As a consequence, nonreal-time tasks running on the Linux kernel have the sophisticated services of a standard time-sharing operating system at their disposal, whereas real-time tasks operate in a protected, predictable and low-latency environment. The real-time kernel performs first-level real-time scheduling and interrupt handling, and runs the Linux kernel as its lowest-priority task. In order to keep changes in the Linux kernel to an absolute minimum, the real-time kernel provides for an emulation of the interrupt control hardware. In particular, any interrupt disable/enable request issued by the Linux kernel is not passed to the hardware, but is emulated in the real-time kernel instead; thus, for example, when Linux disables interrupts, the hardware interrupts actually stay enabled and the real-time kernel queues and delays the delivery of any interrupt of interest to the Linux kernel. Real-time interrupts are not affected at all, and are handled as usual, without any performance penalty. To handle communication between real-time and nonreal-time tasks, RT-Linux and RTAI implement lock-free queues and shared memory. In this way, real-time applications can rely on Linux system services for nontime-critical operations, such as filesystem access and graphics user interface. RTEMS. The development of RTEMS [14] is coordinated by On-Line Applications Research Corporation, which also offers paid technical support. Its application development environment is based on open-source GNU tools, and has a monolithic/layered architecture, commonly found in high-performance real-time executives. RTEMS complies with the POSIX 1003.1b application programming interface and supports multiple threads of execution, but does not implement a multiprocess environment with independent application address spaces. It also supports networking and a filesystem.

© 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

11-25

11.5 Virtual-Machine Operating Systems According to its most general definition, a virtual machine is an abstract system, composed of one or more virtual processors and zero or more virtual devices. The implementation of a virtual machine can be carried out in a number of different ways, such as, for example, interpretation, (partial) just-in-time compilation, and hardware-assisted instruction-level emulation. Moreover, these techniques can be, and usually are, combined to obtain the best compromise between complexity of implementation and performance for a given class of applications. In this section, we focus on perfect virtualization, that is, the implementation of virtual machines whose processors and I/O devices are identical in all respects to their counterpart in a physical machine, that is, also the machine on which the virtualization software runs. The implementation of virtual machines is carried out by means of a peculiar kind of operating system kernel; hardware assistance keeps overheads to a minimum. As described in Section 11.2.1, the internal architecture of an operating system based on virtual machines revolves around the basic observation that an operating system must perform two essential functions: • Multiprogramming • System services Accordingly, those operating systems fully separate these functions and implement them as two distinct operating system components: • A virtual machine monitor • A guest operating system

11.5.1 Related Work A particularly interesting application of the layered approach to operating system design, the virtual machines, were first introduced by Meyer and Seawright in the experimental CP-40 and, later, CP-67 system [15] on an IBM 360/67. This early system was peculiar because it provided each user with a virtual IBM 360/65 (not 67) including I/O devices. So, processor and I/O virtualization was perfect, but MMU virtualization was not attempted at all, and the virtual machines thus lacked an MMU. Later on, MMU virtualization was added by A. Auroux, and CP-67 evolved into a true IBM product, VM/370. Offsprings of VM/370 are still in use today on IBM mainframes; for example, z/VM [16] runs on IBM zSeries mainframes and supports the execution of multiple z/OS and Linux operating system images, each in its own virtual machine. An extensive, early discussion of virtual machines and their properties can be found in References 17 and 18. More recently, microcode support for virtual machines was added to the 680x0 microprocessor family in the transition between the 68000 and the 68010 [19]; in particular, the privilege level required to execute some instructions was raised to make processor virtualization feasible. Commercial virtualization products now available include, for example, the VMware virtualization software for the Intel Pentium architecture [20]. For example, in Reference 21 this product was used to give a prototype implementation of a virtual machine-based platform for trusted computing. More advanced attempts at virtualization in various forms can be found in References 22 and 23. In particular, Reference 22 discusses in detail the trade-off between “perfect” virtualization and efficiency.

11.5.2 Views of Processor State The execution of each machine instruction both depends on, and affects, the internal processor state. In addition, the semantics of an instruction depend on the processor execution mode the processor was in

© 2006 by Taylor & Francis Group, LLC

11-26

Embedded Systems Handbook

when the instruction itself was executed, since the processor mode directly determines the privilege level of the processor itself. For example: • On the ARM V5 [24] processor, the execution of the ADD R13, R13, #1 instruction increments by one the contents of register R13. • The outcome of the BEQ label instruction depends on the current value of the Z processor state flag, and conditionally updates the contents of the program counter. The view that machine code has of the internal processor state depends on the mode the processor is running in. In particular, let us define two, somewhat simplified, views of the processor state: User-mode view. It is the portion of the processor state that is accessible through machine instructions, with either read-only or read-write access rights, when the processor is in user mode. In other words, it is the portion of the processor state that can be accessed by unprivileged machine instructions. Privileged-mode view. It is the portion of the processor state that is accessible through machine instructions, with either read-only or read-write access rights, when the processor is in privileged mode. It usually is a superset of the user-mode state and, if the processor supports a single privileged mode, coincides with full access to the processor state as a whole. When the processor supports either privilege rings or multiple, independent privileged modes, the definition of privileged-mode view becomes more complicated, and involves either: • A nested set of views when the processor supports privilege rings. In this case, the inner view, corresponding to the most privileged processor mode, encompasses the machine state as a whole with the most powerful access rights; outer, less privileged modes have more restricted views of the processor state. Above, nested means that the outer view either has no visibility of a processor state item that is visible from the inner view, or that the outer view has less powerful access rights than the inner view on one or more processor state items. • A collection of independent views when the processor supports multiple, independent privileged modes. In this case, it should be noted that the intersection among views can be, and usually is, not empty; for example, in the ARM V5 processor, user registers R0 through R7 are accessible from, and common to all unprivileged and privileged processor modes. In addition, registers R8 through R12 are common to all but one processor mode. It should be noted also that only the union of all views give full access to the processor state: in general, no individual view can do the same, not even the view corresponding to the “most privileged” privileged mode, even if the processor specification contains such a hierarchical classification of privileged modes. Continuing the first example above, if the processor implements multiple, mode-dependent instances of register R13, the execution of the ADD instruction presented above in user mode will update the user-mode view of R13, but will not affect the view of the same register in any other mode. As customary, we define a process as the activity of executing a program; a process therefore encompasses both the program text and a view of the current processor state when the execution takes place. The latter includes the notion of execution progress that could be captured, for example, by a program counter.

11.5.3 Operating Principle The most important concept behind processor virtualization is that a low-level system software component, the Virtual Machine Monitor (or VMM for short), also historically known as the Control Program (or CP), performs the following functions, likely with hardware assistance: • It gives to a set of machine code programs, running either in user or privileged mode, their own, independent view of the processor state; this gives rise to a set of independent sequential processes. • Each view is “correct” from the point of view of the corresponding program, in the sense that the view is indistinguishable from the view the program would have when run on a bare machine,

© 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

11-27

without a VMM in between. This requirement supports the illusion that each process runs on its own processor, and that processor is identical to the physical processor below the VMM. • It is able to switch the processor among the processes mentioned above; in this way, the VMM implements multiprogramming. Switching the processor involves both a switch among possibly different program texts, and among distinct processor state views. The key difference of the VMM approach with respect to traditional multiprogramming is that a traditional operating system confines user-written programs to run in user mode only, and to access privileged mode by means of a system call. Thus, each service request made by a user program traps to the operating system and switches the processor to privileged mode; the operating system, running in privileged mode, performs the service on behalf of the user program and then returns control to it, simultaneously bringing the processor back to user mode. A VMM system, instead, supports virtual processors that can run either in user or in privileged mode, just like the real, physical processor they mimic; the real processor must necessarily execute the processes inside the virtual machine — also called the virtual machine guest code — in user mode, to keep control of the system as a whole. We must therefore distinguish between: • The real, physical processor mode, that is, the processor mode the physical processor is running in. At each instant, each physical processor in the system is characterized by its current processor mode. • The virtual processor mode, that is, the processor mode each virtual processor is running in. At each instant, each virtual machine is characterized by its current processor mode, and it does not necessarily coincide with the physical processor mode, even when the virtual machine is being executed.

11.5.4 Virtualization by Instruction Emulation The classic approach to processor virtualization is based on privileged instruction emulation; with this approach, the VMM maintains a virtual machine control block (or VMCB for short) for each virtual machine. Among other things, the VMCB holds the full processor state, both unprivileged and privileged, of a virtual processor; therefore, it contains state information belonging to, and accessible from, distinct views of the processor state, with different levels of privilege. When a virtual machine is being executed by a physical processor, the VMM transfers part of the VMCB into the physical processor state; when the VMM assigns the physical processor to another virtual machine, the physical processor state is transferred back into the VMCB. It is important to notice that virtual machines are always executed with the physical processor in user mode, regardless of the virtual processor mode. Most virtual machine instructions are executed directly by the physical processor, with zero overhead; however, some instructions must be emulated by the VMM to incur a trap handling overhead. In particular: 1. Unprivileged instructions act on, and depend on, the current view of the processor state only, and are executed directly by the physical processor. Two subcases are possible, depending on the current virtual processor mode: (a) Both the virtual and the physical processor are in user mode. In this case, virtual and physical instruction execution and their corresponding processor state views fully coincide, and no further manipulation of the processor state is necessary. (b) The virtual processor is running in a privileged mode and the physical processor is in user mode. So, instruction execution acts on the user-mode view of the processor state, and the intended effect is to act on one of the privileged views. To compensate for this, the VMM must update the contents of the user-mode view in the physical processor from the appropriate

© 2006 by Taylor & Francis Group, LLC

11-28

Embedded Systems Handbook

portion of the VMCB whenever the virtual processor changes state. Even in this case, the overhead incurred during actual instruction execution is zero. 2. Privileged instructions act on one of the privileged views of the processor state. So, when the execution of a privileged instruction is attempted in physical user mode, the physical processor takes a trap to the VMM. In turn, the VMM must emulate either the trap or the trapped instruction, depending on the current virtual privilege level, and reflect the outcome of the emulation in the virtual processor state stored in the VMCB: (a) If the virtual processor was in user mode when the trap occurred, the VMM must emulate the trap. Actual trap handling will be performed by the privileged software inside the virtual machine, in virtual privileged mode, because the emulation of the trap, among other things, switches the virtual processor into privileged mode. The virtual machine privileged software actually receives the emulated trap. (b) If the virtual processor was in privileged mode, and the trap was triggered by lack of the required physical processor privilege level, the VMM must emulate the privileged instruction; in this case, the VMM itself performs trap handling and the privileged software inside the virtual machine does not receive the trap at all. Instead, it sees the outcome of the emulated execution of the privileged instruction. (c) If the virtual processor was in privileged mode, and the trap was triggered by other reasons, the VMM must emulate the trap; the actual trap handling, if any, will be performed by the privileged software inside the virtual machine, in virtual privileged mode. It should be noted in fact that, in most simple processor architectures and operating systems, a trap occurring in a privileged mode is usually considered to be a fatal condition, and triggers the immediate shutdown of the operating system itself. In the first and third case above, the behavior of the virtual processor exactly matches the behavior of a physical processor in the same situation, except that the trap enter mechanism is emulated in software instead of being performed either in hardware or in microcode. In the second case, the overall behavior of the virtual processor still matches the behavior of a physical processor in the same situation, but the trap is kept invisible to the virtual machine guest software because, in this case, the trap is instrumental for the VMM to properly catch and emulate the privileged instruction. 3. A third class of instructions includes unprivileged instructions whose outcome depends on a physical processor state item belonging to privileged processor state views only. The third and last class of instructions is anomalous and problematic in nature from the point of view of processor virtualization, because these instructions allow a program to infer something about a processor state item that would not be accessible from its current view of the processor state itself. The presence of instructions of this kind hampers the privileged instruction emulation approach to processor virtualization just discussed, because it is based on the separation between physical processor state and virtual processor state, and enforces this separation by trapping (and then emulating in the virtual processor context) all instruction that try to access privileged processor state views. Instructions of this kind are able to bypass this mechanism as a whole, because they generate no trap, so the VMM is unable to detect and emulate them; instead, they take information directly from the physical processor state.

11.5.5 Processor-Mode Change A change in the mode of a virtual processor may occur for several reasons, of which the main ones are: • When the execution of an instruction triggers a trap; in this case, trap handling is synchronous with respect to the code being executed.

© 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

11-29

• When an interrupt request for the virtual machine comes in, and is accepted; the interrupt request is asynchronous with respect to the code being executed, but the hardware implicitly synchronizes interrupt handling with instruction boundaries. In both cases, the VMM takes control of the physical processor and then implements the mode change by manipulating the virtual machine’s VMCB and, if the virtual machine was being executed when the mode change was requested, a portion of the physical processor state. In particular, to implement the processor-mode change the VMM must perform the following actions: • Save the portion of the physical processor state pertaining to the processor state view of the old processor state into the VMCB; for example, if the processor was running in user mode, the user-mode registers currently loaded in the physical processor registers must be saved. • Update the VMCB to reflect the effects of the mode change to the virtual processor state; for example, a system call instruction will likely modify the program counter of the virtual processor and load a new processor status word. • Load the physical processor state pertaining to the state view of the new processor state from the VMCB; for example, if the processor is being switched to a privileged mode, the privileged-mode registers must be transferred from the VMCB into the user-mode registers of the physical processor. Notice that the mode of the registers saved in the VMCB and the accessibility mode of the physical processor registers into which they are loaded do not coincide.

11.5.6 Privileged Instruction Emulation To perform its duties, the VMM must be able to receive and handle traps on behalf of a virtual machine, and these traps can be triggered for a variety of reasons. When using the privileged instruction emulation approach to processor virtualization, the most common trap reason is the request to emulate a privileged instruction. The VMM must perform privileged instruction emulation when a virtual processor attempts to execute a legal privileged instruction while in virtual privileged mode. In this case, the physical processor (running in physical user mode) takes a “privileged instruction” trap that would have not been taken if it were in privileged mode as the virtual machine software expects it to be. The main steps of the instruction emulation sequence are: • Save into the VMCB all registers in the view corresponding to the current virtual processor mode. This both “freezes” the virtual machine state for subsequent instruction emulation, and frees the physical processor state for VMM use. • Locate and decode the instruction to be emulated in the virtual processor instruction stream. This operation may involve multiple steps because, for example, on superscalar or deeply pipelined architecture, the exact value of the program counter at the time of the trap can not be easy to compute. • Switch the physical processor into the appropriate privileged mode for instruction emulation, that is, to the processor mode of the virtual processor. The trap handling mechanism of the physical processor always switches the processor into a privileged mode, but if the processor supports multiple privileged modes then the privileged mode might not coincide with the actual privileged mode of the virtual processor. • Emulate the instruction using the VMCB as the reference machine state for the emulation, and reflect its outcome into the VMCB itself. Notice that the execution of a privileged instruction may update both the privileged and the unprivileged portion of the virtual processor state, so the VMCB as a whole is involved. Also, the execution of a privileged instruction may change the processor mode of the virtual processor.

© 2006 by Taylor & Francis Group, LLC

11-30

Embedded Systems Handbook

• Update the virtual program counter in the VMCB to the next instruction in the instruction stream of the virtual processor. • Restore the virtual processor state from the updated VMCB and return from the trap. In the last step above, the virtual processor state can in principle be restored either: • From the VMCB of the virtual machine that generated the trap in the first place, if the processor scheduler of the VMM is not invoked after instruction emulation; this is the case just described. • From the VMCB of another virtual machine, if the processor scheduler of the VMM is invoked after instruction emulation.

11.5.7 Exception Handling When any synchronous exception other than a privileged instruction trap, occurs in either virtual user or virtual privileged modes, the VMM, and not the guest operating system of the virtual machine, receives the trap in the first place. When the trap is not instrumental to the implementation of virtual machines, as it happens in most cases, the VMM must simply emulate the trap mechanism itself inside the virtual machine, and appropriately update the VMCB to reflect the trap back to the privileged virtual machine code. Another situation in which the VMM must simply propagate the trap is the occurrence of a privileged instruction trap when the virtual processor is in virtual user mode. This occurrence usually, but not always, indicates a bug in the guest software: an easy counterexample can be found when a VMM is running inside the virtual machine. A special case of exception is that generated by the system call instruction, whenever it is implemented. However, from the point of view of the VMM, this kind of exception is handled exactly as all others; only the interpretation given to the exception by the guest code running in the virtual machine is different. It should also be noted that asynchronous exceptions, such as interrupt requests, must be handled in a different and more complex way, as described in the following section.

11.5.8 Interrupt Handling We distinguish between three kinds of interrupt; each of them requires a different handling strategy by the VMM: • Interrupts triggered by, and destined to, the VMM itself, for example, the VMM processor scheduler timeslice interrupt, and the VMM console interrupt; in this case, no virtual machine ever notices the interrupt. • Interrupts destined to a single virtual machine, for example, a disk interrupt for a physical disk permanently assigned to a virtual machine. • Interrupts synthesized by the VMM, either by itself or as a consequence of another interrupt, and destined to a virtual machine, for example, a disk interrupt for a virtual disk emulated by the VMM, or a network interrupt for a virtual communication channel between virtual machines. In either case, the general approach to interrupt handling is the same, and the delivery of an interrupt request to a virtual machine implies at least the following steps: • If the processor was executing in a virtual machine, save the status of the current virtual machine, if any, into the corresponding VMCB; then, switch the processor onto the VMM context and stack, and select the most privileged processor mode. Else, the processor was already executing the VMM; the processor already is in the VMM context and stack, and runs at the right privilege level. In both cases, after this phase, the current virtual machine context has been secured in its VMCB and the physical processor can freely be used by VMM code; this is also a good boundary for the transition between the portion of the VMM written in assembly code and the bulk of the VMM, written in a higher-level programming language. © 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

11-31

• Determine the type of interrupt request and to which virtual machine it must be dispatched; then, emulate the interrupt processing normally performed by the physical processor in the corresponding VMCB. An additional complication arises if the target virtual machine is the current virtual machine and the VMM was in active execution, that is, it was emulating an instruction on behalf of the virtual machine itself, when the request arrived. In this case, the simplest approach, which also adheres most to the behavior of the physical processor, is to defer interrupt emulation to the end of the current emulation sequence. To implement the deferred handling mechanism efficiently, some features of the physical processor, such as Asynchronous System Traps (ASTs) and deferrable software interrupts, may be useful; unfortunately, they are now uncommon on RISC (Reduced Instruction Set Computer) machines. • Return either to the VMM or the virtual machine code that was being executed when the interrupt request arrived. Notice that, at this point, no actual interrupt handling took place yet, and that some devices may require some limited intervention before returning from their interrupt handler, for example, to release their interrupt request line. In this case, it may be necessary to incorporate this low-level interrupt handling in the VMM directly, and at the same time ensure that it is idempotent when repeated by the virtual machine interrupt handler. The opportunity and the relative advantages and disadvantages of invoking the VMM processor scheduler after each interrupt request will be discussed in Section 11.5.10.

11.5.9 Trap Redirection A problem common to privileged instruction emulation, exception handling, and interrupt handling is that the VMM must be able to intercept any exception the processor takes while executing on behalf of a virtual machine and direct it toward its own handler. Most modern processors use an unified trap vector or dispatch table for all kinds of trap, exception, and interrupt. Each trap type has its own code that is used as an index in the trap table to fetch the address in memory at which the corresponding trap handler starts. A slightly different approach is to execute the instruction inside the table directly (in turn, the instruction will usually be an unconditional jump instruction), but the net effect is the same. A privileged register, usually called the trap table base register, gives the starting address of the table. In either case, all vectors actually used by the physical processor when handling a trap reside in the privileged address space, and are accessed after the physical processor has been switched into an appropriate privileged mode. The VMM must have full control on these vectors, because it relies on them to intercept traps at runtime. On the other hand, the virtual machine guest code should be able to set its own trap table, with any vectors it desires; the latter table resides in the virtually privileged address space of the virtual machine, and must be accessible to the virtual machine guest code in read and write mode. The content of this table is not used by the physical processor, but by the VMM to compute the target address to which to redirect traps via emulation. The simplest approach to accommodate these conflicting needs, when it is not possible to map the same virtual address to multiple, distinct physical addresses depending on the processor mode without software intervention — which is quite a common restriction on simple MMUs — is to reserve in the addressing space of each virtual machine a phantom page that is not currently in use by the guest code, grant read-only access to it only when the processor is in physical privileged mode to keep it invisible to the guest code, and store the actual trap table there and direct the processor to use it by setting its trap table base register appropriately. The initialization of the actual trap table is made by the VMM for each virtual machine during the initial instantiation of the virtual machine itself. Since the initialization is performed by the VMM (and not by the virtual machine), the read-only access restriction described above does not apply. © 2006 by Taylor & Francis Group, LLC

11-32

Embedded Systems Handbook

The VMM must then intercept any access made by the virtual machine guest code to the trap table base register, in order to properly locate the virtual trap table and be able to compute the target address to which to redirect traps. In other words, the availability of a trap table base register allows the guest code to set up its own, virtual trap table, and the VMM to set up the trap table obeyed by the physical processor, without resorting to virtual/physical address mapping functions that depend on the processor mode. Two complementary approaches can be followed to determine the exact location of the phantom page in the address space of the virtual machine. In principle, the two approaches are the same; the difference between them is the compromise between ease of implementation, overheads at runtime, and flexibility: • If the characteristics of the guest code, in particular of the guest operating systems, are well known, the location of the phantom page can be fixed in advance. This is the simpler choice, and has also the least runtime overhead. • If one does not want to make any assumption at all about the guest code, the location of the phantom page must be computed dynamically from the current contents of the page table set up by the guest code and may change with time (e.g., when the guest code decides to map the location in which the phantom page currently resides for its own use). When the location of the phantom page changes, the VMM must update the trap table base register accordingly; moreover, it must ensure that the first-level trap handlers contain only position-independent code.

11.5.10 VMM Processor Scheduler The main purpose of the VMM, with respect to processor virtualization, is to emulate the privileged instructions issued by virtual processors in virtual privileged mode. Code fragments implementing the emulation are usually short and often require atomic execution, above all with respect to interrupt requests directed to the same virtual machine. The processor scheduling activity carried out by the VMM by means of the VMM processor scheduler is tied to the emulation of privileged instructions quite naturally, because the handling of a privileged instruction exception is a convenient scheduling point. On the other hand, the main role of the VMM in interrupt handling is to redirect each interrupt to the virtual machine(s) interested in it; this action, too, must be completed atomically with respect to instruction execution on the same virtual machine, like the physical processor itself would do. This suggests to disable the rescheduling of the physical processor if the processor was executing the VMM when the interrupt arrived, and to delay the rescheduling until the end of the current VMM emulation path. The advantage of this approach is twofold: • The state of the VMM must never be saved when switching between different virtual machines. • The VMM must not be reentrant with respect to itself, because a VMM/VMM context switch cannot occur. By contrast, the processor allocation latency gets worse because the maximum latency, not taking higher-priority activities into account, becomes the sum of: • The longest emulation path in the VMM. • The longest sequence of instructions in the VMM to be executed with interrupts disabled, due to synchronization constraints. • The scheduling time. • The virtual machine context save and restore time. In a naive implementation, making the VMM not preemptable seems promising, because it is conceptually simple and does impose a negligible performance penalty in the average case, if it is assumed that the occurrence of an interrupt that needs an immediate rescheduling while VMM execution is in progress is rare.

© 2006 by Taylor & Francis Group, LLC

Real-Time Embedded Operating Systems: Standards and Perspectives

11-33

Also, some of the contributions to the processor allocation latency described earlier, mainly the length of instruction emulation paths in the VMM and the statistical distribution of the different instruction classes to be emulated in the instruction stream, will be better known only after some experimentation because they also depend on the behavior of the guest operating systems and their applications. It must also be taken into account that making the VMM preemptable will likely give additional performance penalties in the average case: • The switch between virtual machines will become more complex, because their corresponding VMM state must be switched as well. • The preemption of some VMM operations, for example, the propagation of an interrupt request, may be ill-defined if it occurs while a privileged instruction of the same virtual machine is being emulated. So, at least for soft real-time applications, implementing preemption of individual instruction emulation in the VMM should be done only if it is strictly necessary to satisfy the latency requirements of the application, and after extensive experimentation. From the point of view of the scheduling algorithms, at least in a naive implementation, a fixed-priority scheduler with a global priority level assigned to each virtual machine is deemed to be the best choice, because: • It easily accommodates the common case in which nonreal-time tasks are confined under the control of a general-purpose operating system in a virtual machine, and real-time tasks either run under the control of a real-time operating system in another virtual machine, or have a private virtual machine each. • The sensible selection of a more sophisticated scheduling algorithm can be accomplished only after extensive experimentation with the actual set of applications to be run, and when a detailed model of the real-time behavior of the application itself and of the devices it depends on is available. • The choice of algorithms to be used when multiple, hierarchical schedulers are in effect in a real-time environment has not yet received extensive attention in the literature.

References [1] A. Silberschatz, P.B. Galvin, and G. Gagne. Applied Operating Systems Concepts. John Wiley & Sons, Hoboken, NJ, 1999. [2] A.S. Tanenbaum and A.S. Woodhull. Operating Systems — Design and Implementation. Prentice Hall, Englewood Cliffs, NJ, 1997. [3] M.K. McKusick, K. Bostic, M.J. Karels, and J.S. Quarterman. The Design and Implementation of the 4.4BSD Operating System. Addison-Wesley, Reading, MA, 1996. [4] B. Sprunt, L. Sha, and J.P. Lehoczky. Aperiodic Task Scheduling for Hard Real-Time Systems. Real-Time Systems, 1, 27–60, 1989. [5] IEEE Std 1003.1-2001. The Open Group Base Specifications Issue 6. The IEEE and The Open Group, 2001. Also available online, at http://www.opengroup.org/. [6] OSEK/VDX. OSEK/VDX Operating System Specification. Available online, at http://www. osek-vdx.org/. [7] IEEE Std 1003.13-1998. Standardized Application Environment Profile — POSIX Realtime Application Support (AEP). The IEEE, New York, 1998. [8] ISO/IEC 9899:1999. Programming Languages — C. International Standards Organization, Geneva, 1999. [9] Red Hat Inc. eCos User Guide. Available online, at http://sources.redhat.com/ecos/. [10] Arcturus Networks Inc. µClinux Documentation. Available online, at http://www. uclinux.org/.

© 2006 by Taylor & Francis Group, LLC

11-34

Embedded Systems Handbook

[11] FSMLabs, Inc. RTLinuxPro Frequently Asked Questions. Available online, at http://www. rtlinux.org/. [12] Politecnico di Milano, Dip. di Ingegneria Aerospaziale. The RTAI Manual. Available online, at http://www.aero.polimi.it/˜rtai/. [13] M. Barabanov and V. Yodaiken. Real-Time Linux. Linux Journal, February 1997. Also available online, at http://www.rtlinux.org/. [14] On-Line Applications Research. RTEMS Documentation. Available online, at http://www. rtems.com/. [15] R.A. Meyer and L.H. Seawright. A Virtual Machine Time-Sharing System. IBM Systems Journal, 9, 199–218, 1970. [16] International Business Machines Corp. z/VM General Information, GC24-5991-05. Also available online, at http://www.ibm.com/. [17] R. Goldberg. Architectural Principles for Virtual Computer Systems, Ph.D. thesis, Harvard University, 1972. [18] R. Goldberg. Survey of Virtual Machine Research. IEEE Computer Magazine, 7, 34–45, 1974. [19] Motorola Inc. M68000 Programmer’s Reference Manual, M68000PM/AD, Rev. 1. [20] VMware Inc. VMware GSX Server User’s Manual. Also available online, at http://www. vmware.com/. [21] T. Garfinkel, B. Plaff, J. Chow, M. Rosenblum, and D. Boneh. Terra: A Virtual Machine-Based Platform for Trusted Computing. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, October 2003. [22] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugerbauer, I. Pratt, and A. Warfield. Xen and the Art of Virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, October 2003. [23] E. Bugnion, S. Devine, and M. Rosenblum. Disco: Running Commodity Operating Systems on Scalable Multiprocessors. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, October 1997. [24] D. Seal, Ed. ARM Architecture Reference Manual, 2nd ed. Addison-Wesley, Reading, MA, 2001.

© 2006 by Taylor & Francis Group, LLC

12 Real-Time Operating Systems: The Scheduling and Resource Management Aspects 12.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-1 Achieving Predictability

12.2 Periodic Task Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2 Timeline Scheduling • Rate Monotonic Scheduling • Earliest Deadline First • Tasks with Deadlines Less than Periods

12.3 Aperiodic Task Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-6 12.4 Protocols for Accessing Shared Resources . . . . . . . . . . . . . . 12-9 Priority Inheritance Protocol • Priority Ceiling Protocol • Schedulability Analysis

Giorgio C. Buttazzo University of Pavia

12.5 New Applications and Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-11 12.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12

12.1 Introduction Often, people say that real-time systems must react fast to external events. Such a definition, however, is not precise, because processing speed does not provide any information on the actual capability of the system to react timely to events. In fact, the effect of controller actions in a system can only be evaluated when considering the dynamic characteristics of the controlled environment. A more precise definition would say that a real-time system is a system in which performance depends not only on the correctness of the single controller actions, but also on the time at which actions are produced [1]. The main difference between a real-time task and a nonreal-time task is that a real-time task must complete within a given deadline. In other words, a deadline is the maximum time allowed for a computational process to finish its execution. In real-time applications, a result produced after its deadline is not only late, but can be dangerous. Depending on the consequences caused by a missed deadline, real-time activities can be classified into hard and soft tasks [2]. A real-time task is said to be hard if missing a deadline may have catastrophic consequences in the controlled system. A real-time task is said to be soft if missing a deadline 12-1

© 2006 by Taylor & Francis Group, LLC

12-2

Embedded Systems Handbook

causes a performance degradation, but does not jeopardize correct system behavior. An operating system able to manage hard tasks is called a hard real-time system [3,4]. In general, hard real-time systems have to handle both hard and soft activities. In a control application, typical hard tasks include sensory data acquisition, detection of critical conditions, motor actuation, and action planning. Typical soft tasks include user command interpretation, keyboard input, message visualization, system status representation, and graphical activities. The great interest in real-time systems is motivated by the growing diffusion they have in our society in several application fields, including chemical and nuclear power plants, flight control systems, traffic monitoring systems, telecommunication systems, automotive devices, industrial automation, military systems, space missions, and robotic systems. Despite this large application domain, most of today’s real-time control systems are still designed using ad hoc techniques and heuristic approaches. Very often, control applications with stringent time constraints are implemented by writing large portions of code in assembly language, programming timers, writing low-level drivers for device handling, and manipulating task and interrupt priorities. Although the code produced by these techniques can be optimized to run very efficiently, this approach has several disadvantages. First of all, the implementation of large and complex applications in assembly language is much more difficult and time consuming than using high-level programming. Moreover, the efficiency of the code strongly depends on the programmer’s ability. In addition, assembly code optimization makes a program more difficult to comprehend, complicating software maintenance. Finally, without the support of specific tools and methodologies for code and schedulability analysis, the verification of time constraints becomes practically impossible. The major consequence of this state of affairs is that control software produced by empirical techniques can be highly unpredictable. If all critical time constraints cannot be verified a priori and the operating system does not include specific features for handling real-time tasks, the system apparently works well for a period of time, but may collapse in certain rare, but possible, situations. The consequences of a failure can sometimes be catastrophic and may injure people or cause serious damage to the environment. A trustworthy guarantee of system behavior under all possible operating conditions can only be achieved by adopting appropriate design methodologies and kernel mechanisms specifically developed for handling explicit timing constraints.

12.1.1 Achieving Predictability The most important property of a real-time system is not high speed, but predictability. In a predictable system we should be able to determine in advance whether all the computational activities can be completed within their timing constraints. The deterministic behavior of a system typically depends on several factors, ranging from the hardware architecture to the operating system, up to the programming language used to write the application. Architectural features that have major influence on task execution include interrupts, direct memory access (DMA), cache, and prefetching mechanisms. Although such features improve the average performance of the processor, they introduce a nondeterministic behavior in process execution, prolonging the worst-case response times. Other factors that significantly affect task execution are due to the internal mechanisms used in the operating system, such as the scheduling algorithm, the synchronization mechanisms, the memory management policy, and the method used to handle I/O devices. The programming language has also an important impact on predictability, through the constructs it provides to handle the timing requirements specified for computational activities.

12.2 Periodic Task Handling Periodic activities represent the major computational load in a real-time control system. For example, activities such as actuator regulation, signal acquisition, filtering, sensory data processing, action planning, and monitoring, need to be executed with a frequency derived from the application requirements. A periodic task is characterized by an infinite sequence of instances, or jobs. Each job is characterized by a request time and a deadline. The request time r(k) of the kth job of a task represents the time at

© 2006 by Taylor & Francis Group, LLC

Real-Time Operating Systems

12-3

which the task becomes ready for execution for the kth time. The interval of time between two consecutive request times is equal to the task period. The absolute deadline of the kth job, denoted with d(k), represents the time within which the job has to complete its execution, and r(k) < d(k) ≤ r(k + 1).

12.2.1 Timeline Scheduling Timeline Scheduling (TS), also known as a cyclic executive, is one of the most used approaches to handle periodic tasks in defense military systems and traffic control systems. The method involves dividing the temporal axis into slices of equal length, in which one or more tasks can be allocated for execution, in such a way to respect the frequencies derived from the application requirements. A timer synchronizes the activation of the tasks at the beginning of each time slice. In order to illustrate this method, consider the following example, in which three tasks, A, B, and C, need to be executed with a frequency of 40, 20, and 10 Hz, respectively. By analyzing the task periods, it is easy to verify that the optimal length for the time slice is 25 msec, which is the greatest common divisor of the periods. Hence, to meet the required frequencies, task A needs to be executed in every time slice, task B every two slices, and task C every four slices. A possible scheduling solution for this task set is illustrated in Figure 12.1. The duration of the time slice is also called a minor cycle, whereas the minimum period after which the schedule repeats itself is called a major cycle. In general, the major cycle is equal to the least common multiple of all the periods (in the example it is equal to 100 msec). In order to guarantee a priori that a schedule is feasible on a particular processor, it is sufficient to know the task worst-case execution times and verify that the sum of the executions within each time slice is less than or equal to the minor cycle. In the example shown in Figure 12.1, if CA , CB , and CC denote the execution times of the tasks, it is sufficient to verify that CA + CB ≤ 25 msec CA + CC ≤ 25 msec The major relevant advantage of TS is its simplicity. The method can be implemented by programming a timer to interrupt with a period equal to the minor cycle and by writing a main program that calls the tasks in the order given in the major cycle, inserting a time synchronization point at the beginning of each minor cycle. Since the task sequence is not decided by a scheduling algorithm in the kernel, but is triggered by the calls made by the main program, there are no context switches, so the runtime overhead is very low. Moreover, the sequence of tasks in the schedule is always the same, can be easily visualized, and it is not affected by jitter (i.e., task start times and response times are not subject to large variations). In spite of these advantages, TS has some problems. For example, it is very fragile during overload conditions. If a task does not terminate at the minor cycle boundary, we can either let it continue or abort it. In both cases, however, the system may enter in a risky situation. In fact, if we leave the failing task in execution, it can cause a domino effect on the other tasks, breaking the entire schedule (timeline break). On the other hand, if the failing task is aborted, the system may be left in an inconsistent state, jeopardizing correct system behavior. Another big problem of the TS technique is its sensitivity to application changes. If updating a Task A Major cycle

Task B Task C

Minor cycle

0

25

FIGURE 12.1 Example of TS.

© 2006 by Taylor & Francis Group, LLC

50

75

100

125

150

t

12-4

Embedded Systems Handbook

task requires an increase of its computation time or its activation frequency, the entire scheduling sequence may need to be reconstructed from scratch. Considering the previous example, if task B is updated to B and the code change is such that CA + CB > 25 msec, then we have to divide B in two or more pieces to be allocated in the available intervals of the timeline. Changing the task frequencies may cause even more radical changes in the schedule. For example, if the frequency of task B changes from 20 to 25 Hz, the previous schedule is not valid any more, because the new minor cycle is equal to 10 msec and the new major cycle is equal to 200 msec. Finally, another limitation of the TS is that it is difficult to handle aperiodic activities efficiently without changing the task sequence. The problems outlined above can be solved by using priority-based scheduling algorithms.

12.2.2 Rate Monotonic Scheduling The rate monotonic (RM) algorithm assigns each task a priority directly proportional to its activation frequency, so that tasks with shorter period have higher priority. Since a period is usually kept constant for a task, the RM algorithm implements a static priority assignment, in the sense that task priorities are decided at task creation and remain unchanged for the entire application run. RM is typically preemptive, although it can also be used in a non-preemptive mode. In 1973, Liu and Layland [5] showed that RM is optimal among all static scheduling algorithms, in the sense that if a task set is not schedulable by RM, then the task set cannot be feasibly scheduled by any other fixed priority assignment. Another important result proved by the same authors is that a set = τ1 , . . . , τn of n periodic tasks is schedulable by RM if n Ci i=1

Ti

≤ n(21/n − 1)

(12.1)

where Ci and Ti represent the worst-case computation time and the period of task i, respectively. The quantity U =

n Ci i=1

Ti

represents the processor utilization factor and denotes the fraction of time used by the processor to execute the entire task set. Table 12.1 shows the values of n(21/n − 1) for n from 1 to 10. As can be seen, the factor decreases with n and, for large n, it tends to the following limit value: lim n(21/n − 1) = ln 2 0.69

n→∞

TABLE 12.1 Maximum Processor Utilization for the RM Algorithm n 1 2 3 4 5 6 7 8 9 10

© 2006 by Taylor & Francis Group, LLC

Ulub 1.000 0.828 0.780 0.757 0.743 0.735 0.729 0.724 0.721 0.718

Real-Time Operating Systems

12-5

We note that the test by Liu and Layland only gives a sufficient condition for guaranteeing a feasible schedule under the RM algorithm. Hence, a task set can be schedulable by RM even though the utilization condition is not satisfied. Nevertheless, we can certainly state that a periodic task set cannot be feasibly scheduled by any algorithm if U > 1. A statistical study carried out by Lehoczky et al. [6] on randomly generated task sets showed that the utilization bound of the RM algorithm has an average value of 0.88, and becomes 1 for periodic tasks with harmonic period relations. Necessary and sufficient schedulability tests for RM have been proposed [6,10,11,29], but they have pseudo-polynomial complexity. Recently, Bini and Buttazzo derived a sufficient polynomial time test, the Hyperbolic Bound [28], capable of accepting more tasks than the Liu and Layland test. In spite of the limitation on the schedulability bound, which in most cases prevents the full processor utilization, the RM algorithm is widely used in real-time applications, mainly for its simplicity. At the same time, being a static scheduling algorithm, it can be easily implemented on top of commercial operating systems, using a set of fixed priority levels. Moreover, in overload conditions, the highest priority tasks are less prone to missing their deadlines. For all these reasons, the Software Engineering Institute of Pittsburgh has prepared a sort of user guide for the design and analysis of real-time systems based on the RM algorithm [7]. Since the RM algorithm is optimal among all fixed priority assignments, the schedulability bound can only be improved through a dynamic priority assignment.

12.2.3 Earliest Deadline First The earliest deadline first (EDF) algorithm entails selecting (among the ready tasks) the task with the earliest absolute deadline. The EDF algorithm is typically preemptive, in the sense that, a newly arrived task can preempt the running task if its absolute deadline is shorter. If the operating system does not support explicit timing constraints, EDF (as RM) can be implemented on a priority-based kernel, where priorities are dynamically assigned to tasks. A task will receive the highest priority if its deadline is the earliest among those of the ready tasks, whereas it will receive the lowest priority if its deadline is the latest one. A task gets a priority that is inversely proportional to its absolute deadline. The EDF algorithm is more general than RM, since it can be used to schedule both periodic and aperiodic task sets, because the selection of a task is based on the value of its absolute deadline, which can be defined for both types of tasks. Typically, a periodic task that completed its execution is suspended by the kernel until its next release, coincident with the end of the current period. Dertouzos [8] showed that EDF is optimal among all online algorithms, while Liu and Layland [5] proved that a set = τ1 , . . . , τn of n periodic tasks is schedulable by EDF if and only if n Ci i=1

Ti

≤1

It is worth noting that the EDF schedulability condition is necessary and sufficient to guarantee a feasible schedule. This means that, if it is not satisfied, no algorithm is able to produce a feasible schedule for that task set. The dynamic priority assignment allows EDF to exploit the full processor, reaching up to 100 utilization factor less than one, the residual fraction of time can be efficiently used to handle aperiodic requests activated by external events. In addition, compared with RM, EDF generates a lower number of context switches, thus causing less runtime overhead. On the other hand, RM is simpler to implement on a fixed priority kernel and is more predictable in overload situations, because higher priority tasks are less viable to miss their deadlines.

12.2.4 Tasks with Deadlines Less than Periods Using RM or EDF, a periodic task can be executed at any time during its period. The only guarantee provided by the schedulability test is that each task will be able to complete its execution before the next

© 2006 by Taylor & Francis Group, LLC

12-6

Embedded Systems Handbook

release time. In some real-time applications, however, there is the need for some periodic task to complete within an interval less than its period. The deadline monotonic (DM) algorithm, proposed by Leung and Whitehead [9], extends RM to handle tasks with a relative deadline less than or equal to their period. According to DM, at each instant the processor is assigned the task with the shortest relative deadline. In priority-based kernels, this is equivalent to assigning each task a priority Pi inversely proportional to its relative deadline. With Di fixed for each task, DM is classified as a static scheduling algorithm. In the recent years, several authors [6, 10, 11] independently proposed a necessary and sufficient test to verify the schedulability of a periodic task set. For example, the method proposed by Audsley et al. [10] involves computing the worst-case response time Ri of each periodic task. It is derived by summing its computation time and the interference caused by tasks with higher priority: Ri = Ci +

Ri Ck Tk

(12.2)

k∈hp(i)

where hp(i) denotes the set of tasks having priority higher than task i and x denotes the ceiling of a rational number, that is, the smaller integer greater than or equal to x. The equation above can be solved by an iterative approach, starting with Ri (0) = Ci and terminating when Ri (s) = Ri (s − 1). If Ri (s) > Di for some task, then the task set cannot be feasibly scheduled by DM. Under EDF, the schedulability analysis for periodic task sets with deadlines less than periods is based on the processor demand criterion, proposed by Baruah et al. [12]. According to this method, a task set is schedulable by EDF if and only if, in every interval of length L (starting at time 0), the overall computational demand is no greater than the available processing time, that is, if and only if ∀L > 0,

n L + T i − Di Ci ≤ L Ti

(12.3)

i=1

This test is feasible, because L can only be checked for values equal to task deadlines no larger than the least common multiple of the periods. A detailed analysis of EDF has been presented by Stankovic, Ramamritham, Spuri and Buttazzo [30] under several workload conditions.

12.3 Aperiodic Task Handling Although in a real-time system most acquisition and control tasks are periodic, there exist computational activities that must be executed only at the occurrence of external events (typically signaled through interrupts), which may arrive at irregular times. When the system must handle aperiodic requests of computation, we have to balance two conflicting interests: on the one hand, we would like to serve an event as soon as possible to improve system responsiveness; on the other hand, we do not want to jeopardize the schedulability of periodic tasks. If aperiodic activities are less critical than periodic tasks, then the objective of a scheduling algorithm should be to minimize their response time, while guaranteeing that all periodic tasks (although being delayed by the aperiodic service) complete their executions within their deadlines. If some aperiodic task has a hard deadline, we should try to guarantee its timely completion offline. Such a guarantee can only be done by assuming that aperiodic requests, although arriving at irregular intervals, do not exceed a maximum given frequency, that is, they are separated by a minimum interarrival time. An aperiodic task characterized by a minimum interarrival time is called a sporadic task. Let us consider an example in which an aperiodic job Ja of 3 units of time must be scheduled by RM along with two periodic tasks, having computation times C1 = 1, C2 = 3 and periods T1 = 4, T2 = 6, respectively. As shown in Figure 12.2, if the aperiodic request is serviced immediately (i.e., with a priority higher than that assigned to periodic tasks), then task τ2 will miss its deadline. The simplest technique for managing aperiodic activities while preserving the guarantee for periodic tasks is to schedule them in background. This means that an aperiodic task executes only when the

© 2006 by Taylor & Francis Group, LLC

Real-Time Operating Systems

12-7

t1 0

4

8

12

Deadline miss t2 0

6

12

Ja 0

FIGURE 12.2

2

4

6

8

10

12

Immediate service of an aperiodic task. Periodic tasks are scheduled by RM.

t1 0

4

8

12

t2 0

6

12

Ja 0

FIGURE 12.3

2

4

6

8

10

12

Background service of an aperiodic task. Periodic tasks are scheduled by RM.

processor is not busy with periodic tasks. The disadvantage of this solution is that, if the computational load due to periodic tasks is high, the residual time left for aperiodic execution can be insufficient for satisfying their deadlines. Considering the same task set as before, Figure 12.3 illustrates how job Ja is handled by a background service. The response time of aperiodic tasks can be improved by handling them through a periodic server dedicated to their execution. As any other periodic task, a server is characterized by a period Ts and an execution time Cs , called the server capacity (or budget). In general, the server is scheduled using the algorithm adopted for periodic tasks and, once activated, it starts serving the pending aperiodic requests within the limit of its current capacity. The order of service of the aperiodic requests is independent of the scheduling algorithm used for the periodic tasks, and it can be a function of the arrival time, computation time, or deadline. During the last years, several aperiodic service algorithms have been proposed in the real-time literature, differing in performance and complexity. Among the fixed priority algorithms we mention the Polling Server, the Deferrable Server [13,14], the Sporadic Server [15], and the Slack Stealer [16]. Among those servers using dynamic priorities (which are more efficient on the average), we recall the Dynamic Sporadic Server [17,18], the Total Bandwidth Server [19], the Tunable Bandwidth Server [20], and the Constant Bandwidth Server [21]. In order to clarify the idea behind an aperiodic server, Figure 12.4 illustrates the schedule produced, under EDF, by a Dynamic Deferrable Server with capacity Cs = 1 and period Ts = 4. We note that, when the absolute deadline of the server is equal to the

© 2006 by Taylor & Francis Group, LLC

12-8

Embedded Systems Handbook

t1 0

4

8

12

t2 0

6

12

Ja 0

2

4

6

8

10

12

0

2

4

6

8

10

12

Cs

FIGURE 12.4 Aperiodic service performed by a Dynamic Deferrable Server. Periodic tasks, including the server, are scheduled by EDF. Cs is the remaining budget available for Ja .

t1 0

4

8

12

t2 0

6

1

2

Ja 0

2

4

6

8

10

12

FIGURE 12.5 Optimal aperiodic service under EDF.

one of a periodic task, priority is given to the server in order to enhance aperiodic responsiveness. We also observe that the same task set would not be schedulable under a fixed priority system. Although the response time achieved by a server is less than that achieved through the background service, it is not the minimum possible. The minimum response time can be obtained with an optimal server (TB∗ ) that assigns each aperiodic request the earliest possible deadline which still produces a feasible EDF schedule [20]. The schedule generated by the optimal TB∗ algorithm is illustrated in Figure 12.5, where the minimum response time for job Ja is equal to 5 units of time (obtained by assigning the job a deadline da = 7). As for all the efficient solutions, the better performance is achieved at the price of a larger runtime overhead (due to the complexity of computing the minimum deadline). However, adopting a variant of the algorithm, called the Tunable Bandwidth Server [20], overhead cost and performance can be balanced in order to select the best service method for a given real-time system. An overview of the

© 2006 by Taylor & Francis Group, LLC

Real-Time Operating Systems

12-9

Normal execution Critical section Blocked

t1

t2

t3 t0

FIGURE 12.6

t1

t2

t3

t4

t5

t6

t7

Example of priority inversion.

most common aperiodic service algorithms (both under fixed and dynamic priorities) can be found in Reference 3.

12.4 Protocols for Accessing Shared Resources When two or more tasks interact through shared resources (e.g., shared memory buffers), the direct use of classical synchronization mechanisms, such as semaphores or monitors, can cause a phenomenon known as priority inversion: a high priority task can be blocked by a low priority task for an unbounded interval of time. Such a blocking condition can create serious problems in safety critical real-time systems, since it can cause deadlines to be missed. For example, consider three tasks, τ1 , τ2 , and τ3 , having decreasing priority (τ1 is the task with highest priority), and assume that τ1 and τ3 share a data structure protected by a binary semaphore S. As shown in Figure 12.6, suppose that at time t1 task τ3 enters its critical section, holding semaphore S. During the execution of τ3 , at time t2 , assume τ1 becomes ready and preempts τ3 . At time t3 , when τ1 tries to access the shared resource, it is blocked on semaphore S, since the resource is used by τ3 . Since τ1 is the highest priority task, we would expect it to be blocked for an interval no longer than the time needed by τ3 to complete its critical section. Unfortunately, however, the maximum blocking time for τ1 can become much larger. In fact, task τ3 , while holding the resource, can be preempted by medium priority tasks (such as τ2 ), which will prolong the blocking interval of τ1 for their entire execution! The situation illustrated in Figure 12.6 can be avoided by simply preventing preemption inside critical sections. This solution, however, is appropriate only for very short critical sections, because it could cause unnecessary delays for high priority tasks. For example, a low priority task inside a long critical section would prevent the execution of a high priority task, even though they do not share any resource. A more efficient solution is to regulate the access to shared resource through the use of specific concurrency control protocols [22], designed to limit the priority inversion phenomenon.

12.4.1 Priority Inheritance Protocol An elegant solution to the priority inversion phenomenon caused by mutual exclusion is offered by the priority inheritance protocol (PIP) [23]. Here, the problem is solved by dynamically modifying the priorities of tasks that cause a blocking condition. In particular, when a task τa blocks on a shared resource, it transmits its priority to the task τb that is holding the resource. In this way, τb will execute its critical

© 2006 by Taylor & Francis Group, LLC

12-10

Embedded Systems Handbook

Normal execution Critical section Direct blocking t1

Push-through blocking t2

t2 t0

FIGURE 12.7

t1

t2

t3

t4

t5

t6

t7

Schedule produced using Priority Inheritance on the task set of Figure 12.6.

section with the priority of task τa . In general, τb inherits the highest priority among the tasks it blocks. Moreover, priority inheritance is transitive, thus if task τc blocks τb , which in turn blocks τa , then τc will inherit the priority of τa through τb . Figure 12.7 illustrates how the schedule shown in Figure 12.6 is changed when resources are accessed using the PIP. Until time t3 the system evolution is the same as the one shown in Figure 12.6. At time t3 , the high priority task τ1 blocks after attempting to enter the resource held by τ3 (direct blocking). In this case, however, the protocol imposes that τ3 inherits the maximum priority among the tasks blocked on that resource, thus it continues the execution of its critical section at the priority of τ1 . Under these conditions, at time t4 , task τ2 is not able to preempt τ3 , hence it blocks until the resource is released (push-through blocking). In other words, although τ2 has a nominal priority greater than τ3 , it cannot execute, because τ3 inherited the priority of τ1 . At time t5 , τ3 exits its critical section, releases the semaphore and recovers its nominal priority. As a consequence, τ1 can proceed until its completion, which occurs at time t6 . Only then τ2 can start executing. The PIP has the following property [23]: given a task τ , if n is the number of tasks with lower priority sharing a resource with a task with priority higher than or equal to τ and m is the number of semaphores that could block τ , then τ can be blocked for at most the duration of min(n, m) critical sections. Although the PIP limits the priority inversion phenomenon, the maximum blocking time for high priority tasks can still be significant, due to possible chained blocking conditions. Moreover, deadlock can occur if semaphores are not properly used in nested critical sections.

12.4.2 Priority Ceiling Protocol The priority ceiling protocol (PCP) [23] provides a better solution for the priority inversion phenomenon, also avoiding chained blocking and deadlock conditions. The basic idea behind this protocol is to ensure that, whenever a task τ enters a critical section, its priority is the highest among those that can be inherited from all the lower priority tasks that are currently suspended in a critical section. If this condition is not satisfied, τ is blocked and the task that is blocking τ inherits τ ’s priority. This idea is implemented by assigning each semaphore a priority ceiling equal to the highest priority of the tasks using that semaphore. Then, a task τ is allowed to enter a critical section only if its priority is strictly greater than all priority © 2006 by Taylor & Francis Group, LLC

Real-Time Operating Systems

12-11

ceilings of the semaphores held by the other tasks. As for the PIP, the inheritance mechanism is transitive. The PCP, besides avoiding chained blocking and deadlocks, has the property that each task can be blocked for at most the duration of a single critical section.

12.4.3 Schedulability Analysis The importance of the protocols for accessing shared resources in a real-time system derives from the fact that they can bound the maximum blocking time experienced by a task. This is essential for analyzing the schedulability of a set of real-time tasks interacting through shared buffers or any other non-preemptable resource, for example, a communication port or bus. To verify the schedulability of task τi using the processor utilization approach, we need to consider the utilization factor of task τi , the interference caused by the higher priority tasks and the blocking time caused by lower priority tasks. If Bi is the maximum blocking time that can be experienced by task τi , then the sum of the utilization factors due to these three causes cannot exceed the least upper bound of the scheduling algorithm, that is: ∀i, 1 ≤ i ≤ n,

Ck Bi + ≤ i(21/i − 1) Tk Ti

(12.4)

k∈hp(i)

where hp(i) denotes the set of tasks with priority higher than τi . The same test is valid for both the protocols described above, the only difference being the amount of blocking that each task may experience.

12.5 New Applications and Trends In the last years, real-time system technology has been applied to several application domains, where computational activities have less stringent timing constraints and occasional deadline misses are typically tolerated. Examples of such systems include monitoring, multimedia systems, flight simulators, and, in general, virtual reality games. In such applications, missing a deadline does not cause catastrophic effects on the system, but just a performance degradation. Hence, instead of requiring an absolute guarantee for the feasibility of the schedule, such systems demand an acceptable quality of service (QoS). It is worth observing that, since some timing constraints need to be handled anyway (although not critical), a nonrealtime operating system, such a Linux or Windows, is not appropriate: first of all, such systems do not provide temporal isolation among tasks, thus a sporadic peak load on a task may negatively affect the execution of other tasks in the system. Furthermore, the lack of concurrency control mechanisms that prevent priority inversion makes these systems unsuitable for guaranteeing a desired QoS level. On the other hand, a hard real-time approach is also not well suited for supporting such applications, because resources would be wasted due to static allocation mechanisms and pessimistic design assumptions. Moreover, in many multimedia applications, tasks are characterized by highly variable execution times (consider, for instance, an MPEG player), thus providing precise estimations on task computation times is practically impossible, unless one uses overly pessimistic figures. In order to provide efficient as well as predictable support for this type of real-time applications, several new approaches and scheduling methodologies have been proposed. They increase the flexibility and the adaptability of a system to online variations. For example, temporal protection mechanisms have been proposed to isolate task overruns and reduce reciprocal task interference [21, 24]. Statistical analysis techniques have been introduced to provide a probabilistic guarantee aimed at improving system efficiency [21]. Other techniques have been devised to handle transient and permanent overload conditions in a controlled fashion, thus increasing the average computational load in the system. One method absorbs the overload by regularly aborting some jobs of a periodic task, without exceeding a maximum limit specified by the user through a QoS parameter describing the minimum number of jobs between two consecutive abortions [25,26]. Another technique handles overloads through a suitable variation of periods, managed to decrease the processor utilization up to a desired level [27]. © 2006 by Taylor & Francis Group, LLC

12-12

Embedded Systems Handbook

12.6 Conclusions This paper surveyed some kernel methodologies aimed at enhancing the efficiency and the predictability of real-time control applications. In particular, the paper presented some scheduling algorithms and analysis techniques for periodic and aperiodic task sets. Two concurrency control protocols have been described to access shared resources in mutual exclusion while avoiding the priority inversion phenomenon. Each technique has the property to be analyzable, so that an offline guarantee can be provided for feasibility of the schedule within the timing constraints imposed by the application. For soft real-time systems, such as multimedia systems or simulators, the hard real-time approach can be too rigid and inefficient, especially when the application tasks have highly variable computation times. In these cases, novel methodologies have been introduced to improve average resource exploitation. They are also able to guarantee a desired QoS level and control performance degradation during overload conditions. In addition to research efforts aimed at providing solutions to more complex problems, a concrete increase in the reliability of future real-time systems can only be achieved if the mature methodologies are actually integrated in next generation operating systems and languages, defining new standards for the development of real-time applications. At the same time, programmers and software engineers need to be educated about the appropriate use of the available technologies.

References [1] J. Stankovic, A Serious Problem for Next-Generation Systems. IEEE Computer. 10–19, 1988. [2] J. Stankovic and K. Ramamritham, Tutorial on Hard Real-Time Systems, IEEE Computer Society Press, Washington, 1988. [3] G.C. Buttazzo, Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications, Kluwer Academic Publishers, Boston, MA, 1997. [4] J. Stankovic, M. Spuri, M. Di Natale, and G. Buttazzo, Implications of Classical Scheduling Results for Real-Time Systems. IEEE Computer, 28, 16–25, 1995. [5] C.L. Liu and J.W. Layland, Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment. Journal of the ACM, 20, 40–61, 1973. [6] J.P. Lehoczky, L. Sha, and Y. Ding, The Rate-Monotonic Scheduling Algorithm: Exact Characterization and Average Case Behaviour. In Proceedings of the IEEE Real-Time Systems Symposium, pp. 166–171, 1989. [7] M.H. Klein et al., A Practitioners’ Handbook for Real-Time Analysis: Guide to Rate Monotonic Analysis for Real-Time Systems, Kluwer Academic Publishers, Boston, MA, 1993. [8] M.L. Dertouzos, Control Robotics: the Procedural Control of Physical Processes. Information Processing, Vol. 74, North-Holland, Amsterdam, 1974. [9] J. Leung and J. Whitehead, On the Complexity of Fixed Priority Scheduling of Periodic Real-Time Tasks. Performance Evaluation, 2, 237–250, 1982. [10] N.C. Audsley, A. Burns, M. Richardson, K. Tindell, and A. Wellings, Applying New Scheduling Theory to Static Priority Preemptive Scheduling. Software Engineering Journal, 8, 284–292, 1993. [11] M. Joseph and P. Pandya, Finding Response Times in a Real-Time System. The Computer Journal, 29, 390–395, 1986. [12] S.K. Baruah, R.R. Howell, and L.E. Rosier, Algorithms and Complexity Concerning the Preemptive Scheduling of Periodic Real-Time Tasks on One Processor. Real-Time Systems, 2, 301–324, 1990. [13] J.P. Lehoczky, L. Sha, and J.K. Strosnider, Enhanced Aperiodic Responsiveness in Hard Real-Time Environments. In Proceedings of the IEEE Real-Time Systems Symposium, pp. 261–270, 1987. [14] J.K. Strosnider, J.P. Lehoczky, and L. Sha, The Deferrable Server Algorithm for Enhanced Aperiodic Responsiveness in Hard Real-Time Environments. IEEE Transactions on Computers, 44, 1995. [15] B. Sprunt, L. Sha, and J. Lehoczky, Aperiodic Task Scheduling for Hard Real-Time System. Journal of Real-Time Systems, 1, 27–60, 1989.

© 2006 by Taylor & Francis Group, LLC

Real-Time Operating Systems

12-13

[16] J.P. Lehoczky and S. Ramos-Thuel, An Optimal Algorithm for Scheduling Soft-Aperiodic Tasks in Fixed-Priority Preemptive Systems. In Proceedings of the IEEE Real-Time Systems Symposium, 1992. [17] T.M. Ghazalie and T.P. Baker, Aperiodic Servers in a Deadline Scheduling Environment Real-Time Systems, 9(1), 31–67, 1995. [18] M. Spuri and G.C. Buttazzo, Efficient Aperiodic Service under Earliest Deadline Scheduling. In Proceedings of IEEE Real-Time System Symposium, San Juan, PR, December 1994. [19] M. Spuri and G. Buttazzo, Scheduling Aperiodic Tasks in Dynamic Priority Systems. Real-Time Systems, 10(2), 179–210, 1996. [20] G. Buttazzo and F. Sensini, Optimal Deadline Assignment for Scheduling Soft Aperiodic Tasks in Hard Real-Time Environments. IEEE Transactions on Computers, 48(10), 1035–1052, 1999. [21] L. Abeni and G. Buttazzo, Integrating Multimedia Applications in Hard Real-Time Systems. In Proceedings of the IEEE Real-Time Systems Symposium, Madrid, Spain, December 1998. [22] R. Rajkumar, Synchronous Programming of Reactive Systems, Kluwer Academic Publishers, Boston, MA, 1991. [23] L. Sha, R. Rajkumar, and J.P. Lehoczky, Priority Inheritance Protocols: An Approach to Real-Time Synchronization. IEEE Transactions on Computers, 39, 1175–1185, 1990. [24] I. Stoica, H-Abdel-Wahab, K. Jeffay, S. Baruah, J.E. Gehrke, and G.C. Plaxton, A Proportional Share Resource Allocation Algorithm for Real-Time Timeshared Systems. In Proceedings of IEEE Real-Time Systems Symposium, December 1996. [25] G. Buttazzo and M. Caccamo, Minimizing Aperiodic Response Times in a Firm Real-Time Environment. IEEE Transactions on Software Engineering, 25, 22–32, 1999. [26] G. Koren and D. Shasha, Skip-Over: Algorithms and Complexity for Overloaded Systems that Allow Skips. In Proceedings of the IEEE Real-Time Systems Symposium, 1995. [27] G. Buttazzo, G. Lipari, M. Caccamo, and L. Abeni, Elastic Scheduling for Flexible Workload Management. IEEE Transactions on Computers, 51, 289–302, 2002. [28] E. Bini, G.C. Buttazzo, and G.M. Buttazzo, A Hyperbolic Bound for the Rate Monotonic Algorithm. In Proceedings of the 13th Euromicro Conference on Real-Time Systems, Delft, The Netherlands, pp. 59–66, June 2001. [29] E. Bini and G.C. Buttazzo, The Space of Rate Monotonic Schedulability. In Proceedings of the 23rd IEEE Real-Time Systems Symposium, Austin, TX, December 2002. [30] J. Stankovic, K. Ramamritham, M. Spuri, and G. Buttazzo, Deadline Scheduling for Real-Time Systems, Kluwer Academic Publishers, Boston, MA, 1998.

© 2006 by Taylor & Francis Group, LLC

13 Quasi-Static Scheduling of Concurrent Speciﬁcations Alex Kondratyev Cadence Berkeley Laboratories and Politecnico di Torino

13.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1 Quasi-Static Scheduling • A Simple Example

13.2 Overview of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4 13.3 QSS for PNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5

Luciano Lavagno

Definitions • Specification Model • Schedulability Analysis • Algorithmic Implementation

Politecnico di Torino

Claudio Passerone

13.4 QSS for Boolean Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10 Definitions • Schedulability Analysis • Comparison to PN Model

Politecnico di Torino

Yosinori Watanabe Cadence Berkeley Laboratories

13.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14

13.1 Introduction 13.1.1 Quasi-Static Scheduling The phenomenal growth of complexity and breadth of use of embedded systems can be managed only by raising the level of abstraction at which design activities start and most design space exploration occurs. This enables greater reuse potential, but requires significant tool support for efficient analysis, mapping, and synthesis. In this chapter we deal with methods aimed at providing designers with efficient methods for uniprocessor software synthesis, from formal models that explicitly represent the available concurrency. These methods can be extended to multiprocessor support and to hardware synthesis, however these advanced topics are outside the scope of this chapter. Concurrent specifications, such as dataflow networks [1], Kahn process networks [2], Communicating Sequential Processes [3], synchronous languages [4], and graphical state machines [5], are interesting because they expose the inherent parallelism in the application, which is much harder to recover a posteriori by optimizing compilers. In such a specification, the application is described as a set of processes that sequentially execute operations and communicate with each other. In considering an 13-1

© 2006 by Taylor & Francis Group, LLC

13-2

Embedded Systems Handbook

implementation of the application, it is often necessary to analyze how these processes interact with each other. This analysis is used for evaluating how often a process will be invoked during an execution of the system, or how much memory will be required for implementing the communication between the processes. Quasi-Static Scheduling (QSS) is a technique for finding sequences of operations to be executed across the processes that constitute a concurrent specification of the application. Several approaches have been proposed [6–11], where they use certain mathematical models to abstract the specification and aim to compute graphs of finite size such that the sequences are given by traversing the graphs. We call the sequences of operations, or the graph which represents them, a schedule of the specification. The schedule is static in the sense that it statically commits to a particular execution order of operations of the processes. In general, there exist more than one possible order of operations to be executed, with a different implementation cost for each. On the other hand, by committing to a particular sequence, a static schedule allows a more rigorous analysis of the interaction among the processes than dynamic schedules, because one can precisely observe how the operations from different processes are interleaved to constitute the system execution. The reason to start from a concurrent specification is twofold. First of all, coarse-grained parallelism is very difficult to recover from a sequential specification, except in relatively simple cases (e.g., nested loops with affine memory accesses [12]). Second, parallel specifications offer a good model to perform systemlevel partitioning experiments, aimed at finding the best mixed hardware/software implementation on a complex SOC platform. The reason to look for a sequencing of the originally concurrent operations is that we are considering in this chapter embedded software implementations, for which context switching implied by a multithreaded concurrent implementation would be very expensive, whenever concurrency can be resolved at compile time. This resolution is especially difficult if the specification involves data-dependent conditional constructs, such as if-then-else with a data-dependent condition, because different sets of operations may be executed depending upon how the constructs are resolved. For such a specification, the static scheduling produces in principle a sequence of operations for each possible way of resolving the constructs (in practice, these multiple sequences are collapsed as much as possible, in order to reduce code size). Note that these constructs are resolved based on the data values, and therefore some of the resolutions of the constructs may not happen at the runtime in a particular execution of the system. The information about data values is not available to the static scheduling algorithm, because the latter runs statically at compile time. In this sense, scheduling for a specification with such constructs is called quasi-static. It is responsible for providing a sequence of operations to be executed for each possible runtime resolution of data-dependent choices. After a simple motivating example, we present an overview of some approaches proposed in the literature. In Section 13.2, we consider two questions that one is concerned with in QSS, and briefly describe how these questions are addressed in two different models that have been proposed in the literature. One of the models is Petri nets (PNs), and the other is Boolean Dataflow (BDF) Graphs. They model a given concurrent specification in different ways, and thus the expressiveness of the models and issues that need to be accounted for to solve the scheduling problem are different. These two models and issues on their scheduling approaches are presented more in detail in Sections 13.3 and 13.4, respectively.

13.1.2 A Simple Example Figure 13.1 illustrates how QSS works. In Figure 13.1(a), there are two processes, each with a sequential program associated. The one shown on the left reads a value from port DATA into variable d, computes a value for the variable D and writes it to the port PORT, and then goes back to the beginning. The other process reads a value for variable N from port START, and then iterates the body of the for-loop N times. For each iteration, it reads two values from port IN, and sets them to x[0] and x[1], respectively. Here, the third argument of the read function designates the number of data items to be read at a time. Since

© 2006 by Taylor & Francis Group, LLC

QSS of Concurrent Speciﬁcations

13-3 START

(a) DATA

while (1) { read(START, N, 1); for (i=0, y=0; i
while (1) { read(DATA, d, 1); D = d*d; write(PORT, D, 1); }

PORT IN (b)

DATA

OUT

START

Start: read(START, N, 1); i=0; y=0; DE: if (i
OUT

FIGURE 13.1 A simple example: (a) initial specification, (b) result of the schedule.

IN is connected with PORT, it means that the process on the left needs to produce the necessary data. However, it writes only one data item at a time to PORT, and therefore it needs to iterate the body of the while-loop twice in order to provide enough data items required by this read statement. Once the values of x have been set, a value for variable y is computed. At the end of the for-loop, the result assigned to y is written to the port OUT, and this whole process is repeated. Throughout this paper, we assume that the communication between processes is made through a point-to-point finite buffer with the first in first out (FIFO) semantics. Therefore, a read operation can be completed only when the requested number of data items is available at the corresponding buffer. Similarly, a write operation can be completed only if the number of data items does not exceed the predefined bound of the buffer after writing. A result of scheduling this system is shown in Figure 13.1(b). It is a single process that interleaves the statements of the original two processes. Note that the resulting process does not have ports for PORT and IN, which were originally used for connecting the two processes, because the read and write operations to these ports are replaced by assignments of the variable D to x[0] and x[1]. In this way, scheduling uses data assignments to realize the communication between the original processes, which is often more efficient to implement. Further, it repeats the same set of operations given by read; D = d*d; x[i] = D;, making explicit the fact that one of the original processes needs to iterate twice for each iteration of the for-loop of the other process. Such a repetition could be effectively exploited in general to realize an efficient implementation, but it can be identified only by analyzing how the original processes interact with each other, and therefore is not taken into account when implementing each process individually. The effectiveness of this kind of scheduling is shown by case studies such as [13], where the QSS was applied to a part of the MPEG video decoding algorithm and the speed of the scheduled design was improved by 45%. The improvement was mainly due to the replacement of the explicit communication among the processes by data assignments, and also due to a reduction of the number of processes, which in turn reduced the amount of context switches.

© 2006 by Taylor & Francis Group, LLC

13-4

Embedded Systems Handbook

13.2 Overview of Related Work When solving the scheduling problem two main questions are usually of interest: 1. Does the specification have a bounded-length cyclic schedule? By length, we mean the number of “steps” in a schedule required to return the specification to its initial “state.” This question is important if the specification is to be scheduled with a hard real-time constraint. 2. Can the specification be scheduled within bounded memory? This means that at every “state” of a schedule one can compute and move to the next step using a finite amount of memory for communication buffers, and eventually return to the original state. A bounded-length cyclic schedule implies bounded memory but not vice versa, as will be discussed more in detail in Section 13.4. Depending on the descriptive power of a model used to represent the specification, these questions have different answers. One such model is dataflow graphs, which are commonly used for digital signal processing applications. In Static Dataflow (SDF) Graphs [1], the number of tokens produced by a process1 on an output port, or consumed by a process from an input port, is fixed and known statically, or at “compile time.” Computationally efficient algorithms exist to answer questions 1 and 2 for any SDF specification [1]. Furthermore, all schedulable graphs have bounded length schedules and the required memory is bounded. When the specification contains data-dependent conditional constructs, SDF graphs are insufficient to model it. An extension of SDF to handle such constructs could be done differently: (1) by associating data values with token flows, or (2) by introducing nondeterminism structurally (see Figure 13.2). Examples of the first modeling approach can be found in a rich body of research on BDF Graphs and their derivatives/extensions [6, 7, 9, 11]. A similar modeling mechanism is also exploited in scheduling approaches starting from networks of finite state machine-like models [10,14]. Interestingly, answering the question about the existence of bounded length schedules for an arbitrary BDF graph can be done nearly as simply as for SDF. However, the status of the bounded memory problem in BDF is very different. Annotating tokens with values makes the BDF model Turing complete and the problem of finding a bounded memory schedule becomes undecidable [6]. For this reason, papers devoted to BDF scheduling propose heuristic approaches for schedulability analysis. An example of this is given in Reference 9. The proposed method initially sets a bound on the size of each buffer based on the structural information of the specification, and tries to find a schedule within the bound. If a schedule is not found, the procedure heuristically increases the sizes of some buffers, and repeats the search. In order to claim the absence of a schedule within a given bound, the reachability space of the system defined for the bound is completely analyzed. Other heuristics exploit clustering algorithms, which in case of success derive a bounded memory schedule, while in case of failure leave the question open [6]. The work given in Reference 8 employs a PN as the underlying model for the system specification and searches for a schedule in its reachability space. It abstracts data-dependent conditional constructs of the specification using nondeterministic choices (see Figure 13.2). This abstraction in general helps improve the efficiency of the scheduling procedure, while it makes the approach conservative. The PN model is not Turing complete and there are only a few problems that are undecidable for PNs [15]. Nevertheless, decidability of the problem of finding bounded memory schedules for a PN has not been proven or disproven. However for the important subclass of equal-choice PNs (an extension of free-choice PNs), bounded memory schedules are found efficiently (if they exist). A list of modeling approaches and the complexity of their schedulability problems is shown in Table 13.1, where O(|Cycle_seq|) denotes that the problem is decidable and its computational complexity is linear 1 In the terminology of

dataflow graphs, a process is often called actor, and we may use these terms interchangeably

in this chapter.

© 2006 by Taylor & Francis Group, LLC

QSS of Concurrent Speciﬁcations

13-5

SDF = Marked graphs COEF read(COEF,d,1); read(DATA,x,2); Out=d*x[0]*x[1];

Static scheduling

DATA

2 OUT

Data dependence Value representation

Structural representation

PN

BDF x

if (x > 0) A: y=x; else B: y=x*x;

>0 SWITCH F A

x

QSS

if (x > 0) A: y=x; else B: y=x*x;

T

A

B

B

FIGURE 13.2 The PN and BDF modeling approaches.

TABLE 13.1

Models for Embedded Systems and the Complexity of Scheduling Problems PN

Modeling data dependence Bounded length schedule Bounded memory schedule

SDF graph

BDF graph

Equal-choice

General

No O(|Cycle_seq|) O(|Cycle_seq|)

Yes O(|Cycle_seq|) Undecidable

Yes O(|Cycle_seq|) O(|Cycle_seq|)

Yes O(|Cycle_seq|) Unknown

in the length of the sequence that brings the specification back to its initial state (called cyclic sequence). Note that, however, the size of this cyclic sequence can be exponential in the size of the SDF graph. We will review scheduling approaches based on PNs and on BDF more in detail in Sections 13.3 and 13.4, respectively.

13.3 QSS for PNs 13.3.1 Deﬁnitions A PN is defined by a tuple (P, T , F , M0 ), where P and T are sets of places and transitions respectively. F is a function from (P × T ) ∪ (T × P) to nonnegative integers. A marking M is another function from P to nonnegative integers, where M [p] denotes the number of tokens at p in M . M0 is the initial marking. A PN can be represented by a directed bipartite graph, where an edge [u, v] exists if F (u, v) is positive, which is called the weight of the edge. A transition t is enabled at a marking M , if M [p] ≥ F ( p, t ) for all p of P. In this case, one may fire the transition at the marking, which yields a marking M given by t M [p] = M [p] − F ( p, t ) + F (t , p) for each p of P. In the sequel, M → M denotes the fact that a transition t is enabled at a marking M and M is obtained by firing t at M . A transition t is said to be a source, if F (p, t ) = 0 for all p of P. A marking M is said to be reachable from M if there is a sequence of transitions fireable from M that leads to M . The set of markings reachable from the initial marking is denoted by R(M0 ).

© 2006 by Taylor & Francis Group, LLC

13-6

Embedded Systems Handbook

A transition system is a tuple A = (S, , →, sin ), where S is a set of states, is an alphabet of symbols, →⊆ S × × S is the transition relation and sin is the initial state. Given s ∈ S, t ∈ is said to be fireable t at s if there exists s ∈ S such that (s, t , s ) ∈→, and we denote it with s → s . If corresponds to the set of transition T of a PN, a transition system can be used to represent a portion of the reachability space of that PN. This concept will be used in the definition of schedule in Section 13.3.3. A key notion for defining schedules is that of equal conflict sets. A pair of transitions ti and tj is said to be in equal conflict, if F ( p, ti ) = F ( p, tj ) for all p of P. These transitions are in conflict in the sense that ti is enabled at a given marking if and only if tj is enabled and it is not possible to fire both transitions at the marking. Equal conflict is an equivalence relation defined on the set of transitions, and each equivalence class is called equal conflict set (ECS). By definition, if one transition of an ECS is enabled at a given marking, all the other transitions of the ECS are also enabled. Thus, we may say that this ECS is enabled at the marking. A place p is said to be a choice place if it has more than one successor transition. A choice place is equal choice (a generalization of free choice [16]) if all the successor transitions are in the same ECS. A PN is said to be equal choice if all choice places are equal.

13.3.2 Speciﬁcation Model A specification is given as a set of processes communicating via FIFOs, as described in Section 13.1.2. A PN is then used to model the specification by employing a specific abstraction. The PN for the example shown in Figure 13.1(a) is shown in Figure 13.3. Communication operations on ports and the computation operations of the processes are modeled by transitions. Places are used to represent both sequencing within processes (a single token models the program counter) and FIFO communication (the tokens model the presence of the data items, while hiding their values). For an input (output) port connected to the environment, a source (sink) transition is connected to the place for the port. Overall, the PN represents the control flow of the programs of the specification. Place p7 is a choice place, while the transitions D and E form an ECS, as defined in the previous section. This choice place models at the data-dependent control construct given by the termination condition of the for-loop in the specification (Figure 13.1[a]). We note that the PN does not model the termination condition of the for-loop, that is, it abstracts away the condition on data with which control constructs are resolved. An automatic procedure to produce this model of PNs has been developed for the C language, extended with constructs for implementing operations of the FIFO communication [8]. In addition to the read and write operations for FIFOs, the extended language supports another communication construct called select [17]. It realizes synchronization-dependent control, which determines the control flow of the program depending on the availability of data items on input ports. It accepts as

p5

p6

START C p7 p1 DATA

E

D A

p3

B

p4 p9 2

p2

FIGURE 13.3 PN model for the example of Figure 13.1.

© 2006 by Taylor & Francis Group, LLC

F

p8 OUT

QSS of Concurrent Speciﬁcations

13-7

argument a set of input ports, and nondeterministically selects one of them with data available at the port. In case none of them has data, the process blocks until some is available. The select operations introduce nondeterminism in the specification, which some may consider a bad idea.2 However, they are essential in order to model efficiently systems whose control flow depends on the availability of data at some ports. For example, when specifying a component of a multimedia application, parameters such as the coefficients used in a filtering algorithm are typically provided nondeterministically at a control port, since they depend on the quality and size of the images being processed, which can be dynamically changed by a user of the device or by a global quality-of-service manager. In this case, the process looks at the control port as well as the data ports, and if new data are available at the control port, the process uses them for the filtering while otherwise it will simply proceed using the current values for the coefficients. Always polling for new coefficient values at every new image would lead to unnecessary loss of performance [17]. An example of a specification including select is shown in Figure 13.4(a), which has two processes, two input ports (IN and COEF) and one output port (OUT). The processes communicate to each other through the channel DATA. The process Filter extracts data inserted by Get_Data, multiplies them by a coefficient and sends them to the environment through the port OUT, where the availability of the coefficient is tested by the select statement. Figure 13.4(b) shows a PN for this specification. In formally defining a schedule based on this PN model, it is necessary to clarify how the environment is modeled. This work, following the notation introduced in Reference 1, uses source transitions, as depicted by Tin and Tcoef in Figure 13.4(b). However, the basic model of Reference 1 is extended by distinguishing between two types of inputs, and hence of source transitions. The types are called controllable and uncontrollable respectively. The uncontrollable inputs are the stimuli to the system being scheduled, that is, the system execution takes place as a reaction to events provided by this type of inputs. The objective of the scheduling is then to find a sequence of operations to be executed in each such reaction. The scheduling problem is defined under the assumption that all the uncontrollable inputs are independent with respect to each other and with respect to the execution of the system. This means that the system cannot tell when the stimuli are provided by the environment or how they are related, and thus no such information can be assumed when schedules are sought. Therefore, a schedule must be designed so that when the system is ready to react to a stimulus from one uncontrollable input, it must be ready to react to a stimulus from any other uncontrollable input. In Figure 13.4, all the inputs are uncontrollable. Controllable inputs, on the other hand, represent data from the environment that the system can acquire whenever it decides to do so. It follows that schedules can be sought under the assumption that if the read operation is executed on a controllable input, then the operation will always succeed in reading the specified amount of data from the input without blocking the execution of the system. For example, in the specification given in Figure 13.1, the input to the port START was specified as uncontrollable while the port DATA was given as controllable. This environment model covers the one used in dataflow models such as SDF and BDF, in which one input (the first one selected by the scheduler) can be considered as uncontrollable, while all other inputs are controllable.

13.3.3 Schedulability Analysis The problem to be solved is to find a schedule for a given PN and to generate a software code that implements the schedule. A schedule is formally defined as a transition system that satisfies the following properties with respect to the given PN. 2 An essential property of dataflow networks and Kahn process networks is the fact that the result of the computation

is independent of the order in which the processes are scheduled for execution. This property no longer holds if the select construct is added.

© 2006 by Taylor & Francis Group, LLC

13-8

Embedded Systems Handbook

COEF

(a)

OUT

IN

PROCESS GetData (InPort IN, OutPort DATA) { float sample,sum; int i; while (1) { sum = 0; for (i=0; i
DATA

PROCESS Filter (InPort DATA, InPort COEF,OutPort OUT) { float c,d; int j; c=1; j=0; while (1) { SELECT (DATA, COEF) { case DATA: READ (DATA,d,1); if ( j == N ) { j=0; d = d*c; WRITE (OUT,d,1); } else j++; break; case COEF: READ(COEF,c,1); break; } }}

p1

(b) t1

t2

sum = 0; i = 0;

p2 i < N Tin

p5 t6

WRITE(DATA,sum/N,1)

p6

False

true

Tcoef

c = 1; j = 0 COEF

t7

t3

i++ t5

p3

p4

READ(DATA,d,1)

IN

t10

DATA p7 j==N

t4 READ(IN,sample,1); sum += sample; WRITE (DATA,sample,1)

t8

READ(COEF,c,1)

false

j++ t9

true

j = 0; d = d ∗c; WRITE(OUT,d,1)

FIGURE 13.4 System specification and corresponding PN.

Definition 13.1 (Sequential schedule). Given a Petri net N = (P, T , F , M0 ), a sequential schedule of N is a transition system Sch = (S, T , →, s0 ) with the following properties: 1. 2. 3. 4.

S is finite and there is a mapping µ : S → R(M0 ), with µ(s0 ) = M0 .3 t t If transition t is fireable in state s, with s → s , then µ(s) → µ(s ) in N. If t1 is fireable in s, then t2 is fireable in s if and only if t2 ∈ ECS(t1 ). t σ For each state s ∈ S, there is a path s → s → for each uncontrollable source transition t of N.

Property 2 implies trace containment between Sch and N (any feasible trace in the schedule is feasible in the original PN). Property 3 indicates that one ECS is scheduled at each state. Finally, the existence of the path in property 4 ensures that any input event from the environment will be eventually served. 3 This

mapping is required in order to enable the same state to be visited multiple times with different termination

criteria.

© 2006 by Taylor & Francis Group, LLC

QSS of Concurrent Speciﬁcations

13-9

Intuitively, scheduling can be deemed as a game between the scheduler and the environment. The rules of the game are the following: • The environment makes a move by firing any of the uncontrollable source transitions. • The scheduler picks up any of the enabled transitions to fire with two exceptions: (a) It has no control over choosing which one of the uncontrollable source transitions to fire. (b) It cannot resolve choice for data-dependent constructs, which are described by equal-choice places. In cases (a) and (b) the scheduler must explore all possible branches during the traversal of the reachability space, that is, fire all the transitions from the same ECS. However, it can decide the moment for serving the source transitions or for resolving an equal choice, because it can finitely postpone these by choosing some other enabled transitions to fire. The goal of the game is to process any input from the environment while keeping the traversed space (and hence the amount of memory required to implement the communication buffers) finite. In case of success, the result is to both classify the original PN as schedulable and derive the set of states (schedule) that the scheduler can visit while serving an arbitrary mix of source transitions. Under the assumption that the environment is sufficiently slow to allow the scheduler to fire all nonsource transitions, the schedule is an upper approximation of the set of states visited during the real-time execution of the specification. This is due to the fact that the scheduler is constructed taking into account the worst possible conditions, since it has no knowledge about the environment behavior and data dependencies.

13.3.4 Algorithmic Implementation In this section, we describe an algorithm for finding a schedule for each uncontrollable source transition a of a given PN. The algorithm, which is fully described in Reference 8, gradually creates a rooted tree, and a postprocessing step creates a cycle for each leaf to generate a schedule. The algorithm initially creates a root node corresponding to the initial marking of the PN, and fires the source transition a, generating a new marking. From here, it tries to create a tree by firing-enabled transitions. For each node that is added to the tree, it checks whether a termination condition is satisfied, or if an ancestor with the same marking exists. In the latter case, the search along the path is stopped and the branch is closed into a loop with the ancestor node. To avoid exploring the possibly infinite reachability space of the PN, the algorithm uses a heuristic to identify a boundary of that space so that it would not need to search beyond it [8]. If a schedule is found, the corresponding code that implements the schedule must be generated. Although a direct translation of the schedule into code is possible, it usually increases the code size, since different paths of the schedule may be associated with the same sequence of transitions. Optimizations are thus required to reduce the code size. Also, ports that originally belong to different processes might become part of the same final task, and therefore do not require any communication primitive, but rather are implemented using assignments or circular buffers, whose size can be statically determined by analyzing the schedule. As an example, let us consider the system illustrated in Figure 13.1(a). The PN model for the two processes is shown in Figure 13.3, where the source port START is uncontrollable, while the source port DATA is controllable. The ports PORT and IN are connected through place p4 . In the initial marking, places p2 and p6 have a token each. After creating the root node of the tree, the algorithm to find the schedule fires the only uncontrollable source transition START, generating a new node in the schedule with marking p2 p5 p6 . Then, either transition C or DATA are enabled, and we may decide to fire C. In the newly created node with marking p2 p7 , transitions D and E are both enabled, and they constitute an equal-choice set. Therefore, the algorithm explores the two branches, until it can close loops for both of them. The final schedule is shown in Figure 13.5.

© 2006 by Taylor & Francis Group, LLC

13-10

Embedded Systems Handbook p2p6 START p 2 p 5 p6 OUT

C p2p7 D p2p8

E p2p6p9

DATA p1p2p8 A F

p3p8 B p2p4p8

DATA

FIGURE 13.5

p1p2 p4p8

p3p4p8 A

B

p1p4 p4p8

Schedule for the PN of Figure 13.3.

The last step is to generate the code, already shown in Figure 13.1(b). A node in the schedule with multiple outgoing arcs corresponds to an if-then-else, and loops are implemented using the goto statement. Note that in this example no optimization has been performed to reduce code size. On the other hand, the communication between the two processes has been implemented using assignments in the single task that is generated.

13.4 QSS for Boolean Dataﬂow 13.4.1 Deﬁnitions An SDF graph [1] is a directed graph D = (A, E ) with actors A represented by nodes and arcs E representing connections between the actors. These connections convey values between nodes, similar to the tokens in PNs. Values arrive to actors respecting FIFO ordering. Two mapping functions I and O are defined from E to nonnegative integers. They define the consumption and production rates of values for the connections between nodes, that is, for a connection e = (a, b ) from an actor a to an actor b, O (e ) (respectively I [e ]) shows how many tokens are produced at (consumed from) e when the actor a (b) fires. The initial marking M0 tells how many tokens reside on the arcs E before SDF starts an execution. An actor a fires if every input arc e carries at least I (e ) tokens. Firing an actor consumes I (e ) tokens from each input arc and produces O(e) tokens on every output arc. A connection of an actor to its input (or output) is denoted as input (or output) port. A simple example of SDF graph is shown in Figure 13.6. In its initial marking only actor a is enabled. Firing of a produces a token on each output port (arcs (a, c) and (a, b)). a needs to fire twice to enable c because I (a, c) = 2. The feasible sequence of actor firings a, a, c, b returns the graph to the original marking. An extension of SDF graphs to capture specifications with data dependency results in adding to the model dynamic actors [6] that satisfy the following properties: 1. An input port may be a conditional port, where the number of tokens consumed by the port is given by a two-valued integer function of the value of a Boolean token received at the

© 2006 by Taylor & Francis Group, LLC

QSS of Concurrent Speciﬁcations

13-11 2 b 1 a

1 1

1

2 c

FIGURE 13.6 SDF graph. (a)

(b)

B

(c)

A e1

b e7

A; if ( b ) { C; } else { D; } E;

SWITCH

T

F e4

e2

C

D

e5 e8

e3

T

F

Γ=

A SW D C SEL E B 0 0 0 0

–1 1–p 0 p

0 –1 1 0

0 0 0 0 0 0 0 p–1 0 –1 0 0

0 0 0 0

e1 e2 e3 e4

0 0 0 0

0 0 –1 0

0 0 0 0

1 0 0 0

0 0 1 1

e5 e6 e7 e8

–p 1 0 –1

0 –1 0 0

SELECT

e6 E

FIGURE 13.7 If-then-else BDF graph.

special input port (the control port) of the same actor. One of the two values of the function is zero. 2. Control ports are never conditional ports, and always transfer exactly one token per execution. The canonical examples of this type of actors are SWITCH and SELECT4 (e.g., see Figure 13.7[b]). The SWITCH actor consumes an input token and a control token. If the control token is TRUE, the input token is copied to the output labeled T; otherwise it is copied to the output labeled F. The SELECT actor performs the inverse operation, reading a token from the T input if the control token is TRUE, otherwise reading from the F input, and copying the token to the output. Figure 13.7(b) shows an example of BDF that uses SWITCH and SELECT vectors to model the piece of program in Figure 13.7(a).

13.4.2 Schedulability Analysis 13.4.2.1 Bounded Length Schedules Deriving a bounded length schedule in a BDF graph reduces to the following two steps: 1. Finding a sequence of actors (cyclic sequence) that returns the graph to the initial marking. 2. Simulating the firing of a cyclic sequence to make sure that it is fireable under the given initial marking. The first task can be done through solving the so-called system of balance equations. This requires to construct the incidence matrix of the BDF graph, which contains the integer O(ei ) in position ( j, i) if the ith actor produces O(ei ) tokens on the jth arc and −I (ei ) if the ith actor consumes I (ei ) tokens from the jth arc (self-loop arcs are ignored, since their consistency checking is trivial). For dynamic actors the number of produced and consumed tokens depends on control ports. This is represented in the incidence matrix by using symbolic variables pi (one for each Boolean stream) that are interpreted as ratios of TRUE 4 Note

that this is different from the select operation introduced in Section 13.3.1, because it is a deterministic operation, depending not on the number of available input tokens, which in turn may depend on the scheduling order, but rather on the value of the control port, which is guaranteed to be independent of the scheduling order.

© 2006 by Taylor & Francis Group, LLC

13-12

Embedded Systems Handbook

values out of all values present in the stream (this ratio is [1 − pi ] for FALSE values). Then the system of equations to be solved is: ×r =0 where 0 is a vector with all entries equal to 0, and r is the repetition vector with one entry, called ri , for each actor, representing how many times actor i fires in order to bring the BDF to the original marking. If a nonzero solution for this system exists then the repetition vector shows how many times each actor must fire to return the graph to the initial marking. Applying the above procedure to the incidence matrix in Figure 13.7(c) corresponding to the BDF graph from Figure 13.7(b) one can find the repetition vector r = [1 1 (1 − p ) p 1 1 1]. Note, that the existence of solution cannot depend on the value of p, since the values of the Boolean stream b are arbitrary. By simulating the firing of actors according to the values of r for both p = 0 and p = 1, one can see that the repetition vector indeed describes the fireable sequence of actors, and the existence of a bounded length schedule for BDF graph Figure 13.7(b) is proved. This procedure is effective for any arbitrary BDF graph [6]. 13.4.2.2 Bounded Memory Schedule If a bounded length schedule is found, then it obviously provides a bounded memory schedule as well. However, the converse is not true. There are specifications that do not have a bounded length schedule, but are perfectly schedulable with bounded memory. A common example is given by a loop with an unknown bound on the number of iterations (e.g., see Figure 13.1). For such specifications the length of the cyclic sequence is unbounded, because it depends on the number of loop iterations. The problem of finding bounded memory schedules in BDF graphs is undecidable [6], hence conservative heuristic techniques, which may not find a solution even if one exists, exactly like the algorithm of Section 13.3.4, must be used. We describe two of them: clustering and state enumeration. Clustering. The goal of the clustering algorithm is to map a BDF graph into the traditional control structures used by high-level languages, such as if-then-else and do-while, whenever possible. The subgraphs corresponding to these structures can then be treated as atomic actors. At first, adjacent actors with the same producing/consuming rates are merged into a single cluster, where possible. Actors may not be merged if this would create deadlock, or if the resulting cluster would not be a BDF actor (e.g., it may depend on a control arc that is hidden by the merge operation). Then clusters are enclosed into conditional or loop constructs, as required in order to match the token production and consumption rates of their neighbors. The procedure terminates when no more changes in the BDF graph are possible. At this point, if the interior of each cluster has a schedule of bounded length, and the top-level cluster does as well, then the entire graph can be scheduled with bounded memory. State Enumeration. One can enumerate the states that the system can reach by simulating the execution of the BDF graph, similar to the scheduling approach described in Section 13.3.4. If the graph cannot be scheduled in bounded memory, however, a straightforward state enumeration procedure will not terminate. One possible solution is to impose an upper bound to the number of tokens that may appear on each arc, according to some heuristic, and to assume that there is a problem if this bound is exceeded. A technique similar to this is used in Ptolemy’s dynamic dataflow scheduler [18].

13.4.3 Comparison to PN Model Boolean DataFlow graphs being Turing complete provide a very powerful specification model. It is remarkable, that in spite of that, some important scheduling problems (like bounded length schedule) have efficient and simple solutions for them. When a designer seeks for schedules with that kind of properties, BDF graphs are an excellent choice. The attractive feature of BDF modeling is that keeping track about the consistency of decisions made by different actors consuming the same Boolean stream is easy. This is automatically ensured through the use of FIFO semantics in storing Boolean values at actor ports.

© 2006 by Taylor & Francis Group, LLC

QSS of Concurrent Speciﬁcations

13-13

In PN modeling, on the other hand, data dependencies are fully abstracted as nondeterministic choices. This makes the designer responsible to ensure that different choices are resolved consistently when they stem from the same data. Undecidability in the BDF case comes from the fact that establishing the equivalence of Boolean streams is also undecidable. Therefore, ensuring the consistency of choices done by dynamic actors is possible only when the stream is exactly the same (like for SWITCH and SELECT actors in Figure 13.7[b]), and hence when a single p variable can be used to represent both. Note that in such cases, that is, with syntactic equivalence, an improved version of the tool generating a PN model from a C model could annotate the PN so as to make this equivalence explicit. However, no such capability is available from the PN-based tools. They resort to the simple, but cumbersome techniques described in Reference 13. Hence the BDF scheduling implementation in Ptolemy [18] is more user-friendly in this respect. The abstraction of data by nondeterministic choices in PNs is, however, of great importance when solving more difficult scheduling problems. Applications very commonly contain computations with unknown number of iterations. For them the most interesting scheduling problem is finding a bounded memory schedule. Here the power of BDF model becomes a burden, and makes it very difficult to devise efficient heuristics to solve the problem. To illustrate this let us look at Figure 13.8, which represents a BDF graph corresponding to the example of specification in Figure 13.1, where diamonds labeled with F denote the initial marking of the corresponding arcs with Boolean value “False.” Applying clustering to this example does not provide a conclusive decision about its schedulability, because it cannot prove that the Boolean value “False” would ever be produced at the input arcs of the SELECT actors, which is needed in order to return the network to the initial marking. This is an inherent problem of the clustering approach: it is not clear how often it succeeds in doing scheduling analysis, unless the specification was already devised so as to use only recognizable actor patterns. To find a schedule for the example in Figure 13.8 one must use the state enumeration approach. However, contrary to the PN case, the state of a BDF graph must also include the values stored in the FIFO queues for all Boolean streams of dynamic actors. This leads to significant memory penalties when

Process 1

Process 2 DATA

∗

A

B

START

0

T

F

F

F

F

SELECT

2

–1

>0?

+

SWITCH

OUT

FIGURE 13.8 BDF graph for the example of Figure 13.1.

© 2006 by Taylor & Francis Group, LLC

SWITCH F

T

T SELECT

F

T

13-14

Embedded Systems Handbook

storing the state graph. Even worse, it also significantly reduces the capabilities of pruning the explored reachability space based on different termination conditions. These conditions impose a partial order between states and avoid generation of reachability space beyond “ordered” states [6, 8]. For PNs the partial order is established purely by markings, while for BDF graphs in addition to markings it also requires to consider values of Boolean streams. Due to this, state graphs of BDF have sparser ordering relations and are significantly larger. Hence we feel that for bounded memory quasi-static schedulability analysis, the PN approach is simpler and more suitable, especially if the limitations of current translators from C to PNs are addressed.

13.5 Conclusions This chapter described modeling methods and scheduling algorithms that bridge the gap between specification and implementation of reactive systems. From a specification given in terms of concurrent communicating processes, and by deriving intermediate representations based on PNs and dataflow graphs, one can (unfortunately not always) obtain a sequential schedule that can be efficiently implemented on a processor. Future work should consider better heuristic to find such schedules, since the problem is undecidable in general, once data-dependent choices come into play. Furthermore, it would be interesting to extend it by considering sequential and concurrent implementations on several resources (e.g., CPUs and custom datapaths) [19]. Another body of future research concerns the extension of the notion of schedule into the time domain, in order to cope with performance constraints, while all the approaches considered in this chapter assume infinite processing speed with respect to the speed of the environment. For real-time applications one would need to extend the scheduling frameworks by explicit annotation of system events with delays, and by using timing driven algorithms for schedule construction.

References [1] E.A. Lee and D.G. Messerschmitt. Static scheduling of synchronous data flow graphs for digital signal processing. IEEE Transactions on Computers, C-36(1), 24–35, 1987. [2] G. Kahn. The semantics of a simple language for parallel programming. In Proceedings of IFIP Congress, August 1974. [3] C.A.R. Hoare. Communicating Sequential Processes. International Series in Computer Science. Prentice Hall, Hertfordshire, 1985. [4] N. Halbwachs. Synchronous Programming of Reactive Systems. Kluwer Academic Publishers, Boston, MA, 1993. [5] D. Har’el, H. Lachover, A. Naamad, A. Pnueli, M. Politi, R. Sherman, A. Shtull-Trauring, and M. Trakhtenbrot. STATEMATE: a working environment for the development of complex reactive systems. IEEE Transactions on Software Engineering, 16(4), 403–414, 1990. [6] J. Buck. Scheduling Dynamic Dataflow Graphs with Bounded Memory Using the Token Flow Model. Ph.D. thesis, University of California, Berkeley, 1993. [7] J.T. Buck. Static scheduling and code generation from dynamic dataflow graphs with integer valued control streams. In Proceedings of the 28th Asilomar Conference on Signals, Systems, and Computer, October 1994. [8] J. Cortadella, A. Kondratyev, L. Lavagno, C. Passerone, and Y. Watanabe. Quasi-static scheduling of independent tasks for reactive systems. IEEE Transactions on Computer-Aided Design, 24(9), 2004. [9] T.M. Parks. Bounded Scheduling of Process Networks. Ph.D. thesis, Department of EECS, University of California, Berkeley, 1995. Technical report UCB/ERL 95/105. [10] K. Strehl, L. Thiele, D. Ziegenbein, R. Ernst et al. Scheduling hardware/software systems using symbolic techniques. In International Workshop on Hardware/Software Codesign, 1999.

© 2006 by Taylor & Francis Group, LLC

QSS of Concurrent Speciﬁcations

13-15

[11] P. Wauters, M. Engels, R. Lauwereins, and J.A. Peperstraete. Cyclo-dynamic dataflow. In Proceedings of the 4th EUROMICRO Workshop on Parallel and Distributed Processing, January 1996. [12] T. Stefanov, C. Zissulescu, A. Turjan, B. Kienhuis, and E. Deprettere. System design using Kahn process networks: the Compaan/Laura approach. In Proceedings of the Design Automation and Test in Europe Conference, February 2004. [13] G. Arrigoni, L. Duchini, L. Lavagno, C. Passerone, and Y. Watanabe. False path elimination in quasi-static scheduling. In Proceedings of the Design Automation and Test in Europe Conference, March 2002. [14] F. Thoen, M. Cornero, G. Goossens, and H. De Man. Real-time multi-tasking in software synthesis for information processing systems. In Proceedings of the International System Synthesis Symposium, 1995. [15] Javier Esparza. Decidability and complexity of Petri net problems — an introduction. In Lectures on Petri Nets I: Basic Models, Advances in Petri Nets, Lecture notes on Computer Science, vol. 1491, Petri Nets, 1996, pp. 374–428. [16] T. Murata. Petri nets: properties, analysis, and applications. Proceedings of the IEEE, 77(4), 541–580, 1989. [17] E.A. de Kock, G. Essink, W.J.M. Smits, P. van der Wolf, J.-Y. Brunel, W.M. Kruijtzer, P. Lieverse, and K.A. Vissers. YAPI: application modeling for signal processing systems. In Proceedings of the 37th Design Automation Conference, June 2000. [18] Joseph Buck, Soonhoi Ha, Edward A. Lee, and David G. Messerschmitt. Ptolemy: a framework for simulating and prototyping heterogenous systems. International Journal in Computer Simulation, 4(2), 1994. [19] J. Cortadella, A. Kondratyev, L. Lavagno, A. Taubin, and Y. Watanabe. Quasi-static scheduling for concurrent architectures. Fundamenta Informaticae, 62, 171–196, 2004.

© 2006 by Taylor & Francis Group, LLC

Timing and Performance Analysis 14 Determining Bounds on Execution Times Reinhard Wilhelm

15 Performance Analysis of Distributed Embedded Systems Lothar Thiele and Ernesto Wandeler

© 2006 by Taylor & Francis Group, LLC

14 Determining Bounds on Execution Times 14.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-2 Tool Architecture and Algorithm • Timing Anomalies • Contexts

14.2 Cache-Behavior Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6 Cache Memories • Cache Semantics • Abstract Semantics

14.3 Pipeline Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10 Simple Architectures without Timing Anomalies • Processors with Timing Anomalies • Algorithm Pipeline-Analysis • Pipeline Modeling • Formal Models of Abstract Pipelines • Pipeline States

14.4 Path Analysis Using Integer Linear Programming . . . . . 14-17 14.5 Other Ingredients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-18 Value Analysis • Control Flow Specification and Analysis • Frontends for Executables

14.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-19 A (Partly) Dynamic Method • Purely Static Methods

Reinhard Wilhelm Universität des Saarlandes

14.7 State of the Art and Future Extensions . . . . . . . . . . . . . . . . . 14-20 Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-21 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-21

Run-time guarantees play an important role in the area of embedded systems and especially hard real-time systems. These systems are typically subject to stringent timing constraints, which often result from the interaction with the surrounding physical environment. It is essential that the computations are completed within their associated time bounds; otherwise severe damages may result, or the system may be unusable. Therefore, a schedulability analysis has to be performed which guarantees that all timing constraints will be met. Schedulability analyses require upper bounds for the execution times of all tasks in the system to be known. These bounds must be safe, that is, they may never underestimate the real execution time. Furthermore, they should be tight, that is, the overestimation should be as small as possible. In modern microprocessor architectures, caches, pipelines, and all kinds of speculation are key features for improving (average-case) performance. Unfortunately, they make the analysis of the timing behavior of instructions very difficult, since the execution time of an instruction depends on the execution history. A lack of precision in the predicted timing behavior may lead to a waste of hardware resources, which would have to be invested in order to meet the requirements. For products which are manufactured

14-1

© 2006 by Taylor & Francis Group, LLC

14-2

Embedded Systems Handbook

in high quantities, for example, in the automobile or telecommunications markets this would result in intolerable expenses. Subject of this chapter are one particular approach and the subtasks involved in computing safe and precise bounds on the execution times for real-time systems.

14.1 Introduction Hard real-time systems are subject to stringent timing constraints which are dictated by the surrounding physical environment. We assume that a real-time system consists of a number of tasks, which realize the required functionality. A schedulability analysis for this set of tasks and a given hardware has to be performed in order to guarantee that all the timing constraints of these tasks will be met (“timing validation”). Existing techniques for schedulability analysis require upper bounds for the execution times of all the system’s tasks to be known. These upper bounds are commonly called the worst-case execution times (WCETs), a misnomer that causes a lot of confusion and will therefore not be adopted in this presentation. In analogy, lower bounds on the execution time have been named best-case execution times (BCET). These upper bounds (and lower bounds) have to be safe, that is, they must never underestimate (overestimate) the real execution time. Furthermore, they should be tight, that is, the overestimation (underestimation) should be as small as possible. Figure 14.1 depicts the most important concepts of our domain. The system shows a certain variation of execution times depending on the input data or different behavior of the environment. In general, the state space is too large to exhaustively explore all possible executions and so determine the exact worst-case and best-case execution times, WCET and BCET, respectively. Some abstraction of the system is necessary to make a timing analysis of the system feasible. These abstractions loose information, and thus are responsible for the distance between WCETs and upper bounds and between BCETs and lower bounds. How much is lost depends both on the methods used for timing analysis and on system properties, such as the hardware architecture and the cleanness of the software. So, the two distances mentioned above, termed upper predictability and lower predictability can be seen as a measure for the timing predictability of the system. Experience has shown that the two predictabilities can be quite different, cf. Reference 1. The methods used to determine upper bounds and lower bounds are the same. We will concentrate on the determination of upper bounds unless otherwise stated. Methods to compute sharp bounds for processors with fixed execution times for each instruction have long been established [2,3]. However, in modern microprocessor architectures caches, pipelines, and all kinds of speculation are key features for improving (average-case) performance. Caches are used to bridge the gap between processor speed and the access time of main memory. Pipelines enable acceleration by overlapping the executions of different instructions. The consequence is that the execution time of individual instructions, and thus the contribution of one execution of an instruction to the program’s

Predictability w.c. guarantee w.c. performance 0

Lower bound

Best case

Worst case

Upper bound

Variation of execution time FIGURE 14.1 Basic notions concerning timing analysis of systems. © 2006 by Taylor & Francis Group, LLC

t

Determining Bounds on Execution Times

14-3

Fetch

Issue

Execute

Retire

I–Cache miss?

Unit occupied?

Multicycle?

Pending instructions? 6

19

4

1 No

6

3

30

6

Yes 1 3

6 41

FIGURE 14.2

Different paths through the execution of a multiply instruction. Unlabeled transitions take 1 cycle.

execution time can vary widely. The interval of execution times for one instruction is bounded by the execution times of the following two cases: • The instruction goes “smoothly” through the pipeline; all loads hit the cache, no pipeline hazard happens, that is, all operands are ready, no resource conflicts with other currently executing instructions exist. • “Everything goes wrong,” that is, instruction and/or operand fetches miss the cache, resources needed by the instruction are occupied, etc. Figure 14.2 shows the different paths through a multiply instruction of a PowerPC processor. The instruction-fetch phase may find the instruction in the cache (cache hit ), in which case it takes 1 cycle to load it. In the case of a cache miss, it may take something like 30 cycles to load the memory block containing the instruction into the cache. The instruction needs an arithmetic unit, which may be occupied by a preceding instruction. Waiting for the unit to become free may take up to 19 cycles. This latency would not occur, if the instruction fetch had missed the cache, because the cache-miss penalty of 30 cycles has allowed any preceding instruction to terminate its arithmetic operation. The time it takes to multiply two operands depends on the size of the operands; for small operands, 1 cycle is enough, for larger, three are needed. When the operation has finished, it has to be retired in the order it appeared in the instruction stream. The processor keeps a queue for instructions waiting to be retired. Waiting for a place in this queue may take up to 6 cycles. On the dashed path, where the execution always takes the fast way, its overall execution time is 4 cycles. However, on the dotted path, where it always takes the slowest way, the overall execution time is 41 cycles. We will call any increase in execution time during an instruction’s execution a timing accident and the number of cycles by which it increases the timing penalty of this accident. Timing penalties for an instruction can add up to several hundred processor cycles. Whether the execution of an instruction encounters a timing accident depends on the execution state, for example, the contents of the cache(s), the occupancy of other resources, and thus on the execution history. It is therefore obvious that the attempt to predict or exclude timing accidents needs information about the execution history. For certain classes of architectures, namely those without timing anomalies of Section 1, excluding timing accidents means decreasing the upper bounds. However, for those with timing anomalies this assumption is not true.

14.1.1 Tool Architecture and Algorithm A more or less standard architecture for timing-analysis tools has emerged [4–6]. Figure 14.3 shows one instance of this architecture. A first phase, depicted on the left, predicts the behavior of processor © 2006 by Taylor & Francis Group, LLC

14-4

Embedded Systems Handbook

CFG builder

FIGURE 14.3

The architecture of the aiT timing-analysis tool.

components for the instructions of the program. It usually consists of a sequence of static program analyses of the program. They altogether allow to derive safe upper bounds for the execution times of basic blocks. A second phase, the column on the right, computes an upper bound on the execution times over all possible paths of the program. This is realized by mapping the control flow of the program to an Integer Linear Program and solving this by appropriate methods. This architecture has been successfully used to determine precise upper bounds on the execution times of real-time programs running on processors used in embedded systems [1,7–10]. A commercially available tool, aiT by AbsInt, cf. http://www.absint.de/wcet.htm, was implemented and is used in the aeronautics and automotive industries. The structure of the first phase, processor-behavior prediction, often called microarchitecture analysis, may vary depending on the complexity of the processor architecture. A first, modular approach would be the following: 1. Cache-behavior prediction determines statically and approximately the contents of caches at each program point. For each access to a memory block, it is checked, whether the analysis can safely predict a cache hit. Information about cache contents can be forgotten after the cache analysis. Only the miss/hit information is needed by the pipeline analysis. 2. Pipeline-behavior prediction analyzes, how instructions pass through the pipeline taking cache-hit or miss information into account. The cache-miss penalty is assumed for all cases, where a cache hit can not be guaranteed. At the end of simulating one instruction, the pipeline analysis continues with only those states that show the locally maximal execution times. All others can be forgotten.

14.1.2 Timing Anomalies Unfortunately, this approach is not safe for many processor architectures. Most powerful microprocessors have so-called timing anomalies. Timing anomalies are contra-intuitive influences of the (local) execution time of one instruction on the (global) execution time of the whole program [11]. The interaction of

© 2006 by Taylor & Francis Group, LLC

Determining Bounds on Execution Times

14-5

several processor features can interact in such a way that a locally faster execution of an instruction can lead to a globally longer execution time of the whole program. For example, a cache miss contributes the cache-miss penalty to the execution time of a program. It was, however, observed for the MCF 5307 [12], that a cache miss may actually speed up program execution. Since the MCF 5307 has a unified cache and the fetch and execute pipelines are independent, the following can happen: a data access that is a cache hit is served directly from the cache. At the same time, the fetch pipeline fetches another instruction block from main memory, performing branch prediction and replacing two lines of data in the cache. These may be reused later on and cause two misses. If the data access was a cache miss, the instruction fetch pipeline may not have fetched those two lines, because the execution pipeline may have resolved a misprediction before those lines were fetched. The general case of a timing anomaly is the following. Different assumption about the processor’s execution state, for example, the fact that the instruction is or is not in the instruction cache, will result in a difference Tlocal of the execution time of the instruction between these two cases. Either assumption may lead to a difference T of the global execution time compared to the other one. We say that a timing anomaly occurs if either Tlocal < 0 that is, the instruction executes faster, and T < Tlocal , the overall execution is accelerated by more than the acceleration of the instruction, or T > 0, the program runs longer than before. Tlocal > 0 that is, the instruction takes longer to execute, and T > Tlocal , that is, the overall execution is extended by more than the delay of the instruction, or T < 0 that is, the overall execution of the program takes less time to execute than before. The case Tlocal < 0 ∧ T > 0 is a critical case for our timing analysis. It makes it impossible to use local worst cases for the calculation of the program’s execution time. The analysis has to follow all possible paths as will be explained in Section 14.3.

14.1.3 Contexts The contribution of an individual instruction to the total execution time of a program may vary widely depending on the execution history. For example, the first iteration of a loop typically loads the caches, and later iterations profit from the loaded memory blocks being in the caches. In this case, the execution of an instruction in a first iteration encounters one or more cache misses and pays with the cache-miss penalty. Later executions, however, will execute much faster because they hit the cache. A similar observation holds for dynamic branch predictors. They may need a few iterations until they stabilize and predict correctly. Therefore, precision is increased if instructions are considered in their control-flow context, that is, the way control reached them. Contexts are associated with basic blocks, that is, maximally long straightline code sequences that can be only entered at the first instruction and left at the last. They indicate through which sequence of function calls and loop iterations control arrived at the basic block. Thus, when analyzing the cache behavior of a loop, precision can be increased by regarding the first iteration of the loop and all other iterations separately; more precisely, to unroll the loop once and then analyze the resulting code.1 Definition 14.1 Let p be a program with set of functions P = {p1 , p2 , . . . , pn } and set of loops L = {l1 , l2 , . . . , ln }. A word c over the alphabet P ∪ L × IN is called a context for a basic block b, if b can be reached by calling the functions and iterating through the loops in the order given in c. 1 Actually,

this unrolling transformation need not be really performed, but can be incorporated into the iteration strategy of the analyzer. So, we talk of virtually unrolling the loops.

© 2006 by Taylor & Francis Group, LLC

14-6

Embedded Systems Handbook

Even, if all loops have static loop bounds and recursion is also bounded, there are in general too many contexts to consider them exhaustively. A heuristics is used to keep relevant contexts apart and summarize the rest conservatively, if their influence on the behavior of instructions does not significantly differ. Experience has shown [10], that a few first iterations and recursive calls are sufficient to “stabilize” the behavior information, as the above example indicates, and that the right differentiation of contexts is decisive for the precision of the prediction [13]. A particular choice of contexts transforms the call and the control flow graph into a context-extended control-flow graph by virtually unrolling the loops and virtually inlining the functions as indicated by the contexts. The formal treatment of this concept is quite involved and shall not be given here. It can be found in Reference 14.

14.2 Cache-Behavior Prediction Abstract Interpretation [15] is used to compute invariants about cache contents. How the behavior of programs on processor pipelines is predicted follows in Section 14.3.

14.2.1 Cache Memories A cache can be characterized by three major parameters: • Capacity is the number of bytes it may contain. • Line size (also called block size) is the number of contiguous bytes that are transferred from memory on a cache miss. The cache can hold at most n = capacity/line size blocks. • Associativity is the number of cache locations where a particular block may reside. n/associativity is the number of sets of a cache. If a block can reside in any cache location, then the cache is called fully associative. If a block can reside in exactly one location, then it is called direct mapped. If a block can reside in exactly A locations, then the cache is called A-way set associative. The fully associative and the direct mapped caches are special cases of the A-way set associative cache where A = n and A = 1, respectively. In the case of an associative cache, a cache line has to be selected for replacement when the cache is full and the processor requests further data. This is done according to a replacement strategy. Common strategies are LRU (Least Recently Used), FIFO (First In First Out), and random. The set where a memory block may reside in the cache is uniquely determined by the address of the memory block, that is, the behavior of the sets is independent of each other. The behavior of an A-way set associative cache is completely described by the behavior of its n/A fully associative sets. This holds also for direct mapped caches where A = 1. For the sake of space, we restrict our description to the semantics of fully associative caches with LRU replacement strategy. More complete descriptions that explicitly describe direct mapped and A-way set associative caches can be found in References 8 and 16.

14.2.2 Cache Semantics In the following, we consider a (fully associative) cache as a set of cache Lines L = {l1 , . . . , ln } and the store as a set of memory blocks S = {s1 , . . . , sm }. To indicate the absence of any memory block in a cache line, we introduce a new element I ; S = S ∪{I }. Definition 14.2 (concrete cache state) A (concrete) cache state is a function c : L → S . Cc denotes the set of all concrete cache states. The initial cache State cI maps all cache lines to I . If c(li ) = sy for a concrete cache state c, then i is the relative age of the memory block according to the LRU replacement strategy and not necessarily the physical position in the cache hardware.

© 2006 by Taylor & Francis Group, LLC

Determining Bounds on Execution Times

14-7

z y x t

s z y x

z s x t

s z x t

“young” Age “old”

[s]

FIGURE 14.4

Update of a concrete fully associative (sub-) cache.

The update function describes the effect on the cache of referencing a block in memory. The referenced memory block sx moves into l1 if it was in the cache already. All memory blocks in the cache that had been used more recently than sx increase their relative age by one, that is, they are shifted by one position to the next cache line. If the referenced memory block was not yet in the cache, it is loaded into l1 after all memory blocks in the cache have been shifted and the “oldest,” that is, least recently used memory block, has been removed from the cache if the cache was full. Definition 14.3 (cache update) A cache update function U : Cc × S → Cc determines the new cache state for a given cache state and a referenced memory block. Updates of fully associative caches with LRU replacement strategy are pictured as in Figure 14.4. 14.2.2.1 Control Flow Representation We represent programs by control flow graphs consisting of nodes and typed edges. The nodes represent basic blocks. A basic block is a sequence (of fragments) of instructions in which control flow enters at the beginning and leaves at the end without halt or possibility of branching except at the end. For cache analysis, it is most convenient to have one memory reference per control flow node. Therefore, our nodes may represent the different fragments of machine instructions that access memory. For nonprecisely determined addresses of data references, one can use a set of possibly referenced memory blocks. We assume that for each basic block, the sequence of references to memory is known (This is appropriate for instruction caches and can be too restricted for data caches and combined caches. See References 7 and 16 for weaker restrictions.), that is, there exists a mapping from control flow nodes to sequences of memory blocks: L : V → S ∗ . We can describe the effect of such a sequence on a cache with the help of the update function U . Therefore, we extend U to sequences of memory references by sequential composition: U (c, sx1 , . . . , sxy ) = U (. . . (U (c, sx1 )), . . . , sxy ). The cache state for a path (k1 , . . . , kp ) in the control flow graph is given by applying U to the initial cache state cI and the concatenation of all sequences of memory references along the path: U (cI , L(k1 ), . . . , L(kp )). The Collecting Semantics of a program gathers at each program point the set of all execution states, which the program may encounter at this point during some execution. A semantics on which to base a cache analysis has to model cache contents as part of the execution state. One could thus compute the collecting semantics and project the execution states onto their cache components to obtain the set of all possible cache contents for a given program point. However, the collecting semantics is in general not computable. Instead, one restricts the standard semantics to only those program constructs, which involve the cache, that is, memory references. Only they have an effect on the cache modelled by the cache update function, U . This coarser semantics may execute program paths which are not executable in the start

© 2006 by Taylor & Francis Group, LLC

14-8

Embedded Systems Handbook

semantics. Therefore, the Collecting Cache Semantics of a program computes a superset of the set of all concrete cache states occurring at each program point. Definition 14.4 (Collecting Cache Semantics) The Collecting Cache Semantics of a program is Ccoll ( p) = {U (cI , L(k1 ), . . . , L(kn ))|(k1 , . . . , kn ) path in the CFG leading to p} This collecting semantics would be computable, although often of enormous size. Therefore, another step abstracts it into a compact representation, so called abstract cache states. Note that every information drawn from the abstract cache states allows to safely deduce information about sets of concrete cache states, that is, only precision may be reduced in this two step process. Correctness is guaranteed.

14.2.3 Abstract Semantics The specification of a program analysis consists of the specification of an abstract domain and of the abstract semantic functions, mostly called transfer functions. The least upper bound operator of the domain combines information when control flow merges. We present two analyses. The must analysis determines a set of memory blocks that are in the cache at a given program point whenever execution reaches this point. The may analysis determines all memory blocks that may be in the cache at a given program point. The latter analysis is used to determine the absence of a memory block in the cache. The analyses are used to compute a categorization for each memory reference describing its cache behavior. The categories are described in Table 14.1. The domains for our abstract interpretations consist of abstract cache states. Definition 14.5 (abstract cache state) An abstract cache state cˆ : L → 2S maps cache lines to sets of memory blocks. Cˆ denotes the set of all abstract cache states. The position of a line in an abstract cache will, as in the case of concrete caches, denote the relative age of the corresponding memory blocks. Note, however, that the domains of abstract cache states will have different partial orders and that the interpretation of abstract cache states will be different in the different analyses. The following functions relate concrete and abstract domains. An extraction function, extr, maps a concrete cache state to an abstract cache state. The abstraction function, abstr, maps sets of concrete cache states to their best representation in the domain of abstract cache states. It is induced by the extraction function. The concretization function, concr, maps an abstract cache state to the set of all concrete cache states represented by it. It allows to interpret abstract cache states. It is often induced by the abstraction function, cf. Reference 17. Definition 14.6 (extraction, abstraction, concretization functions) The extraction function extr: Cc → Cˆ forms singleton sets from the images of the concrete cache states it is applied to, that is, extr(c)(li ) = {sx } if c(li ) = sx . The abstraction function abstr: 2Cc → Cˆ is defined by abstr(C) = {extr(c)|c ∈ C} The concretization function concr: Cˆ → 2Cc is defined by concr(ˆc ) = {c|extr(c) cˆ }. TABLE 14.1 Category

Categorizations of Memory References and Memory Blocks Abbreviation

Meaning

ah am nc

The memory reference will always result in a cache hit. The memory reference will always result in a cache miss. The memory reference could neither be classified as ah nor am.

Always hit Always miss Not classified

© 2006 by Taylor & Francis Group, LLC

Determining Bounds on Execution Times

14-9

So much of commonalities of all the domains to be designed. Note, that all the constructions are parameterized in and . The transfer functions, the abstract cache update functions, all denoted Uˆ , will describe the effects of a control flow node on an element of the abstract domain. They will be composed of two parts: 1. “Refreshing” the accessed memory block, that is, inserting it into the youngest cache line. 2. “Aging” some other memory blocks already in the abstract cache. 14.2.3.1 Termination of the Analyses There are only a finite number of cache lines and for each program a finite number of memory blocks. This means, that the domain of abstract cache states cˆ : L → 2S is finite. Hence, every ascending chain is finite. Additionally, the abstract cache update functions, Uˆ , are monotonic. This guarantees that all the analyses will terminate. 14.2.3.2 Must Analysis As explained above, the must analysis determines a set of memory blocks that are in the cache at a given program point whenever execution reaches this point. Good information, in the sense of valuable for the prediction of cache hits, is the knowledge that a memory block is in this set. The bigger the set, the better. As we will see, additional information will even tell how long it will at least stay in the cache. This is connected to the “age” of a memory block. Therefore, the partial order on the must -domain is as follows. Take an abstract cache state cˆ . Above cˆ in the domain, that is, less precise, are states where memory blocks from cˆ are either missing or are older than in cˆ . Therefore, the -operator applied to two abstract cache states cˆ1 and cˆ2 will produce a state cˆ containing only those memory blocks contained in both, and will give them the maximum of their ages in cˆ1 and cˆ2 (see Figure 14.5). The positions of the memory blocks in the abstract cache state are thus the upper bounds of the ages of the memory blocks in the concrete caches occurring in the collecting cache semantics. Concretization of an abstract cache state, cˆ , produces the set of all concrete cache states, which contain all the memory blocks contained in cˆ with ages not older than in cˆ . Cache lines not filled by these are filled with other memory blocks. We use the abstract cache update function depicted in Figure 14.6. Let us argue the correctness of this update function. The following theorem formulates the soundness of the must-cache analysis. Theorem 14.1 Let n be a program point, cˆin the abstract cache state at the entry to n, s a memory line in cˆin with age k. (i) For each 1 ≤ k ≤ A there are at most k memory lines in lines 1, 2, . . . , k. (ii) On all paths to n, s is in cache with age at most k. The solution of the must analysis problem is interpreted as follows: let cˆ be an abstract cache state at some program point. If sx ∈ cˆ (li ) for a cache line li then sx will definitely be in the cache whenever execution reaches this program point. A reference to sx is categorized as always hit (ah). There is even a stronger interpretation of the fact that sx ∈ cˆ (li ). sx will stay in the cache at least for the next n − i {c} {e} {a} {d}

{a} {} {c, f} {d}

“intersection + maximal age” {} {} {a, c} {d}

FIGURE 14.5 Combination for must analysis.

© 2006 by Taylor & Francis Group, LLC

14-10

Embedded Systems Handbook

{x} {} {s, t} {y}

{s} {x} {t} {y}

“young” Age “old”

[s]

FIGURE 14.6

Update of an abstract fully associative (sub-) cache. {c} {e} {a} {d}

{a} {c, f} {} {d}

“union + minimal age” {a, c} {e, f} {} {d}

FIGURE 14.7

Combination for may analysis.

references to memory blocks that are not in the cache or are older than the memory blocks in cˆ , whereby sa is older than sb means: ∃li , lj : sa ∈ cˆ (li ), sb ∈ cˆ (lj ), i > j. 14.2.3.3 May Analysis To determine, if a memory block sx will never be in the cache, we compute the complimentary information, that is, sets of memory blocks that may be in the cache. “Good” information is that a memory block is not in this set, because this memory block can be classified as definitely not in the cache whenever execution reaches the given program point. Thus, the smaller the sets are, the better. Additionally, the older blocks will reach the desired situation to be removed from the cache faster than the younger ones. Therefore, the partial order on this domain is as follows. Take some abstract cache state cˆ . Above cˆ in the domain, that is, less precise, are those states which contain additional memory blocks or where memory blocks from cˆ are younger than in cˆ . Therefore, the -operator applied to two abstract cache states cˆ1 and cˆ2 will produce a state cˆ containing those memory blocks contained in cˆ1 or cˆ2 and will give them the minimum of their ages in cˆ1 and cˆ2 (see Figure 14.7). The positions of the memory blocks in the abstract cache state are thus the lower bounds of the ages of the memory blocks in the concrete caches occurring in the collecting cache semantics. The solution of the may analysis problem is interpreted as follows: the fact that sx is in the abstract cache cˆ means that sx may be in the cache during some execution when the program point is reached. If sx is not in cˆ (li ) for any li then it will definitely be not in the cache on any execution. A reference to sx is categorized as always miss (am).

14.3 Pipeline Analysis Pipeline analysis attempts to find out how instructions move through the pipeline. In particular, it determines how many cycles they spend in the pipeline. This largely depends on the timing accidents the instructions suffer. Timing accidents during pipelined executions can be of several kinds. Cache misses during instruction or data load stall the pipeline for as many cycles as the cache miss penalty indicates. Functional units that an instruction needs may be occupied. Queues into which the instruction may have to be moved may be full, and prefetch queues, from which instructions have to be loaded, may be empty. The bus needed for a pipeline phase may be occupied by a different phase of another instruction.

© 2006 by Taylor & Francis Group, LLC

Determining Bounds on Execution Times

14-11

Again, for architectures without timing anomalies we can use a simplified picture, in which the task is to find out which timing accidents can be safely excluded, because each excluded accident allows to decrease the bound for the execution time. Accidents that can not be safely excluded are assumed to happen. A cache analysis as described in Section 14.2 has annotated the instructions with cache-hit information. This information is used to exclude pipeline stalls at instruction or data fetches. We will explain pipeline analysis in a number of steps starting with concrete-pipeline execution. A pipeline goes through a number of pipeline phases and consumes a number of cycles when it executes a sequence of instructions; in general, a different number of cycles for different initial execution states. The execution of the instructions in the sequence overlaps in the instruction pipeline as far as the data dependences between instructions permit it and if the pipeline conditions are statisfied. Each execution of a sequence of instructions starting in some initial state produces one trace, that is, sequence of execution states. The length of the trace is the number of cycles this execution takes. Thus, concrete execution can be viewed as applying a function function exec (b : basic block, s : pipeline state) t : trace that executes the instruction sequence of basic block b starting in concrete pipeline state s producing a trace t of concrete states. last (t ) is the final state when executing b. It is the initial state for the successor block to be executed next. So far, we talked about concrete execution on a concrete pipeline. Pipeline analysis regards abstract execution of sequences of instructions on abstract (models of) pipelines. The execution of programs on abstract pipelines produces abstract traces, that is, sequences of abstract states, where some information contained in the concrete states may be missing. There are several types of missing information: • The cache analysis in general has incomplete information about cache contents. • The latency of an arithmetic operation, if it depends on the operand sizes, may be unknown. It influences the occupancy of pipeline units. • The state of a dynamic branch predictor changes over iterations of a loop and may be unknown for a particular iteration. • Data dependences can not safely be excluded because effective addresses of operands are not always statically known.

14.3.1 Simple Architectures without Timing Anomalies In a first step, we assume a simple processor architecture, with in-order execution and without timing anomalies, that is, architectures, where local worst cases contribute to the program’s global execution time, cf. Section 14.1.2. Also, it is safe to assume the local worst cases for unknown information. For both of them the corresponding timing penalties are added. For example, the cache miss penalty has to be added for instruction fetch of an instruction in the two cases, that a cache miss is predicted or that neither a cache miss nor a cache hit can be predicted. The result of the abstract execution of an instruction sequence for a given initial abstract state is again one trace; however, possibly of a greater length and thus an upper bound properly bounding the execution time from above. Because worst cases were assumed for all uncertainties, this number of cycles is a safe upper bound for all executions of the basic block starting in concrete states represented by this initial abstract state. The Algorithm for pipeline analysis is quite simple. It uses a function function e xec ˆ (b : cache-annotated basic block, sˆ : abstract pipeline state) tˆ : abstract trace that executes the instruction sequence of basic block b, annotated with cache information, starting in the abstract pipeline state sˆ and producing a trace tˆ of abstract states. This function is applied to each basic block b in each of its contexts and the empty pipeline state sˆ0 corresponding to a flushed pipeline. Therefore, a linear traversal of the cache-annotated context-extended

© 2006 by Taylor & Francis Group, LLC

14-12

Embedded Systems Handbook

Basic-Block Graph suffices. The result is a trace for the instruction sequence of the block, whose length is an upper bound for the execution time of the block in this context. Note, that it still makes sense to analyze a basic block in several contexts because the cache information for them may be quite different. Note, that this algorithm is simple and efficient, but not necessarily very precise. Starting with a flushed pipeline at the beginning of the basic block is safe, but it ignores the potential overlap between consecutive basic blocks. A more precise algorithm is possible. The problem is with basic blocks having several predecessor blocks. Which of their final states should be selected as initial state of the successor block? First solution involves working with sets of states for each pair of basic block and context. Then, one analysis of each basic block and context would be performed for each of the initial states. The resulting set of final states would be passed on to successor blocks, and the maximum of the trace lengths would be taken as upper bound for this basic block in this context. Second solution would work with a single state per basic block and context and would combine the set of predecessor final states conservatively to the initial state for the successor.

14.3.2 Processors with Timing Anomalies In the next step, we assume more complex processors, including those with out-of-order execution. They typically have timing anomalies. Our assumption above, that is, that local worst cases contribute worst-case times to the global execution times, is no more valid. This forces us to consider several paths, wherever uncertainty in the abstract execution state does not allow to take a decision between several successor states. Note, that the absence of information leads from the deterministic concrete pipeline to an abstract pipeline that is non-deterministic. This situation is depicted in Figure 14.8. It demonstrates two cases of missing information in the abstract state. First, the abstract state lacks the information whether the instruction is in the I-cache. Pipeline analysis has to follow both cases in case of instruction fetch, because it could turn out that the I-cache miss, in fact, is not the global worst case. Second, the abstract state does not contain information about the size of the operands. We also have to follow both paths. The dashed paths have to be explored to obtain the execution times for this instruction. Depending on the architecture, we may be able to conservatively assume the case of large operands and surpress some paths. The algorithm has to combine cache and pipeline analysis because of the interference between both, which actually is the reason for the existence of the timing anomalies. For the cache analysis, it uses the abstract cache states discussed in Section 14.2. For the pipeline part, it uses analysis states, which are sets of abstract pipeline states, that is, sets of states of the abstract pipeline. The question arises whether Fetch

Issue

Execute

Retire

I–cache miss?

Unit occupied?

Multicycle?

Pending instructions? 6

19

4

1 6

3

No

6 30

6

Yes 1 3

6 41

FIGURE 14.8 Different paths through the execution of a multiply instruction. Decisions inside the boxes can not be deterministically taken based on the abstract execution state because of missing information.

© 2006 by Taylor & Francis Group, LLC

Determining Bounds on Execution Times

14-13

an abstract cache state is to be combined with an analysis state ssˆ or an individual one with each of the abstract pipeline states in ss. ˆ So, there could be one abstract cache state for ssˆ representing the concrete cache contents for all abstract pipeline states in ss, ˆ or there could be one abstract cache state per abstract pipeline state in ss. ˆ The first choice saves memory during the analysis, but loses precision. This is because different pipeline states may cause different memory accesses and thus cache contents, which have to be merged into the one abstract state thereby losing information. The second choice is more precise but requires more memory during the analysis. We choose the second alternative and thus define a new domain of analysis states Aˆ of the following type: ˆ ˆ Aˆ = 2S ×C

(14.1)

Sˆ = set of abstract pipeline states

(14.2)

Cˆ = set of abstract cache states

(14.3)

The Algorithm again uses a new function exêc c . function exêc c (b : basic block, aˆ : analysis state) Tˆ : set of abstract trace, which analyzes a basic block b starting in an analysis state aˆ consisting of pairs of abstract pipeline states and abstract cache states. As a result it will produce a set of abstract traces. The algorithm is as follows.

14.3.3 Algorithm Pipeline-Analysis Perform fixpoint iteration over the context-extended Basic-Block Graph: For each basic block b in each of its contexts c, and for the initial analysis state aˆ , compute exêc c (b, aˆ ) yielding a set of traces {tˆ1 , tˆ2 , . . . , tˆm }. max ({|tˆ1 |, |tˆ2 |, . . . , |tˆm |}) is the bound for this basic block in this context. The set of output states {last (tˆ1 ), last (tˆ2 ), . . . , last (tˆm )} will be passed on to the successor block(s) in context c as initial states. Basic blocks (in some context) having more than one predecessor receive the union of the set of output states as initial states. The abstraction we use as analysis states is a set of abstract pipeline states, since the number of possible pipeline states for one instruction is not too big. Hence, our abstraction computes an upper bound to the collecting semantics. The abstract update for an analysis state aˆ is thus the application of the concrete update on each abstract pipeline state in aˆ extended with the possibility of multiple successor states in case of uncertainties. Figure 14.9 shows the possible pipeline states for a basic block in this example. Such pictures are shown by aiT tool upon special demand. The large dark grey boxes correspond to the instructions of the basic block, and the smaller rectangles in them stand for individual pipeline states. Their cyclewise evolution is indicated by the strokes connecting them. Each layer in the trees corresponds to one CPU cycle. Branches in the trees are caused by conditions that could not be statically evaluated, for example, a memory access with unknown address in presence of memory areas with different access times. On the other hand, two pipeline states fall together when details they differ in leave the pipeline. This happened, for instance, at the end of the second instruction, reducing the number of states from four to three. The update function belonging to an edge (ν, ν ) of the control-flow graph updates each abstract pipeline state separately. When the bus unit is updated, the pipeline state may split into several successor states with different cache states. The initial analysis state is a set of empty pipeline states plus a cache that represents a cache with unknown content. There can be multiple concrete pipeline states in the initial states, since the adjustment of internal to external clock of the processor is not known in the beginning and every possibility (aligned, one cycle apart, etc.) has to be considered. Thus prefetching must start from

© 2006 by Taylor & Francis Group, LLC

14-14

FIGURE 14.9

Embedded Systems Handbook

Possible pipeline states in a basic block.

scratch, but pending bus requests are ignored. To obtain correct results, they must be taken into account by adding a fixed penalty to the calculated upper bounds.

14.3.4 Pipeline Modeling The basis for pipeline analysis is a model of an abstract version of the processor pipeline, which is conservative with respect to the timing behavior, that is, times predicted by the abstract pipeline must never be lower than those observed in concrete executions. Some terminology is needed to avoid confusion. Processors have concrete pipelines, which may be described in some formal language, for example, VHDL. If this is the case, there exists a formal model of the pipeline. Our abstraction step, by which we eliminate many components of a concrete pipeline that are not relevant for the timing behavior lead us to an abstract pipeline. This may again be described in a formal language, for example, VHDL, and thus have a formal model. Deriving an abstract pipeline is a complex task. It is demonstrated for the Motorola ColdFire processor, a processor quite popular in the aeronautics and the submarine industry. The presentation follows closely that of Reference 18.2 14.3.4.1 The ColdFire MCF 5307 Pipeline The pipeline of the ColdFire MCF 5307 consists of a fetch pipeline that fetches instructions from memory (or the cache), and an execution pipeline that executes instructions, cf. Figure 14.10. Fetch and execution pipelines are connected and as far as speed is concerned decoupled by a FIFO instruction buffer that can hold at most 8 instructions. The MCF 5307 accesses memory through a bus hierarchy. The fast pipelined K-bus connects the cache and an internal 4KB SRAM area to the pipeline. Accesses to this bus are performed by the IC1/IC2 and the AGEX and DSOC stages of the pipeline. On the next level, the M-Bus connects the K-Bus to the internal peripherals. This bus runs at the external bus frequency, while the K-Bus is clocked with the faster internal core clock. The M-Bus connects to the external bus, which accesses off-chip peripherals and memory. The fetch pipeline performs branch prediction in the IED stage, redirecting fetching long before the branch reaches the execution stages. The fetch pipeline is stalled if the instruction buffer is full, or if the execution pipeline needs the bus for a memory access. All these stalls cause the pipeline to wait for one cycle. After that, the stall condition is checked again. The fetch pipeline is also stalled if the memory block to be fetched is not in the cache (cache miss). The pipeline must wait until the memory block is loaded into the cache and forwarded to the pipeline. The instructions that are already in the later stages of the fetch pipeline are forwarded to the instruction buffer. 2 The model of

the abstract pipeline of the MCF 5307 has been derived by hand. A computer-supported derivation would have been preferable. Ways to develop this are subject of actual research.

© 2006 by Taylor & Francis Group, LLC

Determining Bounds on Execution Times

Instruction Fetch Pipeline (IFP)

Operand Execution Pipeline (OEP)

IAG

Instruction Address Generation

IC1

Instruction Fetch Cycle 1

IC2

Instruction Fetch Cycle 2

IED

Instruction Early Decode

IB

FIFO Instruction Buffer

DSOC

Decode & Select, Operand Fetch

AGEX

14-15

Address [31:0]

Data[31:0]

Address Generation, Execute

FIGURE 14.10 The pipeline of the Motorola ColdFire 5307 processor.

The execution pipeline finishes the decoding of instructions, evaluates their operands, and executes the instructions. Each kind of operation follows a fixed schedule. This schedule determines, how many cycles the operation needs and in which cycles memory is accessed.3 The execution time varies between 2 cycles and several dozen cycles. Pipelining admits a maximum overlap of 1 cycle between consecutive instructions: the last cycle of each instruction may overlap with the first of the next one. In this first cycle, no memory access and no control-flow alteration happen. Thus, cache and pipeline cannot be affected by two different instructions in the same cycle. The execution of an instruction is delayed if memory accesses lead to cache misses. Misaligned accesses lead to small time penalties of 1 to 3 cycles. Store operations are delayed if the distance to the previous store operation is less than 2 cycles. (This does not hold if the previous store operation was issued by a MOVEM instruction.) The start of the next instruction is delayed if the instruction buffer is empty.

14.3.5 Formal Models of Abstract Pipelines An abstract pipeline can be seen as a big finite state machine, which makes a transition on every clock cycle. The states of the abstract pipeline, although greatly simplified still contain all timing relevant information 3 In fact, there are some instructions like MOVEM

whose execution schedule depends on the value of an argument as immediate constant. These instructions can be taken into account by special means.

© 2006 by Taylor & Francis Group, LLC

14-16

Embedded Systems Handbook

of the processor. The number of transitions it takes from the beginning of the execution of an instruction until its end gives the execution time of that instruction. The abstract pipeline although greatly reduced by leaving out irrelevant components still is a really big finite state machine, but it has structure. Its states can be naturally decomposed into components according to the architecture. This makes it easier to specify, verify, and implement a model of an abstract pipeline. In the formal approach presented here, an abstract pipeline state consists of several units with inner states that communicate with one another and the memory via signals, and evolve cycle-wise according to their inner state and the signals received. Thus, the means of decomposition are units and signals. Signals may be instantaneous, meaning that they are received in the same cycle as they are sent, or delayed, meaning that they are received one cycle after they have been sent. Signals may carry data, for example, a fetch address. Note that these signals are only part of the formal pipeline model. They may or may not correspond to real hardware signals. The instantaneous signals between units are used to transport information between the units. The state transitions are coded in the evolution rules local to each unit. Figure 14.11 shows the formal pipeline model for the ColdFire MCF 5307. It consists of the following units: IAG (instruction address generation), IC1 (instruction fetch cycle 1), IC2 (instruction fetch cycle 2), IED (instruction early decode), IB (instruction buffer), EX (execution unit), SST (store stall timer). In addition, there is a bus unit modeling the buses that connect the CPU, the static RAM, the cache, and

set(a)/stop

IAG addr(a)

wait

cancel

fetch(a)

IC1 await(a)

hold wait

cancel code(a)

IC2 wait

wait

IED instr

wait

IB next

start read(A)/write(A)

EX data/hold store

wait

SST FIGURE 14.11 Abstract model of the Motorola ColdFire 5307 processor.

© 2006 by Taylor & Francis Group, LLC

BUS UNIT

put(a)

Determining Bounds on Execution Times

14-17

the main memory. The signals between these units are shown as arrows. Most units directly correspond to a stage in the real pipeline. However, the SST unit is used to model the fact that two stores must be separated by at least two clock cycles. It is implemented as a (virtual) counter. The two stages of the execution pipeline are modeled by a single stage, EX, because instructions can only overlap by one cycle. The inner states and emitted signals of the units evolve in each cycle. The complexity of this state update varies from unit to unit. It can be as simple as a small table, mapping pending signals and inner state to a new state and signals to be emitted, for example, for the IAG unit and the IC1 unit. It can be much more complicated, if multiple dependencies have to be considered, for example, the instruction reconstruction and branch prediction in the IED stage. In this case, the evolution is formulated in pseudo code. Full details on the model can be found in Reference 19.

14.3.6 Pipeline States Abstract Pipeline States are formed by combining the inner states of IAG, IC1, IC2, IED, IB, EX, SST, and bus unit plus additional entries for pending signals into one overall state. This overall state evolves from one cycle to the next. Practically, the evolution of the overall pipeline state can be implemented by updating the functional units one by one in an order that respects the dependencies introduced by input signals and the generation of these signals. 14.3.6.1 Update Function for Pipeline States For pipeline modeling, one needs a function that describes the evolution of the concrete pipeline state while traveling along an edge (ν, ν ) of the control-flow graph. This function can be obtained by iterating the cycle-wise update function of the previous paragraph. An initial concrete pipeline state at ν has an empty execution unit EX. It is updated until an instruction is sent from IB to EX. Updating of the concrete pipeline state continues using the knowledge that the successor instruction is ν until EX has become empty again. The number of cycles needed from the beginning until this point can be taken as the time needed for the transition from ν to ν for this concrete pipeline state.

14.4 Path Analysis Using Integer Linear Programming The structure of a program and the set of program paths can be mapped to an ILP in a very natural way. A set of constraints describes the control flow of the program. Solving these constraints yields very precise results [5]. However, requirements for precision of the results demand analyzing basic blocks in different contexts, that is, in different ways, how control reached them. This makes the control quite complex, so that the mapping to an ILP may be very complex [14]. A problem formulated in an ILP consists of two parts: the cost function and constraints on the variables used in the cost function. Our cost function represents the number of CPU cycles. Correspondingly, it has to be maximized. Each variable in the cost function represents the execution count of one basic block of the program and is weighted by the execution time of that basic block. Additionally, variables are used corresponding to the traversal counts of the edges in the control flow graph, see Figure 14.12. The integer constraints describing how often basic blocks are executed relative to each other can be automatically generated from the control flow graph (Figure 14.13). However, additional information about the program provided by the user is usually needed, as the problem of finding the worst case program path is unsolvable in the general case. Loop and recursion bounds cannot always be inferred automatically and must therefore be provided by the user. The ILP approach for program path analysis has the advantage that users are able to describe in precise terms virtually anything they know about the program by adding integer constraints. The system first generates the obvious constraints automatically and then adds user supplied constraints to tighten the WCET bounds.

© 2006 by Taylor & Francis Group, LLC

14-18

Embedded Systems Handbook e1

trav (e1)

v1

if v1 then v2 else

e2

e4

fi

trav (e3)

cnt(v2)

v3

v2

v3

cnt (v1) trav (e2)

e3

cnt (v3)

trav (e4)

e5

trav (e5)

v4

v4

cnt (v4)

e6

trav (e6)

FIGURE 14.12 A program snippet, the corresponding control flow graph, and the ILP variables generated. ... e1

en

e⬘1

FIGURE 14.13

...

n

m

i=1

i=1

Σ trav(ei ) = cnt(v ) = Σ trav(e⬘i)

v e⬘m

Control flow joins and splits and flow-preservation laws.

14.5 Other Ingredients 14.5.1 Value Analysis A static method for data-cache behavior prediction needs to know effective memory addresses of data, in order to determine where a memory access goes. However, effective addresses are only available at run time. Interval analysis as described by Cousot and Halbwachs [20] can help here. It can compute intervals for address-valued objects like registers and variables. An interval computed for such an object at some program point bounds the set of potential values the object may have when program execution reaches this program point. Such an analysis, in aiT called value analysis has shown to be able to determine many effective addresses in disciplined code statically [10].

14.5.2 Control Flow Speciﬁcation and Analysis Any information about the possible flow of control of the program may increase the precision of the subsequent analyses. Control flow analysis may attempt to exclude infeasible paths, determine execution frequencies of paths or the relation between execution frequencies of different paths or subpaths, etc. The purpose of control flow analysis is to determine the dynamic behavior of the program. This includes information about what functions are called and with which arguments, how many times loops iterate, if there are dependencies between successive if-statements, etc. The main focus of flow analysis has been the determination of loop bounds, since the bounding of loops is a necessary step in order to find an execution time bound for a program. Control-flow analysis can be performed manually or automatically. Automatic analyses have been based on various techniques, like symbolic execution, abstract interpretation, and pattern recognition on parse trees. The best precision is achieved by using interprocedural analysis techniques, but this has to be traded off with the extra computation time and memory required. All automatic techniques allow a user to complement the results and guide the analysis using manual annotations, since this is sometimes necessary in order to obtain reasonable results. Since the flow analysis in general is performed separately from the path analysis, it does not know the execution times of individual program statements, and must thus generate a safe (over)approximation including all possible program executions. The path analysis will later select the path from the set of possible program paths that corresponds to the upper bound using the time information computed by processor behavior prediction.

© 2006 by Taylor & Francis Group, LLC

Determining Bounds on Execution Times

14-19

Control flow specification is preferrably done on the source level. Concepts based on source-level constructs are used in References 6 and 21.

14.5.3 Frontends for Executables Any reasonably precise timing analysis takes fully linked executable programs as input. Source programs do not contain information about program and data allocation, which is essential for the described methods to predict the cache behavior. Executables must be analyzed to reconstruct the original control flow of the program. This may be a difficult task depending on the instruction set of the processor and the code generation of the used compiler. A generic approach to this problem is described in References 14, 22, and 23.

14.6 Related Work It is not possible in general to obtain upper bounds on running times for programs. Otherwise, one could solve the halting problem. However, real-time systems only use a restricted form of programming, which guarantees that programs always terminate. That is, recursion is not allowed (or explicitly bounded) and the maximal iteration counts of loops are known in advance. A worst-case running time of a program could easily be determined if the worst-case input for the program were known. This is in general not the case. The alternative, to execute the program with all possible inputs, is often prohibitively expensive. As a consequence, approximations for the worst-case execution time are determined. Two classes of methods to obtain bounds can be distinguished: • Dynamic methods employ real program executions to obtain approximations. These approximations are unsafe as they only compute the maximum of a subset of all executions. • Static methods only need the program itself, maybe extended with some additional information (like loop bounds).

14.6.1 A (Partly) Dynamic Method A traditional method, still used in industry, combines measuring and static methods. Here, small snippets of code are measured for their execution time, then a safety margin is applied and the results for code pieces are combined according to the structure of the whole task. For example, if a tasks first executes a snippet A and then a snippet B, the resulting time is that measured for A, tA , added to that measured for B, tB : t = tA + tB . This reduces the amount of measurements that have to be made, as code snippets tend to be reused a lot in control software and only the different snippets need to be measured. It adds, however, the need for an argumentation about the correctness of the composition step of the measured snippet times. This typically relies on certain implicit assumptions about the worst-case initial execution state for these measurements. For example, the snippets are measured with an empty cache at the beginning of the measurement under the assumption that this is the worst-case cache state. In Reference 19 it is shown that this assumption can be wrong. The problem of unknown worst-case input exists for this method as well, and it is still infeasible to measure execution times for all input values.

14.6.2 Purely Static Methods 14.6.2.1 The Timing-Schema Approach In the timing-schemata approach [24], bounds for the execution times of a composed statement are computed from the bounds of the constituents. One timing schema is given for each type of statement. Basis are known times of the atomic statements. These are assumed to be constant and available from a manual or are assumed to be computed in a preceding phase. A bound for the whole program is obtained by combining results according to the structure of the program.

© 2006 by Taylor & Francis Group, LLC

14-20

Embedded Systems Handbook

The precision can be very bad because of some implicit assumptions underlying this method. Timing schemes assume compositionality of bounds for execution times, that is, they compute bounds for execution times of composed constructs from already computed bounds of the constituents. However, as we have seen, the execution times of the constituents depend heavily on the execution history. 14.6.2.2 Symbolic Simulation Another static method simulates the execution of the program on an abstract model of the processor. The simulation is performed without input; the simulator thus has to be capable to deal with partly unkown execution states. This method combines flow analysis, processor-behavior prediction, and path analysis in one integrated phase [25,26]. One problem with this approach is that analysis time is proportional to the actual execution time of the program with a usually large factor for doing a simulation. 14.6.2.3 WCET Determination by ILP Li, Malik, and Wolfe proposed an ILP-based approach to WCET determination [27–30]. Cache and pipeline behavior prediction are formulated as a single linear program. The i960KB is investigated, a 32-bit microprocessor with a 512 byte direct mapped instruction cache and a fairly simple pipeline. Only structural hazards need to be modeled, thus keeping the complexity of the integer linear program moderate compared to the expected complexity of a model for a modern microprocessor. Variable execution times, branch prediction, and instruction prefetching are not considered at all. Using this approach for superscalar pipelines does not seem very promising, considering the analysis times reported in one of the articles. One of the severe problems is the exponential increase of the size of the ILP in the number of competing l-blocks. l-blocks are maximally long contiguous sequences of instructions in a basic block mapped to the same cache set. Two l-blocks mapped to the same cache set compete if they do not have the same address tag. For a fixed cache architecture, the number of competing l-blocks grows linearly with the size of the program. Differentiation by contexts, absolutely necessary to achieve precision, increases this number additionally. Thus, the size of the ILP is exponential in the size of the program. Even though the problem is claimed to be a network-flow problem the size of the ILP is killing the approach. Growing associativity of the cache increases the number of competing l-blocks. Thus, also increasing cache-architecture complexity plays against this approach. Nonetheless, their method of modeling the control flow as an ILP, the so-called Implicit Path Enumeration, is elegant and can be efficient if the size of the ILP is kept small. It has been adopted by many groups working in this area. 14.6.2.4 Timing Analysis by Static Program Analysis The method described in this chapter uses a sequence of static program analyses for determining the program’s control flow and its data accesses and for predicting the processor’s behavior for the given program. An early approach to timing analysis using data-flow analysis methods can be found in References 31 and 32. Jakob Engblom showed how to precompute parts of a timing analyzer to speed up the actual timing analysis for architectures without timing anomalies [33]. Reference 34 gives an overview of existing tools for timing analysis, both commercially available tools and academic prototypes.

14.7 State of the Art and Future Extensions The timing-analysis technology described in this chapter is realized in the aiT tool and is used in the aeronautics and automotive industries. Several benchmarks have shown that precision of the predicted upper bounds is in the order of 10% [10]. To obtain such a precision, however, requires competent users since the available knowledge about the program’s control flow may be difficult to specify.

© 2006 by Taylor & Francis Group, LLC

Determining Bounds on Execution Times

14-21

The computational effort is high, but acceptable. Future optimizations will reduce this effort. As often in static program analysis, there is a trade-off between precision and effort. Precision can be reduced if the effort is intolerable. The only really drawback of the described technology is the huge effort for producing abstract processor models. Work is under way to support this activity through transformations on the VHDL level.

Acknowledgments Many former students have worked on different parts of the method presented in this chapter and have together built a timing-analysis tool satisfying industrial requirements. Christian Ferdinand studied cache analysis and showed that precise information about cache contents can be obtained. Stephan Thesing together with Reinhold Heckmann and Marc Langenbach developed methods to model abstract processors. Stephan went through the pains of implementing several abstract models for real-life processors such as the ColdFire MCF 5307 and the PPC 755. I owe him my thanks for help with the presentation of pipeline analysis, Henrik Theiling contributed the preprocessor technology for the analysis of executables and the translation of complex control flow to integer linear programs. Many thanks to him for his contribution to the path analysis section. Michael Schmidt implemented powerful versions of value analysis. Reinhold Heckmann managed to model even very complex cache architectures. Florian Martin implemented the program-analysis generator, PAG, which is the basis for many of the program analyses.

References [1] Reinhold Heckmann, Marc Langenbach, Stephan Thesing, and Reinhard Wilhelm. The influence of processor architecture an the design and the results of WCET tools. IEEE Proceedings on Real-Time Systems, 91: 1038–1054, 2003. [2] P. Puschner and Ch. Koza. Calculating the maximum execution time of real-time programs. Real-Time Systems, 1: 159–176, 1989. [3] Chang Yun Park and Alan C. Shaw. Experiments with a program timing tool based on source-level timing schema. IEEE Computer, 24: 48–57, 1991. [4] Christopher A. Healy, David B. Whalley, and Marion G. Harmon. Integrating the timing analysis of pipelining and instruction caching. In Proceedings of the IEEE Real-Time Systems Symposium, December 1995, pp. 288–297. [5] Henrik Theiling, Christian Ferdinand, and Reinhard Wilhelm. Fast and precise WCET prediction by separated cache and path analyses. Real-Time Systems, 18: 157–179, 2000. [6] Andreas Ermedahl. A Modular Tool Architecture for Worst-Case Execution Time Analysis. Ph.D. thesis, Uppsala University, Uppsala, Sweden, 2003. [7] Martin Alt, Christian Ferdinand, Florian Martin, and Reinhard Wilhelm. Cache behavior prediction by abstract interpretation. In Proceedings of SAS’96, Static Analysis Symposium, Vol. 1145 of Lecture Notes in Computer Science, Springer-Verlag, Heidelberg, 1996, pp. 52–66. [8] Christian Ferdinand, Florian Martin, and Reinhard Wilhelm. Cache behavior prediction by abstract interpretation. Science of Computer Program, 35: 163–189, 1999. [9] C. Ferdinand, R. Heckmann, M. Langenbach, F. Martin, M. Schmidt, H. Theiling, S. Thesing, and R. Wilhelm. Reliable and precise WCET determination for a real-life processor. In Proceedings of the First International Workshop on Embedded Software Workshop, Vol. 2211 of Lecture Notes on Computer Science, Springer-Verlag, London, 2001, pp. 469–485. [10] Stephan Thesing, Jean Souyris, Reinhold Heckmann, Famantanantsoa Randimbivololona, Marc Langenbach, Reinhard Wilhelm, and Christian Ferdinand. An abstract interpretationbased timing validation of hard real-time avionics software systems. In Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN 2003), IEEE Computer Society, Washington, 2003, pp. 625–632.

© 2006 by Taylor & Francis Group, LLC

14-22

Embedded Systems Handbook

[11] Thomas Lundqvist and Per Stenström. Timing Anomalies in Dynamically Scheduled Microprocessors. In Proceedings of the 20th IEEE Real-Time Systems Symposium, December 1999, pp. 12–21. [12] T. Reps, M. Sagiv, and R. Wilhelm. Shape analysis and applications. In Y.N. Srikant and Priti Shankar, Eds., The Compiler Design Handbook: Optimizations and Machine Code Generation, CRC Press, Boca Raton, FL, 2002, pp. 175–217. [13] Florian Martin, Martin Alt, Reinhard Wilhelm, and Christian Ferdinand. Analysis of loops. In Proceedings of the International Conference on Compiler Construction (CC’98), Vol. 1383 of Lecture Notes in Computer Science, Springer-Verlag, Heidelberg, 1998, pp. 80–94. [14] Henrik Theiling. Control Flow Graphs For Real-Time Systems Analysis. Ph.D. thesis, Universität des Saarlandes, Saarbrücken, Germany, 2002. [15] Patrick Cousot and Radhia Cousot. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Proceedings of the 4th ACM Symposium on Principles of Programming Languages, Los Angeles, CA, 1977, pp. 238–252. [16] Christian Ferdinand. Cache Behavior Prediction for Real-Time Systems. Ph.D. thesis, Universität des Saarlandes, Saarbrueken, 1997. [17] Flemming Nielson, Hanne Riis Nielson, and Chris Hankin. Principles of Program Analysis. Springer-Verlag, Heidelberg, 1999. [18] Marc Langenbach, Stephan Thesing, and Reinhold Heckmann. Pipeline modelling for timing analysis. In Manuel V. Hermenegildo and German Puebla, Eds., Static Analysis Symposium SAS 2002, Vol. 2477 of Lecture Notes in Computer Science, Springer-Verlag, Heidelberg, 2002, pp. 294–309. [19] Stephan Thesing. Safe and Precise WCET Determination by Abstract Interpretation of Pipeline Models. Ph.D. thesis, Saarland University, Saarbruecken, 2004. [20] Patrick Cousot and Nicolas Halbwachs. Automatic discovery of linear restraints among variables of a program. In Proceedings of the 5th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Tucson, AZ, ACM Press, New York, 1978, pp. 84–96. [21] Andreas Ermedahl and Jan Gustafsson. Deriving annotations for tight calculation of execution time. In Proceedings of the Euro-Par, 1997, pp. 1298–1307. [22] Henrik Theiling. Extracting safe and precise control flow from binaries. In Proceedings of the Seventh International Conference on Real-Time Systems and Application, IEEE Computer Society, 2000, pp. 23–30. [23] Henrik Theiling. Generating decision trees for decoding binaries. In ACM SIGPLAN 2001 Workshop on Languages, Compilers, and Tools for Embedded Systems, 2001, pp. 112–120. [24] Alan C. Shaw. Reasoning about time in higher-level language software. IEEE Transactions on Software Engineering, 15: 875–889, 1989. [25] Thomas Lundqvist and Per Stenström. An integrated path and timing analysis method based on cycle-level symbolic execution. Real-Time Systems, 17: 183–207, 1999. [26] Thomas Lundqvist. A WCET Analysis Method for Pipelined Microprocessors with Cache Memories. Ph.D. thesis, Department of Computer Engineering, Chalmers University of Technology, Sweden, 2002. [27] Yau-Tsun Steven Li and Sharad Malik. Performance analysis of embedded software using implicit path enumeration. In Proceedings of the 32nd ACM/IEEE Design Automation Conference, June 1995, pp. 456–461. [28] Yau-Tsun Steven Li, Sharad Malik, and Andrew Wolfe. Efficient microarchitecture modeling and path analysis for real-time software. In Proceedings of the IEEE Real-Time Systems Symposium, December 1995, pp. 298–307. [29] Yau-Tsun Steven Li, Sharad Malik, and Andrew Wolfe. Performance estimation of embedded software with instruction cache modeling. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, November 1995, pp. 380–387.

© 2006 by Taylor & Francis Group, LLC

Determining Bounds on Execution Times

14-23

[30] Yau-Tsun Steven Li, Sharad Malik, and Andrew Wolfe. Cache modeling for real-time software: beyond direct mapped instruction caches. In Proceedings of the IEEE Real-Time Systems Symposium, December 1996. [31] R. Arnold, F. Mueller, D. Whalley, and M. Harmon. Bounding worst-case instruction cache performance. In Proceedings of the IEEE Real-Time Systems Symposium, Puerto Rico, December 1994, pp. 172–181. [32] Frank Mueller, David B. Whalley, and Marion Harmon. Predicting instruction cache behavior. In Proceedings of the ACM SIGPLAN Workshop on Language, Compiler and Tool Support for Real-Time Systems, 1994. [33] Jakob Engblom. Processor Pipelines and Static Worst-Case Execution Time Analysis. Ph.D. thesis, Uppsala University, Uppsala, Sweden, 2002. [34] Reinhard Wilhelm, Jakob Engblom, Stephan Thesing, and David Whalley. The determination of worst-case execution times — introduction and survey of available tools, 2004 (submitted).

© 2006 by Taylor & Francis Group, LLC

15 Performance Analysis of Distributed Embedded Systems 15.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-1 Distributed Embedded Systems • Basic Terms • Role in the Design Process • Requirements

15.2 Approaches to Performance Analysis . . . . . . . . . . . . . . . . . . . 15-6 Simulation-Based Methods • Holistic Scheduling Analysis • Compositional Methods

Lothar Thiele and Ernesto Wandeler Swiss Federal Institute of Technology

15.3 The Performance Network Approach . . . . . . . . . . . . . . . . . . 15-11 Performance Network • Variability Characterization • Resource Sharing and Analysis • Concluding Remarks

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-17

15.1 Performance Analysis 15.1.1 Distributed Embedded Systems An embedded system is a special-purpose information processing system that is closely integrated into its environment. It is usually dedicated to a certain application domain and knowledge about the system behavior at design time can be used to minimize resources while maximizing predictability. The embedding into a technical environment and the constraints imposed by a particular application domain very often lead to heterogeneous and distributed implementations. In this case, systems are composed of hardware components that communicate via some interconnection network. The functional and nonfunctional properties of the whole system not only depend on the computations inside the various nodes but also on the interaction of the various data streams on the common communication media. In contrast to multiprocessor or parallel computing platforms, the individual computing nodes have a high degree of independence and usually communicate via message passing. It is particulary difficult to maintain global state and workload information as the local processing nodes usually make independent scheduling and resource access decisions. In addition, the dedication to an application domain very often leads to heterogeneous distributed implementations, where each node is specialized to its local environment and/or its functionality. For example, in an automotive application one may find nodes (usually called embedded control units) that contain a communication controller, a CPU, memory, and I/O interfaces. But depending on the particular

15-1

© 2006 by Taylor & Francis Group, LLC

15-2

Embedded Systems Handbook

task of a node, it may contain additional digital signal processors (DSP), different kinds of CPUs and interfaces, and different memory capacities. The same observation holds for the interconnection networks also. They may be composed of several interconnected smaller sub-networks, each one with its own communication protocol and topology. For example, in automotive applications we may find Controller Area Networks (CAN), time-triggered protocols (TTP) like in TTCAN, or hybrid protocols like in FlexRay. The complexity of a design is particularly high if the computation nodes responsible for a single application are distributed across several networks. In this case, critical information may flow through several sub-networks and connecting gateways before it reaches its destination. Recently, we see that the earlier described architectural concepts of heterogeneity, distributivity, and parallelism can be seen on several layers of granularity. The term system-on-a-chip refers to the implementation of sub-systems on a single device, that contains a collection of (digital or analogue) interfaces, busses, memory, and heterogeneous computing resources such as FPGAs, CPUs, controllers, and DSPs. These individual components are connected using “networks-on-chip” that can be regarded as dedicated interconnection networks involving adapted protocols, bridges, or gateways. Based on the assessment given, it becomes obvious that heterogeneous and distributed embedded systems are inherently difficult to design and to analyze. In many cases, not only the availability, the safety, and the correctness of the computations of the whole embedded system are of major concern, but also the timeliness of the results. One cause for end-to-end timing constraints is the fact that embedded systems are frequently connected to a physical environment through sensors and actuators. Typically, embedded systems are reactive systems that are in continuous interaction with their environment and they must execute at a pace determined by that environment. Examples are automatic control tasks, manufacturing systems, mechatronic systems, automotive/air/space applications, radio receivers and transmitters, and signal processing tasks in general. And also in the case of multimedia and content production, missing audio or video samples need to be avoided under all circumstances. As a result, many embedded systems must meet real-time constraints, that is, they must react to stimuli within the time interval dictated by the environment. A real-time constraint is called hard, if not meeting that constraint could result in a catastrophic failure of the system, and it is called soft otherwise. As a consequence, time-predictability in the strong sense cannot be guaranteed using statistical arguments. Finally, let us give an example that shows part of the complexity in the performance and timing analysis of distributed embedded systems. The example adapted from Reference 1 is particularly simple in order to point out one source of difficulties, namely the interaction of event streams on a communication resource (Figure 15.1). P1, P2

P3 A1

Sensor

CPU

Memory

I/O

Bus

Input P4

...

DSP

Buffer

A2

P5, P6

Bus load t BCET

WCET

FIGURE 15.1 Interference of two applications on a shared communication resource.

© 2006 by Taylor & Francis Group, LLC

Analysis of Distributed Embedded Systems

15-3

The application A1 consists of a sensor that sends periodically bursts of data to the CPU, which stores them in the memory using a task P1. These data are processed by the CPU using a task P2, with a worst-case execution time (WCET) and a best-case execution time (BCET). The processed data are transmitted via the shared bus to a hardware input/output device that is running task P3. We suppose that the CPU uses a preemptive fixed-priority scheduling policy, where P1 has the highest priority. The maximal workload on the CPU is obtained when P2 continuously uses the WCET and when the sensor simultaneously submits data. There is a second streaming application A2 that receives real-time data in equidistant packets via the Input interface. The Input interface is running task P4 to send the data to a DSP for processing with task P5. The processed packets are then transferred to a playout buffer and task P6 periodically removes packets from the buffer, for example, for playback. We suppose that the bus uses a FCFS (first come first serve) scheme for arbitration. As the bus transactions from the applications A1 and A2 interfere on the common bus, there will be a jitter in the packet stream received by the DSP that eventually may lead to an undesirable buffer overflow or underflow. It is now interesting to note that the worst-case situation in terms of jitter occurs if the processing in A1 uses its BCET, as this leads to a blocking of the bus for a long time period. Therefore, the worst-case situation for the CPU load leads to a best case for the bus, and vice versa. In case of more realistic situations, there will be simultaneous resource sharing on the computing and communication resources, there may be different protocols and scheduling policies on these resources, there may be a distributed architecture using interconnected sub-networks, and there may be additional nondeterminism caused by unknown input patterns and data. It is the purpose of performance analysis to determine the timing and memory properties of such systems.

15.1.2 Basic Terms As a starting point of the analysis of timing and performance of embedded systems, it is very useful to clarify a few basic terms. Very often, the timing behavior of an embedded system can be described by the time interval between a specified pair of events. For example, the instantiation of a task, the occurrence of a sensor input, or the arrival of a packet could be a start event. Such events will be denoted as arrival events. Similar, the finishing of an application or a part of it can again be modeled as an event, denoted as finishing event. In case of a distributed system, the physical location of the finishing event may not be equal to that of the corresponding arrival event and the processing may require the processing of a sequence or set of tasks, and the use of distributed computing and communication resources. In this case, we talk about end-to-end timing constraints. Note that not all pairs of events in a system are necessarily critical, that is, have deadline requirements. An embedded system processes the data associated with arrival events. The timing of computations and communications within the embedded system may depend on the input data (because of data dependent behavior of tasks) and on the arrival pattern. In case of a conservative resource sharing strategy, such as the time-triggered architecture (TTA), the interference between these tasks is removed by applying a static sharing strategy. If the use of shared resources is controlled by dynamic policies, all activities may interact with each other and the timing properties influence each other. As shown in Section 15.1.1, it is necessary to distinguish between the following terms: • Worst case and best case. The worst case and the best case are the maximal and minimal time interval between the arrival and finishing events under all admissible system and environment states. The execution time may vary largely, owing to different input data and interference between concurrent system activities. • Upper and lower bounds. Upper and lower bounds are quantities that bound the worst- and bestcase behavior. These quantities are usually computed offline, that is, not during the runtime of the system. • Statistical measures. Instead of computing bounds on the worst- and best-case behavior, one may also determine a statistical characterization of the runtime behavior of the system, for example, expected values, variances, and quantiles.

© 2006 by Taylor & Francis Group, LLC

15-4

Embedded Systems Handbook

In the case of real-time systems, we are particularly interested in upper and lower bounds. They are used in order to verify statically, whether the system meets its timing requirements, for example, deadlines. In contrast to the end-to-end timing properties, the term performance is less well defined. Usually, it refers to a mixture of the achievable deadline, the delay of events or packets, and of the number of events that can be processed per time unit (throughput). There is a close relation between the delay of individual packets or events, the necessary memory in the embedded system and the throughput, that is, the required memory is proportional to the product of throughput and delay. Therefore, we will concentrate on the delay and memory properties in this chapter. Several methods do exist, such as analysis, simulation, emulation, and implementation, in order to determine or approximate the above quantities. Besides analytic methods based on formal models, one may also consider simulation, emulation, or implementation. All the latter possibilities should be used with care as only a finite set of initial states, environment behaviors, and execution traces can be considered. As is well known, the corner cases that lead to a WCET or BCET are usually not known, and thus incorrect results may be obtained. The huge state space of realistic system architectures makes it highly improbable that the critical instances of the execution can be determined without the help of analytical methods. In order to understand the requirements for performance analysis methods in distributed embedded systems, we will classify possible causes for a large difference between the worst case and best case or between the upper and lower bounds: • Nondeterminism and interference. Let us suppose that there is only limited knowledge about the environment of the embedded system, for example, about the time when external events arrive or about their input data. In addition, there is interference of computation and communication on shared resources such as CPU, memory, bus, or network. Then, we will say that the timing properties are nondeterministic with respect to the available information. Therefore, there will be a difference between the worst-case and the best-case behavior as well as between the associated bounds. An example may be that the execution time of a task may depend on its input data. Another example is the communication of data packets on a bus in case of an unknown interference. • Limited analyzability. If there is complete knowledge about the whole system, then the behavior of the system is determined. Nevertheless, it may be that because of the system complexity, there is no feasible way of determining close upper and lower bounds on the worst- and best-case timing, respectively. As a result of this discussion, we understand that methods to analyze the performance of distributed embedded system must be (1) correct in that they determine valid upper and lower bounds and (2) accurate in that the determined bounds are close to the actual worst case and best case. In contrast to other chapters of the handbook, we will concentrate on the interaction between the task level of an embedded system and the distributed operation. We suppose that the whole application is partitioned into tasks and threads. Therefore, the task level refers to operating system issues such as scheduling, memory management, and arbitration of shared resources. In addition, we are faced with applications that run on distributed resources. The corresponding layer contains methods of distributed scheduling and networking. On this level of abstraction we are interested in end-to-end timing and performance properties.

15.1.3 Role in the Design Process One of the major challenges in the design process of embedded systems is to estimate essential characteristics of the final implementation early in the design. This can help in making important design decisions before investing too much time in detailed implementations. Typical questions faced by a designer during a system-level design process are: which functions should be implemented in hardware and which in software (partitioning)? Which hardware components should be chosen (allocation)? How should the different functions be mapped onto the chosen hardware (binding)? Do the system-level timing properties meet the design requirements? What are the different bus utilizations and which bus or processor acts

© 2006 by Taylor & Francis Group, LLC

Analysis of Distributed Embedded Systems

Application specification

15-5

Execution platform Mapping scheduling arbitration

Performance analysis

Design space exploration

FIGURE 15.2

Relation between design space exploration and performance analysis.

as a bottleneck? Then there are also questions related to the on-chip memory requirements and off-chip memory bandwidth. Typically, the performance analysis or estimation is part of the design space exploration, where different implementation choices are investigated in order to determine the appropriate design trade-offs between the different conflicting objectives, for an overview see Reference 2. Following Figure 15.2, the estimation of system properties in an early design phase is an essential part of the design space exploration. Different choices of the underlying system architecture, the mapping of the applications onto this architecture, and the chosen scheduling and arbitration schemes will need to be evaluated in terms of the different quality criteria. In order to achieve acceptable design times though, there is a need for automatic or semiautomatic (interactive) exploration methods. As a result, there are additional requirements for performance analysis if used for design space exploration, namely (1) the simple reconfigurability with respect to architecture, mapping, and resource sharing policies, (2) a short analysis time in order to be able to test many different choices in a reasonable time frame, and (3) the possibility to cope with incomplete design information, as typically the lower layers are not designed or implemented yet. Even if the design space exploration as described is not a part of the chosen design methodology, the performance analysis is often part of the development process of software and hardware. In embedded system design, the functional correctness is validated after each major design step using simulation or formal methods. If there are nonfunctional constraints such as deadline or throughput requirements, they need to be validated as well and all aspects of the design representation related to performance become “first class citizens.” Finally, performance analysis of the whole embedded system may be done after completion of the design, in particular if the system is operated under hard real-time conditions where timing failures lead to a catastrophic situation. As has been mentioned earlier, performance simulation is not appropriate in this case because the critical instances and test patterns are not known in general.

15.1.4 Requirements Based on the discussion, one can list some of the requirements that a methodology for performance analysis of distributed embedded systems must satisfy: • Correctness. The results of the analysis should be correct, that is, there exist no reachable system states and feasible reactions of the system environment such that the calculated bounds are violated. • Accuracy. The lower and upper bounds determined by the performance analysis should be close to the actual worst- and best-case timing properties.

© 2006 by Taylor & Francis Group, LLC

15-6

Embedded Systems Handbook

• Embedding into the design process. The underlying performance model should be sufficiently general to allow the representation of the application (which possibly uses different specification mechanisms), of the environment (periodic, aperiodic, bursty, different event types), of the mapping including the resource sharing strategies (preemption, priorities, time triggered) and of the hardware platform. The method should seamlessly integrate into the functional specification and design methodology. • Short analysis time. Especially, if the performance analysis is part of a design space exploration, a short analysis time is important. In addition, the underlying model should allow for reconfigurability in terms of application, mapping, and hardware platform. As distributed systems are heterogeneous in terms of the underlying execution platform, the diverse concurrently running applications, and the different scheduling and arbitration policies used, modularity is a key requirement for any performance analysis method. We can distinguish between several composition properties: • Process composition. Often, events need to be processed by several consecutive application tasks. In this case, the performance analysis method should be modular in terms of this functional composition. • Scheduling composition. Within one implementation, different scheduling methods can be combined, even within one computing resource (hierarchial scheduling); the same property holds for the scheduling and arbitration of communication resources. • Resource composition. A system implementation can consist of different heterogeneous computing and communication resources. It should be possible to compose them in a similar way as processes and scheduling methods. • Building components. Combinations of processes, associated scheduling methods and architecture elements should be combined into components. This way, one could associate a performance component to a combined hardware/operating system/software module of the implementation, that exposes the performance requirements but hides internal implementation details. It should be mentioned that none of the approaches known to date are able to satisfy all of the above mentioned criteria. On the other hand, depending on the application domain and the chosen design approach, not all of the requirements are equally important. Section 15.2 summarizes some of the available methods and in Section 15.3 one available method is described in more detail.

15.2 Approaches to Performance Analysis In this survey, we select just a few representative and promising approaches that have been proposed for the performance analysis of distributed embedded systems.

15.2.1 Simulation-Based Methods Currently, the performance estimation of embedded systems is mainly done using simulation or tracebased simulation. Examples of available approaches and software support provides the SystemC initiative, see for example, References 3 and 4, that is supported by tools from companies such as Cadence (nc-systemc) and Synopsys (System Studio). In simulation-based methods, many dynamic and complex interactions can be taken into account whereas analytic methods usually have to stick to a restrictive underlying model and suffer from limited scope. In addition, there is the possibility to match the level of abstraction in the representation of time to the required degree of accuracy. Examples for these different layers are cycle-accurate models, for example, those used in the simulation of processors [5], up to networks of discrete event components that can be modeled in SystemC. In order to determine timing properties of an embedded system, a simulation framework not only has to consider the functional behavior but also requires a concept of time and a way of taking into

© 2006 by Taylor & Francis Group, LLC

Analysis of Distributed Embedded Systems

Input stimuli

Performance estimation

Cosimulation (Based on an abstract architecture)

Refined CAG

15-7

Abstract trace

Simulation

Initial CAG

Analysis

Communication topology mapping arbitration protocols

FIGURE 15.3 A hybrid method for performance estimation, based on simulation and analytic methods.

account properties of the execution platform, of the mapping between functional computation and communication processes and elements of the underlying hardware, and of resource sharing policies (as usually implemented in the operating system or directly in hardware). This additional complexity leads to higher computation times, and performance estimation quickly becomes a bottleneck in the design. Besides, there is a substantial set-up effort necessary if the mapping of the application to the underlying hardware platform changes, for example, in order to perform a design space exploration. The fundamental problem of simulation-based approaches to performance estimation is the insufficient corner case coverage. As shown in the example in Figure 15.1, the sub-system corner case (high computation time of A1) does not lead to the system corner case (small computation time of A1). Designers must provide a set of appropriate simulation stimuli in order to cover all the corner cases that exist in the distributed embedded system. Failures of embedded systems very often relate to timing anomalies that happen infrequently and therefore, are almost impossible to discover by simulation. In general, simulation provides estimates of the average system performance but does not yield worst-case results and cannot determine whether the system satisfies required timing constraints. The approach taken by Lahiri et al. [6] combines performance simulation and analysis by a hybrid trace-based methodology. It is intended to fill the gap between pure simulation that may be too slow to be used in a design space exploration cycle, and analytic methods that are often too restricted in scope and not accurate enough. The approach as described concentrates on communication aspects of a distributed embedded system. The performance estimation is partitioned into several stages, see Figure 15.3: • Stage 1. An initial cosimulation of the whole distributed system is performed. The simulation not only covers functional aspects (processing of data) but also captures the communication in an abstract manner, that is, in form of events, tokens, or abstract data transfers. The resulting set of traces cover essential characteristics of computation and communication but do not contain data information anymore. Here, we do not take into account resource sharing such as different arbitration schemes and access conflicts. The output of this step is a timing inaccurate system execution trace. • Stage 2. The traces from stage 1 are transformed into an initial Communication Analysis Graph (CAG). One can omit unnecessary details (values of the data communicated, only the size might be important here, etc.) and bursts of computation/communication events might be clustered by identifying only start and end times of these bursts. • Stage 3. A communication topology is chosen, the mapping of the abstract communications to paths in the communication architecture (network, bus, point-to-point links) is specified and finally, the corresponding arbitration protocols are chosen. • Stage 4. In the analytic part of the whole methodology, the CAG from stage 2 is transformed and refined using the information in stage 3. It now captures the computation, communication, and synchronization as seen on the target system. To this end, the initial CAG is augmented to incorporate the various latencies and additional computations introduced by moving from an abstract communication model to an actual one.

© 2006 by Taylor & Francis Group, LLC

15-8

Embedded Systems Handbook

The resulting CAG can then be analyzed in order to estimate the system performance, determine critical paths, and collect various statistics about the computation and communication components. The above approach still suffers from several disadvantages. All traces are the result of a simulation, and the coverage of corner cases is still limited. The underlying representation is a complete execution of the application in form of a graph that may be of prohibitive size. The effect of the transformations applied in order to (1) reduce the size of the CAG and to (2) incorporate the concrete communication architecture are not formally specified. Therefore, it is not clear what the final analysis results represent. Finally, because of the separation between the functional simulation and the nonfunctional analysis, no feedback is possible. For example, a buffer overflow because of a sporadic communication overload situation may lead to a difference in the functional behavior. Nevertheless, the described approach blends two important approaches to performance estimation, namely simulation and analytic methods and makes use of the best properties of both worlds.

15.2.2 Holistic Scheduling Analysis There is a large body of formal methods available for scheduling of shared computing resources, for example, fixed priority, rate monotonic, earliest deadline first scheduling, time-triggered policies like TDMA or round-robin, and static cyclic scheduling. From the WCET of individual tasks, the arrival pattern of activation and the particular scheduling strategy, one can analyze in many cases the schedulability and worst-case response times, see for example, Reference 7. Many different application models and event patterns have been investigated such as sporadic, periodic, jitter, and bursts. There exists a large number of commercial tools that allow for this “one-model approach” the analysis of quantities such as resource load and response times. In a similar way, network protocols are increasingly supported by analysis and optimization tools. The classical scheduling theory has been extended toward distributed systems where the application is executed on several computing nodes and the timing properties of the communication between these nodes cannot be neglected. The seminal work of Tindell and Clark [8] combined fixed priority preemptive scheduling at computations nodes with TDMA scheduling on the interconnecting bus. These results are based on two major achievements: • The communication system (in this case, the bus), was handled in a similar way than the computing nodes. Because of this integration of process and communication scheduling, the method was called a holistic approach to the performance analysis of distributed real-time systems. • The second contribution was the analysis of the influence of the release jitter on the response time, where the release jitter denotes the worst-case time difference between the arrival (or activation) of a process and its release (making it available to the processor). Finally, the release jitter has been linked to the message delay induced by the communication system. This work was improved in terms of accuracy by Yen and Wolf [9] by taking into account correlations between arrivals of triggering events. In the meantime, many extensions and applications have been published based on the same line of thoughts. Other combinations of scheduling and arbitration policies have been investigated, such as CAN [10], and more recently, the FlexRay protocol [11]. The latter extension opens the holistic scheduling methodology to mixed event-triggered and time-triggered systems where the processing and communication is driven by the occurrence of events or the advance of time, respectively. Nevertheless, it must be noted that the holistic approach does not scale to general distributed architectures in that for every new kind of application structure, sharing of resources and combination thereof, a new analysis needs to be developed. In general, the model complexity grows with the size of the system and the number of different scheduling techniques. In addition, the method is restricted to the classical models of task arrival patterns such as periodic, or periodic with jitter.

© 2006 by Taylor & Francis Group, LLC

Analysis of Distributed Embedded Systems

15-9

15.2.3 Compositional Methods Three main problems arise in the case of complex distributed embedded systems: first, the architecture of such systems, as already mentioned, is highly heterogeneous — the different architectural components are designed assuming different input event models and use different arbitration and resource sharing strategies. This makes any kind of compositional performance analysis difficult. Second, applications very often rely on a high degree of concurrency. Therefore, there are multiple control threads, which additionally complicate timing analysis. And third, we cannot expect that an embedded system only needs to process periodic events where to each event a fixed number of bytes is associated. If, for example, the event stream represents a sampled voice signal, then after several coding, processing, and communication steps, the amount of data per event as well as the timing may have changed substantially. In addition, stream based systems often also have to process other event streams that are sporadic or bursty, for example, they have to react to external events or deal with best-effort traffic for coding, transcription, or encryption. There are only a few approaches available that can handle such complex interactions. One approach is based on a unifying model of different event patterns in the form of arrival curves as known from the networking domain, see References 12 and 13. The proposed real-time calculus (RTC) represents the resources and their processing or communication capabilities in a compatible manner and therefore, allows for a modular hierarchical scheduling and arbitration for distributed embedded systems. The approach will be explained in Section 15.3 in some more detail. Richter et al. propose in References 1, 14, and 15 a method that is based on classical real-time scheduling results. They combine different well known abstractions of event task arrival patterns and provide additional interfaces between them. The approach is based on the following principles: • The main goal is to make use of the very successful results in real-time scheduling, in particular for sharing a single processor or a single communication link, see for example, References 7 and 16. For a large class of scheduling and arbitration policies and a set of arrival patterns (periodic, periodic with jitter, sporadic, and bursty), upper and lower bounds on the response time can be determined, that is, the time difference between the arrival of a task and its finishing time. Therefore, the abstraction of a task of the application consists of a triggering event stream with a certain arrival pattern, the WCET and BCET on the resource. Several tasks can be mapped onto a single resource. Together with the scheduling policy, one can obtain for each task the associated lower and upper bound of the response time. In a similar way, communication and shared busses can be handled. • The application model is a simple concatenation of several tasks. The end-to-end delay can now be obtained by adding the individual contributions of the tasks; the necessary buffer memory can simply be computed taking into account the initial arrival pattern. • Obviously, the approach is feasible only if the arrival patterns fit the few basic models for which results on computing bounds on the response time are available. In order to overcome this limitation, two types of interfaces are defined: (a) EMIF. Event Model Interfaces are used in the performance analysis only. They perform a type conversion between certain arrival patterns, that is, they change the mathematical representation of event streams. (b) EAF. Event Adaptation Functions need to be used in cases where there exists no EMIF. In this case, the hardware/software implementation must be changed in order to make the system analyzable, for example, by adding playout buffers at appropriate locations. In addition, a new set of six arrival patterns was defined in Reference 1 which is more suitable for the proposed type conversion using EMIF and EAF, see Figure 15.4. In Figure 15.5, the example of Figure 15.1 is extended by adding to the tasks P1 to P6, appropriate arrival patterns (event stream abstractions) and EMIF/EAF interfaces. For example, we suppose that there is an analysis method for the bus arbitration scheme available that requires “periodic with jitter” as the input model. As the transformation from “periodic with burst” requires an EAF, the implementation must be changed to accommodate a buffer that smoothens the bursts. From “periodic” to “periodic with jitter,”

© 2006 by Taylor & Francis Group, LLC

15-10

Embedded Systems Handbook

Periodic

Periodic w/jitter

Periodic w/burst

ti + 1 – ti = T ti

ti

ti + 1

t

T

J>T

t

T J Admissible occurrence of event

ti

ti = i ·T + wi + w0 0 > wi > J ti +1 – ti > d

J>T

t

T

J

ti = i ·T + wi + w0 0 > wi > J

FIGURE 15.4 Some arrival patterns of tasks that can be used to characterize properties of event streams in Reference 1. T , J , and d denote the period, jitter, and minimal interarrival time, respectively. φ0 denotes a constant phase shift.

A1

P1 Sensor

P3

Memory P2

Sporadic

Periodic w/burst EAF

EMIF

Periodic w/jitter C2 Periodic w/jitter Periodic

C1

EMIF

Periodic w/burst

A2

P5

P4 …

FIGURE 15.5

Periodic w/burst

Buffer P6

Example of event stream interfaces for the example in Figure 15.1.

one can construct a lossless EMIF simply by setting the jitter J = 0. There is another interface between communication C1 and task P3 that converts the bursty output of the bus to a sporadic model. Now, one can apply performance analysis methods to all of the components. As a result, one may determine the minimal buffer size and an appropriate scheduling policy for the DSP such that no overflow or underflow occurs. Several extensions have been worked out, for example, in order to deal with cyclic nonfunctional dependencies and to generalize the application model. Nevertheless, when comparing the requirements for a modular performance analysis, the approach has some inherent drawbacks. EAFs are caused by the limited class of supported event models and the available analysis methods. The analysis method enforces a change in the implementation. Furthermore, the approach is not modular in terms of the resources, as their service is not modeled explicitly. For example, if several scheduling policies need to be combined in one resource (hierarchical scheduling), then for each new combination an appropriate analysis method must be developed. In this way, the approach suffers from the same problem as the “holistic approach” described earlier. In addition, one is bound to the classical arrival patterns that are not sufficient in case of stream processing applications. Other event models need to be converted with loss in accuracy (EMIF) or the implementation must be changed (EAF).

© 2006 by Taylor & Francis Group, LLC

Analysis of Distributed Embedded Systems

15-11

15.3 The Performance Network Approach This section describes an approach to the performance analysis of embedded systems that is influenced by the worst-case analysis of communication networks. The network calculus as described in Reference 17 is based on Reference 18 and uses (max,+)-algebra to formulate the necessary operations. The network calculus is a promising analysis methodology as it is designed to be modular in various respects and as the representation of event (or packet) streams is not restricted to the few classes mentioned in Section 15.2.3. In References 12 and 19, the method has been extended to the RTC in order to deal with distributed embedded systems by combining computation and communication. Because of the detailed modeling of the capability of the shared computing and communication resources as well as the event streams, a high accuracy can be achieved, see Reference 20. The following sections serve to explain the basic approach. In addition, the main performance analysis method is not bound to the use of the RTC. Instead, any suitable abstraction of event streams and resource characterization is possible. Only the actual computations that are done within the components of the performance network need to be changed appropriately.

15.3.1 Performance Network In functional specification and verification, the given application is usually decomposed into components that are communicating via event interfaces. The properties of the whole system are investigated by combining the behavior of the components. This kind of representation is common in the design of complex embedded systems and is supported by many tools and standards, for example, UML. It would be highly desirable if the performance analysis follows the same line of thinking as it could be integrated into the usual design methodology easily. Considering the discussion given earlier, we can identify two major additions that are necessary: • Abstraction. Performance analysis is interested in making statements about the timing behavior not just for one specific input characterization but for a larger class of possible environments. Therefore, the concrete event streams that flow between the components must be represented in an abstract way. As an example, we have seen their characterization by “periodic” or “sporadic with jitter.” The same way, the nonfunctional properties of the application and the resource sharing mechanisms must be modeled appropriately. • Resource modeling. In comparison to functional validation, we need to model the resource capabilities and how they are changed by the workload of tasks or communication. Therefore, in contrary to the approaches described before, we will model the resources explicitly as “first class citizens” of the approach. As an example of a performance network, let us look again at the simple example from Figure 15.1 and Figure 15.5. In Figure 15.6, we see a corresponding performance network. Because of the simplicity of the example, not all the modeling possibilities can be shown. On the left-hand side, you see the abstract input which models the sources of the event streams that trigger the tasks of the applications: “Timer” represents the periodic instantiation of the task that reads out the buffer for playback, “Sensor” models the periodic bursty events from the sensor and “RT data” denotes the real-time data in equidistant packets via the Input interface. The associated abstract event streams are transformed by the performance components. On the top, you can see the resource modules that model the service of the shared resources, for example, the Input, CPU, Bus, and I/O component. The abstract resource streams (vertical direction) interact with the event streams on the performance modules and performance components. The resource interfaces at the bottom represent the remaining resource service that is available to other applications that may run on the execution platform. The performance components represent (1) the way how the timing properties of input event streams are transformed to timing properties of output event streams and (2) the transformation of the resources. Of course, these components can be hierarchically grouped into larger components. The way how the

© 2006 by Taylor & Francis Group, LLC

15-12

Embedded Systems Handbook

Resource module

Input

CPU

Bus

Timer

DSP

I/O

P5

Sensor

P1

P2

P3 C1 C2

RT data

Abstract input

P4

Abstract event stream

P5

Resource interface

Performance component

Abstract resource stream

FIGURE 15.6 A simple performance network related to the example in Figure 15.1.

performance components are grouped and their transfer function reflect the resource sharing strategy. For example, P1 and P2 are connected serially in terms of the resource stream and therefore, they model a fixed-priority scheme with the high priority assigned to task P1. If the bus implements FCFS strategy or a TTP, the transfer function of C1/C2 needs to be determined such that the abstract representations of the event and resource stream are correctly transformed.

15.3.2 Variability Characterization The timing characterization of event and resource streams is based on Variability Characterization Curves (VCC) which substantially generalize the classical representations such as sporadic or periodic. As the event streams propagate through the distributed architecture, their timing properties get increasingly complex and the standard patterns can not model them with appropriate accuracy. The event streams are described using arrival curves α¯ u (), α¯ l () ∈ R ≥ 0, ∈ R ≥ 0 which provide upper and lower bounds on the number of events in any time interval of length . In particular, there are at most α¯ u () and at least α¯ l () events within the time interval [t , t + ) for all t ≥ 0. In a similar way, the resource streams are characterized using service functions β u (), β l () ∈ R ≥ 0, ∈ R ≥ 0 provide upper and lower bounds on the available service in any time interval of length . The unit of service depends on the kind of the shared resource, for example, instructions (computation) or bytes (communication). Note that as defined above, the VCC’s α¯ u () and α¯ l () are expressed in terms of events (this is marked by a bar on their symbol), while the VCC’s β u () and β l () are expressed in terms of workload/service. A method to transform event-based VCC’s to workload/resource-based VCC’s and vice versa is presented later in this chapter. All calculations and transformations presented here are valid both with only eventbased or with only workload/resource-based VCC’s, but in this chapter mainly the event-based formulation is used. Figure 15.7 shows arrival curves that specify the basic classical models shown in Figure 15.4. Note that in case of sporadic patterns, the lower arrival curves are 0. In a similar way, Figure 15.8 shows a service curve of a simple TDMA bus access with period T , bandwidth b, and slot interval τ .

© 2006 by Taylor & Francis Group, LLC

Analysis of Distributed Embedded Systems Periodic

Periodic w/jitter

au, al

Periodic w/bursts

au, al

4 3 2 1

au, al 4 3 2 1

4 3 2 1

T

FIGURE 15.7

15-13

∆

2T

T 2T T–J T+J

∆

T 2T ∆ d 2T – J 2T + J

Basic arrival functions related to the patterns described in Figure 15.4.

bu, bl Bandwidth b

t

T

t

bt T t T–t

FIGURE 15.8

2T

∆

Example of a service curve that describes a simple TDMA protocol.

Note that arrival curves can be approximated using linear approximations, that is, a piecewise linear function. Moreover, there are of course finite representations of the arrival and service curves, for example, by decomposing them into an irregular initial part and a periodic part. Where do we get the arrival and service functions from, for example, those characterizing a processor (CPU in Figure 15.6), or an abstract input (Sensor in Figure 15.6): • Pattern. In some cases, the patterns of the event or resource stream are known, for example, bursty, periodic, sporadic, and TDMA. In this case, the functions can be constructed analytically, see for example, Figure 15.7 and Figure 15.8. • Trace. In case of unknown arrival or service patterns, one may use a set of traces and compute the envelope. This can be done easily by using a sliding window of size and determining the maximum and minimum number of events (or service) within the window. • Data sheets. In other cases, one can derive the curves by deriving the bounds from the characteristic of the generating device (in terms of the arrival curve) or the hardware component (in case of service curve). The performance components transform abstract event and resource streams. But so far, the arrival curve is defined in terms of events per time interval whereas the service curve is given in terms of service per time interval. One possibility to overcome this gap is to define the concept of workload curves that connect the number of successive events in an event stream and the maximal or minimal workload associated. They capture the variability in execution demands. The upper and lower workload curve γ u (e), γ l (e) ∈ R ≥ 0 denote the maximal and minimal workload on a specific resource for any sequence of e consecutive events. If we have these curves available, then we can easily determine upper and lower bounds on the workload that an event stream imposes in any time interval of length on a resource as α u () = γ u (α¯ u ()) and α l () = γ l (α¯ l ()), respectively. And −1 analogously, β¯ u () = γ l (β u ()) and β¯ l () = γ u −1 (β l ()). As in the case of the arrival and service curves, there appears the question, where the workload curves can come from. A selection of possibilities is given below: • WCET and BCET. The simplest possibility is to (1) assume that each event of an event stream triggers the same task and (2) that this task has a given WCET and BCET determined by other

© 2006 by Taylor & Francis Group, LLC

15-14

Embedded Systems Handbook Task g u, g l

Subtask

16 12 8 4

Workload g u, g l 10 1

2 3

1

2

3

WCET = 4 BCET = 3

20 15 10 5

2

e

1

5

e

Workload

FIGURE 15.9 Two examples of modeling the relation between incoming events and the associated workload on a resource. The left-hand side shows a simple modeling in terms of the WCET and BCET of the task triggered by an event. The right-hand side models the workload generated by a task through a finite state machine. The workload curves can be constructed by considering the maximum or minimum weight paths with e transitions.

methods. An example of an associated workload curve is given in Figure 15.9. The same holds for communication events also. • Application modeling. The above method models the fact that not all events lead to the same execution load (or number of bits) by simply using upper and lower bounds on the execution time. The accuracy of this approach can be substantially improved, if characteristics of the application are taken into account, for example (1) distinguishing between different event types each one triggering a different task and (2) modeling that it is not possible that many consecutive events all have the WCET (or BCET). This way, one can model correlations in event streams, see Reference 21. Figure 15.9 represents on the right-hand side a simple example where a task is refined into a set of subtasks. At each incoming event, a subtask generates the associated workload and the program branches to one of its successors. • Trace. As in the case of arrival curves, we can use a given trace and report the workloads associated to each event, for example, by simulation. Based on this information, we can easily compute the upper and lower envelope. A more fine-grained modeling of an application is possible also, for example, by taking into account different event types in event streams, see Reference 22. By the same approach, it is also possible to model more complex task models, for example, a task with different production and consumption rates of events or tasks with several event inputs, see Reference 23. Moreover, the same modeling holds for the load on communication links of the execution platform also. In order to construct a scheduling network according to Figure 15.6, we still need to take into account the resource sharing strategy.

15.3.3 Resource Sharing and Analysis In Figure 15.1, we see, for example, that the performance modules associated to tasks P1 and P2 are connected serially. This way, we can model a preemptive fixed-priority resource sharing strategy as P2 only gets the CPU resource that is left after the workload of P1 has been served. Other resource sharing strategies can be modeled as well, see for example, Figure 15.10 where in addition a proportional share policy is modeled on the left. In this case, a fixed portion of the available resource (computation or communication) is associated to each task. Other sharing strategies are possible also, such as FCFS [17]. In the same Figure 15.10, we also see how the workload characterization as described in the last section is used to transform the incoming arrival curve into a representation that talks about the workload for a resource. After the transformation of the incoming stream by a block called RTC, the inverse workload transformation may be done again in order to characterize the stream by means of events per time interval. This way, the performance modules can be freely combined as their input and output representations match.

© 2006 by Taylor & Francis Group, LLC

Analysis of Distributed Embedded Systems

15-15

[ bl,bu ]

Share g–1

[al,au ] RTC

[al ⬘,au⬘]

[ bl ⬘,bu⬘]

RTC g Sum Proportional share component

FIGURE 15.10

Fixed priority component

Performance module

Two examples of resource sharing strategies and their model in the RTC.

Input Buffers streams

Resource sharing

Service …

FIGURE 15.11

Functional model of resource sharing on computation and communication resources.

We still need to describe how a single workload stream and resource stream interact on a resource. The underlying model and analysis very much depends on the underlying execution platform. As a common example, we suppose that the events (or data packets) corresponding to a single stream are stored in a queue before being processed, see Figure 15.11. The same model is used for computation as well as for communication resources. It matches well the common structure of operating systems where ready tasks are lined up until the processor is assigned to one of them. Events belonging to one stream are processed in a FCFS manner whereas the order between different streams depends on the particular resource sharing strategy. Following this model, one can derive the equations that describe the transformation of arrival and service curves by an RTC module according to Figure 15.10, see for example, Reference 13:

α¯ u = [(α¯ u ⊗ β¯ u ) β¯ l ] ∧ β¯ u α¯ l = [(α¯ l β¯ u ) ⊗ β¯ l ] ∧ β¯ l β¯ u = (β¯ u − α¯ l ) 0 β¯ l = (β¯ l − α¯ u ) ⊗ 0

Following Reference 24, the operators used are called min-plus/max-plus convolutions ( f ⊗ g )(t ) = inf

0≤u≤t

( f ⊗ g )(t ) = sup 0≤u≤t

© 2006 by Taylor & Francis Group, LLC

f (t − u) + g (u) f (t − u) + g (u)

15-16

Embedded Systems Handbook bl1

bl2

blN

au bl

–u a

Delay gu1

gu2

Backlog

guN

∆

–l b

FIGURE 15.12

Representation of the delay and accumulated buffer space computation in a performance network.

and min-plus/max-plus deconvolutions ( f g )(t ) = sup f (t + u ) − g (u ) u ≥0

( f g )(t ) = inf f (t + u ) − g (u ) u ≥0

further, ˆ denotes the minimum-operator (f gˆ )(t ) = min{f (t ), g (t )}. Using these equations, the workload curves, and the characterization of input event and resource streams, we now can determine the characterizations of all event and resource streams in a performance network such as in Figure 15.6. From the resulting arrival curves (leaving the network on the right-hand side) and service curves (at the bottom), we can compute all the relevant information such as the average resource loads, the end-to-end delays and the necessary buffer spaces on the event and packet queues, see Figure 15.11. If the performance network contains cycles, then fixed-point iterations are necessary. As an example let us suppose that the upper input arrival curve of an event stream is α¯ u (). Moreover, the stream is processed by a sequence of N modules according to the right-hand side of Figure 15.10 with incoming service curves βil (), 1 ≤ i ≤ N and workload curves γiu (e ). Then we can determine the maximal end-to-end delay and accumulated buffer space for this stream according to (see Reference 12) γi−1 (W ) = sup{e ≥ 0: γiu (e ) ≤ W } β¯il () = γi−1 (βil ())

∀1 ≤ i ≤ N

∀1 ≤ i ≤ N

β¯ l () = β¯1l () ⊗ β¯2l () · · · ⊗ β¯Nl () delay ≤ sup inf {τ ≥ 0: α¯ u () ≤ β¯ l ( + τ )} ≥0

backlog ≤ sup {α¯ u () − β¯ l ()} ≥0

The curve γ −1 (W ) denotes the pseudo inverse of a workload curve, that is, it yields the minimum number of events that can be processed if the service W is available. Therefore, β¯il () is the minimal available service in terms of events per time interval. It has been shown in Reference 17, that the delay and backlog are determined by the accumulated service β¯ l () that can be obtained using the convolution of all individual services. The delay and backlog can now be interpreted as the maximal horizontal and vertical distance between the arrival and accumulated service curves, respectively, see Figure 15.12. All the above computations can be implemented efficiently, if appropriate representations for the variability characterization curves are used, for example, piecewise linear, discrete points, or periodic.

15.3.4 Concluding Remarks Because of the modularity of the performance network, one can easily analyze a large number of different mapping and resource sharing strategies for design space exploration. Applications can be extended by

© 2006 by Taylor & Francis Group, LLC

Analysis of Distributed Embedded Systems

15-17

adding tasks and performance modules. Moreover, different subsystems can use different kinds of resource sharing without sacrificing the performance analysis. Of particular interest is the possibility to build a performance component for a combined hardware– software system that describes the performance properties of a whole subsystem. This way, a subcontractor can deliver a hardware/software/operating system module that already contains part of the application. The system house can now integrate the performance components of the subsystems in order to validate the performance of the whole system. To this end, he does not need to know the details of the subsystem implementations. In addition, a system house can also add an application to the subsystems. Using the resource interfaces that characterize the remaining available service from the subsystems, its timing correctness can easily be verified. On one hand, the performance network approach is correct in the sense that it yields upper and lower bounds on quantities such as end-to-end delay and buffer space. On the other hand, it is a worst-case approach that covers all possible corner cases independent of their probability. Even if the deviations from simulation results can be small, see for example, Reference 25, in many cases one is interested in average case behavior of distributed embedded systems also. Therefore, performance analysis methods as those described in this chapter can be considered to be complementary to the existing simulation based validation methods. Furthermore, any automated or semiautomated exploration of different design alternatives (design space exploration) could be separated into multiple stages, each having a different level of abstraction. It would then be appropriate to use an analytical performance evaluation framework, such as those described in this chapter, during the initial stages and resort to simulation only when a relatively small set of potential architectures is identified.

References [1] K. Richter, D. Ziegenbein, M. Jersak, and R. Ernst. Model composition for scheduling analysis in platform design. In Proceedings of the 39th Design Automation Conference (DAC). ACM Press, New Orleans, LA, June 2002. [2] Lothar Thiele, Simon Künzli, and Eckart Zitzler. A modular design space exploration framework for embedded systems. IEE Proceedings Computers and Digital Techniques, Special Issue on Embedded Microelectronic Systems, 2004. [3] SystemC homepage. http://www.systemc.org. [4] T. Grötker, S. Liao, G. Martin, and S. Swan. System Design with SystemC. Kluwer Academic Publishers, Boston, May 2002. [5] Doug Burger and Todd M. Austin. The simplescalar tool set, version 2.0. SIGARCH Computer Architecture News, 25, 1997, 13–25. [6] K. Lahiri, A. Raghunathan, and S. Dey. System-level performance analysis for designing on-chip communication architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20, 2001, 768–783. [7] G.C. Buttazzo. Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications. Kluwer Academic Publishers, Boston, 1997. [8] K. Tindell and J. Clark. Holistic schedulability analysis for distributed hard real-time systems. Microprocessing and Microprogramming — Euromicro Journal, Special Issue on Parallel Embedded Real-Time Systems, 40, 1994, 117–134. [9] T. Yen and W. Wolf. Performance estimation for real-time distributed embedded systems. IEEE Transactions on Parallel and Distributed Systems, 9, 1998, 1125–1136. [10] K. Tindell, A. Burns, and A.J. Wellings. Calculating controller area networks (CAN) message response times. Control Engineering Practice, 3, 1995, 1163–1169. [11] T. Pop, P. Eles, and Z. Peng. Holistic scheduling and analysis of mixed time/event triggered distributed embedded systems. In Proceedings of the International Symposium on Hardware–Software Codesign (CODES). ACM Press, May 2002, pp. 187–192.

© 2006 by Taylor & Francis Group, LLC

15-18

Embedded Systems Handbook

[12] L. Thiele, S. Chakraborty, M. Gries, A. Maxiaguine, and J. Greutert. Embedded software in network processors — models and algorithms. In Proceedings of the 1st Workshop on Embedded Software (EMSOFT). (Lake Tahoe, CA, USA), Vol. 2211 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2001, pp. 416–434. [13] L. Thiele, S. Chakraborty, M. Gries, and S. Künzli. A framework for evaluating design tradeoffs in packet processing architectures. In Proceedings of the 39th Design Automation Conference (DAC). ACM Press, New Orleans, LA, June 2002, pp. 880–885. [14] Kai Richter, Marek Jersak, and Rolf Ernst. A formal approach to mpsoc performance verification. IEEE Computer, 36, 2003, 60–67. [15] K. Richter and R. Ernst. Model interfaces for heterogeneous system analysis. In Proceedings of the 6th Design, Automation and Test in Europe (DATE). IEEE, Munich, Germany, March 2002, pp. 506–513. [16] J.A. Stankovic, M. Spuri, K. Ramamritham, and G.C. Buttazzo. Deadline scheduling for real-time systems: EDF and related algorithms. In Kluwer International Series in Engineering and Computer Science, Vol. 460. Kluwer Academic Publishers, Dordrecht, 1998. [17] J.-Y. Le Boudec and P. Thiran. Network Calculus — A Theory of Deterministic Queuing Systems for the Internet, Vol. 2050 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2001. [18] R.L. Cruz. A calculus for network delay, part I: network elements in isolation. IEEE Transactions on Information Theory, 37, 1991, 114–131. [19] L. Thiele, S. Chakraborty, and M. Naedele. Real-time calculus for scheduling hard real-time systems. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Vol. 4. IEEE, 2000, pp. 101–104. [20] S. Chakraborty, S. Künzli, L. Thiele, A. Herkersdorf, and P. Sagmeister. Performance evaluation of network processor architectures: combining simulation with analytical estimation. Computer Networks, 41, 2003, 641–665. [21] Alexander Maxiaguine, Simon Künzli, and Lothar Thiele. Workload characterization model for tasks with variable execution demand. In Proceedings of Design Automation and Test in Europe (DATE). IEEE Press, Paris, France, February 2004, pp. 1040–1045. [22] Ernesto Wandeler, Alexander Maxiaguine, and Lothar Thiele. Quantitative characterization of event streams in analysis of hard real-time applications. In Proceedings of the 10th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). IEEE Computer Society, May 2004, pp. 450–459. [23] Ernesto Wandeler and Lothar Thiele. Abstracting functionality for modular performance analysis of hard real-time systems. In Asia South Pacific Design Automation Conference (ASP-DAC). IEEE, January 2005. [24] F. Baccelli, G. Cohen, G. Olsder, and J.-P. Quadrat. Synchronization and Linearity. John Wiley & Sons, New York, 1992. [25] S. Chakraborty, S. Künzli, and L. Thiele. A general framework for analysing system properties in platform-based embedded system designs. In Proceedings of the 6th Design, Automation and Test in Europe (DATE). Munich, Germany, March 2003.

© 2006 by Taylor & Francis Group, LLC

Power Aware Computing 16 Power Aware Embedded Computing Margarida F. Jacome and Anand Ramachandran

© 2006 by Taylor & Francis Group, LLC

16 Power Aware Embedded Computing 16.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-1 16.2 Energy and Power Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-3 Instruction- and Function-Level Models • Micro-Architectural Models • Memory and Bus Models • Battery Models

16.3 System/Application Level Optimizations . . . . . . . . . . . . . . 16-7 16.4 Energy Efficient Processing Subsystems . . . . . . . . . . . . . . . . 16-8 Voltage and Frequency Scaling • Dynamic Resource Scaling • Processor Core Selection

16.5 Energy Efficient Memory Subsystems . . . . . . . . . . . . . . . . . . 16-11

Margarida F. Jacome and Anand Ramachandran University of Texas

Cache Hierarchy Tuning • Novel Horizontal and Vertical Cache Partitioning Schemes • Dynamic Scaling of Memory Elements • Software-Controlled Memories, Scratch-Pad Memories • Improving Access Patterns to Off-Chip Memory • Special Purpose Memory Subsystems for Media Streaming • Code Compression • Interconnect Optimizations

16.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-17 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-17

16.1 Introduction Embedded systems are pervasive in modern life. State-of-the-art embedded technology drives the ongoing revolution in consumer and communication electronics, and is on the basis of substantial innovation in many other domains, including medical instrumentation, process control, etc. [1]. The impact of embedded systems in well established “traditional” industrial sectors, for example, automotive industry, is also increasing at a fast pace [1,2]. Unfortunately, as Complementary Metal-Oxide Semiconductor (CMOS) technology rapidly scales, enabling the fabrication of ever faster and denser Integrated Circuits (ICs), the challenges that must be overcome to deliver each new generation of electronic products multiply. In the last few years, power dissipation has emerged as a major concern. In fact, projections on power density increases owing to CMOS scaling clearly indicate that this is one of the fundamental problems that will ultimately preclude further scaling [3,4]. Although the power challenge is indeed considerable, much can be done to mitigate the deleterious effects of power dissipation, thus enabling performance and device density to be taken to truly unprecedented levels by the semiconductor industry throughout the next 10 to 15 years.

16-1

© 2006 by Taylor & Francis Group, LLC

16-2

Embedded Systems Handbook

Power density has a direct impact on packaging and cooling costs, and can also affect system reliability, owing to electromigration and hot-electron degradation effects. Thus, the ability to decrease power density, while offering similar performance and functionality, critically enhances the competitiveness of a product. Moreover, for battery operated portable systems, maximizing battery lifetime translates into maximizing duration of service, an objective of paramount importance for this class of products. Power is thus a primary figure of merit in contemporaneous embedded system design. Digital CMOS circuits have two main types of power dissipation: dynamic and static. Dynamic power is dissipated when the circuit performs the function(s) it was designed for, for example, logic and arithmetic operations (computation), data retrieval, storage, and transport, etc. Ultimately, all of this activity translates into switching of the logic states held on circuit nodes. Dynamic power dissipation is thus proportional to C · VDD 2 · f · r, where C denotes the total circuit capacitance, VDD and f denote the circuit supply voltage and clock frequency, respectively, and r denotes the fraction of transistors expected to switch at each clock cycle [5,6]. In other words, dynamic power dissipation is impacted to first order by circuit size/complexity, speed/rate, and switching activity. In contrast, static power dissipation is associated with preserving the logic state of circuit nodes between such switching activity, and is caused by subthreshold leakage mechanisms. Unfortunately, as device sizes shrink, the severity of leakage power is increasing at an alarming pace [3]. Clearly, the power problem must be addressed at all levels of the design hierarchy, from system to circuit, as well as through innovations on CMOS device technology [5,7,8]. In this survey we provide a snapshot on the state-of-the-art on system and architecture level design techniques and methodologies aimed at reducing, both, static and dynamic power dissipation. Since such techniques focus on the highest level of the design hierarchy, their potential benefits are immense. In particular, at this high level of abstraction, the specifics of each particular class of embedded applications can be considered as a whole and, as it will be shown in our survey, such an ability is critical to designing power/energy efficient systems, that is, systems that spend energy strictly when and where it is needed. Broadly speaking, this requires a proper design and allocation of system resources, geared toward addressing critical performance bottlenecks in a power efficient way. Substantial power/energy savings can also be achieved through the implementation of adequate dynamic power management policies, for example, tracking instantaneous workloads (or levels of resource utilization) and “shutting-down” idling/unused resources, so as to reduce leakage power, or “slowing down” under-utilized resources, so as to decrease dynamic power dissipation. These are clearly system level decisions/policies, in that their implementation typically impacts several architectural subsystems. Moreover, different decisions/policies may interfere or conflict with each other and, thus, assessing their overall effectiveness requires a system level (i.e., global) view of the problem. A typical embedded system architecture consists of a processing subsystem (including one or more processor cores, hardware accelerators, etc.), a memory subsystem, peripherals, and global and local interconnect structures (buses, bridges, crossbars, etc.). Figure 16.1 shows an abstract view of two such architecture instances. Broadly speaking, system level design consists of defining the specific embedded system architecture to be used for a particular product, as well as defining how the target embedded application (implementing the required functionality/services) is to be mapped onto that architecture. Embedded systems come in many varieties and with many distinct design optimization goals and requirements. Even when two products provide the same basic functionality, say, video encoding/decoding, they may have fundamentally different characteristics, namely, different performance and quality-ofservice requirements, one may be battery operated and the other not, etc. The implications of such product differentiation are of paramount importance when power/energy are considered. Clearly, the higher the system’s required performance/speed (defined by metrics such as throughput, latency, bandwidth, response time, etc.), the higher will be its power dissipation. The key objective is thus to minimize the power dissipated to deliver the required level of performance [5,9]. The trade-offs, techniques, and optimizations required to develop such power aware or power efficient designs vary widely across the vast spectrum of embedded systems available in today’s market, encompassing many complex decisions driven by system requirements as well as intrinsic characteristics of the target applications [10].

© 2006 by Taylor & Francis Group, LLC

Power Aware Embedded Computing

Memory

Modem DSP core

I/O Ports

VLIW core Embedded processor core

A/D and D/A

16-3

VLIW core Primary embedded processor core A/D and D/A

Master control ASIP core Sound codec DSP core Host interface

ASIP memory controller RAM Flash ROM

Hardware accelerator FFT, DCT, ...

I/O ports

FIGURE 16.1 Illustrative examples of a simple and a more complex embedded system architecture.

Consider, for example, the design task of deciding on the number and type of processing elements to be instantiated on an embedded architecture, that is, defining its processing subsystem. Power, performance, cost, and time-to-market considerations dictate if one should rely entirely on readily available processors (i.e., off-the-shelf microcontrollers, Digital Signal Processors (DSPs) and/or general-purpose RISC cores), or should also consider custom execution engines, namely, Application Specific Instruction set Processors (ASIPs), possibly reconfigurable, and/or hardware accelerators (see Figure 16.1). Hardware/software partitioning is a critical step in this process [1]. It consists of deciding which of an application’s segments/functions should be implemented in software (i.e., run on a processor core) and which (if any) should be implemented in hardware (i.e., execute on high performance, highly power efficient custom hardware accelerators). Naturally, hardware/software partitioning decisions should reflect the power/performance criticality of each such segment/function. Clearly, this is a complex multiobjective optimization problem defined on a huge design space that encompasses, both, hardware and software related decisions. To compound the problem, the performance and energy efficiency of an architecture’s processing subsystem cannot be evaluated in isolation, since its effectiveness can be substantially impacted by the memory subsystem (i.e., the adopted memory hierarchy/organization) and the interconnect structures supporting communication/data transfers between processing components to/from the environment in which the system is embedded. Thus, decisions with respect to these other subsystems and components must be concurrently made and jointly assessed. Targeting up front a specific embedded system platform, that is, an architectural “subspace” relevant to a particular class of products/applications, can considerably reduce the design effort [11,12]. Still, the design space remains (typically) so complex that a substantial design space exploration may be needed in order to identify power/energy efficient solutions for the specified performance levels. Since time to market is critical, methodologies to efficiently drive such an exploration, as well as fast simulators and low complexity (yet good fidelity) performance, power, and energy estimation models, are critical to aggressively exploiting effective power/energy driven optimizations, within a reasonable time frame. Our survey starts by providing an overview on state-of-the-art models and tools used to evaluate the goodness of individual system design points. We then discuss power management techniques and optimizations aimed at aggressively improving the power/energy efficiency of the various subsystems of an embedded system.

16.2 Energy and Power Modeling This section discusses high-level modeling and power estimation techniques aimed at assisting system and architecture-level design. It would be unrealistic to expect a high degree of accuracy on power estimates

© 2006 by Taylor & Francis Group, LLC

16-4

Embedded Systems Handbook

produced during such an early design phase, since accurate power modeling requires detailed physical level information that may not yet be available. Moreover, highly accurate estimation tools (working with detailed circuit/layout-level information) would be too time consuming to allow for any reasonable degree of design space exploration [1,5,13]. Thus, practically speaking, power estimation during early design space exploration should aim at ensuring a high degree of fidelity rather than necessarily accuracy. Specifically, the primary objective during this critical exploration phase is to assess the relative power efficiency of different candidate system architectures (populated with different hardware and/or software components), the relative effectiveness of alternative software implementations (of the same functionality), the relative effectiveness of different power management techniques, etc. Estimates that correctly expose such relative power trends across the design space region being explored provide the designer with the necessary information to guide the exploration process.

16.2.1 Instruction- and Function-Level Models Instruction-level power models are used to assess the relative power/energy efficiency of different processors executing a given target embedded application, possibly with alternative memory subsystem configurations. Such models are thus instrumental during the definition of the main subsystems of an embedded architecture, as well as during hardware/software partitioning. Moreover, instruction-level power models can also be used to evaluate the relative effectiveness of different software implementations of the same embedded application, in the context of a specific embedded architecture/platform. In their most basic form, instruction-level power models simply assign a power cost to each assembly instruction (or class of assembly instructions) of a programmable processor. The overall energy consumed by a program running on the target processor is estimated by summing up the instruction costs for a dynamic execution trace which is representative of the application [14–17]. Instruction-level power models were first developed by experimentally measuring the current drawn by a processor while executing different instruction sequences [14]. During this first modeling effort, it was observed that the power cost of an instruction may actually depend on previous instructions. Accordingly, the instruction-level power models developed in Reference 14 include several inter-instruction effects. Later studies observed that, for certain processors, the power dissipation incurred by the hardware responsible for fetching, decoding, analyzing, and issuing instructions, and then routing and reordering results, was so high that a simpler model that only differentiates between instructions that access on-chip resources and those that go off-chip would suffice for such processors [16]. Unfortunately, power estimation based on instruction-level models can still be prohibitively time consuming during early design space exploration, since it requires collecting and analyzing large instruction traces and, for many processors, considering a quadratically large number of inter-instruction effects. In order to “accelerate” estimation, processor specific coarser function-level power models were later developed [18]. Such approaches are faster because they rely on the use of macromodels characterizing the average energy consumption of a library of functions/subroutines executing on a target processor [18]. The key challenge in this case is to devise macromodels that can properly quantify the power consumed by each subroutine of interest, as a function of “easily observable parameters.” Thus, for example, a quadratic power model of the form an 2 + bn + c could be first tentatively selected for a insertion sort routine, where n denotes the number of elements to be sorted. Actual power dissipation then needs to be measured for a large number of experiments, run with different values of n. Finally, the values of the macromodel’s coefficients a, b, and c are derived, using regression analysis, and the overall accuracy of the resulting macromodel is assessed [18]. The “high-level” instruction- and function-level power models discussed so far allow designers to “quickly” assess a large number candidate system architectures and alternative software implementations, so as to narrow the design space to a few promising alternatives. Once this initial broad exploration is concluded, power models for each of the architecture’s main subsystems and components are needed, in order to support the detailed architectural design phase that follows.

© 2006 by Taylor & Francis Group, LLC

Power Aware Embedded Computing

16-5

16.2.2 Micro-Architectural Models Micro-architectural power models are critical to evaluating the the impact of different processing subsystem choices on power consumption, as well as the effectiveness of different (micro-architecture level) power management techniques implemented on the various subsystems. In the late 1980s and early 1990s, cycle accurate (or more precisely, cycle-by-cycle) simulators, such as Simplescalar [19], were developed to study the effect of architectural choices on the performance of general-purpose processors. Such simulators are in general very flexible, allowing designers/architects to explore the complex design space of contemporaneous processors. Namely, they include built-in parameters that can be used to specify the number and mix of functional units to be instantiated in the processor’s datapath, the issue width of the machine, the size and associativity of the L1 and L2 caches, etc. By varying such parameters, designers can study the performance of different machine configurations for representative applications/benchmarks. As power consumption became more important, simulators to estimate dynamic power dissipation (e.g., “Wattch [20],”“Cai–Lim model [21],” and “Simplepower [22]”) were later incorporated on these existing frameworks. Such an integration was performed seamlessly, by directly augmenting the “cycle-oriented” performance models for the various micro-architectural components with corresponding power models. Naturally, the overall accuracy of these simulation-based power estimation techniques is determined by the level of detail of the power models used for the micro-architecture’s constituent components. For out-of-order RISC cores, for example, the power consumed in finding independent instructions to issue is a function of the number of instructions currently in the instruction queue and of the actual dependencies between such instructions. Unfortunately, the use of detailed power models accurately capturing the impact of input and state data on the power dissipated by each component would prohibitively increase the already “long” micro-architectural simulation runtimes. Thus, most state-of-the-art simulators use very simple/straightforward empirical power models for datapath and control logic, and slightly more sophisticated models for regular structures such as caches [20]. In their simplest form, such models capture “typical” or “average” power dissipation for each individual micro-architectural component. Specifically, each time a given component is accessed/used during a simulation run, it is assumed that it dissipates its corresponding “average” power. Slightly more sophisticated power macromodels for datapath components have been proposed in References 23–28, and shown to improve accuracy with a relatively small impact on simulation time. So far, we have discussed power modeling of micro-architectural components, yet a substantial percentage of the overall power budget of a processor is actually spent on the global clock (up to 40 to 45% [5]). Thus, global clock power models must also be incorporated in these frameworks. The power dissipated on global clock distribution is impacted to first order by the number of pipeline registers (and thus by a processor’s pipeline depth) and by global and local wiring capacitances (and thus by a processor’s core area) [5]. Accordingly, different processor cores and/or different configurations of the same core may dissipate substantially different clock distribution power. Power estimates incorporating such numbers are thus critical during processor core selection and configuration. The component-level and clock distribution models discussed so far are used to estimate the dynamic power dissipation of a target micro-architecture. Yet, as mentioned earlier, static/leakage power dissipation is becoming a major concern, and thus, micro-architectural techniques aimed at reducing leakage power are increasingly relevant. Models to support early estimation of static power dissipation emerged along the same lines as those used for dynamic power dissipation. The “Butts–Sohi” model, which is one of the most influential static power models developed so far, quantifies static energy in CMOS circuits/components using a lumped parameter model that maps technology and design effects into corresponding characterizing parameters [29]. Specifically, static power dissipation is modeled as VDD · N · kdesign · Ileak , where VDD is the supply voltage and N denotes the number of transistors in the circuit. kdesign is the design dependent parameter — it captures “circuit style” related characteristics of a component, including average transistor aspect ratio, average number of transistors switched off during “normal/typical” component operation, etc. Finally, Ileak is the technology dependent parameter. It accounts for the impact

© 2006 by Taylor & Francis Group, LLC

16-6

Embedded Systems Handbook

of threshold voltage, temperature, and other key parameters, on leakage current, for a specific fabrication process. From a system designer’s perspective, static power can be reduced by lowering supply voltage (VDD ), and/or by power supply gating or VDD gating (as opposed to clock gating) unused/idling devices (N ). Integrating models for estimating static power dissipation on cycle-by-cycle simulators thus enables embedded system designers to analyze critical static power versus performance trade-offs enabled by “power aware” features available in contemporaneous processors, such as dynamic voltage scaling and selective datapath (re)configuration. An improved version of the “Butts–Sohi” model, providing the ability to dynamically recalculate leakage currents (as temperature and voltage change owing to operating conditions and/or dynamic voltage scaling), has been integrated into the Simplescalar simulation framework, called “HotLeakage,” enabling such high-level trade-offs to be explored by embedded system designers [30].

16.2.3 Memory and Bus Models Storage elements, such as caches, register files, queues, buffers, and tables constitute a substantial part of the power budget of contemporaneous embedded systems [31]. Fortunately, the high regularity of some such memory structures (e.g., caches) permits the use of simple, yet reasonably accurate power estimation techniques, relying on automatically synthesized “structural designs” for such components. The Cache Access and Cycle TIme (CACTI) framework implements this synthesis-driven power estimation paradigm. Specifically, given a specific cache hierarchy configuration (defined by parameters such as cache size, associativity, and line size), as well as information on the minimum feature size of the target technology [32], it internally generates a coarse structural design for such cache configuration. It then derives delay and power estimates for that particular design, using parameterized built-in C models for the various constituent elements, namely, SRAM cells, row and column decoders, word and bit lines, precharge circuitry, etc. [33,34]. CACTI’s synthesis algorithms used to generate the structural design of the memory hierarchy (which include defining the aspect ratio of memory blocks, the number of instantiated sub-banks, etc.) have been shown to consistently deliver reasonably “good” designs across a large range of cache hierarchy parameters [34]. CACTI can thus be used to “quickly” generate power estimates (starting from highlevel architectural parameters) exhibiting a reasonably good fidelity over a large region of the design space. During design space exploration, the designer may thus consider a number of alternative L1 and L2 cache configurations and use CACTI to obtain “access-based” power dissipation estimates for each such configuration with good fidelity. Naturally, the memory access traces used by CACTI should be generated by a micro-architecture simulator (e.g., Simplescalar) working with a memory simulator (e.g., Dinero [35]), so that they reflect the bandwidth requirements of the embedded application of interest. Buses are also a significant contributor to dynamic power dissipation [5,36]. The dynamic power 2 · fa, where C denotes the total capacitance of the bus dissipation on a bus is proportional to C · VDD (including metal wires and buffers), VDD denotes the supply voltage, and fa denotes the average switching frequency of the bus [36]. In this high-level model, the average switching frequency of the bus (fa) is defined by the product of two terms, namely, the average number of bus transitions per word, and the bus frequency (given in bus words per second). The average number of bus transitions per word can be estimated by simulating sample programs and collecting the corresponding transition traces. Although this model is coarse, it may suffice during the early design phases under consideration.

16.2.4 Battery Models The capacity of a battery is a nonlinear function of the current drawn from it, that is, if one increases the average current drawn from a battery by a factor of two, the “remaining” deliverable battery capacity, and thus its lifetime, decreases by more than half. Peukert’s formula models such nonlinear behavior by defining the capacity of a battery as k/I α , where k is a constant depending on the battery design, I is the

© 2006 by Taylor & Francis Group, LLC

Power Aware Embedded Computing

16-7

discharge current and α quantifies the “nonideal” behavior of the battery [37].1 More effective systemlevel trade-offs between quality/performance and duration of service can be implemented by taking such nonlinearity (also called rate-capacity effect) into consideration. However, in order to properly evaluate the effectiveness of such techniques/trade-offs during system level design, adequate battery models, and metrics are needed. Energy-delay product is a well-known metric used to assess the energy efficiency of a system. It basically quantifies a system’s performance loss per unit gain in energy consumption. To emphasize the importance of accurately exploring key trade-offs between battery lifetime and system performance, a new metric, viz., battery-discharge delay product, has recently been proposed [38]. Task scheduling strategies and dynamic voltage scaling policies adopted for battery operated embedded systems should thus aim at minimizing battery-discharge delay product, rather than energy-delay product, since it captures the important ratecapacity effect alluded to above, whereas energy-delay product is insensitive to it. Yet, a metric such as battery-discharge delay product requires the use of precise/detailed battery models. One such detailed battery model has been recently proposed, which can predict the remaining capacity of a rechargeable lithium-ion battery in terms of several critical factors, viz., discharge-rate (current), battery output voltage, battery temperature, and cycle age (i.e., the number of times a battery has been charged and discharged) [39].

16.3 System/Application Level Optimizations When designing an embedded system, in particular, a battery powered one, it may be useful to explore different task implementations exhibiting different power/energy versus quality-of-service characteristics, so as to provide the system with the desired functionality while meeting cost, battery lifetime, and other critical requirements. Namely, one may be interested in trading-off accuracy for energy savings on a handheld GPS system, or image quality for energy savings on an image decoder, etc. [40–42]. Such application level “tuning/optimizations” may be performed statically (i.e., one may use a single implementation for each task) or may be performed at runtime, under the control of a system level power manager. For example, if the battery level drops below a certain threshold, the power manager may drop some services and/or swap some tasks to less power hungry (“lower quality”) software versions, so that the system can remain operational for a specified window of additional time. An interesting example of such dynamic power management was implemented for the Mars Pathfinder, an unmanned space robot that draws power from, both, a nonrechargable battery and solar cells [43]. In this case, the power manager tracks the power available from the solar cells, ensuring that most of the robot’s active work is done during daylight, since the solar energy cannot be stored for later use. In addition to dropping and/or swapping tasks, a system’s dynamic power manager may also shutdown or slowdown subsystems/modules that are idling or under-utilized. The Advanced Configuration and Power Interface (ACPI) is a widely adopted standard that specifies an interface between an architecture’s power managed subsystems/modules (e.g., display drivers, modems, hard disk drivers, processors, etc.) and the system’s dynamic power manager. It is assumed that the power managed subsystems have at least two power states (ACTIVE and STANDBY) and that one can dynamically switch between them. Using ACPI, one can implement virtually any power management policy, including fixed time-out, predictive shutdown, predictive wake-up, etc. [44]. Since the transition from one state to another may take a substantial amount of time and consume a nonnegligible amount of energy, the selection of a proper power management policy for the various subsystems is critical to achieve good energy savings with a negligible impact on performance. The so-called time-out policy, widely used in modern computers, simply switches a subsystem to STANDBY mode when the elapsed time after the last utilization reaches a given fixed threshold. Predictive shutdown uses the previous history of the subsystem to predict the next an “ideal” battery, that is, a battery whose capacity is independent of the way current is drawn, α = 0, while for a real battery α may be as high as 0.7 [37]. 1 For

© 2006 by Taylor & Francis Group, LLC

16-8

Embedded Systems Handbook

expected idle time and, based on that, decides if it should or should not be shut down. More sophisticated policies, using stochastic methods [40–42,45], can also be implemented, yet they are more complex, and thus the power associated with running the associated dynamic power management algorithms may render them inadequate or inefficient for certain classes of systems. The system-level techniques discussed here act on each subsystem as a whole. Although the effectiveness of such techniques has been demonstrated across a wide variety of systems, finer-grained self-monitoring techniques, implemented at the subsystem level, can substantially add to these savings. Such techniques are discussed in the following sections.

16.4 Energy Efﬁcient Processing Subsystems As mentioned earlier, hardware/software codesign methodologies partition the functionality of an embedded system into hardware and software components. Software components, executing on programmable microcontrollers or processors (either general purpose or application specific), are the preferred solution, since their use can substantially reduce design and manufacturing costs, as well as shorten time-to-market. Custom hardware components are typically used only when strictly necessary, namely, when an embedded system’s power budget and/or performance constraints preclude the use of software. Accordingly, a large number of power aware processor families is available in today’s market, providing a vast gamut of alternatives suiting the requirements/needs of most embedded system [46,47]. In the sequel we discuss power aware features available in contemporaneous processors, as well as several power related issues relevant to processor core selection.

16.4.1 Voltage and Frequency Scaling Dynamic power consumption in a processor (be it general purpose or application specific) can be decreased by reducing two of its key contributors, viz., supply voltage and clock frequency. In fact, since the power dissipated in a CMOS circuit is proportional to the square of the supply voltage, the most effective way to reduce power is to scale down the supply voltage. Note that, however, the propagation delay across a CMOS transistor is proportional to VDD /(VDD − VT )2 , where VDD is the supply voltage and VT is the threshold voltage. So, unfortunately, as the supply voltage decreases, the propagation delay increases as well, and so clock frequency (i.e., speed) may need to be decreased [48]. Accordingly, many contemporaneous processor families, such as Intel’s XScale [49], IBM’s PowerPC 405LP [50], and Transmeta’s Crusoe [51], offer dynamic voltage and frequency scaling features. For example, the Intel 80200 processor, which belongs to the XScale family of processors mentioned earlier, supports a software programmable clock frequency. Specifically, the voltage can be varied from 1.0 to 1.5 V, in small increments, with the frequency varying correspondingly from 200 to 733 MHz, in steps of 33/66 MHz. The simplest way to take advantage of the scaling features discussed above is by carefully identifying the “smallest” supply voltage (and corresponding operating frequency) that guarantee that the target embedded application meets its timing constraints, and run the processor for that fix setting. If the workload is reasonably constant throughout execution, this simple scheme may suffice. However, if the workload varies substantially during execution, more sophisticated techniques that dynamically adapt the processor’s voltage and frequency to the varying workload, can deliver more substantial power savings. Naturally, when designing such techniques it is critical to consider the cost of transitioning from one setting to another, that is, the delay and power consumption overheads incurred by each transition. For example, for Intel 80200 processor, changing the processor frequency could take up to 20 µsec, while changing the voltage could take up to 1 msec [49]. Most processors developed for the mobile/portable market already support some form of built-in mechanism for voltage/frequency scaling. Intel’s SpeedStep technology, for example, detects if the system is currently plugged into a power outlet or running on a battery, and based on that, either runs the

© 2006 by Taylor & Francis Group, LLC

Power Aware Embedded Computing

16-9

High voltage, High frequency

Low voltage, Low frequency Deadline

Idle time

Power

Power

Deadline

Time

Time

FIGURE 16.2 Power consumption without and with dynamic voltage and frequency scaling.

processor at the highest voltage/frequency or switches it to a less power hungry mode. Transmeta’s Crusoe processors offer a power manager called LongRun [51], which is implemented in the processor’s firmware. LongRun relies on the historical utilization of the processor to guide clock rate selection: it increases the processor’s clock frequency if the current utilization is high and decreases it if the utilization is low. More sophisticated/aggressive dynamic scaling techniques should vary the core’s voltage/frequency based on some predictive strategy, while carefully monitoring performance, so as to ensure that it does not drop beyond a certain threshold and/or to ensure that task deadlines are not consistently missed [52–55] (see Figure 16.2). The use of such voltage/frequency scaling techniques must necessarily rely on adequate dynamic workload prediction and performance metrics, and thus requires the direct intervention of the operating system and/or of the applications themselves. Although more complex than the simple schemes discussed above, several such predictive techniques have been shown to deliver substantial gains for applications with well-defined task deadlines, for example, hard/soft real-time systems. Simple interval-based prediction schemes consider the amount of idle time on a previous interval as a measure of the processor’s utilization for the next time interval, and use that prediction to decide on the voltage/frequency settings to be used throughout its duration. Naturally, many critical issues must be factored in when defining the duration of such an interval (or “prediction window”), including overhead costs associated with switching voltage/frequency settings. While a prediction scheme based on a single interval may deliver substantial power gains with marginal loss in performance [52], looking exclusively at a single interval may not suffice for many applications. Namely, the voltage/frequency settings may end up oscillating in an inefficient way between different settings [53]. Simple smoothing techniques, for example, using an exponentially moving average of previous intervals, can be employed to mitigate the problem [53,56]. Finally, note that the benefits of dynamic voltage and frequency scaling are not limited to reducing dynamic power consumption. When the voltage/frequency of a battery operated system is lowered, the instantaneous current drawn by the processor decreases accordingly, leading to a more effective utilization of the battery capacity, and thus to increased duration of service.

16.4.2 Dynamic Resource Scaling Dynamic resource scaling refers to exploiting adaptive, fine-grained hardware resource reconfiguration techniques in order to improve power efficiency. Dynamic resource scaling requires enhancing the microarchitecture with the ability to selectively “disable” components, fully or partially, through either clock gating or VDD gating.2 The effectiveness of dynamic resource scaling is predicated on the fact that many applications have variable workloads, that is, have execution phases with substantial Instruction-Level Parallelism (ILP), and other phases with much less inherent parallelism. Thus, by dynamically “scaling down” micro-architecture components during such “low activity” periods, substantial power savings can potentially be achieved. 2 Note

that, in contrast to clock gating, all state information is lost when a circuit is VDD gated.

© 2006 by Taylor & Francis Group, LLC

16-10

Embedded Systems Handbook

Techniques for reducing static power dissipation on processor cores can definitely exploit resource scaling features, once they become widely available on processors. Namely, under-utilized or idling resources can be partially or fully VDD gated, thus reducing leakage power. Several utilization-driven techniques have already been proposed, that can selectively shutdown functional units, segments of register files, and other datapath components, when such conditions arise [57–60]. Naturally, the delay overhead incurred by power supply gating should be carefully factored into these techniques, so that the corresponding static energy savings are achieved with only a small degradation in performance. Dynamic energy consumption can also be reduced by dynamically “scaling down” power hungry microarchitecture components, for example, reducing the size of the issue window and/or reducing the effective width of the pipeline, during periods of low“activity”(say, when low ILP code is being executed). Moreover, if one is able to scale down these micro-architecture components that define the critical path delay of the machine’s pipeline (typically located on the rename and window access stages), additional opportunities for voltage scaling, and thus dynamic power savings, can be created. This is, thus, a very promising area still undergoing intensive research [61].

16.4.3 Processor Core Selection The power aware techniques discussed so far can be broadly applied to general purpose as well as application specific processors. Naturally, the selection of processor core(s) to be instantiated in the architecture’s processing subsystem is likely to have also a substantial impact on overall power consumption, particularly when computation intensive embedded systems are considered. A plethora of programmable processing elements, including microcontrollers, general-purpose processors, DSPs, and ASIPs, addressing the specific needs of virtually every segment of the embedded systems’ market, is currently offered by vendors. As alluded to above, for computation intensive embedded systems with moderately high to stringent timing constraints, the selection of processor cores is particularly critical, since high performance usually signifies high levels of power dissipation. For these systems, ASIPs and DSPs have the potential to be substantially more energy efficient than their general-purpose counterparts, yet their “specialized” nature poses significant compilation challenges3 [62]. In contrast, general-purpose processors are easier to compile to, being typically shipped with good optimizing compilers, as well as debuggers and other development tools. Unfortunately, their “generality” incurs a substantial power overhead. In particular, high-performance general-purpose processors require the use of power hungry hardware assists, including reservation stations, reorder buffers and rename logic, and complex branch prediction logic to alleviate control stalls. Still, their flexibility and high quality development tools are very attractive for systems with stringent time-to-market constraints, making them definitively relevant for embedded systems. IBM/Motorola’s PowerPC family and the ARM family are examples of general-purpose processors enhanced with power aware features that are widely used in modern embedded systems. A plethora of specialized/customizable processors is also offered by several vendors, including specialized media cores from Philips, Trimedia, MIPS, etc., DSP cores offered by Texas Instruments, StarCore, and Motorola, and customizable cores from Hewlett-Packard-STMicroelectronics and Tensilica. The Instruction Set Architectures (ISAs) and features offered on these specialized processors can vary substantially, since they are designed and optimized for different classes of applications. However, several of them, including TI’s TMS320C6x family, HP-STS’s Lx, the Starcore and Trimedia families, and Philip’s Nexperia use a “Very Large Instruction Word” (VLIW) [63,64] or “Explicitly Parallel Instruction Computing” (EPIC) [65] paradigm. One of the key differences between VLIW and superscalar architectures is that VLIW machines rely on the compiler to extract ILP, and then schedule and bind instructions to functional units statically, while their high performance superscalar counterparts use dedicated (power hungry) hardware to perform runtime dependence checking, instruction reordering, etc. Thus, in broad 3 Several

DSP/ASIP vendors provide preoptimized assembly libraries to mitigate this problem.

© 2006 by Taylor & Francis Group, LLC

Power Aware Embedded Computing

16-11

terms, VLIW machines eliminate power hungry micro-architecture components by moving the corresponding functionality to the compiler.4 Moreover, wide VLIW machines are generally organized as a set of small clusters with local register files.5 Thus, in contrast to traditional superscalar machines, which rely on power hungry multiported monolithic register files, multicluster VLIW machines scale better with increasing issue widths, that is, dissipate less dynamic power and can work at faster clock rates. Yet, they are harder to compile to [66–72]. In summary, VLIW paradigm works very well for many classes of embedded applications, as attested to by the large number of VLIW processors currently available in the market. However, it poses substantial compilation challenges, some of which are still undergoing active research [8,73,74]. Fortunately, many processing intensive embedded applications have only a few time critical loop kernels. Thus, only a very small percentage of the overall code needs to be actually subjected to the complex, time consuming compiler optimizations required by VLIW machines. In the case of media applications, for example, such loop kernels may represent as little as 3% of the overall program, and yet take up to 95% of the execution time [75]. When the performance requirements of an embedded system are extremely high, using a dedicated coprocessor aimed at accelerating the execution of time-critical kernels (under the control of a host computer), may be the only feasible solution. Imagine, one such programmable coprocessors, was designed to accelerate the execution of kernels of streaming media applications [76]. It can deliver up to 20 GFLOPS at a relatively low power cost (2 GFLOPS/W), yet such a power efficiency does not come without a cost [76]. As expected, Imagine has a complex programming paradigm, that requires extracting all time-critical kernels from the target application, and carefully reprogramming them using Imagine’s “stream-oriented” coding style, so that, both, data and instructions can be efficiently routed from the host to the accelerator. Programming Imagine thus require a substantial effort, yet its power efficiency makes it very attractive for systems that demand such high levels of performance. Finally, at the highest end of the performance spectrum, one may need to consider using fully customized hardware accelerators. Comprehensive methodologies to design such accelerators are discussed in detail in Reference 77.

16.5 Energy Efﬁcient Memory Subsystems While processor speeds have been increasing at a very fast rate (∼60% a year), memory performance has increased at a comparatively modest rate (∼7% a year), leading to the well-known “processor-memory performance gap” [31]. In order to alleviate the memory access latency problem, modern processor designs use increasingly large on-chip caches, with up to 60% of the transistors dedicated to on-chip memory and support circuitry [31]. As a consequence, power dissipation in the memory subsystem contributes to a substantial fraction of the energy consumed by modern processors. A study targeting the StrongARM SA-110, a low-power processor widely used in embedded systems, revealed that more than 40% of the processor’s power budget is taken up by on-chip data and instruction caches [78]. For high-performance general-purpose processors, this percentage is even higher, with up to 90% of the power budget consumed by memory elements and circuits aimed at alleviating the aforementioned memory bottleneck [31]. Power aware memory designs have thus received considerable attention in recent years.

16.5.1 Cache Hierarchy Tuning The energy cost of accessing data/instructions from off-chip memories can be as much as two orders of magnitude higher than that of an access to on-chip memory [79]. By retaining instructions and data with high spatial and/or temporal locality on-chip, caches can substantially reduce the number of costly off-chip data transfers, thus leading to potentially quite substantial energy savings. 4 Since functional unit binding decisions are made by the compiler, VLIW code is larger than RISC/superscalar code. We will discuss techniques to address this problem later in our survey. 5 A cluster is a set of functional units connected to a local register file. Clusters communicate with each other through a dedicated interconnection network.

© 2006 by Taylor & Francis Group, LLC

16-12

Embedded Systems Handbook

The “one-size-fits-all” nature of the general-purpose domain dictates that one should use large caches with a high degree of associativity, so as to try to ensure high hit rates (and thus low average memory access latencies) for as many applications as possible. Unfortunately, as one increases cache size and associativity, larger circuits and/or more circuits are activated on each access to the cache, leading to a corresponding increase in dynamic energy consumption. Clearly, in the context of embedded systems, one can do much better. Specifically, by carefully tuning/scaling the configuration of the cache hierarchy, so that it more efficiently matches the bandwidth requirements and access patterns of the target embedded application, one can essentially achieve memory access latencies similar to those delivered by “larger” (general purpose) memory subsystems, while substantially decreasing the average energy cost of such accesses. Since several of the cache hierarchy parameters exhibit conflicting trends, an aggressive design space exploration over many candidate cache configurations is typically required in order to properly tune the cache hierarchy of an embedded system. Figure 16.3 summarizes the results of one such design space exploration performed for a media application. Namely, the graph in Figure 16.3 plots the energydelay product metric for a wide range of L1 and L2 on-chip D-cache configurations. The delay term is given by the number of cycles taken to process a representative data set from start to completion. The energy term accounts for the energy spent on data memory accesses for the particular execution trace.6 As it can be seen, the design space is very complex, reflecting the conflicting trends alluded to above. For this case study, the best set of cache hierarchy configurations exhibit an energy-delay product that is about one order of magnitude better (i.e., lower) than that of the worst configurations. Perhaps even more interesting is that some of the worst memory subsystem configurations use quite large caches with

× 105

Energy × delay (mJ × cycles × 100,000)

3 2.5 2 1.5 1 0.5 0

L2 Cache: 64 B /128 B/256 B, 1way/2 way 8 K /16 K/32 K/64 K kill window

L1 Cache: 4 KB/16 KB/32 KB 32 B, 1 way/2 way 1 K/4 K/16 K/32 K kill window

FIGURE 16.3 Design space exploration: energy-delay product for various L1 and L2 D-cache configurations for a JPEG application running on an XScale-like processor core.

6 Specifically, the

energy term accounts for accesses to the L1 and L2 on-chip D-caches, and to main memory.

© 2006 by Taylor & Francis Group, LLC

Power Aware Embedded Computing

16-13

a high degree of associativity, clearly indicating that no substantial performance gains would be achieved (for this particular media application) by using such an aggressively dimensioned memory subsystems. For many embedded systems/applications, the power efficiency of the memory subsystems can be improved even more aggressively, yet this requires the use of novel (nonstandard) memory system designs, as discussed in the sections that follow.

16.5.2 Novel Horizontal and Vertical Cache Partitioning Schemes In recent years, several novel cache designs have been proposed to aggressively reduce the average dynamic energy consumption incurred by memory accesses. Energy efficiency is improved in these designs by taking direct advantage of specific characteristics of target classes of applications. The memory footprint of instructions and data in media applications, for example, tends to be very small, thus creating unique opportunities for energy savings [75]. Since streaming media applications are pervasive in today’s portable electronics market, they have been a preferred application domain for validating the effectiveness of such novel cache designs. Vertical partition schemes [80–83], as the name suggests, introduce additional small buffers/caches before the first level of the “traditional” memory hierarchy. For applications with “small” working sets, this strategy can lead to considerable dynamic power savings. A concrete example of a vertical partition scheme is the filter cache [84], which is a very small cache placed in front of the standard L1 data cache. If the filter cache is properly dimensioned, dynamic energy consumption in the memory hierarchy can be substantially reduced, not only by accessing most of the data from the filter cache, but also by powering down (clock gating) the L1 D-cache to a STANDBY mode during periods of inactivity [84]. Although switching the L1 D-cache to STANDBY mode results in delay/energy penalties when there is a miss in the filter cache, it was observed that for media applications, the energy-delay product did improve quite significantly when the two techniques were combined. Predecoded instruction buffers [85] and loop buffers [86] are variants of the vertical partitioning scheme discussed above, yet applied to instruction caches (I-caches). The key idea of the first partitioning scheme mentioned is to store recently used instructions on an instruction buffer, in a decoded form, so as to reduce the average dynamic power spent on fetching and decoding instructions. The second partitioning scheme allows one to hold time-critical loop bodies (identified a priori by the compiler or by the programmer) on small and thus energy efficient dedicated loop buffers. Horizontal partition schemes refer to the placement of additional (small) buffers or caches at the same level as the L1 cache. For each memory reference, the appropriate (level one) cache to be accessed is determined by dedicated decoding circuitry residing between the processor core and the memory hierarchy. Naturally, the method used to partition data across the set of first level caches should ensure that the cache selection logic is simple, and thus cache access times are not significantly affected. Region-based Caches implement one such“horizontal partitioning”scheme, by adding two small 2 KB L1 D-caches to the first level of the memory hierarchy, one for stack and one for global data. This arrangement has also been shown to achieve substantial gains in dynamic energy consumption for streaming media applications with a negligible impact on performance [87].

16.5.3 Dynamic Scaling of Memory Elements With an increasing number of on-chip transistors being devoted to storage elements in modern processors, of which only a very small set is active at any point in time, static power dissipation is expected to soon become a key contributor to a processor’s power budget. State-of-the-art techniques to reduce static power consumption in on-chip memories are based on the simple observation that, in general, data or instructions fetched into a given cache line have an immediate flurry of accesses during a “small” interval of time, followed by a relatively “long” period of time where they are not used, before eventually being evicted to make way for new data/instructions [88,89]. If one can “guess” when that period starts, it is

© 2006 by Taylor & Francis Group, LLC

16-14

Embedded Systems Handbook

possible to “switch-off ” (i.e., VDD gate) the corresponding cache lines without introducing extra cache misses, thereby saving static energy consumption with no impact on performance [90,91]. Cache Decay was one of the earliest attempts to exploit such a “generational” memory usage behavior to decrease leakage power [90]. The original Cache Decay implementation used a simple policy that turned off cache lines after a fixed number of cycles (decay interval) since the last access. Note that if the selected decay interval happens to be too small, cache lines are switched off prematurely, causing extra cache misses, and if it is too large, opportunities for saving leakage energy are missed. Thus, when such a simple scheme is used, it is critical to tune the “fixed” decay interval very carefully, so that it adequately matches the access patterns of the embedded application of interest. Adaptive strategies, varying the decay interval at runtime so as to dynamically adjust it to the changing access patterns, have been proposed more recently, so as to enable the use of the cache decay principle across a wider range of applications [90,91]. Similar leakage energy reduction techniques have also been proposed for issue queues [59,60,92] and branch prediction tables [93]. Naturally, leakage energy reduction techniques for instruction/program caches are also very critical [94]. A technique has been recently proposed that monitors the performance of the instruction cache over time, and dynamically scales (via VDD gating) its size, so as to closely match the size of the working set of the application [94].

16.5.4 Software-Controlled Memories, Scratch-Pad Memories Most of novel designs and/or techniques discussed so far require an application-driven tuning of several architecturally visible parameters. However, similar to more “traditional” cache hierarchies, the memory subsystem interface implemented on these novel designs still exposes a flat view of the memory hierarchy to the compiler/software. That is, the underlying details of the memory subsystem architecture are essentially transparent to both. Dynamic power dissipation incurred by accesses to basic memory modules occurs owing to switching activity in bit lines, word lines, and input and output lines. Traditional caches have additional switching overheads, owing to the circuitry (comparators, multiplexers, tags, etc.) needed to provide the “flat” memory interface alluded to above. Since the hardware assists necessary to support such a transparent view of the memory hierarchy are quite power hungry, additional energy saving opportunities can be created by relying more on the compiler (and less on dedicated hardware) to manage the memory subsystem. The use of software-controlled (rather than hardware-controlled) memory components is thus becoming increasing prevalent in power aware embedded system design. Scratch-Pads are an example of such novel, software-controlled memories [95–100]. Scratch-Pads are essentially on-chip partitions of main memory directly managed by the compiler. Namely, decisions concerning data/instruction placement in on-chip Scratch-Pads are made statically by the compiler, rather than dynamically, using dedicated hardware circuitry. Therefore, these memories are much less complex and thus less power hungry than traditional caches. As one would expect, the ability to aggressively improve energy-delay efficiency through the use of Scratch-Pads is predicated on the quality of the decisions made by the compiler on the subset of data/instructions that are to be assigned to that limited memory space [95,101]. Several compiler-driven techniques have been proposed to identify the data/instructions that can be assigned to the Scratch-Pad more profitably, with frequency of use being one of the key selection criteria [102–104]. The Cool Cache architecture [105], also proposed for media applications, is a good example of a novel, power aware memory subsystem that relies on the use of software-controlled memories. It uses a small Scratch-Pad and a “software-controlled cache,” each of which is implemented on a different on-chip SRAM. The program’s scalars are mapped to the small (2 KB) Scratch-Pad [100].7 Nonscalar data is mapped to the software-controlled cache, and the compiler is responsible for translating virtual addresses to SRAM lines, using a small register lookup area. Even though cache misses are handled in software, 7 This

size was found to be sufficient for most media applications.

© 2006 by Taylor & Francis Group, LLC

Power Aware Embedded Computing

16-15

thereby incurring substantial latency/energy penalties, the overall architecture has been shown to yield substantial energy-delay product improvements for media applications, when compared to traditional cache hierarchies [105]. The effectiveness of techniques such as the above is so pronounced that several embedded processors currently offer a variety of software-controlled memory blocks, including configurable Scratch-Pads (TI’s 320C6x [106]), lockable caches (Intel’s XScale [49] and Trimedia [107]) and stream buffers (Intel’s StrongARM [49]).

16.5.5 Improving Access Patterns to Off-Chip Memory During the last few decades, there has been a substantial effort in the compiler domain aimed at minimizing the number of off-chip memory accesses incurred by optimized code, as well as enabling the implementation of aggressive prefetching strategies. This includes devising compiler techniques to restructure, reorganize, and layout data in off-chip memory, as well as techniques to properly reorder a program’s memory access patterns [108–112]. Prefetching techniques have received considerable attention lately, particularly in the domain of embedded streaming media applications. Instruction and data prefetching techniques can be hardware- or software-driven [113–120]. Hardware-based data prefetching techniques try to dynamically predict when a given piece of data will be needed, so as to load it into cache (or into some dedicated on-chip buffer), before it is actually referenced by the application (i.e., explicitly required by a demand access) [114–116]. In contrast, software-based data prefetching techniques work by inserting prefetch instructions for selected data references at carefully chosen points in the program — such explicit prefetch instructions are executed by the processor, to move data into cache [117–120]. It has been extensively demonstrated that, when properly used, prefetching techniques can substantially improve average memory access latencies [113–120]. Moreover, techniques that prefetch substantial chunks of data (rather than, say, a single cache line), possibly to a dedicated buffer, can also simultaneously decrease dynamic power dissipation [121]. Namely, when data is brought from off-chip memory in large bursts, then energy-efficient burst/page access modes can be more effectively exploited. Moreover, by prefetching large quantities of instructions/data, the average length of DRAM idle times is expected to increase, thus creating more profitable opportunities for the DRAM to be switched to a lower power mode [122–124]. Naturally, it is important to ensure that the overhead associated with the prefetching mechanism itself, as well as potential increases in static energy consumption owing to additional storage requirements, do not outweigh the benefits achieved from enabling more energy-efficient off-chip accesses [124].

16.5.6 Special Purpose Memory Subsystems for Media Streaming As alluded to before, streaming media applications have been a preferred application domain for validating the effectiveness of many novel, power aware memory designs. Although the compiler is consistently given a more preeminent role in the management of these novel memory subsystems, they require no fundamental changes to the adopted programming paradigm. Additional opportunities for energy savings can be unlocked by adopting a programming paradigm that directly exposes those elements of an application that should be considered by an optimizing compiler, during performance versus power trade-off exploration. The two special purpose memory subsystems discussed below do precisely that, in the context of streaming media applications. Xtream-Fit is a special purpose data memory subsystem targeted to generic uni-processor embedded system platforms executing media applications [124]. Xtream-Fit’s on-chip memory consists of a ScratchPad, to hold constants and scalars, and a novel software-controlled streaming memory, partitioned into regions, each of which holds one of the input or output streams used/produced by the target application. The use of software-controlled memories by Xtream-Fit ensures that dynamic energy consumption is low, while the region based organization of the streaming memory enables the implementation of very simple

© 2006 by Taylor & Francis Group, LLC

16-16

Embedded Systems Handbook

and yet effective shutdown policies to turn off different memory regions, as the data they hold become “dead.” Xtream-Fit’s programming model is actually quite simple, requiring only a minor “reprogramming” effort. It simply requires organizing/partitioning the application code into a small set of processing and data transfer tasks. Data transfer tasks prefetch streaming media data (the amount required by the next set of processing tasks) into the streaming memory. The amount of prefetched data is explicitly exposed via a single customization parameter. By varying this single customization parameter, the compiler can thus aggressively minimize energy-delay product, by considering, both, dynamic and leakage power, dissipated in on-chip and in off-chip memories [124]. While Xtream-Fit provides sufficient memory bandwidth for generic uni-processor embedded media architectures, it cannot support the very high bandwidth requirements of high-performance media accelerators. For example, Imagine, the multicluster media accelerator alluded previously, uses its own specialized memory hierarchy, consisting of a streaming memory, a 128 KB stream register file, and stream buffers and register files local to each of its eight clusters. Imagine’s memory subsystem delivers a very high bandwidth (2.1 GB/sec) with very high energy-efficiency, yet it requires the use of a specialized programming paradigm. Namely, data transfers to/from the host are controlled by a stream controller, and between the stream register file and the functional units by a microcontroller, both of which have to be programmed separately, using Imagine’s own “stream-oriented” programming style [125]. Systems that demand still higher performance and/or energy-efficiency may require memory architectures fully customized to the target application. Comprehensive methodologies for designing high-performance memory architectures for custom hardware accelerators are discussed in detail in [36,126].

16.5.7 Code Compression Code size affects both program storage requirements, and off-chip memory bandwidth requirements, and can thus have a first order impact on the overall power consumed by an embedded system. Instruction compression schemes decrease both such requirements by storing in main memory (i.e., off-chip) frequently fetched/executed instruction sequences in an encoded/compressed form [127–129]. Naturally, when one such scheme is adopted, it is important to factor in the overhead incurred by the on-chip decoding circuitry, so that it does not outweigh the gains achieved on storage and interconnect elements. Furthermore, different approaches have been considered for storing such selected instruction sequences on-chip in either compressed or decompressed forms. On-chip storage of instructions in compressed form saves on-chip storage, yet instructions must be decoded every time they are executed, adding additional latency/power overheads. Instruction subsetting is an alternative instruction compression scheme, where instructions that are not commonly used are discarded from the instruction set, thus enabling the “reduced” instruction set to be encoded using less bits [130]. The Thumb instruction set is a classic example of a compressed instruction set, featuring the most commonly used 32-bit ARM instructions, compressed to 16-bit wide format. The Thumb instructions set is decompressed transparently to full 32-bit ARM instructions in real time, with no performance loss.

16.5.8 Interconnect Optimizations Power dissipation in on- and off-chip interconnect structures is also a significant contributor to an embedded system’s power budget [131]. A shared bus is a commonly used interconnect structure, as it offers a good trade-off between generality/simplicity and performance. Power consumption on the bus can be reduced by decreasing its supply voltage, capacitance, and/or switching activity. Bus splitting, for example, reduces bus capacitance by splitting long bus lines into smaller sections, with one section relaying the data to the next [132]. Power consumption in this approach is reduced at the expense of a small penalty in latency, incurred at each relay point. Bus switching activity, and thus dynamic power dissipation, can also be substantially reduced by using an appropriate bus encoding scheme [133–138].

© 2006 by Taylor & Francis Group, LLC

Power Aware Embedded Computing

16-17

Bus invert coding [133], for example, is a simple, and yet widely used coding scheme. The first step of bus invert coding is to compute the Hamming distance between the current bus value and the previous bus value. If this value is greater than half the number of total bits, then the data value is transmitted in inverted form, with an additional invert bit to interpret the data at the other end. Several other encoding schemes have been proposed, achieving lower switching activity at the expense of higher encoding and decoding complexity [134–138]. With the increasing adoption of System-on-Chip design methodologies for embedded systems, devising energy-delay efficient interconnect architectures for such large scale systems is becoming increasingly critical and is still undergoing intensive research [5].

16.6 Summary Design methodologies for today’s embedded systems must necessarily treat power consumption as a primary figure of merit. At the system and architecture levels of design abstraction, power aware embedded system design requires the availability of high-fidelity power estimation and simulation frameworks. Such frameworks are essential to enable designers to explore and evaluate, in reasonable time, the complex energy-delay trade-offs realized by different candidate architectures, subsystem realizations, and power management techniques, and thus quickly identify promising solutions for the target application of interest. The detailed, system and architecture level design phases that follow should adequately combine coarse, system level dynamic power management strategies, with fine-grained, self-monitoring techniques, exploiting voltage and frequency scaling, as well as advanced dynamic resource scaling and power-driven reconfiguration techniques.

References [1] G. Micheli, R. Ernst, and W. Wolf, Eds. Readings in Hardware/Software Co-Design. Morgan Kaufman Publishers, Norwell, MA, 2002. [2] F. Balarin, P. Giusto, A. Jurecska, C. Passerone, E. Sentovich, B. Tabbara, M. Chiodo, H. Hsieh, L. Lavagno, A.L. Sangiovanni-Vincentelli, and K. Suzuki. Hardware–Software Co-Design of Embedded Systems: The POLIS Approach. Kluwer Academic Publishers, Dordrecht, 1997. [3] S. Borkar. Design Challenges of Technology Scaling. IEEE Micro, 19: 23–29, 1999. [4] http://public.itrs.net/ [5] M. Pedram and J.M. Rabaey. Power Aware Design Methodologies. Kluwer Academic Publishers, Dordrecht, 2002. [6] R. Gonzalez, B. Gordon, and M. Horowitz. Supply and Threshold Voltage Scaling for Low Power CMOS. IEEE Journal of Solid-State Circuits, 32(8): 1210–1216, 1997. [7] A.P. Chandrakasan, S. Sheng, and R. W. Brodersen. Low-Power CMOS Digital Design. IEEE Journal of Solid-State Circuits, 27(4): 473–484, 1992. [8] A.A. Jerraya, S. Yoo, N. Wehn, and D. Verkest, Eds. Embedded Software for SoC. Kluwer Academic Publishers, Dordrecht, 2003. [9] A.P. Chandrakasan and R.W. Brodersen. Low Power Digital CMOS Design. Kluwer Academic Publishers, Dordrecht, 1995. [10] T.L. Martin, D.P. Siewiorek, A. Smailagic, M. Bosworth, M. Ettus, and J. Warren. A Case Study of a System-Level Approach to Power-Aware Computing. ACM Transactions on Embedded Computing Systems, Special Issue on Power-Aware Embedded Computing, 2(3): 255–276, 2003. [11] A.S. Vincentelli and G. Martin. A Vision for Embedded Systems: Platform-Based Design and Software Methodology. IEEE Design and Test of Computers, 18(6): 23–33, 2001. [12] J.M. Rabaey and A.S. Vincentelli. System-on-a-Chip — A Platform Perspective. In Keynote Presentation, Korean Semiconductor Conference, 2002. Available at http://bwrc.eecs.berkeley.edu/ People/Faculty/jan/presentations/platformdesign.pdf

© 2006 by Taylor & Francis Group, LLC

16-18

Embedded Systems Handbook

[13] J.T. Buck, S. Ha, E.A. Lee, and D.G. Messerschmitt. Ptolemy: A Framework for Simulating and Prototyping Heterogeneous Systems. International Journal of Computer Simulation, Special Issue on Simulation Software Development, 4: 155–182, 1994. [14] V. Tiwari, S. Malik, and A. Wolfe. Power Analysis of Embedded Software: A First Step Towards Software Power Minimization. IEEE Transactions on Very Large Scale Integration Systems, 2(4): 437–445, 1994. [15] P.M. Chau and S.R. Powell. Power Dissipation of VLSI Array Processing Systems. Journal of VLSI Signal Processing, 4(2–3): 199–212, 1992. [16] J. Russell and M. Jacome. Software Power Estimation and Optimization for High-Performance 32-bit Embedded Processors. In Proceedings of the International Conference on Computer Design, 1998, pp. 328–333. [17] C. Brandolese, W. Fornaciari, F. Salice, and D. Sciuto. An Instruction-Level Functionality-Based Energy Estimation Model for 32-bits Microprocessors. In Proceedings of the Design Automation Conference, 2000, pp. 346–351. [18] G. Qu, N. Kawabe, K. Usami, and M. Potkonjak. Function-Level Power Estimation Methodology for Microprocessors. In Proceedings of the Design Automation Conference, 2000, pp. 810–813. [19] D.C. Burger and T.M. Austin. The SimpleScalar Tool Set, Version 2.0. Computer Architecture News, 25(3): 13–25, 1997. [20] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A Framework for Architectural Level Power Analysis and Optimizations. In Proceedings of the International Symposium on Computer Architecture, 2000, pp. 83–94. [21] G. Cai and C.H. Lim. Architectural Level Power/Performance Optimization and Dynamic Power Estimation. In Cool Chips Tutorial, International Symposium on Microarchitecture, 1999. [22] W. Ye, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. The Design and Use of Simplepower: A Cycle-Accurate Energy Estimation Tool. In Proceedings of the Design Automation Conference, 2000, pp. 340–345. [23] G. Jochens, L. Kruse, E. Schmidt, and W. Nebel. A New Parameterizable Power Macro-Model for Datapath Components. In Proceedings of the Design Automation and Test in Europe, 1999. [24] A. Bogliolo, L. Benini, and G.D. Micheli. Regression-Based RTL Power Modeling. ACM Transactions on Design Automation of Electronic Systems, 5(3): 337–372, 2000. [25] S.A. Theoharis, C.E. Goutis, G. Theodoridis, and D. Soudris. Accurate Data Path Models for RT-Level Power Estimation. In Proceedings of the International Workshop on Power and Timing Modeling, Optimization and Simulation, 1998, pp. 213–222. [26] M. Khellah and M.I. Elmasry. Effective Capacitance Macro-Modelling for Architectural-Level Power Estimation. In Proceedings of the Eighth Great Lakes Symposium on VLSI, 1998, pp. 414–419. [27] Z. Chen, K. Roy, and E.K. Chong. Estimation of Power Dissipation Using a Novel Power Macromodeling Technique. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 19(11): 1363–1369, 2000. [28] R. Melhem and R. Graybill, Eds. Challenges for Architectural Level Power Modeling. In Power Aware Computing. Kluwer Academic Publishers, Dordrecht, 2001. [29] J.A. Butts and G.S. Sohi. A Static Power Model for Architects. In Proceedings of the International Symposium on Microarchitecture, 2000, pp. 191–201. [30] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan. HotLeakage: A TemperatureAware Model of Subthreshold and Gate Leakage for Architects. Technical report, Department of Computer Science, University of Virginia, 2003. [31] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. A Case for Intelligent RAM. IEEE Micro, 17(2): 33–44, 1997. [32] S.J.E. Wilton and N.M. Jouppi. CACTI: An Enhanced Cache Access and Cycle Time Model. Technical report, Digital Equipment Corporation, Western Research Lab, 1996. [33] M. Kamble and K. Ghose. Analytical Energy Dissipation Models For Low Power Caches. In Proceedings of the International Symposium on Low Power Electronics and Design, 1997, pp. 143–148. © 2006 by Taylor & Francis Group, LLC

Power Aware Embedded Computing

16-19

[34] G. Reinman and N.M. Jouppi. CACTI 2.0: An Integrated Cache Timing and Power Model. Technical report, Compaq Computer Corporation, Western Research Lab, 2001. [35] J. Edler and M.D. Hill. Dinero IV Trace-Driven Uniprocessor Cache Simulator, 1998. http://www.cs.wisc.edu/∼markhill/DineroIV/ [36] F. Catthoor, S. Wuytack, E. DeGreef, F. Balasa, L. Nachtergaele, and A. Vandecappelle. Custom Memory Management Methodology: Exploration of Memory Organization for Embedded Multimedia System Design. Kluwer Academic Publishers, Dordrecht, 1998. [37] T. Martin and D. Siewiorek. A Power Metric for Mobile Systems. In International Symposium on Lower Power Electronics and Design, 1996, pp. 37–42. [38] M. Pedram and Q. Wu. Battery-Powered Digital CMOS Design. IEEE Transactions on Very Large Scale Integration Systems, 10: 601–607, 2002. [39] P. Rong and M. Pedram. An Analytical Model for Predicting the Remaining Battery Capacity of Lithium-Ion Batteries. In Proceedings of the Design Automation and Test in Europe, 2003, pp. 11148–11149. [40] M. Srivastava, A. Chandrakasan, and R. Brodersen. Predictive System Shutdown and other Architectural Techniques for Energy Efficient Programmable Computation. IEEE Transactions on Very Large Scale Integration Systems, 4(1): 42–55, 1996. [41] Q. Qiu and M. Pedram. Dynamic Power Management Based on Continuous-Time Markov Decision Processes. In Proceedings of the Design Automation Conference, 1999, pp. 555–561. [42] T. Simunic, L. Benini, P. Glynn, and G. De Micheli. Dynamic Power Management of Portable Systems. In Proceedings of the International Conference on Mobile Computing and Networking, 2000, pp. 11–19. [43] J. Liu, P. Chou, N. Bagherzadeh, and F. Kurdahi. A Constraint-Based Application Model and Scheduling Techniques for Power-Aware Systems. In Proceedings of the International Conference on Hardware/Software Codesign, 2001, pp. 153–158. [44] http://www.acpi.info/ [45] T. Simunic, L. Benini, A. Acquaviva, P. Glynn, and G. De Micheli. Dynamic Voltage Scaling for Portable Systems. In Proceedings of the Design Automation Conference, 2001, pp. 524–529. [46] R. Gonzalez and M. Horowitz. Energy Dissipation in General Purpose Microprocessors. IEEE Journal of Solid-State Circuits, 31(9): 1277–1284, 1996. [47] T.D. Burd and R.W. Brodersen. Processor Design for Portable Systems. Journal of VLSI Signal Processing, 13(2–3): 203–221, 1996. [48] T. Ishihara and H. Yasuura. Voltage Scheduling Problem for Dynamically Variable Voltage Processors. In Proceedings of the International Symposium on Low Power Electronics and Design, 1998, pp. 197–202. [49] http://www.intel.com/ [50] http://www.ibm.com/ [51] http://www.transmeta.com/ [52] M. Weiser, B. Welch, A.J. Demers, and S. Shenker. Scheduling for Reduced CPU Energy. In Proceedings of the Symposium on Operating Systems Design and Implementation, 1994, pp. 13–23. [53] K. Govil, E. Chan, and H. Wasserman. Comparing Algorithm for Dynamic Speed-Setting of a Low-Power CPU. In Proceedings of the International Conference on Mobile Computing and Networking, 1995, pp. 13–25. [54] T. Pering, T. Burd, and R. Brodersen. The Simulation and Evaluation of Dynamic Voltage Scaling Algorithms. In Proceedings of the International Symposium on Low Power Electronics and Design, 1998, pp. 76–81. [55] T. Pering, T. Burd, and R. Brodersen. Voltage Scheduling in the lpARM Microprocessor System. In Proceedings of the International Symposium on Low Power Electronics and Design, 2000, pp. 96–101. [56] K. Flautner, S. Reinhardt, and T. Mudge. Automatic Performance Setting for Dynamic Voltage Scaling. ACM Journal of Wireless Networks, 8(5): 507–520, 2002. © 2006 by Taylor & Francis Group, LLC

16-20

Embedded Systems Handbook

[57] D. Brooks and M. Martonosi. Value-Based Clock Gating and Operation Packing: Dynamic Strategies for Improving Processor Power and Performance. ACM Transactions on Computer Systems, 18(2): 89–126, 2000. [58] S. Dropsho, V. Kursun, D.H. Albonesi, S. Dwarkadas, and E.G. Friedma. Managing Static Leakage Energy in Microprocessor Functional Units. In Proceedings of the International Symposium on Microarchitecture, 2002, pp. 321–332. [59] D. Ponomarev, G. Kucuk, and K. Ghose. Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources. In Proceedings of the International Symposium on Microarchitecture, 2001, pp. 90–101. [60] A. Buyuktosunoglu, D. Albonesi, P. Bose, P. Cook, and S. Schuster. Tradeoffs in Power-Efficient Issue Queue Design. In Proceedings of the International Symposium on Low Power Electronics and Design, 2002, pp. 184–189. [61] C.J. Hughes, J. Srinivasan, and S.V. Adve. Saving Energy with Architectural and Frequency Adaptations for Multimedia Applications. In Proceedings of the International Symposium on Microarchitecture, 2001, pp. 250–261. [62] M.F. Jacome and G. de Veciana. Design Challenges for New Application Specific Processors. IEEE Design and Test of Computers, Special Issue on System Design of Embedded Systems, 17(2): 50–60, 2000. [63] R.P. Colwell, R.P. Nix, J.J.O. Donnell, D.B. Papworth, and P.K. Rodman. A VLIW Architecture for a Trace Scheduling Compiler. IEEE Transactions on Computers, 37(8): 967–979, 1988. [64] G.R. Beck, D.W.L. Yen, and T.L. Anderson. The Cydra 5: Mini-Supercomputer: Architecture and Implementation. The Journal of Supercomputing, 7(1/2): 143–180, 1993. [65] M.S. Schlansker and B.R. Rau. EPIC: An Architecture for Instruction-Level Parallel Processors. Technical report HPL-99-111, Hewlewtt Packard Laboratories, 2000. [66] W.W. Hwu, R.E. Hank, D.M. Gallagher, S.A. Mahlke, D.M. Lavery, G.E. Haab, J.C. Gyllenhaal, and D.I. August. Compiler Technology for Future Microprocessors. Proceedings of the IEEE, 83(12): 1625–1640, 1995. [67] J. R. Ellis. Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge, MA, 1985. [68] J. Dehnert and R. Towle. Compiling for the Cydra-5. Journal of Supercomputing, 7(1/2): 181–227, 1993. [69] C. Dulong, R. Krishnaiyer, D. Kulkarni, D. Lavery, W. Li, J. Ng, and D. Sehr. An Overview of the Intel IA-64 Compiler. Intel Technology Journal, Q4, 1999, pp. 1–15. [70] M.F. Jacome, G. de Veciana, and V. Lapinskii. Exploring Performance Tradeoffs for Clustered VLIW ASIPs. In Proceedings of the International Conference on Computer-Aided Design, 2000, pp. 504–510. [71] V. Lapinskii, M.F. Jacome, and G. de Veciana. Application-Specific Clustered VLIW Datapaths: Early Exploration on a Parameterized Design Space. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 21(8): 889–903, 2002. [72] S. Pillai and M.F. Jacome. Compiler-Directed ILP Extraction for Clustered VLIW/EPIC Machines: Predication, Speculation and Modulo Scheduling. In Proceedings of the Design Automation and Test in Europe, 2003, p. 10422. [73] P. Marwedel and G. Goosens, Eds. Code Generation for Embedded Processors. Kluwer Academic Publishers, Dordrecht, 1995. [74] C. Liem. Retargetable Compilers for Embedded Core Processors. Kluwer Academic Publishers, Dordrecht, 1997. [75] J. Fritts, W. Wolf, and B. Liu. Understanding Multimedia Application Characteristics for Designing Programmable Media Processors. In SPIE Photonics West, Media Processors, 1999, pp. 2–13. [76] B. Khailany, W.J. Dally, S. Rixner, U.J. Kapasi, P. Mattson, J. Namkoong, J.D. Owens, B. Towles, and A. Chang. Imagine: Media Processing with Streams. IEEE Micro, 21: 35–46, 2001.

© 2006 by Taylor & Francis Group, LLC

Power Aware Embedded Computing

16-21

[77] J. Rabaey, H. De Man, J. Vanhoof, G. Goossens, and F. Catthoor. CATHEDRAL-II : A Synthesis System for Multiprocessor DSP Systems. In Silicon Compilation. Addison-Wesley, Reading, MA, 1987. [78] J. Montanaro, R.T. Witek, K. Anne, A.J. Black, E.M. Cooper, D.W. Dobberpuhl, P.M. Donahue, J. Eno, A. Farell, G.W. Hoeppner, D. Kruckemyer, T.H. Lee, P. Lin, L. Madden, D. Murray, M. Pearce, S. Santhanam, K.J. Snyder, R. Stephany, and S.C. Thierauf. A 160 MHz 32b 0.5 W CMOS RISC Microprocessor. In Proceedings of the International Solid-State Circuits Conference, Digest of Technical Papers, 31(11): 1703–1714, 1996. [79] P. Hicks, M. Walnock, and R.M. Owens. Analysis of Power Consumption in Memory Hierarchies. In Proceedings of the International Symposium on Low Power Electronics and Design, 1997, pp. 239–242. [80] K. Ghose and M.B. Kamble. Reducing Power in Superscalar Processor Caches Using Subbanking, Multiple Line Buffers and Bit-Line Segmentation. In Proceedings of the International Symposium on Low Power Electronics and Design, 1999, pp. 70–75. [81] C.-L. Su and A.M. Despain. Cache Design Trade-Offs for Power and Performance Optimization: A Case Study. In Proceedings of the International Symposium on Low Power Electronics and Design, 1995, pp. 63–68. [82] J. Kin, M. Gupta, and W.H. Mangione-Smith. Filtering Memory References to Increase Energy Efficiency. IEEE Transactions on Computers, 49(1): 1–15, 2000. [83] A.H. Farrahi, G.E. Téllez, and M. Sarrafzadeh. Memory Segmentation to Exploit Sleep Mode Operation. In Proceedings of the Design Automation Conference, 1995, pp. 36–41. [84] J. Kin, M. Gupta, and W.H. Mangione-Smith. The Filter Cache: An Energy Efficient Memory Structure. In Proceedings of the International Symposium on Microarchitecture, 1997, pp. 184–193. [85] R.S. Bajwa, M. Hiraki, H. Kojima, D.J. Gorny, K. Nitta, A. Shridhar, K. Seki, and K. Sasaki. Instruction Buffering to Reduce Power in Processors for Signal Processing. IEEE Transactions on Very Large Scale Integration Systems, 5(4): 417–424, 1997. [86] L. Lee, B. Moyer, and J. Arends. Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops. In Proceedings of the International Symposium on Low Power Electronics and Design, 1999, pp. 267–269. [87] H.-H. Lee and G. Tyson. Region-Based Caching: An Energy-Delay Efficient Memory Architecture for Embedded Processors. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, 2000, pp. 120–127. [88] D.A. Wood, M.D. Hill, and R.E. Kessler. A Model for Estimating Trace-Sample Miss Ratios. In Proceedings of the SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1991, pp. 79–89. [89] D.C. Burger, J.R. Goodman, and A. Kagi. The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors. University of Wisconsin-Madison Computer Sciences Technical report 1261, 1995. [90] S. Kaxiras, Z. Hu, and M. Martonosi. Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power. In Proceedings of the International Symposium on Computer Architecture, 2001, pp. 240–251. [91] H. Zhou, M.C. Toburen, E. Rotenberg, and T.M. Conte. Adaptive Mode Control: A Static-PowerEfficient Cache Design. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2001, pp. 61–72. [92] D. Folegnani and A. Gonzalez. Energy-Effective Issue Logic. In Proceedings of the International Symposium on Computer, 2001, pp. 230–239. [93] Z. Hu, P. Juang, K. Skadron, D. Clark, and M. Martonosi. Applying Decay Strategies to Branch Predictors for Leakage Energy Savings. In Proceedings of the International Conference on Computer Design, 2002, pp. 442–445.

© 2006 by Taylor & Francis Group, LLC

16-22

Embedded Systems Handbook

[94] S.-H. Yang, M.D. Powell, B. Falsafi, K. Roy, and T.N. Vijaykumar. An Integrated Circuit/ Architecture Approach to Reducing Leakage in Deep-Submicron High-Performance I-Caches. In Proceedings of the High-Performance Computer Architecture, 2001, pp. 147–158. [95] P.R. Panda, N.D. Dutt, and A. Nicolau. Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications. In Proceedings of the European Design and Test Conference, 1997, pp. 7–11. [96] D. Chiou, P. Jain, S. Devadas, and L. Rudolph. Application-Specific Memory Management for Embedded Systems Using Software-Controlled Caches. In Proceedings of the Design Automation Conference, 2000, pp. 416–419. [97] L. Benini, A. Macii, and M. Poncino. A Recursive Algorithm for Low-Power Memory Partitioning. In Proceedings of the International Symposium on Low Power Electronics and Design, 2000, pp. 78–83. [98] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad Memory: A Design Alternative for Cache On-Chip Memory in Embedded Systems. In Proceedings of the International Workshop on Hardware/Software Codesign, 2002, pp. 73–38. [99] M. Kandemir, J. Ramanujam, M. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. Dynamic Management of Scratch-Pad Memory Space. In Proceedings of the Design Automation Conference, 2001, pp. 690–695. [100] O.S. Unsal, Z. Wang, I. Koren, C.M. Krishna, and C.A. Moritz. On Memory Behavior of Scalars in Embedded Multimedia Systems. In Proceedings of the Workshop on Memory Performance Issues, Goteborg, Sweden, 2001. [101] P.R. Panda, F. Catthoor, N.D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandercappelle, and P.G Kjeldsberg. Data and Memory Optimization Techniques for Embedded Systems. ACM Transactions on Design Automation of Electronic Systems, 6(2): 149–206, 2001. [102] J. Sjödin, B. Fröderberg, and T. Lindgren. Allocation of Global Data Objects in On-Chip RAM. In Proceedings of the Workshop on Compiler and Architectural Support for Embedded Computer Systems, Washington DC, USA, 1998. [103] T. Ishihara and H. Yasuura. A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors. In Proceedings of the Design, Automation and Test in Europe, 2000, pp. 617–623. [104] S. Steinke, L. Wehmeyer, B.-S. Lee, and P. Marwedel. Assigning Program and Data Objects to Scratchpad for Energy Reduction. In Proceedings of the Design Automation and Test in Europe, 2002, pp. 409–417. [105] O.S. Unsal, R. Ashok, I. Koren, C.M. Krishna, and C.A. Moritz. Cool Cache: A Compiler-Enabled Energy Efficient Data Caching Framework for Embedded/Multimedia Processors. ACM Transactions on Embedded Computing Systems, Special Issue on Power-Aware Embedded Computing, 2(3): 373–392, 2003. [106] http://www.ti.com/ [107] http://www.trimedia.com/ [108] M.E. Wolf and M. Lam. A Data Locality Optimizing Algorithm. In Proceedings of the Conference on Programming Language Design and Implementation, 1991, pp. 30–44. [109] S. Carr, K.S. McKinley, and C. Tseng. Compiler Optimizations for Improving Data Locality. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1994, pp. 252–262. [110] S. Coleman and K.S. McKinley. Tile Size Selection Using Cache Organization and Data Layout. In Proceedings of the Conference on Programming Language Design and Implementation, 1995. [111] M.J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley Publishers, Reading, MA, 1995, pp. 279–290. [112] M. Kandemir, J. Ramanujam, A. Choudhary, and P. Banerjee. A Layout-Conscious Iteration Space Transformation Technique. IEEE Transactions on Computers, 50(12): 1321–1335, 2001.

© 2006 by Taylor & Francis Group, LLC

Power Aware Embedded Computing

16-23

[113] N.P. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small FullyAssociative Cache and Prefetch Buffers. In Proceedings of the International Symposium on Computer Architecture, 1990, pp. 364–373. [114] T.F. Chen and J.L. Baer. Effective Hardware-Based Data Prefetching for High Performance Processors. IEEE Transactions on Computers, 44(5): 609–623, 1995. [115] J.W.C. Fu, J.H. Patel, and B.L. Janssens. Stride Directed Prefetching in Scalar Processor. In Proceedings of the International Symposium on Microarchitecture, 1992, pp. 102–110. [116] S.S. Pinter and A. Yoaz. A Hardware-Based Data Prefetching Technique for Superscalar Processors. In Proceedings of the International Symposium on Microarchitecture, 1996, pp. 214–225. [117] D. Callahan, K. Kennedy, and A. Porterfield. Software Prefetching. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1991, pp. 40–52. [118] A.C. Klaiber and H.M. Levy. An Architecture for Software Controlled Data Prefetching. In Proceedings of the International Symposium on Computer Architecture, 1991, pp. 43–53. [119] T.C. Mowry, M.S. Lam, and A. Gupta. Design and Evaluation of a Compiler Algorithm for Prefetching. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1992, pp. 62–73. [120] D.F. Zucker, R.B. Lee, and M.J. Flynn. Hardware and Software Cache Prefetching Techniques for MPEG Benchmarks. IEEE Transactions on Circuits and Systems for Video Technology, 10(5): 782–796, 2000. [121] Y. Choi and T. Kim. Memory Layout Technique for Variables Utilizing Efficient DRAM Access Modes in Embedded System Design. In Proceedings of the Design Automation Conference, 2003, pp. 881–886. [122] X. Fan, C.S. Ellis, and A.R. Lebeck. Memory Controller Policies for DRAM Power Management. In Proceedings of the International Symposium on Low Power Electronics and Design, 2001, pp. 129–134. [123] V. Delaluz, A. Sivasubramaniam, M. Kandemir, N. Vijaykrishnan, and M.J. Irwin. SchedulerBased DRAM Energy Management. In Proceedings of the Design Automation Conference, 2002, pp. 697–702. [124] A. Ramachandran and M. Jacome. Xtream-Fit: An Energy-Delay Efficient Data Memory Subsystem for Embedded Media Processing. In Proceedings of the Design Automation Conference, 2003, pp. 137–142. [125] P. Mattson. A Programming System for the Imagine Media Processor. PhD thesis, Stanford University, 2001. [126] P. Grun, N. Dutt, and A. Nicolau. Memory Architecture Exploration for Programmable Embedded Systems. Kluwer Academic Publishers, Dordrecht, 2003. [127] A. Wolfe and A. Chanin. Executing Compressed Programs on an Embedded RISC Architecture. In Proceedings of the International Symposium on Microarchitecture, 1992, pp. 81–91. [128] C. Lefurgy, P. Bird, I-C. Cheng, and T. Mudge. Improving Code Density Using Compression Techniques. In Proceedings of the International Symposium on Microarchitecture, 1997, pp. 194–203. [129] H. Lekatsas and W. Wolf. SAMC: A Code Compression Algorithm for Embedded Processors. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 18(12): 1689–1701, 1999. [130] W.E. Dougherty, D.J. Pursley, and D.E. Thomas. Instruction Subsetting: Trading Power for Programmability. In Proceedings of the International Workshop on Hardware/Software Codesign, 1998. [131] D. Sylvester and K. Keutzer. A Global Wiring Paradigm for Deep Submicron Design. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 19(2): 242–252, 2000. [132] C.-T. Hsieh and M. Pedram. Architectural Power Optimization by Bus Splitting. In Proceedings of the Conference on Design, Automation and Test in Europe, 2000, pp. 612–616.

© 2006 by Taylor & Francis Group, LLC

16-24

Embedded Systems Handbook

[133] M.R. Stan and W.P. Burleson. Bus-Invert Coding for Low-Power I/O. IEEE Transactions on Very Large Scale Integration Systems, 3(1): 49–58, 1995. [134] H. Mehta, R.M. Owens, and M.J. Irwin. Some Issues in Gray Code Addressing. In Proceedings of the Sixth Great Lakes Symposium on VLSI, 1996, pp. 178–181. [135] L. Benini, G. De Micheli, E. Macii, M. Poncino, and S. Quez. System-Level Power Optimization of Special Purpose Applications — The Beach Solution. In Proceedings of the International Symposium on Low Power Electronics and Design, 1997, pp. 24–29. [136] L. Benini, G. De Micheli, E. Macii, D. Sciuto, and C. Silvano. Address Bus Encoding Techniques for System-Level Power Optimization. In Proceedings of the Design, Automation and Test in Europe, 1998, pp. 861–867. [137] P.R. Panda and N.D. Dutt. Low-Power Memory Mapping Through Reducing Address Bus Activity. IEEE Transactions on Very Large Scale Integration Systems, 7(3): 309–320, 1999. [138] N. Chang, K. Kim, and J. Cho. Bus Encoding for Low-Power High-Performance Memory Systems. In Proceedings of the Design Automation Conference, 2000, pp. 800–805.

© 2006 by Taylor & Francis Group, LLC

Security in Embedded Systems 17 Design Issues in Secure Embedded Systems A.G. Voyiatzis, A.G. Fragopoulos, and D.N. Serpanos

© 2006 by Taylor & Francis Group, LLC

17 Design Issues in Secure Embedded Systems 17.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-1 17.2 Security Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-2 Abilities of Attackers • Security Implementation Levels • Implementation Technology and Operational Environment

17.3 Security Constraints in Embedded Systems Design . . . 17-4 Energy Considerations • Processing Power Limitations • Flexibility and Availability Requirements • Cost of Implementation

17.4 Design of Secure Embedded Systems. . . . . . . . . . . . . . . . . . . 17-7 System Design Issues • Application Design Issues

17.5 Cryptography and Embedded Systems . . . . . . . . . . . . . . . . . 17-10

A.G. Voyiatzis, A.G. Fragopoulos, and D.N. Serpanos University of Patras

Physical Security • Side-Channel Cryptanalysis • SideChannel Implementations • Fault-Based Cryptanalysis • Passive Side-Channel Cryptanalysis • Countermeasures

17.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-20 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-20

17.1 Introduction A computing system is typically considered as an embedded system when it is a programmable device with limited resources (energy, memory, computation power, etc.) that serves one or few applications and is embedded in a larger system. Their limited resources make them ineffective to be used as general-purpose computing systems. However, they usually have to meet hard requirements, such as time deadlines and other real-time processing requirements. Embedded systems can be classified in two general categories: (1) standalone embedded systems, where all hardware and software components of the system are physically close and incorporated into a single device, for example, a Personal Digital Assistant (PDA) or a system in a washing machine or a fax, and there is no attachment to a network, and (2) distributed (networked) embedded systems, where several autonomous components — each one a standalone embedded system — communicate with each other over a network in order to deliver services or support an application. Several architectural and design parameters leading to the development of distributed embedded applications, such as the placement of processing power at the physical point where an event takes place, data reduction, etc. [1].

17-1

© 2006 by Taylor & Francis Group, LLC

17-2

Embedded Systems Handbook

The increasing capabilities of embedded systems combined with their decreasing cost have enabled their adoption in a wide range of applications and services, from financial and personalized entertainment services to automotive and military applications in the field. Importantly, in addition to the typical requirements for responsiveness, reliability, availability, robustness, and extensibility, many conventional embedded systems and applications have significant security requirements. However, security is a resource-demanding function that needs special attention in embedded computing. Furthermore, the wide deployment of small devices which are used in critical applications has triggered the development of new, strong attacks that exploit more systemic characteristics, in contrast to traditional attacks that focused on algorithmic characteristics, due to the inability of attackers to experiment with the physical devices used in secure applications. Thus, design of secure embedded systems requires special attention. In this chapter we provide an overview of security issues in embedded systems. Section 17.2 presents the parameters of security systems, while Section 17.3 describes the effect of security in the resourceconstrained environment of embedded systems. Section 17.4 presents the main issues in the design of secure embedded systems. Finally, Section 17.5 covers in detail attacks and countermeasures of cryptographic algorithm implementations in embedded systems, considering the critical role of cryptography and the novel systemic attacks developed due to the wide availability of embedded computing systems.

17.2 Security Parameters Security is a generic term used to indicate several different requirements in computing systems. Depending on the system and its use, several security properties may be satisfied in each system and in each operational environment. Overall, secure systems need to meet all or a subset of the following requirements [2,3]: 1. Confidentiality. Data stored in the system or transmitted from the system have to be protected from disclosure; this is usually achieved through data encryption. 2. Integrity. A mechanism to ensure that data received in a data communication was indeed the data transmitted. 3. Nonrepudiation. A mechanism to ensure that all entities (systems or applications) participating in a transaction cannot deny their actions in the transaction. 4. Availability. The system’s ability to perform its primary functions and serve its legitimate users without any disruption, under all conditions, including possible malicious attacks that target to disrupt service, such as the well-known Denial of Service (DoS) attacks. 5. Authentication. The ability of the receiver of a message to identify the message sender. 6. Access control. The ability to ensure that only legal users may take part in a transaction and have access to system resources. To be effective, access control is typically used in conjunction with authentication. These requirements are placed by different parties involved in the development and use of computing systems, for example, vendors, application providers, and users. For example, vendors need to ensure the protection of their Intellectual Property (IP) that is embedded in the system, while end users want to be certain that the system will provide secure user identification (only authorized users may access the system and its applications, even if the system gets in the hands of malicious users) and will have high availability, that is, the system will be available under all circumstances; also, content providers are concerned for the protection of their IP, for example, that the data delivered through an application are not copied. Ravi et al. [3,4] have identified the participating parties in system and application development and use as well as their security requirements. This classification enables us to identify several possible malicious users, depending on a party’s view; for example, for the hardware manufacturer, even a legal end user of a portable device (e.g., a PDA or a mobile phone) can be a possible malicious user. Considering the security requirements and the interested parties above, the design of a secure system requires identification and definition of the following parameters: (1) the abilities of the attackers, (2) the level at which security should be implemented, and (3) implementation technology and operational environment.

© 2006 by Taylor & Francis Group, LLC

Design Issues in Secure Embedded Systems

17-3

17.2.1 Abilities of Attackers Malicious users can be classified in several categories, depending on their knowledge, equipment, etc. Abraham et al. [5] propose a classification in three categories, depending on their knowledge, their hardware–software equipment, and their funds: 1. Class I — clever outsiders. Very intelligent attackers, not well funded and with no sophisticated equipment. They do not have specific knowledge of the attacked system; basically they are trying to exploit hardware vulnerabilities and software glitches. 2. Class II — knowledgeable insiders. Attackers with outstanding technical background and education, using highly sophisticated equipment and, often, with inside information for the system under attack; such attackers include former employees who participated in the development cycle of the system. 3. Class III — funded organizations. Attackers who are mostly working in teams, and have excellent technical skills and theoretical background. They are well funded, have access to very advanced tools and also have the capability to analyze the system — technically and theoretically — developing highly sophisticated attacks. Such organizations could be well-organized education foundations, government institutions, etc.

17.2.2 Security Implementation Levels Security can be implemented at various system levels, ranging from protection of the physical system itself to application and network security. Clearly, different mechanisms and implementation technologies have to be used to implement security at different levels. In general, the levels of security considered are four: (1) physical, (2) hardware, (3) software, and (4) network and protocol security. Physical security mechanisms target to protect systems from unauthorized physical access to the system itself. Protecting systems physically ensures data privacy and data and application integrity. According to US Federal Standard 1027, physical security mechanisms are considered successful when they ensure that a possible attack will have low possibility of success and high possibility of tracking the malicious attacker, in reasonable time. The wide adoption of embedded computing systems in a variety of devices, such as smartcards, mobile devices, and sensor networks, as well as the ability to network them, for example, through the Internet or VPNs, has led to revision and reconsideration of physical security. Weingart [6] surveys possible attacks and countermeasures concerning physical security issues, concluding that physical security needs continuous improvement and revision in order to keep at the leading edge. Hardware security may be considered as a subset of physical security, referring to security issues concerning the hardware parts of a computer system. Hardware-level attacks exploit circuit and technological vulnerabilities and take advantage of possible hardware defects. These attacks do not necessarily require very sophisticated and expensive equipment. Anderson and Kuhn [7] describe several ways to attack smartcards and microcontrollers through the use of unusual voltages and temperatures that affect the behavior of specific hardware parts or through microprobing a smartcard chip, such as the Subscriber Identity Module (SIM) chip found in cellular phones. Reverse engineering attack techniques are equally successful as Blythe et al. [8] reported for the case of a wide range of microprocessors. Their work concluded that special hardware protection mechanisms are necessary to avoid these types of attacks; such mechanisms include silicon coatings of the chip, increased complexity in the chip layout, etc. One of the major goals in the design of secure systems is the development of secure software, which is free of flaws and security vulnerabilities that may appear under certain conditions. Numerous software security flaws have been identified in real systems, for example, by Landwehr et al. [9], and there have been several cases where malicious intruders hack into systems through exploitation of software defects [10]. Some methods for the prevention of such problems have been proposed by Tevis and Hamilton [11]. The use of the Internet, which is an unsafe interconnection for information transfer, as a backbone network for communicating entities and the wide deployment of wireless networks demonstrate that improvements have to be done in existing protocol architectures in order to provide new, secure protocols [12,13]. Such protocols will ensure authentication between communicating entities, integrity of

© 2006 by Taylor & Francis Group, LLC

17-4

Embedded Systems Handbook

communicated data, protection of the communicating parties, and nonrepudiation (the inability of an entity to deny its participation in a communication transaction). Furthermore, special attention has to be paid in the design of secure protocols for embedded systems, due to their physical constraints, that is, limited battery power, limited processing, and memory resources, as well as their cost and communication requirements.

17.2.3 Implementation Technology and Operational Environment In regard to implementation technology, systems can be classified by static versus programmable technology and fixed versus extensible architecture. When static technology is used, the hardware-implemented functions are fixed and inflexible, but they offer higher performance and can reduce cost. However, static systems can be more vulnerable to attacks, because, once a flaw is identified — for example, in the design of the system — it is impossible to patch already deployed systems, especially in the case of large installations, such as SIM cards for cellular telephony or pay-per-view TV. Static systems should be implemented only once and correctly, which is an unattainable expectation in computing. In contrast, programmable systems are not limited as static ones, but they can be proven flexible in the hands of an attacker as well; system flexibility may allow an attacker to manipulate the system in ways not expected or defined by the designer. Programmability is typically achieved through the use of specialized software over a general-purpose processor or hardware. Fixed architectures are composed of specific hardware components that cannot be altered. Typically, it is almost impossible to add functionality in later stages, but they have lower cost of implementation and are, in general, less vulnerable because they offer limited choices to attackers. An extensible architecture is like a general-purpose processor, capable to interface with several peripherals through standardized connections. Peripherals can be changed or upgraded easily to increase security or to provide new functionality. However, an attacker can connect malicious peripherals or interface the system in untested or unexpected cases. As testing is more difficult relatively to static systems, one cannot be too confident that the system operates correctly under every possible input. Field Programmable Gate Arrays (FPGAs) combine benefits of all types of systems and architectures because they combine hardware implementation performance and programmability, enabling system reconfiguration. They are widely used to implement cryptographic primitives in various systems. Thus, significant attention has to be paid to the security of FPGAs as independent systems. There exist research efforts addressing this issue where systematic approaches are developed and open problems in FPGA security are addressed; for example, Wollinger et al. [14] provide such an approach and address several open problems, including resistance under physical attacks.

17.3 Security Constraints in Embedded Systems Design The design of secure systems requires special considerations, because security functions are resource demanding, especially in terms of processing power and energy consumption. The limited resources of embedded systems require novel design approaches in order to deal with trade-offs between efficiency — speed and cost — and effectiveness — satisfaction of the functional and operational requirements.

17.3.1 Energy Considerations Embedded systems are often battery powered, that is, they are power constrained. Battery capacity constitutes a major bottleneck to processing for security on embedded systems. Unfortunately, improvements in battery capacity do not follow the improvements of increasing performance, complexity, and functionality of the systems they power. Gunther et al. [15], Buchmann [16], and Lahiri et al. [17] report the widening “battery gap,” due to the exponential growth of power requirements and the linear growth in energy density. Thus, the power subsystem of embedded systems is a weak point of system security. A malicious

© 2006 by Taylor & Francis Group, LLC

Design Issues in Secure Embedded Systems

17-5

attacker, for example, may form a DoS attack by draining the system’s battery more quickly than the usual. Martin et al. [18] describe three ways in which such an attack may take place: (1) service request power attacks, (2) benign power attacks, and (3) malignant power attacks. In service request attacks, a malicious user may request repeatedly from the device to serve a power hungry application, even if the application is not supported by the device. In benign power attacks, the legitimate user is forced to execute an application with high-power requirements, while in malignant power attacks malicious users modify the executable code of an existing application, in order to drain as much battery power as possible without changing the application functionality. They conclude that such attacks may reduce battery life by one to two orders of magnitude. Inclusion of security functions in an embedded system places extra requirements on power consumption due to: (1) extra processing power necessary to perform various security functions, such as authentication, encryption, decryption, signing, and data verification, (2) transmission of security-related data between various entities, if the system is distributed, that is, a wireless sensor network, and (3) energy required to store security-related parameters. Embedded systems are often used to deploy performance-critical functions, which require a lot of processing power. Inclusion of cryptographic algorithms that are used as building blocks in secure embedded design may lead to great consumption of system battery. The energy consumption of the cryptographic algorithms used in security protocols has been analyzed well, for example, by Potlapally et al. [19]. They present a general framework that shows asymmetric algorithms having the highest energy cost, symmetric algorithms as the next power-hungry category, and hash algorithms at the bottom. The power required by cryptographic algorithms is significant as measurements indicate [20]. Importantly, in many applications the power consumed by security functions is larger than that used for the applications themselves. For example, Raghunathan et al. [21] present the battery gap for a sensor node with an embedded processor, calculating the number of transactions that the node can serve working in secure or insecure mode until system battery runs out. Their results state that working in secure mode consumes the battery in less than half time than when working in insecure mode. Many applications that involve embedded systems are implemented through distributed, networked platforms, resulting in a power overhead due to communication between the various nodes of the system [1]. Considering a wireless sensor network, which is a typical distributed embedded system, one can easily see that significant energy is consumed in communication between various nodes. Factors such as modulation type, data rate, transmit power, and security overhead affect power consumption significantly [22]. Savvides et al. [23] showed that the radio communication between nodes consumes most of the power, that is, 50 to 60% of the total power, when using the WINS — Wireless Integrated Network Sensor — platform [24]. Furthermore, in a wireless sensor network, the security functions consume energy due to extra internode exchange of cryptographic information — key exchange, authentication information — and per-message security overhead, which is a function of both the number and the size of messages [20]. It is important to identify the energy consumption of alternative security mechanisms. Hodjat and Verbauwhede [25], for example, have measured the energy consumption using two widely used algorithms for the key exchange of information between entities in a distributed environment, (1) Diffie–Hellman protocol [26] and (2) basic Kerberos protocol [26]. Their results show that Diffie–Hellman, implemented using elliptic curve public key cryptography, consumes 1213.7 mJ, 4296 mJ, and 9378.3 mJ for 128-bit, 192-bit, and 256-bit keys, respectively, while the Kerberos key exchange protocol using symmetric cryptography consumes 139.62 mJ; this indicates that the Kerberos protocol configuration consumes significantly less energy.

17.3.2 Processing Power Limitations Security processing places significant additional requirements on the processing power of embedded systems, since conventional architectures are quite limited. The term security processing is used to indicate the portion of the system computational effort that is dedicated to the implementation of the security requirements. Since embedded systems have limited processing power, they cannot cope efficiently with

© 2006 by Taylor & Francis Group, LLC

17-6

Embedded Systems Handbook

the execution of complex cryptographic algorithms, which are used in the secure design of an embedded system. For example, the generation of a 512-bit key for the RSA public key algorithm requires 3.4 min for the PalmIIIx PDA, while encryption using DES takes only 4.9 msec per block, leading to an encryption rate of 13 Kbps [27]. The adoption of modern embedded systems in high-end systems (servers, firewalls, and routers) with increasing data transmission rates and complex security protocols, such as SSL, make the security processing gap wider and demonstrate that the existing embedded architectures need to be improved, in order to keep up with the increasing computational requirements that are placed by security processing. The wide processing gap has been exposed by measurements, as by Ravi et al. [4], who measured the security processing gap in the client-server model using the SSL protocol for various embedded microprocessors. Specifically, considering a StrongARM (206 MHz SA-1110) processor, which may be used in a low-end system such as a PDA or a mobile device, 100% of the processing power dedicated to SSL processing can achieve data rates up to 1.8 Mbps, while a 2.8 GHz Xeon achieves data rates up to 29 Mbps. Considering that the data rates of low-end systems range between 128 Kbps and 2 Mbps, while data rates of high-end systems range between 2 and 100 Mbps, it is clear that the processors mentioned above cannot achieve higher data rates than their maximum, leading to a security processing gap.

17.3.3 Flexibility and Availability Requirements The design and implementation of security in an embedded system does not mean that the system will not change its operational security characteristics through time. Considering that security requirements evolve and security protocols are continuously strengthened, embedded systems need to be flexible and adaptable to changes in security requirements, without losing their performance and availability goals as well as their primary security objectives. Modern embedded systems are characterized by their ability to operate in different environments, under various conditions. Such an embedded system must be able to achieve different security objectives in every environment; thus, the system must be characterized by significant flexibility and efficient adaptation. For example, consider a PDA with mobile telecommunication capabilities that may operate in a wireless environment [28–30] or provide 3G cellular services [31]; different security objectives must be satisfied in each case. Another issue that must be addressed is the implementation of different security requirements at different layers of the protocol architecture. Consider, for example, a mobile PDA that must be able to execute several security protocols, such as IPSec [13], SSL [12], and WEP [32], depending on its specific application. Importantly, availability is a significant requirement that needs special support, considering that it should be provided in an evolving world in terms of functionality and increasing system complexity. Conventional embedded systems should target to provide high availability characteristics not only in their expected, attack-free environment but in an emerging hostile environment as well.

17.3.4 Cost of Implementation Inclusion of security in embedded system design can increase system cost dramatically. The problem originates from the strong resource limitations of embedded systems, through which the system is required to exhibit great performance as well as high level of security while retaining a low cost of implementation. It is necessary to perform a careful, in-depth analysis of the designed system, in terms of the abilities of the possible adversaries, the environmental conditions under which the system will operate, etc., in order to estimate cost realistically. Consider, for example, the incorporation of a tamper-resistant cryptographic module in an embedded system. As described by Ravi et al. [4], according to the Federal Information Processing Standard [33], a designer can distinguish four levels of security requirements for cryptographic modules. The choice of the security level influences design and implementation cost significantly; so, the manufacturer faces a trade-off between the security requirements that will be implemented and the cost of manufacturing.

© 2006 by Taylor & Francis Group, LLC

Design Issues in Secure Embedded Systems

17-7

17.4 Design of Secure Embedded Systems Secure embedded systems must provide basic security properties, such as data integrity, as well as mechanisms and support for more complex security functions, such as authentication and confidentiality. Furthermore, they have to support the security requirements of applications, which are implemented, in turn, using the security mechanisms offered by the system. In this section, we describe the main design issues at both the system and application level.

17.4.1 System Design Issues Design of secure embedded systems needs to address several issues and parameters ranging from the employed hardware technology to software development methodologies. Although several techniques used in general-purpose systems can be effectively used in embedded system development as well, there are specific design issues that need to be addressed separately, because they are unique or weaker in embedded systems, due to the high volume of available low-cost systems that can be used for development of attacks by malicious users. The major of these design issues are tamper-resistance properties, memory protection, IP protection, management of processing power, communication security, and embedded software design. These issues are covered in the following paragraphs. Modern secure embedded systems must be able to operate in various environmental conditions, without loss of performance and deviation from their primary goals. In many cases they must survive various physical attacks and have tamper-resistance mechanisms. Tamper resistance is the property that enables systems to prevent the distortion of physical parts. Additionally to tamper-resistance mechanisms, there exist tamper-evidence mechanisms, which allow users or technical staff to identify tampering attacks and take countermeasures. Computer systems are vulnerable to tampering attacks, where malicious users intervene in hardware system parts and compromise them, in order to take advantage of them. Security of many critical systems relies on tamper resistance of smartcards and other embedded processors. Anderson and Kuhn [7] describe various techniques and methods to attack tamper-resistance systems, concluding that tamper-resistance mechanisms need to be extended or reevaluated. Memory technology may be an additional weakness in system implementation. Typical embedded systems have ROM, RAM, and EEPROM memory to store data. EEPROM memory constitutes the vulnerable spot of such systems, because it can be erased with the use of appropriate electrical signaling by malicious users [7]. Intellectual Property (IP) protection of manufacturers is an important issue addressed in secure embedded systems. Complicated systems tend to be partitioned in smaller independent modules leading to module reusability and cost reduction. These modules include IP of the manufacturers, which needs to be protected from third-party users, who might claim and use these modules. The illegal users of an IP block do not necessarily need to have full, detailed knowledge of the IP component, since IP blocks are independent modules which can very easily be incorporated and integrated with the rest of the system components. Lach et al. [34] propose a fingerprinting technique for IP blocks implemented using FPGAs, through an embedded unique marker onto the IP hardware that identifies both the origin and the recipient of the IP block. Also, they are stating that the removal of such a mark is extremely difficult, with a probability of less than one in a million. Implementation of security techniques for tamper resistance, tamper prevention, and IP protection may require additional processing power, which is limited in embedded systems. The “processing gap” between the computational requirements of security and the available processing power of embedded processors requires special consideration. A variety of architectures and enhancements in security protocols have been proposed, in order to bridge that gap. Burke et al. [35] propose enhancements in the Instruction Set Architecture (ISA) of embedded processors, in order to efficiently calculate various cryptographic primitives, such as permutations, bit rotations, fast substitutions, and modular arithmetic. Another approach is to build dedicated cryptographic embedded coprocessors with their own ISA. The Cryptomaniac coprocessor [36] is an example of this approach. Several vendors, for example,

© 2006 by Taylor & Francis Group, LLC

17-8

Embedded Systems Handbook

Infineon [37] and ARM [38], have manufactured microcontrollers that have embedded coprocessors dedicated to serve cryptographic functions. Intel [39] announced a new generation of 64-bit embedded processors that have some features that can speed up processing hungry algorithms, such as cryptographic ones; these features include larger register sets, parallel execution of computations, improvements in large integers multiplication, etc. In a third approach, software optimizations are exploited. Potlapally et al. [40] have conducted extensive research in the improvement of public-key algorithms, studying various algorithmic optimizations, identifying an algorithm design space where performance is improved significantly. Also, SmartMIPS [41] provides system flexibility and adaptation to any changes in security requirements through high-performance software-based enhancements of its cryptographic modules, while it supports various cryptographic algorithms. Even if the “processing gap” is bridged and security functions are provided, embedded systems are required to support secure communications as well, considering that, often, embedded applications are implemented in a distributed environment where communicating systems may exchange (possibly) sensitive data over an untrusted network-wired, wireless or mobile-like Internet, a Virtual Private Network, the Public Telephone network, etc. In order to fulfill the basic security requirements for secure communications, embedded systems must be able to use strong cryptographic algorithms and to support various protocols. One of the fundamental requirements regarding secure protocols is interoperability, leading to the requirement for system flexibility and adaptability. Since an embedded system can operate in several environments, for example, a mobile phone may provide 3G cellular services or connect to a wireless LAN, it is necessary for the system to operate securely in all environments without loss of performance. Furthermore, as security protocols are developed for various layers of the OSI reference model, embedded systems must be adaptable to different security requirements at each layer of the architecture. Finally, the continuous evolutions of security protocols require system flexibility as new standards are developed, and requirements are reevaluated and new cryptographic techniques are added to overall architecture. A comprehensive presentation of the evolution of security protocols in wireless communications, such as WTLS [42], MET [43], and IPSec [13], is provided by Raghunathan et al. [21]. An important consideration in the development of flexible secure communication subsystems for embedded systems is the limitation of energy, and processing and memory resources. The performance/cost trade-off leads to special attention for the placement of protocol functions in hardware — for high performance — or software — for cost reduction. Embedded software, such as the operating system or application-specific code, constitutes a crucial factor in secure embedded system design. Kocher and co-workers [3] identify three basic factors that make embedded software development a challenging area of security: (1) complexity of the system, (2) system extensibility, and (3) connectivity. Embedded systems serve critical, complex, and hard to implement applications, with many parameters that need to be considered, which, in turn, leads to “buggy” and vulnerable software. Furthermore, the required extensibility of conventional embedded systems makes the exploitation of vulnerabilities relatively easy. Finally, as modern embedded systems are designed with network connectivity, the higher the connectivity degree of the system, the higher the risk for a software breach to expand as time goes by. Many attacks can be implemented by malicious users that exploit software glitches and lead to system unavailability, which can have a disastrous impact, for example, a DoS attack on a military embedded system. Landwehr et al. [9] present a survey of common software security faults, helping designers to learn from their faults. Tevis and Hamilton [11] propose some methods to detect and prevent software vulnerabilities, focusing on some possible weaknesses that have to be avoided, preventing buffer overflow attacks, heap overflow attacks, array indexing attacks, etc. They also provide some coded security programs that help designers to analyze the security of their software. Buffer overflow attacks constitute the most widely used type of attacks that lead to unavailability of the attacked system; with these attacks, malicious users exploit system vulnerabilities and are able to execute malicious code, which can cause several problems such as a system crash — preventing legitimate users from using the system, loss of sensitive data, etc. Shao et al. [44] propose a technique, called Hardware–Software Defender, which targets to protect an embedded system from buffer overflow attacks; their proposal is to design a secure instruction set, extending the instruction set of existing microprocessors, and to demand from outside

© 2006 by Taylor & Francis Group, LLC

Design Issues in Secure Embedded Systems

17-9

software developers to call secure functions from that set. The limited memory resources of embedded systems, specifically the lack of disk space and virtual memory, make the system vulnerable in cases of memory-hungry applications: applications that require excessive amount of memory do not have a swap file to grow and can very easily cause an out-of-memory unavailability of the system. Given the significance of this potential problem and attack, Biswas et al. [45] propose mechanisms to protect an embedded system from such a memory overflow, thus providing reliability and availability of the system: (1) use of software runtime checks, in order to check possible out-of-memory conditions, (2) allowing out-of-memory data segments to be placed in free system space, and (3) compressing already used and unnecessary data.

17.4.2 Application Design Issues Embedded system applications present significant challenges to system designers, in order to achieve efficient and secure systems. A key issue in secure embedded design is user identification and access control. User identification includes the necessary mechanisms that guarantee that only legitimate users have access to system resources and can also verify, whenever requested, the identity of the user who has access to the system. The explosive growth of mobile devices and their use in critical, sensitive transactions, such as bank transactions, e-commerce, etc., demand secure systems with high performance and low cost. This demand has become urgent and crucial considering the successful attacks on these systems, such as the recent hardware hacking attacks on PIN (Personal Identification Number)-based bank ATMs (Automatic Teller Machines) that have led to significant loss of money and decreased the credibility of financial organizations toward people. A solution to this problem may come from an emerging new technology for user identification that is based on biometric recognition, for both user identification and verification. Biometrics are based on pattern recognition in acquired biological data taken from a user who wants to gain access to a system, that is, palm prints [46], finger prints [47], iris scan, etc., and comparing them with the data that have been stored in databases identifying the legitimate users of the system [48]. Moon et al. [49] propose a secure smartcard that uses biometrics capabilities, claiming that the development of such a system is less vulnerable to attacks when compared with software-based solutions and that the combination of smartcard and fingerprint recognition is much more robust than PIN-based identification. Implementation of such systems is realistic as Tang et al. [50] illustrated with the implementation of a fingerprint recognition system with high reliability and high speed; they achieved an average computational time per fingerprint image less than 1 sec, using a fixed-point arithmetic StrongARM 200 MHz embedded processor. As mentioned previously, an embedded system must store information that enables it to identify and validate users that have access to the system. But, how does an embedded system store this information? Embedded systems use several types of memory to store different types of data: (1) ROM EPROM to store programming data used to serve generic applications, (2) RAM to store temporary data, and (3) EEPROM and FLASH memories to store mobile downloadable code [20]. In an embedded device such as a PDA or a mobile phone several pieces of sensitive information, such as PINs, credit card numbers, personal data, keys, and certificates for authorization purposes, may be permanently stored in secondary storage media. The requirement to protect this information as well as the rapid growth of communications capabilities of embedded devices, for example, mobile Internet access, which make embedded systems vulnerable to network attacks as well, lead to increasing demands for secure storage space. The use of hard cryptographic algorithms to ensure data integrity and confidentiality is not feasible in most embedded systems, mainly due to their limited computational resources. Benini et al. [51] present a survey of architectures and techniques used to implement memory for embedded systems, taking into consideration energy limitations of embedded systems. Rosenthal [52] presents an effective way to ensure that data cannot be erased or destroyed by “hiding” memory from the processor through use of a serial EEPROM, which is the same as standard EEPROM with the only difference that a serial link binds the memory with the processor reading/writing data, using a strict protocol. Actel [53] describes security issues and design considerations for the implementation of embedded memories using FPGAs claiming

© 2006 by Taylor & Francis Group, LLC

17-10

Embedded Systems Handbook

that SRAM FPGAs are vulnerable to Level I [5] attacks, while it is more preferable to use nonvolatile Flash and Antifuse-based FPGA memories, which provide higher levels of security relatively to SRAM FPGAs. Another key issue in secure embedded systems design is to ensure that any digital content already stored or downloaded in the embedded system will be used according to the terms and conditions the content provider has set and in accordance with the agreements between user and provider; such content includes software for a specific application or a hardware component embedded in the system by a third-party vendor. It is essential that conventional embedded devices, mobile or not, be enhanced with Digital Right Management (DRM) mechanisms, in order to protect the digital IP of manufacturers and vendors. Trusted computing platforms constitute one approach to resolve this problem. Such platforms are significant, in general, as indicated by the Trusted Computing Platform Alliance (TCPA) [54], which tries to standardize the methods to build trusted platforms. For embedded systems, IP protection can be implemented in various ways. A method to produce a trusted computing platform based on a trusted, secure hardware component, called spy, can lead to systems executing one or more applications securely [55,56].Ways to transform a 3G mobile device into a trusted one have been investigated by Messerges and Dabbish [57], who are capable of protecting content through analysis and probing of the various components in a trusted system; for example, the operating system of the embedded system is enhanced with DRM security hardware–software part, which transforms the system to a trusted one. Alternatively, Thekkath et al. [58] propose a method to prevent unauthorized reading, modification, and copying of proprietary software code, using eXecute Only Memory (XOM) system that permits only code execution. The concept is that code stored in a device can be marked as “execute only” and content-sensitive applications can be stored in independent compartments [59]. If an application tries to access data outside its compartment, then it is stopped. Significant attention has to be paid to protect against possible attacks through malicious downloadable software, such as viruses, Trojans, logic bombs, etc. [60]. The wide deployment of distributed embedded systems and the Internet have resulted in the requirement for an ability of portable embedded systems, for example, mobile phones and PDAs, to download and execute various software applications. This ability may be new to the world of portable, highly constrained embedded systems, but it is not new in the world of general-purpose systems, which have had the ability to download and execute Java applets and executable files from the Internet or from other network resources for a long time. One major problem in this service is that users cannot be sure about the content of the software that is downloaded and executed on their system(s), who the creator is and what its origin is. Kingpin and Mudge [61] provide a comprehensive presentation of security issues in personal digitals assistants, analyzing in detail what malicious software is, that is, viruses, Trojans, backdoors, etc., where it resides and how it is spread, giving to the future users of such devices a deeper understanding about the extra security risks that arise with the use of mobile downloadable code. An additional important consideration is the robustness of the downloadable code: once the mobile code is considered secure, downloaded, and executed, it must not affect preinstalled system software. Various techniques have been proposed to protect remote hosts from malicious mobile code. The sandbox technique, proposed by Rubin and Geer [62], is based on the idea that the mobile code cannot execute system functions, that is, it cannot affect the file system or open network connections. Instead of disabling mobile code from execution, one can empower it using enhanced security policies as Venkatakrishnan et al. [63] propose. Necula [64] suggests the use of proof-carrying code. The producer of the mobile code, a possibly untrusted source, must embed some type of proof that can be tested by the remote host in order to prove the validity of the mobile code.

17.5 Cryptography and Embedded Systems Secure embedded systems should support the basic security functions for (1) confidentiality, (2) integrity, and (3) authentication. Cryptography provides a mechanism that ensures that the previous three requirements are met. However, implementation of cryptography in embedded systems can be a challenging task. The requirement of high performance has to be achieved in a resource-limited environment; this

© 2006 by Taylor & Francis Group, LLC

Design Issues in Secure Embedded Systems

17-11

task is even more challenging when low power constraints exist. Performance usually dictates an increased cost, which is not always desirable or possible. Cryptography can protect digital assets provided that the secret keys of the algorithms are stored and accessed in a secure manner. For this, the use of specialized hardware devices to store the secret keys and to implement cryptographic algorithms is preferred over the use of general-purpose computers. However, this also increases the implementation cost and results in reduced flexibility. On the other hand, flexibility is required, because modern cryptographic protocols do not rely on a specific cryptographic algorithm but rather allows the use of a wide range of algorithms for increased security and adaptability to advances on cryptanalysis. For example, both the SSL and IPSec network protocols support numerous cryptographic algorithms to perform the same function, for example, encryption. The protocol enables negotiation of the algorithms to be used, in order to ensure that both parties use the desirable level of protection dictated by their security policies. Apart from the performance issue, a correct cryptographic implementation requires expertise that is not always available or affordable during the lifecycle of a system. Insecure implementations of theoretically secure algorithms have made their way to headline news quite often in the past. An excellent survey on cryptography implementation faults is provided in [65], while Anderson [66] focuses on the causes of cryptographic systems failures in banking applications. A common misunderstanding is the use of random numbers. Pure Linear Feedback Shift Registers (LFSRs) and other pseudorandom number generators produce random-looking sequences that may be sufficient for scientific experiments but can be disastrous for cryptographic algorithms that require some unpredictable random input. On the other hand, the cryptographic community has focused on proving the theoretical security of various cryptographic algorithms and has paid little attention to actual implementations on specific hardware platforms. In fact, many algorithms are designed with portability in mind and efficient implementation on a specific platform meeting specific requirements can be quite tricky. This communication gap between vendors and cryptographers intensifies in the case of embedded systems, which can have many design choices and constraints that are not easily comprehensible. In the late 1990s, Side-Channel Attacks (SCAs) were introduced. SCA are a method of cryptanalysis that focuses on the implementation characteristics of a cryptographic algorithm in order to derive its secret keys. This advancement bridged the gap between embedded systems, a common target of such attacks, and cryptographers. Vendors became aware and concerned by this new form of attacks, while cryptographers focused on the specifics of the implementations, in order to advance their cryptanalysis techniques. In this section, we present side-channel cryptanalysis. First, we introduce the concept of tamper resistance, the implementation of side channels and information leakage through them from otherwise secure devices; then, we demonstrate how this information can be exploited to recover the secret keys of cryptographic algorithm, presenting case studies of attacks to the RSA algorithm.

17.5.1 Physical Security Secrecy is always a desirable property. In the case of cryptographic algorithms, the secret keys of the algorithm must be stored, accessed, used, and destroyed in a secure manner, in order to provide the required security functions. This statement is often overlooked and design or implementation flaws result in insecure cryptographic implementations. It is well known that general-purpose computing systems and operating systems can not provide enough protection mechanisms for cryptographic keys. For example, SSL certificates for web servers are stored unprotected on servers’ disks and rely on file system permissions for protection. This is necessary, because web servers can offer secure services unattended. Alternatively, a human would provide the password to access the certificate for each connection; this would not be an efficient decision in the era of e-commerce, where thousands of transactions are made every day. On the other hand, any software bug in the operating system, in a high-privileged application or in the web server software itself may expose this certificate to malicious users. Embedded systems are commonly used for implementing security functions. Since they are complete systems, they can perform the necessary cryptographic operations in a sealed and controlled environment [67–69]. Tamper resistance refers to the ability of a system to resist to tampering attacks, that is,

© 2006 by Taylor & Francis Group, LLC

17-12

Embedded Systems Handbook

attempts to bypass its attack-prevention mechanisms. The IBM PCI Cryptographic Coprocessor [70] is such a system, having achieved FIPS 140-2 Level 4 certification [33]. Advancements of DRM technology to consumer devices and general-purpose computers drives the use of embedded systems for cryptographic protection of IP. Smartcards are a well-known example of tamper-resistant embedded systems that are used for financial transactions and subscription-based service provision. In many cases, embedded systems used for security-critical operations do not implement any tamperresistance mechanisms. Rather, a thin layer of obscurity is preferred, both for simplicity and performance issues. However, as users become more interested in bypassing the security mechanisms of the system, the thin layer of obscurity is easily broken and the cryptographic keys are publicly exposed. The Adobe eBook software encryption [71], the Microsoft XBox case [72], the USB hardware token devices [73], and the DVD CSS copy protection scheme [74] are examples of systems that have implemented security by obscurity and were easily broken. Finally, an often neglected issue is a lifecycle-wide management of cryptographic systems. While a device may be withdrawn from operation, the data it has stored or processed over time may still need to be protected. The security of keys that relies on the fact that only authorized personnel has access to the system may not be sufficient for the recycled device. Garfinkel and Shelat [75], Skorobogatov [76], and Gutman [77] present methods for recovering data from devices using noninvasive techniques.

17.5.2 Side-Channel Cryptanalysis Until the middle 1990s, academic research on cryptography focused on the mathematical properties of the cryptographic algorithms. Paul Kocher was the first to present cryptanalysis attacks on implementations of cryptographic algorithms, which were based on the implementation properties of a system. Kocher observed that a cryptographic implementation of the RSA algorithm required varying amounts of time to encrypt a block of data depending on the secret key used. Careful analysis of the timing differences, allowed him to derive the secret key and he extended this method to other algorithms as well [78]. This result came as a surprise, since the RSA algorithm has withstood years of mathematical cryptanalysis and was considered secure [79]. A short time later, Boneh et al. presented theoretical attacks on how to derive the secret keys on implementations of the RSA algorithm and the Fiat-Shamir and Schnorr identification schemes [80], revised in Reference 81, while similar results were presented by Bao et al. [82]. These findings revealed a new class of attacks on cryptographic algorithms. The term side-channel attacks (SCAs), first appeared in Reference 83, has been widely used to refer to this type of cryptanalysis, while the terms fault-based cryptanalysis, implementation cryptanalysis, active/passive hardware attacks, leakage attacks, and others have been used also. Cryptographic algorithms acquired a new security dimension, that of their exact implementation. Cryptographers had previously focused on understanding the underlying mathematical problems to prove or conjecture for the security of a cryptographic algorithm based on the abstract mathematical symbols. Now, in spite of the hard underlying mathematical problems to be solved, an implementation may be vulnerable and allow the extraction of secret keys or other sensitive material. Implementation vulnerabilities are of course not a new security concept. In the previous section, we presented some impressive attacks on security that were based on implementation faults. The new concept of SCA is that even cryptographic algorithms that are otherwise considered secure can be also vulnerable to such faults. This observation is of significant importance, since cryptography is widely used as a major building block for security; if cryptographic algorithms can be driven insecure, the whole construction collapses. Embedded systems and especially smartcards are a popular target for SCA. To understand this, recall that such systems are usually owned by a service provider, such as a mobile phone operator, a TV broadcaster or a bank, and possessed by service clients. The service provider resides on the security of the embedded system in order to prove service usage by the clients, such as phone calls, movie viewing or a purchase, and charge the client accordingly. On the other hand, consumers have the incentive to bypass these mechanisms in order to enjoy free services. Given that SCA are implementation specific and rely, as we will present later,

© 2006 by Taylor & Francis Group, LLC

Design Issues in Secure Embedded Systems

17-13

on the ability to interfere, passively or actively with the device implementing a cryptographic algorithm, embedded systems are a further attractive target, given their resource limitation, which makes the attack efforts easier. In the following, we present the classes of SCA and countermeasures that have been developed. The technical field remains highly active, since ingenious channels are continuously appearing in the bibliography. Embedded system vendors must study the attacks carefully, evaluate the associated risks for their environment, and ensure that appropriate countermeasures are implemented in their systems; furthermore, they must be prepared to adapt promptly to new techniques for deriving secrets from their systems.

17.5.3 Side-Channel Implementations A side channel is any physical channel that can carry information from the operation of a device while implementing a cryptographic operation; such channels are not captured by the existing abstract mathematical models. The definition is quite broad and the inventiveness of attackers is noticeable. Timing differences, power consumption, electromagnetic emissions, acoustic noise, and faults have been currently exploited for leaking information out of cryptographic systems. The channel realization can be categorized in three broad classes: physical or probing attacks, faultinduction or glitch attacks, and emission attacks, such as TEMPEST. We shortly review the first two classes; readers interested in TEMPEST attacks are referred to Reference 84. The side channels may seem unavoidable and a frightening threat. However, it should be strongly emphasized that in most cases, reported attacks, both theoretical and practical, rely for their success on the detailed knowledge of the platform under attack and the specific implementation of the cryptographic algorithm. For example, power analysis is successful in most cases, because cryptographic algorithms tend to use only a small subset of a processor’s instruction set and especially simple instructions, such as LOAD, STORE, XOR, AND, and SHIFT, in order to develop elegant, portable, and high-performance implementations. This decision allows an attacker to minimize the power profiles he or she must construct and simplifies the distinction of different instructions that are executed. 17.5.3.1 Fault-Induction Techniques Devices are always susceptible to erroneous computations or other kinds of faults for several reasons. Faulty computations are a known issue from space systems, because, in deep space, devices are exposed to radiation that can cause temporary or permanent bit flips, gate destruction, or other problems. Incomplete testing during manufacturing may allow imperfect designs from reaching the market, as in the case of the Intel Pentium FDIV bug [85], or in the case of device operation in conditions out of their specifications [86]. Careful manipulation of the power supply or the clock oscillator can also cause glitches in code execution by tricking the processor, for example, to execute unknown instructions or bypass a control statement [87]. Some researchers have questioned the feasibility of fault-injection attacks on real systems [88]. While fault injection may seem as an approach that requires expensive and specialized equipment, there have been reports that fault injection can be achieved with low cost and readily available equipment. Anderson and Kuhn [89] and Anderson [66] present low-cost attacks for tamper-resistant devices, which achieve extraction of secret information from smartcards and similar devices. Kömmerling and Kuhn [87] present noninvasive fault-injection techniques, for example, by manipulating power supply. Anderson [90] supports the view that the underground community has been using such techniques for quite a long time to break the security of smartcards of pay-TV systems. Furthermore, Weingart [6] and Aumüller et al. [91] present attacks performed in a controlled lab environment, proving that fault-injection attacks are feasible. Skorobogatov and Anderson [140] introduces low-cost light flashes, such as a camera flash, as a means to introduce errors, while eddy-current attacks are introduced in Reference 92. A complete presentation of the fault-injection methods is presented in Reference 93, along with experimental evidence on the applicability of the methods to industrial systems and anecdotal information.

© 2006 by Taylor & Francis Group, LLC

17-14

Embedded Systems Handbook

The combined time–space isolation problem [94] is of significant importance in fault-induction attacks. The space isolation problem refers to isolation of the appropriate space (area) of the chip in which to introduce the fault. The space isolation problem has four parameters: 1. Macroscopic. The part of the chip where the fault can be injected. Possible answers can be one or more of the following: main memory, address bus, system bus, register file. 2. Bandwidth. The number of bits that can be affected. It may be possible to change just one bit or multiple bits at once. The exact number of changed bits can be controllable (e.g., one) or follow a random distribution. 3. Granularity. The area where the error can occur. The attacker may drive the fault-injection position at a bit level or a wider area, such as a byte or a multibyte area. The fault-injected area can be covered by a single error or by multiple errors. How are these errors distributed with respect to the area? They may focus around the mark or may be evenly distributed. 4. Lifetime. The time duration of the fault. It may be a transient fault or a permanent fault. For example, a power glitch may cause a transient fault at a memory location, since the next time the location will be written, a new value will be correctly written. In contrast, a cell or gate destruction will result in a permanent error, since the output bit will be stuck at 0 or 1, independently of the input. The time isolation problem refers to the time at which a fault is injected. An attacker may be able to synchronize exactly with the clock of the chip or may introduce the error in a random fashion. This granularity is the only parameter of the time isolation problem. Clearly, the ability to inject a fault in a clock period granularity is desirable, but impractical in real-world applications. 17.5.3.2 Passive Side Channels Passive side channels are not a new concept in cryptography and security. The information available from the now partially declassified TEMPEST project reveals helpful insights in how electromagnetic emissions occur and can be used to reconstruct signals for surveillance purposes. A good review of the subject is provided in chapter 15 of Reference 90. Kuhn [84,95,96] presents innovative use of electromagnetic emissions to reconstruct information from CRT and LCD displays, while Loughry and Umphress [97] reconstructs information flowing through network devices using the emissions of their LEDs. The new concept in this area is the fact that such emissions can be also used to derive secret information from an otherwise secure device. Probably, the first such attack took place in 1956 [98]. MI5, the British Intelligence, used a microphone to capture the sound of the rotor clicks of a Hagelin machine in order to deduce the core position of some of its rotors. This resulted in reducing the problem to calculate the initial setup of the machine within the range of their then available resources, and to eavesdrop the encrypted communications for quite a long time. While the so-called acoustic cryptanalysis may seem outdated, researchers provided a fresh look on this topic recently, by monitoring low-frequency (KHz) sounds and correlating them with operations performed by a high-frequency (GHz) processor [99]. Researchers have been quite creative and have used many types of emissions or other physical interactions of the device with the environment it operates. Kocher [78] introduces the idea of monitoring the execution time of a cryptographic algorithm and tries to identify the secret keys used. The key concept in this approach is that an implementation of an algorithm may contain branches and other conditional execution or the implementation may follow different execution paths. If these variances are based on the bit values of a secret key, then a statistical analysis can reveal the secret key bit-by-bit. Coron et al. [100] explain the power dissipation sources and causes, while Kocher et al. [101] present how power consumption can be also correlated with key bits. Rao and Rohatgi [102] and Quisquater and Samyde [86] introduce electromagnetic analysis. Probing attacks can also be applied to reveal the Hamming weight of data transferred across a bus or stored in memory. This approach is also heavily dependent on the exact hardware platform [103,104]. While passive side channels are usually thought in the context of embedded systems and other resourcelimited environments, complex computing systems may also have passive side channels. Page [105] explores the theoretical use of timing variations due to processor cache in order to extract secret keys.

© 2006 by Taylor & Francis Group, LLC

Design Issues in Secure Embedded Systems

17-15

Song et al. [106] take advantage of a timing channel in the secure communication protocol SSH to recover user passwords, while Felten and Schneider [107] present timing attacks on web privacy. A malicious web server can inject client-side code that fetches some specific pages transparently on behalf of the user; the server would like to know if the user has visited these pages before. The time difference between fetching the web page from the remote server and accessing it from the user’s cache is sufficient to identify if the user has visited this page before. A more impressive result, directly related to cryptography, is presented in Reference 108, where remote timing attacks on web servers implementing the SSL protocol are shown to be practical and the malicious user can extract the server’s certificate private key by measuring its response times.

17.5.4 Fault-Based Cryptanalysis The first theoretical active attacks are presented in References 80 and 82. The attacks in the former paper focused on RSA, when implemented with the Chinese Remainder Theorem (CRT) and Montgomery multiplication method, and the Fiat-Shamir and the Schnorr identification schemes. The latter work focuses on cryptosystems where security is based on the Discrete Logarithm Problem and presents attacks on the ElGamal signature scheme, the Schnorr signature scheme and DSA. The attack on the Schnorr signature scheme is extended, with some modification, to the identification scheme as well. Furthermore, the second paper reports independently an attack on the RSA–Montgomery. Since then, this area has been quite active, both in developing attacks based on fault induction and countermeasures. The attacks have succeeded in most of the popular and widely used algorithms. In the following, we give a brief review of the bibliography. The attacks on RSA with Montgomery have been extended by attacking the signing key, instead of the message [109]. Furthermore, similar attacks are presented for LUC and KMOV (based on elliptic curves) cryptosystems. In Reference 110, the attacks are generalized for any RSA-type cryptosystem, with examples of LUC and Demytko cryptosystems. Faults can be used to expose the private key of RSA–KEM scheme [111] and transient faults can be used to derive the RSA and DSA secret keys from applications compatible with the OpenPGP format [112]. The Bellcore attack on the Fiat-Shamir scheme is shown to be incomplete in Reference 94; the Precautious Fiat-Shamir scheme is introduced, which defends against it. A new attack that succeeds against both the classical and the Precautious Fiat-Shamir scheme is presented in Reference 113. Beginning with Biham and Shamir [114], fault-based cryptanalysis focused on symmetric key cryptosystems. DES is shown to be vulnerable to the so-called Differential Fault Analysis (DFA), using only 50 to 200 faulty ciphertexts. The method is also extended to unknown cryptosystems and an example of an attack on the once classified algorithm SkipJack is presented. Another variant of the attack on DES takes advantage of permanent instead of transient faults. The same ideas are also explored and extended for completely unknown cryptosystems in Reference 115, while Jacob et al. [116] use faults to attack obfuscated ciphers in software and extract secret material by avoiding de-obfuscating the code. For some time it was believed that fault-induction attacks can only succeed on cryptographic schemes based on algebraic-based hard mathematical problems, such as number factoring and discrete logarithm computation. Elliptic Curve Cryptosystems (ECCs) are a preferable choice to implement cryptography, since they offer equivalent security with that of algebraic public key algorithms, requiring only about a tenth of key bits. Biehl et al. [117] extend the DFA on ECC and, especially, on schemes whose security is based on the discrete logarithm problem over elliptic curve fields. Furthermore, Zheng and Matsumoto [118] use transient and permanent faults to attack random number generators, a crucial building block for cryptographic protocols, and the ElGamal signature scheme. Rijndael [119] was nominated as the AES algorithm [120], the replacement of DES. The case of the AES algorithm is quite interesting, considering that it was submitted after the introduction of SCA; thus, authors have taken all the appropriate countermeasures to ensure that the algorithm resisted all known cryptanalysis techniques applicable to their design. The original proposal [119] even noted timing attacks and how they could be prevented. Koeune and Quisquater [121] describe how a careless implementation

© 2006 by Taylor & Francis Group, LLC

17-16

Embedded Systems Handbook

of the AES algorithm can utilize a timing attack and derive the secret key used. The experiments carried show that the key can be derived having 3000 samples per key byte, with minimal cost and high probability. The proposal of the algorithm is aware of this issue and immune against such a simple attack. However, DFA proved to be successful against AES. Although DFA was designed for attacking algorithms with a Feistel structure, such as DES, Dusart et al. [122] show that it can be applied to AES, which does not have such a structure. Four different fault-injection models are presented and the attacks succeed for all key sizes (128, 192, and 256 bits). Their experiments show that with ten pairs of faulty/correct messages in hand, a 128-bit AES key can be extracted in a few minutes. Blömer et al. [123] present additional faultbased attacks on AES. The attack assumes multiple kinds of fault models. The stricter model, requiring exact synchronization in space and time for the error injection, succeeds in deriving a 128-bit secret key after collecting 128 faulty ciphertexts, while the least strict model derives the 128-bit key, after collecting 256 faulty ciphertexts. 17.5.4.1 Case Study: RSA–Chinese Remainder Theorem The RSA cryptosystem remains a viable and preferable public key cryptosystem, having withstood years of cryptanalysis [79]. The security of the RSA public key algorithm relies on the hardness of the problem of factoring large numbers to prime factors. The elements of the algorithm are N = pq, the product of two large prime numbers, that is, the public and secret exponents respectively, and the modular exponentiation operation m k mod N . To sign a message m, the sender computes s = m d mod N , using his or her public key. The receiver computes m = s e mod N to verify the signature of the received message. The modular exponentiation operation is computationally intensive for large primes and it is the major computational bottleneck in an RSA implementation. The CRT allows fast modular exponentiation. Using RSA with CRT, the sender computes s1 = m d mod p, s2 = m d mod q and combines the two results, based on the CRT, to compute S = as1 + bs2 mod N for some predefined values a and b. The CRT method is quite popular, especially for embedded systems, since it allows four times faster execution and smaller memory storage for intermediate results (for this, observe that typically p and q have half the size of N ). The Bellcore attack [80,81], as it is commonly referenced, is quite simple and powerful against RSA with CRT. It suffices to have one correct signature S for a message m and one faulty signature S , which is caused by an incorrect computation of one of the two intermediate results s1 and s2 . It does not matter either if the error occurred on the first or the second intermediate result, or how many bits were affected by the error. Assuming that an error indeed occurred, it suffices to compute gcd(S − S , N ), which will equal q, if the error occurred in computation of s1 and p if it occurred in s2 . This allows to factor N and thus, the security of the algorithm is broken. Lenstra [124] improves this attack by requiring a known message and a faulty signature, instead of two signatures. In this case, it suffices to compute gcd(M − (S )d , N ) to reveal one of the two prime factors. Boneh et al. [80] propose double computations as means to detect such erroneous computations. However, this is not always efficient, especially in the case of resource-limited environments or where performance is an important issue. Also, this approach is of no help in the case a permanent error has occurred. Kaliski and Robshaw [125] propose signature verification, by checking the equality S e mod N = M . Since the public exponent may be quite large, this check can be rather time consuming for a resource-limited system. Shamir [126] describes a software method for protecting RSA with CRT from fault and timing attacks. The idea is to use a random integer t and perform a “blinded” CRT by computing: Spt = m d mod p × t and Sqt = m d mod q × t . If the equality Spt = Sqt mod t holds, then the initial computation is considered error-free and the result of the CRT can be released from the device. Yen et al. [127] further improve this countermeasure for efficient implementation without performance penalties, but Blömer et al. [128] show that this improvement in fact renders RSA with CRT totally insecure. Aumüller et al. [91] provide another software implementation countermeasure for faulty RSA–CRT computations. However, Yen et al. [129], using a weak fault model, show that both these countermeasures [91,126] are still vulnerable, if the attacker focuses on the modular reduction operation sp = sp mod p of the countermeasures. The attacks are valid for both transient and permanent errors and again, appropriate countermeasures are proposed.

© 2006 by Taylor & Francis Group, LLC

Design Issues in Secure Embedded Systems

17-17

As we show, the implementation of error checking functions using the final or intermediate results of RSA computations can create an additional side meta-channel, although faulty computations never leave a sealed device. Assume that an attacker knows that a bit in a register holding part of the key was invasively set to zero during the computation and that the device checks the correctness of the output by double computation. If the device outputs a signed message, then no error was detected and thus, the respective bit of the key is zero. If the device does not output a signed message or outputs an error message, then the respective bit of the key is one. Such a safe-error attack is presented in Reference 130, focusing on the RSA when implemented with Montgomery multiplication. Yen et al. [131] extend the idea of safe-error attacks from memory faults to computational faults and present such an attack on RSA with Montgomery, which can also be applied to scalar multiplication on elliptic curves. An even simpler attack would be to attack both an intermediate computation and the condition check. A condition check can be a single point of failure and an attacker can easily mount an attack against it, provided that he or she has means to introduce errors in computations [128]. Indeed, in most cases, a condition check is implemented as a bit comparison with a zero flag. Blömer et al. [128] extend the ideas of checking vulnerable points of computation by exhaustively testing every computation performed for an RSA–CRT signing, including the CRT combination. The proposed solution seems the most promising at the moment, allowing only attacks by powerful adversaries that can solve precisely the time–space isolation problem. However, it should be already clear that advancements in this area of cryptanalysis are continuous and they should be always prepared to adapt to new attacks.

17.5.5 Passive Side-Channel Cryptanalysis Passive side-channel cryptanalysis has received a lot of attention, since its introduction in 1996 by Paul Kocher [78]. Passive attacks are considered harder to defend against and many people are concerned, due to their noninvasive nature. Fault-induction attacks require some form of manipulating the device and thus, sensors or other similar means can be used to detect such actions and shut down or even zero out the device. In the case of passive attacks, the physical characteristics of the device are just monitored, usually with readily available probes and other hardware. So, it is not an easy task to detect the presence of a malicious user, especially in the case where only a few measurements are required or abnormal operation (such as continuous requests for encryptions/decryptions) can not be identified. The first results are by Kocher [78]. Timing variations in the execution of a cryptographic algorithm such as Diffie–Hellman key exchange, RSA, and DSS are used to derive bit-by-bit the secret keys of these algorithms. Although mentioned before, we should emphasize that timing attacks and other forms of passive SCA require knowledge of the exact implementation of the cryptographic algorithm under attack. Dhem et al. [132] describe a timing attack against the RSA signature algorithm. The attack derives a 512-bit secret key with 200,000 to 300,000 timing measurements. Schindler et al. [133] improve the timing attacks on RSA modular exponentiation by a factor of 50, allowing extraction of a 512-bit key using as few as 5,000 timing measurements. The approach used is an error-correction (estimator) function, which can detect erroneous bit detections as key extraction process evolves. Hevia and Kiwi [134] introduce a timing attack against DES, which reveals the Hamming weight of the key, by exploiting the fact that a conditional bit “wrap around” function results on variable execution time of the software implementing the algorithm. They succeed in recovering the Hamming weight of the key and 3.95 key bits (out of a 56-bit key). The most threatening issue is that keys with low or high Hamming weight are sparse; so, if the attack reveals that the key has such a weight, the key space that must be searched reduces dramatically. The RC5 algorithm has also been subjected to timing attacks, due to conditional statement execution in its code [135]. Kocher et al. [101] extend the attackers’ arsenal further by introducing the vulnerability of DES to power analysis attacks and more specifically to Differential Power Analysis (DPA), a technique that combines differential cryptanalysis and careful engineering, and to Simple Power Analysis (SPA). SPA refers to power analysis attacks that can be performed only by monitoring a single or a few power traces, probably with the same encryption key. SPA succeeds in revealing the operations performed by the device, such as permutations, comparisons, and multiplications. Practically, any algorithm implementation that executes

© 2006 by Taylor & Francis Group, LLC

17-18

Embedded Systems Handbook INPUT :M,N,d = (dn –1dn –2...d1d0)2 OUTPUT :S = M d mod N S = 1; for (i = n–1; i >= 0;i--) { S = S 2 mod N ; if (di == 1) { S = S ∗M mod N ; } } return S ;

FIGURE 17.1 Left-to-right repeated square-and-multiply algorithm.

some statements conditionally, based on data or key material, is at least susceptible to power analysis attacks. This holds for public key, secret key, and ECC. DPA has been successfully applied at least to block ciphers, such as IDEA, RC5, and DES [83]. Electromagnetic Attacks (EMAs) have contributed some impressive results on what information can be reconstructed. Gandolfi et al. [136] report results from cryptanalysis of real-world cryptosystems, such as DES and RSA. Furthermore, they demonstrate that electromagnetic emissions may be preferable to power analysis, in the sense that fewer traces are needed to mount an attack and these traces carry richer information to derive the secret keys. However, the full power of EMA attacks has not been utilized yet and we should expect more results on real-world cryptanalysis of popular algorithms. 17.5.5.1 Case Study: RSA–Montgomery Previously, we explained the importance of a fast modular exponentiation primitive for the RSA cryptosystem. Montgomery multiplication is a fast implementation of this primitive function [137]. The left-to-right repeated square-and-multiply method is depicted in Figure 17.1, in C pseudocode. The timing attack of Kocher [78] exploits the timing variation caused by the condition statement on the fourth line. If the respective bit of the secret exponent is “1,” then a square (line 3) and multiply (line 5) operations are executed, while if the bit is “0” only a square operation is performed. In summary, the exact time of executing the loop n times is only dependent on the exact values of the bits of the secret exponent. An attacker proceeds as follows. Assume that the first m bits of the secret exponent are known. The attacker has an identical device with that containing the secret exponent and can control the key used for each encryption. The attacker collects from the attacked device the total execution time T1 , T2 , . . . , Tk of each signature operation on some known messages, M1 , M2 , . . . , Mk . He also performs the same operation on the controlled device, so as to collect another set of measurements, t1 , t2 , . . . , tk , where he fixes the m first bits of the key, targeting for the m + 1 bit. Kocher’s key observation is that, if the unknown bit dm+1 = 1, then the two sets of measurements are correlated. If dm+1 = 0, then the two sets behave like independent random variables. This differentiation allows the attacker to extract the secret exponent bit-by-bit. Depending on the implementation, a simpler form of the attack could be implemented. SPA does not require lengthy statistical computations but rather relies on power traces of execution profiles of a cryptographic algorithm. For this example, Schindler et al. [133] explain how we can use the power profiles. Execution of line 5 in the above code requires an additional multiplication. Even if the spikes in power consumption of the squaring and multiplication operations are indistinguishable, the multiplication requires additional load operations and thus, power spikes will be wider than in the case where only squaring is performed.

17.5.6 Countermeasures In the previous sections we provided a review of SCA, both fault-based and passive. In this section, we review the countermeasures that have been proposed. The list is most exhaustive and new results appear continuously, since countermeasures are steadily improving.

© 2006 by Taylor & Francis Group, LLC

Design Issues in Secure Embedded Systems

17-19

The proposed countermeasures can be classified into two main classes: hardware protection mechanisms and mathematical protection mechanisms. A first layer of protection against SCA are hardware protection layers, such as passivation layers that do not allow direct access between a (malicious) user and the system implementing the cryptographic algorithm or memory address bus obfuscation. Various sensors can also be embodied in the device, in order to detect and react to abnormal environmental conditions, such as extreme temperatures, power, and clock variations. Such mechanisms are widely employed in smartcards for financial transactions and other high-risk applications. Such protection layers can be effective against fault-injection attacks, since they shield the device against external manipulation. However, they cannot protect the device from attacks based on external observation, such as power analysis techniques. The previous countermeasures do not alter the current designs of the circuits, but rather add protection layers on top of them. A second approach is the design of a new generation of chips to implement cryptographic algorithms and to process sensitive information. Such circuits have asynchronous/ self-clocking/dual rail logic; each part of the circuit may be clocked independently [138]. Fault attacks that rely on external clock manipulation (such as glitch attacks) are not feasible in this case. Furthermore, timing or power analysis attacks become harder for the attacker, since there is no global clock that correlates the input data and the emitted power. Such countermeasures have the potential to become a common practice. Their application, however, must be carefully evaluated, since they may occupy a large area of the circuit; such expansions are justified by manufacturers usually in order to increase the system’s available memory and not to implement another security feature. Furthermore, such mechanisms require changes in the production line, which is not always feasible. A third approach targets to implement the cryptographic algorithms so that no key information leaks. Proposed approaches include modifying the algorithm to run in constant time, adding random delays in the execution of the algorithm, randomizing the exact sequence of operations without affecting the final result, and adding dummy operations in the execution of the algorithm. These countermeasures can defeat timing attacks, but careful design must be employed to defeat power analysis attacks too. For example, dummy operations or random delays are easily distinguishable in a power trace, since they tend to consume less power than ordinary cryptographic operations. Furthermore, differences in power traces between profiles of known operations can also reveal permutation of operations. For example, a modular multiplication is known to consume more power than a simple addition, so if the execution order is interchanged, they will still be identifiable. In more resource-rich systems, where high-level programming languages are used, compiler or human optimizations can remove these artifacts from the program or change the implementation resulting to vulnerability against SCA. The same holds, if memory caches are used and the algorithm is implemented so that the latency between cache and main memory can be detected, either by timing or power traces. Insertion of random delays or other forms of noise should also be considered carefully, because a large mean value of delay translates directly to reduced performance, which is not always acceptable. The second class of countermeasures focuses on the mathematical strengthening of the algorithms against such attacks. The RSA blinding technique by Shamir [126] is such an example; the proposed method guards the system from leaking meaningful information, because the leaked information is related to the random number used for blinding instead of the key; thus, even if the attacker manages to reveal a number, this will be the random number and not the key. It should be noted, however, that a different random number is used for each signing or encryption operation. Thus, the faults injected in the system will be applied on a different, random number every time and the collected information is useless. At a crossline between mathematical and implementation protection, it is proposed to check cryptographic operations for correctness, in case of fault-injection attacks. However, these checks can be also exploited as side channels of information or can degrade performance significantly. For example, double computations and comparison of the results halves the throughput an implementation can achieve; furthermore, in the absence of other countermeasures, the comparison function can be bypassed (e.g., by a clock glitch or a fault injection in the comparison function) or used as a side channel as well. If multiple checks are employed, measuring the rejection time can reveal in what stage of the algorithm the error

© 2006 by Taylor & Francis Group, LLC

17-20

Embedded Systems Handbook

occurred; if the checks are independent, this can be utilized to extract the secret key, even when the implementation does not output the faulty computation [111,139].

17.6 Conclusions Security constitutes a significant requirement in modern embedded computing systems. Their widespread use in services that involve sensitive information in conjunction with their resource limitations have led to a significant number of innovative attacks that exploit system characteristics and result in loss of critical information. Development of secure embedded systems is an emerging field in computer engineering requiring skills from cryptography, communications, hardware, and software. In this chapter, we surveyed the security requirements of embedded computing systems and described the technologies that are more critical to them, relatively to general-purpose computing systems. Considering the innovative system (side-channel) attacks that were developed with motivation to break secure embedded systems, we presented in detail the known SCA and described the technologies for countermeasures against the known attacks. Clearly, the technical area of secure embedded systems is far from mature. Innovative attacks and successful countermeasures are continuously emerging, promising an attractive and rich technical area for research and development.

References [1] W. Wolf, Computers as Components — Principles of Embedded Computing Systems Design. Elsevier, Amsterdam, 2000. [2] W. Freeman and E. Miller, An experimental analysis of cryptographic overhead in performance — critical systems. In Proceedings of the Seventh International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 1999, p. 348. [3] S. Ravi, P. Kocher, R. Lee, G. McGraw, and A. Raghunathan, Security as a new dimension in embedded system design. In Proceedings of the 41st Annual Conference on Design Automation, 2004, pp. 753–760. [4] S. Ravi, A. Raghunathan, P. Kocher, and S. Hattangady, Security in embedded systems: design challenges. Transactions on Embedded Computing Systems, 3, 461–491, 2004. [5] D.G. Abraham, G.M. Dolan, G.P. Double, and J.V. Stevens, Transaction security system. IBM Systems Journal, 30, 206–229, 1991. [6] S.H. Weingart, Physical security devices for computer subsystems: a survey of attacks and defenses. In Cryptographic Hardware and Embedded Systems — CHES 2000: Second International Workshop, 2000, p. 302. [7] R. Anderson and M. Kuhn, Tamper resistance — a cautionary note. In Proceedings of the Second Usenix Workshop on Electronic Commerce, 1996, pp. 1–11. [8] S. Blythe, B. Fraboni, S. Lall, H. Ahmed, and U. de Riu, Layout reconstruction of complex silicon chips. IEEE Journal of Solid-State Circuits, 28, 138–145, 1993. [9] C.E. Landwehr, A.R. Bull, J.P. McDermott, and W.S. Choi, A taxonomy of computer program security flaws. ACM Computing Surveys, 26, 211–254, 1994. [10] H. Greg and M. Gary, Exploiting Software: How to Break Code. Addison-Wesley Professional, Reading, MA, 2004. [11] J.J. Tevis and J.A. Hamilton, Methods for the prevention, detection and removal of software security vulnerabilities. In Proceedings of the 42nd Annual Southeast Regional Conference, 2004, pp. 197–202. [12] P. Kocher, SSL 3.0 specification. http://wp.netscape.com/eng/ssl3/ [13] IETF, IPSec working group. http://www.ietf.org/html.charters/ipsec-charter.html [14] T. Wollinger, J. Guajardo, and C. Paar, Security on FPGAs: state-of-the-art implementations and attacks. Transactions on Embedded Computing Systems, 3, 534–574, 2004.

© 2006 by Taylor & Francis Group, LLC

Design Issues in Secure Embedded Systems

17-21

[15] S.H. Gunther, F. Binns, D.M. Carmean, and J.C. Hall, Managing the impact of increasing microprocessor power consumption. Intel Journal of Technology, Q1: 9, 2001. http://developer.intel.com/technology/itj/q12001/articles/art_4.htm [16] I. Buchmann, Batteries in a Portable World, 2nd ed. Cadex Electronics Inc, May 2001. [17] K. Lahiri, S. Dey, D. Panigrahi, and A. Raghunathan, Battery-driven system design: a new frontier in low power design. In Proceedings of the 2002 Conference on Asia South Pacific Design Automation/VLSI Design, 2002, p. 261. [18] T. Martin, M. Hsiao, D. Ha, and J. Krishnaswami, Denial-of-service attacks on battery-powered mobile computers. In Proceedings of the Second IEEE International Conference on Pervasive Computing and Communications (PerCom’04), 2004, p. 309. [19] N.R. Potlapally, S. Ravi, A. Raghunathan, and N.K. Jha, Analyzing the energy consumption of security protocols. In Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003, pp. 30–35. [20] D.W. Carman, P.S. Kruus, and B.J. Matt, Constraints and approaches for distributed sensor network security. NAI Labs, Technical report 00-110, 2000. Available at: http://www.cs.umbc.edu/courses/graduate/CMSC691A/Spring04/papers/nailabs_report_00010_final.pdf [21] A. Raghunathan, S. Ravi, S. Hattangady, and J. Quisquater, Securing mobile appliances: new challenges for the system designer. In Design, Automation and Test in Europe Conference and Exhibition (DATE’03). IEEE, 2003, p. 10176. [22] V. Raghunathan, C. Schurgers, S. Park, and M. Srivastava, Energy aware wireless microsensor networks. IEEE Signal Processing Magazine 19, 40–50, 2002. [23] A. Savvides, S. Park, and M.B. Srivastava, On modeling networks of wireless microsensors. In Proceedings of the 2001 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 2001, pp. 318–319. [24] Rockwell Scientific, Wireless integrated networks systems. http://wins.rsc.rockwell.com [25] A. Hodjat and I. Verbauwhede, The energy cost of secrets in ad-hoc networks (Short paper). http://citeseer.ist.psu.edu/hodjat02energy.html [26] B. Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C. John Wiley & Sons, New York, 1995. [27] N. Daswani and D. Boneh, Experimenting with electronic commerce on the PalmPilot. In Proceedings of the Third International Conference on Financial Cryptography, 1999, pp. 1–16. [28] A. Perrig, J. Stankovic, and D. Wagner, Security in wireless sensor networks. Communications of the ACM, 47, 53–57, 2004. [29] S. Ravi, A. Raghunathan, and N. Potlapally, Securing wireless data: system architecture challenges. In Proceedings of the 15th International Symposium on System Synthesis, 2002, pp. 195–200. [30] IEEE 802.11 Working Group, IEEE 802.11 wireless LAN standards. http://grouper.ieee.org/ groups/802/11/ [31] 3GPP, 3G Security; Security Architecture. 3GPP Organization, TS 33.102, 30-09-2003, Rel-6, 2003. [32] Intel Corporation, VPN and WEP, wireless 802.11b security in a corporate environment. http://www.intel.com/business/bss/infrastructure/security/vpn_wep.htm [33] NIST, FIPS PUB 140-2 security requirements for cryptographic modules. Available at http://csrc.nist.gov/cryptval/140-2.htm [34] J. Lach, W.H. Mangione-Smith, and M. Potkonjak, Fingerprinting digital circuits on programmable hardware. In Information Hiding: Second International Workshop, IH’98, Vol. 1525 of Lecture Notes in Computer Science, Springer-Verlag, 1998, pp. 16–31. [35] J. Burke, J. McDonald, and T. Austin, Architectural support for fast symmetric-key cryptography. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, 2000, pp. 178–189.

© 2006 by Taylor & Francis Group, LLC

17-22

Embedded Systems Handbook

[36] L. Wu, C. Weaver, and T. Austin, CryptoManiac: a fast flexible architecture for secure communication. In Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001, pp. 110–119. [37] Infineon, SLE 88 Family Products. http://www.infineon.com/ [38] ARM, ARM SecurCore Family, Vol. 2004. http://www.arm.com/products/CPUs/securcore.html [39] S. Moore, Enhancing Security Performance Through IA-64 Architecture, 2000. Intel Corp., http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/itanium/index.htm [40] N. Potlapally, S. Ravi, A. Raghunathan, and G. Lakshminarayana, Optimizing public-key encryption for wireless clients. In Proceedings of the IEEE International Conference on Communications, May 2002. [41] MIPS Inc., SmartMIPS Architecture, Vol. 2004. http://www.mips.com/ProductCatalog/ P_SmartMIPSASE/productBrief [42] Open Mobile Appliance, http://www.wapforum.org/what/technical.htm [43] Mobile Electronic Transactions, http://www.mobiletransaction.org/ [44] Z. Shao, C. Xue, Q. Zhuge, E.H. Sha, and B. Xiao, Security protection and checking in embedded system integration against buffer overflow attacks. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04), Vol. 2, 2004, pp. 409. [45] S. Biswas, M. Simpson, and R. Barua, Memory overflow protection for embedded systems using run-time checks, reuse and compression. In Proceedings of the 2004 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, 2004, pp. 280–291. [46] J. You, Wai-Kin Kong, D. Zhang, and King Hong Cheung, On hierarchical palmprint coding with multiple features for personal identification in large databases. IEEE Transactions on Circuits and Systems for Video Technology, 14, 234–243, 2004. [47] K.C. Chan, Y.S. Moon, and P.S. Cheng, Fast fingerprint verification using subregions of fingerprint images. IEEE Transactions on Circuits and Systems for Video Technology, 14, 95–101, 2004. [48] A.K. Jain, A. Ross, and S. Prabhakar, An introduction to biometric recognition. IEEE Transactions on Circuits and Systems for Video Technology, IEEE, 14, 4–20, 2004. [49] Y.S. Moon, H.C. Ho, and K.L. Ng, A secure smart card system with biometrics capability. In Proceedings of the IEEE 1999 Canadian Conference on Electrical and Computer Engineering, 1999, pp. 261–266. [50] T.Y. Tang, Y.S. Moon, and K.C. Chan, Efficient implementation of fingerprint verification for mobile embedded systems using fixed-point arithmetic. In Proceedings of the 2004 ACM Symposium on Applied Computing, 2004, pp. 821–825. [51] L. Benini, A. Macii, and M. Poncino, Energy-aware design of embedded memories: a survey of technologies, architectures, and optimization techniques. Transactions on Embedded Computing Systems, 2, 5–32, 2003. [52] Scott Rosenthal, Serial EEPROMs provide secure data storage for embedded systems. SLTF Consulting, http://www.sltf.com/articles/pein/pein9101.htm [53] Actel Corporation, Design security in nonvolatile flash and antifuse FPGAs. Technical report 5172163-0/11.01, 2001. [54] Trusted Computing Group: Home. TCG ©, https://www.trustedcomputinggroup.org/home [55] D.N. Serpanos and R.J. Lipton, Defense against man-in-the-middle attack in client-server systems with secure servers. In Proceedings of IEEE ISCC’2001. Hammammet, Tunisia, July 3–5, 2001, pp. 9–14. [56] R.J. Lipton, S. Rajagopalan, and D.N. Serpanos, Spy: a method to secure clients for network services. Proceedings of the 22nd International Conference on Distributed Computing Systems Workshops (Workshop ADSN’2002). Vienna, Austria, July 2–5, 2002, pp. 23–28. [57] T.S. Messerges and E.A. Dabbish, Digital rights management in a 3G mobile phone and beyond. In Proceedings of the 2003 ACM Workshop on Digital Rights Management, 2003, pp. 27–38.

© 2006 by Taylor & Francis Group, LLC

Design Issues in Secure Embedded Systems

17-23

[58] D.L.C. Thekkath, M. Mitchell, P. Lincoln, D. Boneh, J. Mitchell, and M. Horowitz, Architectural support for copy and tamper resistant software. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, 2000, pp. 168–177. [59] J.H. Saltzer and M.D. Schroder, The protection of information in computer systems. Proceedings of the IEEE, 63, 1278–1308, 1975. [60] T. King, Security + Training Guide. Que Cerification, Boger, Paul, 2003. [61] Kingpin and Mudge, Security analysis of the palm operating system and its weaknesses against malicious code threats. In Proceedings of the 10th Usenix Security Symposium, 2001, pp. 135–152. [62] A.D. Rubin and D.E. Geer Jr., Mobile code security. Internet Computing, IEEE, 2, 30–34, 1998. [63] V.N. Venkatakrishnan, R. Peri, and R. Sekar, Empowering mobile code using expressive security policies. In Proceedings of the 2002 Workshop on New Security Paradigms, 2002, pp. 61–68. [64] G.C. Necula, Proof-carrying code. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’97), 1997, pp. 106–119. [65] P. Gutmann, Lessons learned in implementing and deploying crypto software. In Proceedings of the 11th USENIX Security Symposium, 2002, pp. 315–325. [66] R.J. Anderson, Why cryptosystems fail. In Proceedings of ACM CSS’93, ACM Press, pp. 215–217, November 1993. [67] Andrew J. Clark, Physical protection of cryptographic device. In Proceedings of Eurocrypt ’87, 1987, pp. 83–93. [68] D. Chaum, Design concepts for tamper-responding system. In Advances in Cryptology Proceedings of Crypto ’83, 1983, pp. 387–392. [69] S.H. Weingart, S.R. White, W.C. Arnold, and G.P. Double, An evaluation system for the physical security of computing systems. In Proceedings of the Sixth Annual Computer Security Applications Conference, 1990, pp. 232–243. [70] IBM Corporation, IBM PCI Cryptographic Coprocessor, September, 2004. Available at http:// www-3.ibm.com/security/cryptocards/html/pcicc.shtml [71] EFF, U.S. v. ElcomSoft and Sklyarov FAQ, September, 2004. Available at http://www.eff.org/IP/ DMCA/US_v_Elcomsoft/us_v_sklyarov_faq.html [72] A. Huang, Keeping secrets in hardware: the Microsoft Xbox case study. In Revised Papers from the Fourth International Workshop on Cryptographic Hardware and Embedded Systems, 2003, pp. 213–227. [73] Kingpin, Attacks on and countermeasures for USB hardware token device. In Proceedings of the Fifth Nordic Workshop on Secure IT Systems Encouraging Co-operation, 2000, pp. 135–151. [74] D.S. Touretzky, Gallery of CSS Descramblers, September 2004. Available at http://www.cs. cmu.edu/∼dst/DeCSS/Gallery [75] S.L. Garfinkel and A. Shelat, Remembrance of data passed: a study of disk sanitization practices. IEEE Security and Privacy Magazine, 1, 17–27, 2003. [76] S. Skorobogatov, Low temperature data remanence in static RAM. Technical report UCAM-CLTR-536, University of Cambridge, 2002. [77] P. Gutman, Data remanence in semiconductor devices. In Proceedings of the 10th USENIX Security Symposium, 2001. [78] P.C. Kocher, Timing attacks on implementations of Diffie-Hellman RSA DSS and other systems. In Proceedings of CRYPTO ’96, Lecture Notes in Computer Science, 1996, pp. 104–113. [79] D. Boneh, Twenty years of attacks on the RSA cryptosystem. Notices of the American Mathematical Society (AMS), 46, 203–213, 1999. [80] Dan Boneh, Richard A. DeMillo, and Richard J. Lipton, On the importance of checking cryptographic protocols for faults. In Proceedings of Eurocrypt’97, Vol. 1233 of Lecture Notes in Computer Science, 1997, pp. 37–51.

© 2006 by Taylor & Francis Group, LLC

17-24

Embedded Systems Handbook

[81] Dan Boneh, Richard A. DeMillo, and Richard J. Lipton, On the importance of eliminating errors in cryptographic computations. Journal of Cryptology: The Journal of the International Association for Cryptologic Research, 14, 101–119, 2001. [82] F. Bao, R.H. Deng, Y. Han, A.B. Jeng, A.D. Narasimhalu, and T. Ngair, Breaking public key cryptosystems on tamper resistant devices in the presence of transient faults. In Proceedings of the Fifth International Workshop on Security Protocols, 1998, pp. 115–124. [83] John Kelsey, Bruce Schneier, David Wagner, and Chris Hall, Side channel cryptanalysis of product ciphers. In Proceedings of ESORICS 1998, 1998, pp. 97–110. [84] Markus G. Kuhn, Compromising emanations: eavesdropping risks of computer displays. Technical report UCAM-CL-TR-577, University of Cambridge, December 2003. [85] Intel Corporation, Analysis of the floating point flaw in the Pentium processor. November 1994. Available at http://support.intel.com/support/processors/pentium/fdiv/wp/ (September 2004). [86] Jean-Jacques Quisquater and David Samyde, ElectroMagnetic analysis (EMA): measures and countermeasures for smart cards. In Proceedings of the International Conference on Research in Smart Cards, E-Smart 2001, Lecture Notes in Computer Science, 2001, pp. 200–210. [87] Oliver Kömmerling and Markus G. Kuhn, Design principles for tamper-resistant smartcard processors. In Proceedings of the USENIX Workshop on Smartcard Technology (Smartcard ’99). USENIX Association, Chicago, IL, May 10–11, 1999, pp. 9–20. [88] D.P. Maher, Fault induction attacks, tamper resistance, and hostile reverse engineering in perspective. In Proceedings of the First International Conference on Financial Cryptography, 1997, pp. 109–122. [89] Ross J. Anderson and Markus G. Kuhn, Low cost attacks on tamper resistant devices. In Proceedings of the Fifth International Security Protocols Conference, Vol. 1361 of Lecture Notes on Computer Science. M. Lomas et al. Ed. Springer-Verlag, Paris, France, April 7–9, 1997, pp. 125–136. [90] Ross J. Anderson, Security Engineering: A Guide to Building Dependable Distributed Systems. John Wiley & Sons, New York, 2001. [91] C. Aumüller, P. Bier, W. Fischer, P. Hofreiter, and J. Seifert, Fault attacks on RSA with CRT: concrete results and practical countermeasures. In Revised Papers from the Fourth International Workshop on Cryptographic Hardware and Embedded Systems, Springer-Verlag, 2003, pp. 260–275. [92] David Samyde, Sergei Skorobogatov, Ross Anderson, and Jean-Jacques Quisquater, On a new way to read data from memory. In Proceedings of CHES 2002, Lecture Notes in Computer Science, 2003. [93] Hagai Bar-El, Hamid Choukri, David Naccache, Michael Tunstall, and Claire Whelan, The Sorcerer’s apprentice guide to fault attacks. In Workshop on Fault Diagnosis and Tolerance in Cryptography, 2004. [94] Artemios G. Voyiatzis and Dimitrios N. Serpanos, Active hardware attacks and proactive countermeasures. In Proceedings of IEEE ISCC 2002, 2002. [95] Markus G. Kuhn, Optical time-domain eavesdropping risks of CRT displays. In Proceedings of the IEEE Symposium on Security and Privacy, 2002, pp. 3–18. [96] Markus G. Kuhn, Electromagnetic eavesdropping risks of flat-panel displays. Presented at the Fourth Workshop on Privacy Enhancing Technologies, May 26–28, 2004, Toronto, Canada. [97] J. Loughry and D.A. Umphress, Information leakage from optical emanations. ACM Transactions on Information and System Security, 5, 262–289, 2002. [98] P. Wright, Spycatcher: The Candid Autobiography of a Senior Intelligence Officer. Viking, NY, 1987. [99] Adi Shamir and Eran Tromer, Acoustic cryptanalysis — on noisy people and noisy machines. In Eurocrypt 2004 Rump Session Presentation, September, 2004. Available at http://www.wisdom.weizmann.ac.il/∼tromer/acoustic/ [100] J. Coron, D. Naccache, and P. Kocher, Statistics and secret leakage. ACM Transactions on Embedded Computing Systems, 3, 492–508, 2004.

© 2006 by Taylor & Francis Group, LLC

Design Issues in Secure Embedded Systems

17-25

[101] P. Kocher, J. Jaffe, and B. Jun, Differential power analysis. In Proceedings of the CRYPTO ’99, IACR, 1999, pp. 388–397. [102] Josyula R. Rao, and Pankaj Rohatgi, Empowering side-channel attacks. IACR Cryptography ePrint Archive: report 2001/037, September, 2004. Available at http://eprint.iacr.org/ 2001/037/ [103] Mehdi-Laurent Akkar, Régis Bevan, Paul Dischamp, and Didier Moyar, Power analysis, what is now possible. In Advances in Cryptology — ASIACRYPT 2000: 6th International, Springer-Verlag, 2000, pp. 489–502. [104] Thomas S. Messerges, Ezzy A. Dabbish, and Robert H. Sloan, Investigation of power analysis attacks on smartcards. In Proceedings of USENIX Workshop on Electronic Commerce, 1999, pp. 151–161. [105] D. Page, Theoretical use of cache memory as a cryptanalytic side-channel. Technical report CSTR-02-003, Computer Science Department, University of Bristol, Bristol, 2002. [106] Dawn Xiaodong Song, David Wagner, and Xuqing Tian, Timing analysis of keystrokes and timing attacks on SSH. In Proceedings of the 10th USENIX Security Symposium, USENIX Association, 2001. [107] E.W. Felten and M.A. Schneider, Timing attacks on web privacy. In Proceedings of the Seventh ACM Conference on Computer and Communications Security, ACM Press, 2000, pp. 25–32. [108] David Brumley and Dan Boneh, Remote timing attacks are practical. In Proceedings of the 12th USENIX Security Symposium, 2003. [109] J. Marc and Q. Jean-Jacques, Faulty RSA encryption. Technical report CG-1997/8, UCL Crypto Group, 1997. [110] Marc Joye and Jean-Jacques Quisquater, Attacks on systems using Chinese remaindering. Technical report CG1996/9, UCL Crypto Group, Belgium, 1996. [111] Vlastimil Klíma and Tomáš Rosa, Further results and considerations on side channel attacks on RSA. IACR Cryptography ePrint Archive: report 2002/071, September 2004. Available at http://eprint.iacr.org/2002/071/ [112] Vlastimil and Tomáš Rosa, Attack on private signature keys of the OpenPGP format, PGP(TM) programs and other applications compatible with OpenPGP. IACR Cryptology ePrint Archive report 2002/073, IACR, September 2004. Available at http://eprint.iacr.org/2002/076.pdf [113] A.G. Voyiatzis and D.N. Serpanos, A fault-injection attack on Fiat-Shamir cryptosystems. In Proceedings of the 24th International Conference on Distributed Computing Systems Workshops (ICDCS 2004 Workshops), 2004, pp. 618–621. [114] Eli Biham and Adi Shamir, Differential fault analysis of secret key cryptosystems. Lecture Notes in Computer Science. Springer-Verlag, 1294, 513–525, 1997. [115] P. Paillier, Evaluating differential fault analysis of unknown cryptosystems. In Proceedings of the Second International Workshop on Practice and Theory in Public Key Cryptography, 1999, pp. 235–244. [116] M. Jacob, D. Boneh, and E. Felten, Attacking an obfuscated cipher by injecting faults. In Proceedings of the 2002 ACM Workshop on Digital Rights Management, 2002. [117] Ingrid Biehl, Bernd Meyer, and Voker Müller, Differential fault attacks on elliptic curve cryptosystems. In Proceedings of CRYPTO 2000, Vol. 1880 of Lecture Notes in Computer Science, 2000, pp. 131–146. [118] Y. Zheng and T. Matsumoto, Breaking real-world implementations of cryptosystems by manipulating their random number generation. In Proceedings of the 1997 Symposium on Cryptography and Information Security, 1997. [119] Joan Daemen and Vincent Rijmen, The block cipher Rijndael. In Proceedings of Smart Card Research and Applications 2000, Lecture Notes in Computer Science, 2000, pp. 288–296. [120] NIST, NIST, Advanced Encryption Standard (AES), Federal Information Processing Standards Publication 1997, November 26, 2001.

© 2006 by Taylor & Francis Group, LLC

17-26

Embedded Systems Handbook

[121] François Koeune and Jean-Jacques Quisquater, A timing attack against Rijndael. Technical report CG-1999/1, Universite Catholique de Louvain, 1999. [122] P. Dusart, L. Letourneux, and O. Vivolo, Differential fault analysis on AES. In Proceedings of the International Conference on Applied Cryptography and Network Security, Lecture Notes in Computer Science, 2003, pp. 293–306. [123] Johaness Blömer and Jean-Pierre Seifert, Fault-based cryptanalysis of the advanced encryption standard (AES). In Financial Cryptography 2003, Vol. 2742 of Lecture Notes in Computer Science, 2003, pp. 162–181. [124] Arjen Lenstra, Memo on RSA signature generation in the presence of faults. September 28, 1996. (Manuscript, available from the author.) [125] B. Kaliski and M.J.B. Robshaw, Comments on some new attacks on cryptographic devices. RSA Laboratories Bulletin, 5 July, 1997. [126] Adi Shamir, Method and apparatus for protecting public key schemes from timing and fault attacks. US Patent No. 5,991,415, United States Patent and Trademark Office, November 23, 1999. [127] S. Yen, S. Kim, S. Lim, and S. Moon, RSA speedup with residue number system immune against hardware fault cryptanalysis. In Proceedings of the Fourth International Conference on Information Security and Cryptology, Seoul, 2002, pp. 397–413. [128] J. Blömer, M. Otto, and J. Seifert, A new CRT-RSA algorithm secure against bellcore attacks. In Proceedings of the 10th ACM Conference on Computer and Communication Security, 2003, pp. 311–320. [129] Sung-Ming Yen, Sangjae Moon, and Jae-Cheol Ha, Hardware fault attack on RSA with CRT revisited. In Proceedings of ICISC 2002, Lecture Notes in Computer Science, 2003, pp. 374–388. [130] S. Yen and M. Joye, Checking before output may not be enough against fault-based cryptanalysis. IEEE Transactions on Computers, 49, 967–970, 2000. [131] S. Yen, S. Kim, S. Lim, and S. Moon, A countermeasure against one physical cryptanalysis may benefit another attack. In Proceedings of the Fourth International Conference on Information Security and Cryptology, Seoul, 2002, pp. 414–427. [132] J. Dhem, F. Koeune, P. Leroux, P. Mestr, J. Quisquater, and J. Willems, A practical implementation of the timing attack. In Proceedings of the International Conference on Smart Card Research and Applications, 1998, pp. 167–182. [133] Werner Schindler, François Koeune, and Jean-Jacques Quisquater, Unleashing the full power of timing attack. UCL Crypto Group Technical report CG-2001/3, Universite Catholique de Louvain 2001. [134] A. Hevia and M. Kiwi, Strength of two data encryption standard implementations under timing attacks. ACM Transactions on Information and System Security, 2, 416–437, 1999. [135] Helena Handschuh and Heys Howard, A timing attack on RC5. In Proceedings of the Fifth Annual International Workshop on Selected Areas in Cryptography, SAC’98, 1998. [136] K. Gandolfi, C. Mourtel, and F. Olivier, Electromagnetic analysis: concrete results. In Proceedings of the Third International Workshop on Cryptographic Hardware and Embedded Systems, 2001, pp. 251–261. [137] K. Koç, T. Acar, and B.S. Kaliski Jr., Analyzing and comparing montgomery multiplication algorithms. IEEE Micro, 16, 26–33, 1996. [138] Simon Moore, Ross Anderson, Paul Cunningham, Robert Mullins, and George Taylor, Improving smart card security using self-timed circuits. In Proceedings of the Eighth International Symposium on Advanced Research in Asynchronous Circuits and Systems, 2002. [139] Kouichi Sakurai and Tsuyoshi Takagi, A reject timing attack on an IND-CCA2 public-key cryptosystem. In Proceedings of ICISC 2002, Lecture Notes in Computer Science, 2003. [140] S.P. Skorobogatov and R.J. Anderson, Optical fault induction attacks. In Revised Papers from the Fourth International Workshop on Cryptographic Hardware and Embedded Systems, 2003, pp. 2–12.

© 2006 by Taylor & Francis Group, LLC

II System-on-Chip Design 18 System-on-Chip and Network-on-Chip Design Grant Martin

19 A Novel Methodology for the Design of Application-Specific Instruction-Set Processors Andreas Hoffmann, Achim Nohl, and Gunnar Braun

20 State-of-the-Art SoC Communication Architectures José L. Ayala, Marisa López-Vallejo, Davide Bertozzi, and Luca Benini

21 Network-on-Chip Design for Gigascale Systems-on-Chip Davide Bertozzi, Luca Benini, and Giovanni De Micheli

22 Platform-Based Design for Embedded Systems Luca P. Carloni, Fernando De Bernardinis, Claudio Pinello, Alberto L. Sangiovanni-Vincentelli, and Marco Sgroi

23 Interface Specification and Converter Synthesis Roberto Passerone

24 Hardware/Software Interface Design for SoC Wander O. Cesário, Flávio R. Wagner, and A.A. Jerraya

25 Design and Programming of Embedded Multiprocessors: An Interface-Centric Approach Pieter van der Wolf, Erwin de Kock, Tomas Henriksson, Wido Kruijtzer, and Gerben Essink

26 A Multiprocessor SoC Platform and Tools for Communications Applications Pierre G. Paulin, Chuck Pilkington, Michel Langevin, Essaid Bensoudane, Damien Lyonnard, and Gabriela Nicolescu

© 2006 by Taylor & Francis Group, LLC

18 System-on-Chip and Network-on-Chip Design 18.1 18.2 18.3 18.4 18.5 18.6 18.7 18.8 18.9 18.10

Grant Martin Tensilica Inc.

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System-on-a-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System-on-a-Programmable-Chip . . . . . . . . . . . . . . . . . . . . . IP Cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Virtual Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Platforms and Programmable Platforms . . . . . . . . . . . . . . . Integration Platforms and SoC Design . . . . . . . . . . . . . . . . . Overview of the SoC Design Process . . . . . . . . . . . . . . . . . . . System-Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interconnection and Communication Architectures for SoC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.11 Computation and Memory Architectures for SoC . . . . 18.12 IP Integration Quality and Certification Methods and Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18-1 18-2 18-2 18-4 18-5 18-5 18-6 18-7 18-10 18-11 18-11 18-12 18-12 18-13

18.1 Introduction “System-on-Chip” (SoC) is a phrase that has been much talked about in recent years [1]. It is more than a design style, more than an approach to the design of Application-Specific Integrated Circuits (ASICs), more than a methodology. Rather, SoC represents a major revolution in IC design — a revolution enabled by the advances in process technology allowing the integration of all or most of the major components and subsystems of an electronic product onto a single chip, or integrated chipset [2]. This revolution in design has been embraced by many designers of complex chips, as the performance, power consumption, cost, and size advantages of using the highest level of integration made available have proven to be extremely important for many designs. In fact, the design and use of SoCs is arguably one of the key problems in designing real-time embedded systems. The move to SoC began sometime in the mid-1990s. At this point, the leading CMOS-based semiconductor process technologies of 0.35 and 0.25µm were sufficiently capable of allowing the integration of many of the major components of a second-generation wireless handset or a digital set-top box onto a

18-1

© 2006 by Taylor & Francis Group, LLC

18-2

Embedded Systems Handbook

single chip. The digital baseband functions of a cell phone — a Digital Signal Processor (DSP), hardware (HW) support for voice encoding and decoding, and a RISC processor — could all be placed onto a single die. Although such a baseband SoC was far from the complete cell phone electronics — there were major components such as the RF transceiver, the analog power control, analog baseband, and passives that were not integrated — the evolutionary path with each new process generation, to integrate more and more onto a single die, was clear. Today’s chipset would become tomorrow’s chip. The problems of integrating hybrid technologies involved in making up a complete electronic system would be solved. Thus, eventually, SoC could encompass design components drawn from the standard and more adventurous domains of digital, analog, RF, reconfigurable logic, sensors, actuators, optical, chemical, microelectronic mechanical systems, and even biological and nanotechnology. With this viewpoint of continued process evolution leading to ever-increasing levels of integration into ever-more-complex SoC devices, the issue of a SoC being a single chip at any particular point in time is somewhat moot. Rather, the word “system” in System-on-Chip is more important than “chip.” What is most important about a SoC, whether packaged as a single chip, or integrated chipset, or System-inPackage (SiP) or System-on-Package (SoP) is that it is designed as an integrated system, making design trade-offs across the processing domains and across the individual chip and package boundaries.

18.2 System-on-a-Chip Let us define a SoC as a complex integrated circuit, or integrated chipset, which combines the major functional elements or subsystems of a complete end product into a single entity. These days, all interesting SoC designs include at least one programmable processor, and very often a combination of at least one RISC control processor and one DSP. They also include on-chip communications structures — processor bus(es), peripheral bus(es), and perhaps a high-speed system bus. A hierarchy of on-chip memory units, and links to off-chip memory are important especially for SoC processors (cache, main memories, very often separate instruction and data caches are included). For most signal processing applications, some degree of HW-based accelerating functional units are provided, offering higher performance and lower energy consumption. For interfacing to the external, real world, SoCs include a number of peripheral processing blocks, and owing to the analog nature of the real world, this may include analog components as well as digital interfaces (e.g., to system buses at a higher packaging level). Although there is a much interesting research in incorporating MEMS-based sensors and actuators, and in SoC applications incorporating chemical processing (lab-on-a-chip), these are, with rare exceptions, research topics only. However, future SoCs of a commercial nature may include such subsystems as well as optical communications interfaces. Figure 18.1 illustrates what a typical SoC might contain for consumer applications. One key point about SoC that is often forgotten for those approaching them from a HW-oriented perspective is that, all interesting SoC designs encompass both hardware (HW) and software (SW) components — that is, programmable processors, Real-Time Operating Systems (RTOSs), and other aspects of HW-dependent SW such as peripheral device drivers, as well as middleware stacks for particular application domains, and possibly optimized assembly code for DSPs. Thus, the design and use of SoCs cannot remain a HW-only concern — it involves aspects of system-level design and engineering, HW–SW trade-off and partitioning decisions, and SW architecture, design and implementation.

18.3 System-on-a-Programmable-Chip Recently, attention has begun to expand in the SoC world from SoC implementations using custom, ASIC or Application-Specific Standard Part (ASSP) design approaches, to include the design and use of complex reconfigurable logic parts with embedded processors and other application-oriented blocks of intellectual property. These complex FPGAs (Field-Programmable Gate Arrays) are offered by several vendors, including Xilinx (Virtex-II PRO Platform FPGA) and Altera (SOPC), but are referred to by

© 2006 by Taylor & Francis Group, LLC

System-on-Chip and Network-on-Chip Design

18-3 External memory access

Flash

RAM

ICache

DCache

DMA System bus

Microprocessor

RAM

Flash

DCache

ICache

DSP

Peripheral bus PLL

MPEG decode

Test PCI

Video I/F

USB Audio CODE C

Disk controller 100 base-T

Bus bridge

FIGURE 18.1 A typical SoC device for consumer applications.

several names: highly programmable SoCs, system-on-a-programmable-chip, embedded FPGAs. The key idea behind this approach to SoC is to combine large amounts of reconfigurable logic with embedded RISC processors (either custom laid-out, “hardened” blocks, or synthesizable processor cores), in order to allow very flexible and tailorable combinations of HW and SW processing to be applied to a particular design problem. Algorithms that consist of significant amounts of control logic, plus significant quantities of dataflow processing, can be partitioned into the control RISC processor (e.g., in Xilinx Virtex-II PRO, a PowerPC processor) and reconfigurable logic offering HW acceleration. Although the resulting combination does not offer the highest performance, lowest energy consumption, or lowest cost, in comparison with custom IC or ASIC/ASSP implementations of the same functionality, it does offer tremendous flexibility in modifying the design in the field and avoiding expensive NonRecurring Engineering (NRE) charges in the design. Thus, new applications, interfaces, and improved algorithms can be downloaded to products working in the field using this approach. Products in this area also include other processing and interface cores, such as Multiply-ACcumulate (MAC) blocks which are specifically aimed at DSP-type dataflow signal and image processing applications; and high speed serial interfaces for wired communications such as SERDES (serializer/de-serializer) blocks. In this sense, system-on-a-programmable-chip SoCs are not exactly application-specific, but not completely generic either. It remains to be seen whether system-on-a-programmable-chip SoCs are going to be a successful way of delivering high volume consumer applications, or will end up restricted to the two main applications for high-end FPGAs: rapid prototyping of designs which will be re-targeted to ASIC or ASSP implementations; and used in high-end, relatively expensive parts of the communications infrastructure that require in-field

© 2006 by Taylor & Francis Group, LLC

18-4

Embedded Systems Handbook

flexibility and can tolerate the trade-offs in cost, energy consumption, and performance. Certainly, the use of synthesizable processors on more moderate FPGAs to realize SoC style designs is one alternative to the cost issue. Intermediate forms, such as the use of metal-programmable gate-array style logic fabrics together with hard-core processor subsystems and other cores, such as is offered in the “Structured ASIC” offerings of LSI Logic (RapidChip) and NEC (Instant Silicon Solutions Platform) represents an intermediate form of SoC between the full-mask ASIC and ASSP approach and the field-programmable gate array approach. Here the trade-offs are much slower design creation (a few weeks rather than a day or so), higher NRE than FPGA (but much lower than a full set of masks), and better cost, performance, and energy consumption than FPGA (perhaps 15 to 30% worse than an ASIC approach). Further interesting compromise or hybrid style approaches, such as ASIC/ASSP with on-chip FPGA regions, are also emerging to give design teams more choices.

18.4 IP Cores The design of SoC would not be possible if every design started from scratch. In fact, the design of SoC depends heavily on the reuse of Intellectual Property blocks — what are called “IP Cores.” IP reuse has emerged as a strong trend over the last 8 to 9 yrs [3] and has been one key element in closing what the International Technology Roadmap for Semiconductors [4] calls the “design productivity gap” — the difference between the rate of increase of complexity offered by advancing semiconductor process technology, and the rate of increase in designer productivity offered by advances in design tools and methodologies. But reuse is not just important to offer ways of enhancing designer productivity — although it has dramatic impacts on that. It also provides a mechanism for design teams to create SoC products that span multiple design disciplines and domains. The availability of both hard (laid-out and characterized) and soft (synthesizable) processor cores from a number of processor IP vendors allows design teams who would not be able to design their own processor from scratch to drop them into their designs and thus add RISC control and DSP functionality to an integrated SoC without having to master the art of processor design within the team. In this sense, the advantages of IP reuse go beyond productivity — it offers both a large reduction in design risk, and also a way for SoC designs to be done that would otherwise be infeasible owing to the length of time it would take to acquire expertise and design IP from scratch. This ability when acquiring and reusing IP cores — to acquire, in a prepackaged form, design domain expertise outside one’s own design team’s set of core competencies, is a key requirement for the evolution of SoC design going forward. SoC up to this point has concentrated to a large part on integrating digital components together, perhaps with some analog interface blocks which are treated as black boxes. The hybrid SoCs of the future, incorporating domains unfamiliar to the integration team, such as RF or MEMS, requires the concept of “drop-in” IP to be extended to these new domains. We are not yet at that state — considerable evolution in the IP business and the methodologies of IP creation, qualification, evaluation, integration, and verification are required before we will be able to easily specify and integrate truly heterogeneous sets of disparate IP blocks into a complete hybrid SoC. However, the same issues existed at the beginning of the SoC revolution in the digital domain. They have been solved to a large extent, through the creation of standards for IP creation, evaluation, exchange, and integration — primarily for digital IP blocks but extending also to Analog/Mixed-Signal (AMS) cores. Among the leading organizations in the identification and creation of such standards has been the Virtual Socket Interface Alliance (VSIA) [5], formed in 1996 and having at its peak membership more than 200 IP, systems, semiconductor, and Electronic Design Automation (EDA) corporate members. Although often criticized over the years for a lack of formal and acknowledged adoption of its IP standards, VSIA has had a more subtle influence on the electronics industry. Many companies instituting reuse programmes internally; many IP, systems, and semiconductor companies engaging in IP creation and exchange; and many design groups have used VSIA IP standards as a key starting point for developing their own standards and methods for IP-based design. In this sense, use of VSIA outputs has enabled a kind of IP reuse in the IP business.

© 2006 by Taylor & Francis Group, LLC

System-on-Chip and Network-on-Chip Design

18-5

VSIA, for example, in its early architectural documents of 1996 to 1997, helped define the strong industry-adopted understanding of what it meant for an IP block to be considered to be in “hard” or “soft” form. Other important contributions to design included the widely read system level design model taxonomy created by one of its working groups. Its standards, specifications, and documents thus represent a very useful resource for the industry [6]. Other important issues for the rise of IP-based design and the emergence of a third party industry in this area (which has taken much longer to emerge than originally hoped in the mid-1990s) are the business issues surrounding IP evaluation, purchase, delivery, and use. Organizations such as the Virtual Component Exchange (VCX) [7] emerged to look at these issues and provide solutions. Although still in existence, it is clear that the vast majority of IP business relationships between firms occur within a more ad hoc supplier to customer business framework.

18.5 Virtual Components The VSIA has had a strong influence on the nomenclature of the SoC- and IP-based design industry. The concept of the “virtual socket” — a description of all the design interfaces which an IP core must satisfy, and design models and integration information which must be provided with the IP core — required to allow it to be more easily integrated or “dropped into” an SoC design — comes from the concept of Printed Circuit Board (PCB) design where components are sourced and purchased in prepackaged form and can be dropped into a board design in a standardized way. The dual of the “virtual socket” then becomes the “virtual component.” Not only in the VSIA context, but also more generally in the interface, an IP core represents a design block which might be reusable. A virtual component represents a design block that is intended for reuse, and which has been developed and qualified to be highly reusable. The things that separate IP cores from virtual components are in general: • Virtual components conform in their development and verification processes to well-established design processes and quality standards. • Virtual components come with design data, models, associated design files, scripts, characterization information, and other deliverables which conform to one or other well-accepted standards for IP reuse — for example, the VSIA deliverables, or another internal or external set of standards. • Virtual components in general should have been fabricated at least once, and characterized postfabrication to ensure that they have validated claims. • Virtual components should have been reused at least once by an external design team, and usage reports and feedback should be available. • Virtual components should have been rated for quality using an industry standard quality metric such as OpenMORE (originated by Synopsys and Mentor Graphics) or the VSI Quality standard (which has OpenMORE as one of its inputs). To a large extent, the developments over the last decade in IP reuse have been focused on defining the standards and processes to turn the ad hoc reuse of IP cores into a well-understood and reliable process for acquiring and reusing virtual components — thus enhancing the analogy with PCB design.

18.6 Platforms and Programmable Platforms The emphasis in the preceding sections has been on IP (or virtual component) reuse on a somewhat ad hoc block-by-block basis in SoC design. Over the past several years, however, there has arisen a more integrated approach to the design of complex SoCs and the reuse of virtual components — what has been called “platform-based design.” This will be dealt with at much greater length in another chapter in this book. Much more information is available in References 8–11. Suffice it here to define platform-based design in the SoC context from one perspective.

© 2006 by Taylor & Francis Group, LLC

18-6

Embedded Systems Handbook

We can define platform-based design as a planned design methodology which reduces the time and effort required, and risk involved, in designing and verifying a complex SoC. This is accomplished by extensive reuse of combinations of HW and SW IP. As an alternative to IP reuse in a block-by-block manner, platform-based design assembles groups of components into reusable platform architecture. This reusable architecture, together with libraries of preverified and precharacterized, application oriented HW and SW virtual components, is a SoC integration platform. There are several reasons for the growing popularity of the platform approach in industrial design. These include the increase in design productivity, the reduction in risk, the ability to utilize preintegrated virtual components from other design domains more easily, and the ability to reuse SoC architectures created by experts. Industrial platforms include full application platforms, reconfigurable platforms, and processorcentric platforms [12]. Full application platforms, such as Philips Nexperia and TI OMAP provide a complete implementation vehicle for specific product domains [13]. Processor-centric platforms, such as ARM PrimeXsys concentrate on the processor, its required bus architecture, and basic sets of peripherals, along with RTOS and basic SW drivers. Reconfigurable or “highly programmable” platforms such as the Xilinx Platform FPGA and Altera’s SOPC deliver hardcore processors plus reconfigurable logic along with associated IP libraries and design tool flows.

18.7 Integration Platforms and SoC Design The use of SoC integration platforms changes the SoC design process in two fundamental ways: 1. The basic platform must be designed, using whatever ad hoc or formalized design process for SoC that the platform creators decide on. Section 18.8 outlines some of the basic steps required to build a SoC, whether building a platform or using a block-based more ad hoc integration process. However, when constructing a SoC platform for reuse in derivative design, it is important to remember that it may not be necessary to take the whole platform and its associated HW and SW component libraries through complete implementation. Enough implementation must be done to allow the platform and its constituent libraries to be fully characterized and modeled for reuse. It is also essential that the platform creation phase produce in an archivable and retrievable form all the design files required for the platform and its libraries to be reused in a derivative design process. This must also include the setup of the appropriate configuration programs or scripts to allow automatic creation of a configured platform during derivative design. 2. A design process must be created and qualified for all the derivative designs which will be created based on the SoC integration platform. This must include processes for retrieving the platform from its archive, for entering the derivative design configuration into a platform configurator, the generation of the design files for the derivative, the generation of the appropriate verification environment(s) for the derivative, the ability for derivative design teams to select components from libraries, to modify these components and validate them within the overall platform context, and, to the extent supported by the platform, to create new components for their particular application. Reconfigurable or highly programmable platforms introduce an interesting addition to the platformbased SoC design process [14]. Platform FPGAs and SOPC devices can be thought of as a “meta-platform”: a platform for creating platforms. Design teams can obtain these devices from companies such as Xilinx and Altera, containing a basic set of more generic capabilities and IP-embedded processors, on-chip buses, special IP blocks such as MACs and SERDES, and a variety of other prequalified IP blocks. They can then customize the meta-platform to their own application space by adding application domain-specific IP libraries. Finally, the combined platform can be provided to derivative design teams, who can select the basic meta-platform and configure it within the scope intended by the intermediate platform creation team, selecting the IP blocks needed for their exact derivative application. More on platform-based design will be found in another chapter in this book.

© 2006 by Taylor & Francis Group, LLC

System-on-Chip and Network-on-Chip Design

18-7

18.8 Overview of the SoC Design Process The most important thing to remember about SoC design is that it is a multi-disciplinary design process, which needs to exercise design processes from across the spectrum of electronics. Design teams must gain some fluency with all these multiple disciplines, but the integrative and reuse nature of SoC design means that they may not need to become deep experts in all of them. Indeed, avoiding the need for designers to understand all methodologies, flows, and domain-specific design techniques is one of the key reasons for reuse and enablers of productivity. Nevertheless, from Design-for-Test (DFT) through digital and analog HW design, from verification through system level design, from embedded SW through IP procurement and integration, from SoC architecture through IC analysis, a wide variety of knowledge is required by the team, if not every designer. Figure 18.2 illustrates some of the basic constituents of the SoC design process. We will now define each of these steps as illustrated: SoC requirements analysis. This is the basic step for defining and specifying a complex SoC, based on the needs of the end product into which it will be integrated. The primary input into this step is the marketing definition of the end product and the resulting characteristics of what the SoC should be: both functional and nonfunctional (e.g., cost, size, energy consumption, performance: latency and throughput, package selection). This process of requirements analysis must ultimately answer the question: is the product feasible? Is the desired SoC feasible to design, and with what effort and in what timeframe? How much

SoC requirement analysis

SoC architecture

Communications architecture

Choose processor(s)

System-level design • HW–SW partitioning • System modeling • Performance analysis

Acquisition of HW and SW IP

Build transaction-level golden testbench

Define SW architecture

Configure and floorplan SoC HW microarchitecture DFT architecture and implementation

HW IP assembly and implementation

SW assembly and implementation AMS HW implementation

Final SoC HW assembly and verification Fabrication, testing, packaging, lab verification with SW

FIGURE 18.2 Steps in the SoC design process.

© 2006 by Taylor & Francis Group, LLC

HW and HW–SW verification

18-8

Embedded Systems Handbook

reuse will be possible? Is the SoC design based on legacy designs of previous generation products (or, in the case of platform-based design, to be built based on an existing platform offering)? SoC architecture. In this phase, the basic structure of the desired SoC is defined. Vitally important is to decide on the communications architecture that will be used as the backbone of the SoC on-chip communications network. An inadequate communications architecture will cripple the SoC and have as big an impact as the use of an inappropriate processor subsystem. Of course, the choice of communications architecture is impossible to divorce from making the basic processor(s) choice — for example, do I use a RISC control processor? Do I have an on-board DSP? How many of each? What are the processing demands of my SoC application? Do I integrate the bare processor core, or use a whole processor subsystem provided by an IP company (most processor IP companies have moved from offering just processor cores, to whole processor subsystems including hierarchical bus fabrics tuned to their particular processor needs)? Do I have some ideas, based on legacy SoC design in this space, as to how SW and HW should be partitioned? What memory hierarchy is appropriate? What are the sizes, levels, performance requirements, and configurations of the embedded memories most appropriate to the application domain for the SoC? System-level design. This is an important phase of the SoC process — but one that is often done in a relatively ad hoc way. The whiteboard and the spreadsheet are as much used by the SoC architects as more capable toolsets. However, there has long been use of ad hoc C/C++ based models for the system design phase — to validate basic architectural choices. And designers of complex signal processing algorithms for voice and image processing have long adopted dataflow models and associated tools to define their algorithms, define optimal bit-widths, and validate performance whether destined for HW or SW implementation. A flurry of activity in the last few years on different C/C++ modeling standards for system architects has consolidated on SystemC [15]. The system nature of SoC demands a growing use of system-level design modeling and analysis, as these devices grow more complex. The basic processes carried out in this phase include HW–SW partitioning (the allocation of functions to be implemented in dedicated HW blocks, in SW on processors [and the decision of RISC versus DSP], or a combination of both, together with decisions on the communications mechanisms to be used to interface HW and SW, or HW–HW and SW–SW). In addition, the construction of system-level models, and the analysis of correct functioning, performance, and other nonfunctional attributes of the intended SoC through simulation and other analytical tools, is necessary. Finally, all additional IP blocks required which can be sourced outside, or reused from the design group’s legacy, must be identified — both HW and SW. The remaining new functions will need to be implemented as part of the overall SoC design process. IP acquisition. After system-level design and the identification of the processors and communications architecture, and other HW or SW IP required for the design, the group must undertake an IP acquisition stage. This can, to a large extent, be done at least in part in parallel with other work such as system-level design (assuming early identification of major external IP is made) or building golden transaction-level testbench models. Fortunate design groups will be working in companies with a large legacy of existing well-crafted IP (rather, “virtual components”) organized in databases which can be easily searched; or those with access via supplier agreements to large external IP libraries; or at least those with experience at IP search, evaluation, purchase, and integration. For these lucky groups, the problems at this stage are greatly ameliorated. Others with less experience or infrastructure will need to explore these processes for the first time, hopefully making use of IP suppliers’ experience with the legal and other processes required. Here the external standards bodies such as VSIA and VCX have done much useful work that will smooth the path, at least a little. One key issue in IP acquisition is to conduct rigorous and thorough incoming inspection of IP to ensure its completeness and correctness to the greatest extent possible prior to use, and to resolve any problems with quality early with suppliers — long before SoC integration. Every hour spent on this at this stage will pay back in avoiding much longer schedule slips later. The IP quality guidelines discussed earlier are a foundation level for a quality process at this point. Build a transaction-level golden testbench. The system model built up during the system-level design stage can form the basis for a more elaborated design model, using “transaction-level abstractions” [16], which represents the underlying HW–SW architecture and components in more detail — sufficient detail to act as a functional virtual prototype for the SoC design. This golden model can be used at this stage to

© 2006 by Taylor & Francis Group, LLC

System-on-Chip and Network-on-Chip Design

18-9

verify the microarchitecture of the design and to verify detailed design models for HW IP at the Hardware Description Language (HDL) level within the overall system context. It thus can be reused all the way down the SoC design and implementation cycle. Define the SoC SW architecture. SoC is of course not just about HW [17]. As well as often defining the right on-chip communications architecture, the choice of processor(s) and the nature of the application domain have a very heavy influence on the SW architecture. For example, RTOS choice is limited by the processor ports which have been done and by the application domain (OSEK is an RTOS for automotive systems; Symbian OS for portable wireless devices; PalmOS for Personal Digital Assistants, etc.). As well as the basic RTOS, every SoC peripheral device will need a device driver — hopefully based on reuse and configuration of templates; various middleware application stacks (e.g., telephony, multimedia image processing) are important parts of the SW architecture; voice and image encoding and decoding on portable devices often is based on assembly code IP for DSPs. There is thus a strong need in defining the SoC to fully elaborate the SW architecture to allow reuse, easy customization and effective verification of the overall HW–SW device. Configure and floorplan SoC microarchitecture. At this point we are beginning to deal with the SoC on a more physical and detailed logical basis. Of course, during high-level architecture and system-level design, the team has been looking at physical implementation issues (although our design process diagram shows everything as a waterfall kind of flow, in reality SoC design like all electronics design is more of an iterative, incremental process — that is, more akin to the famous ‘spiral’ model for SW). But before beginning the detailed HW design and integration it is important that there is agreement among the team on the basic physical floorplan; that all the IP blocks are properly and fully configured; that the basic microarchitectures (test, power, clocking, bus, timing) have been fully defined and configured, and that HW implementation can proceed. In addition, this process should also generate the downstream verification environments which will be used throughout the implementation processes — whether SW simulation based, emulation based, using rapid prototypes, or other hybrid verification approaches. DFT architecture and implementation. The test architecture is only one of the key microarchitectures which must be implemented; it is complicated by IP legacy and the fact that it is often impossible to impose one DFT style (such as BIST or SCAN) on all IP blocks. Rather, wrappers or adaptations of standard test interfaces (such as JTAG ports) may be necessary to fit all IP blocks together into a coherent test architecture and plan. AMS HW implementation. Most SoCs incorporating AMS blocks use them to interface to the external world. VSIA, among other groups, has done considerable work in defining how AMS IP blocks should be created to allow them to be more easily integrated into mainly digital SoCs (the “Big D/little a” SoC); and guidelines and rules for such integration. Experiences with these rules and guidelines and extra deliverables have been, on the whole, promising — but they have more impact between internal design groups today than on the industry as a whole. The “Big A/Big D” mixed-signal SoC is still relatively rare. HW IP assembly and integration. This design step is in many ways the most traditional. Many design groups have experience in assembling design blocks done by various designers or subgroups in an incremental fashion, into the agreed on architectures for communications, bussing, clocking, power, etc. The main difference with SoC is that many of the design blocks may be externally sourced IP. To avoid difficulties at this stage, the importance of rigorous qualification of incoming IP and the early definition of the SoC microarchitecture, to which all blocks must conform, cannot be overstated. SW assembly and implementation. Just as with HW, the SW IP, together with new or modified SW tasks created for the particular SoC under design, must be assembled together and validated as to conformance to interfaces and expected operational quality. It is important to verify as much of the SW in its normal system operating context as possible. HW and HW–SW verification. Although represented as a single box on the diagram, this is perhaps one of the largest consumers of design time and effort and the major determinant of final SoC quality. Vital to effective verification is the setup of a targeted SoC verification environment, reusing the golden testbench models created at higher levels of the design process. In addition, highly capable, multi-language, mixed simulation environments are important (e.g., SystemC models and HDL implementation models need to

© 2006 by Taylor & Francis Group, LLC

18-10

Embedded Systems Handbook

be mixed in the verification process and effective links between them are crucial). There are a large number of different verification tools and techniques [18], ranging from SW-based simulation environments to HW emulators, HW accelerators, and FPGA and bonded-core-based rapid prototyping approaches. In addition, formal techniques such as equivalence checking, and model/property checking have enjoyed some successful usage in verifying parts of SoC designs, or the design at multiple stages in the process. Mixed approaches to HW–SW verification range from incorporating Instruction Set Simulators (ISSs) of processors in SW-based simulation to linking HW emulation of the HW blocks (compiled from the HDL code) to SW running natively on a host workstation, linked in an ad hoc fashion by design teams or using a commercial mixed verification environment. Alternatively, HDL models of new HW blocks running in a SW simulator can be linked to emulation of the rest of the system running in HW — a mix of emulation and use of bonded-out processor cores for executing SW. It is important that as much of the system SW be exercised in the context of the whole system as possible, using the most appropriate verification technology that can get the design team close to real-time execution speed (no more than 100× slower is the minimum to run significant amounts of SW). The trend to transaction-based modeling of systems, where transactions range in abstraction from untimed functional communications via message calls, through abstract bus communications models, through cycle-accurate bus functional models, and finally to cycle and pin-accurate transformations of transactions to the fully detailed interfaces, allows verification to occur at several levels or with mixed levels of design description. Finally, a new trend in verification is assertion-based verification, using a variety of input languages (PSL/Sugar, e, Vera, or regular Verilog and VHDL) to model design properties, which can then be monitored during simulation, to ensure that either certain properties will be satisfied or certain error conditions never occur. Combinations of formal property checking and simulation-based assertion checking have been created — viz. “semiformal verification.” The most important thing to remember about verification is that armed with a host of techniques and tools, it is essential for design teams to craft a well-ordered verification process that allows them to definitively answer the question “how do we know that verification is done?” and thus allows the SoC to be fabricated. Final SoC HW assembly and verification. Often done in parallel or overlapping“those final few simulation runs” in the verification stage, the final SoC HW assembly and verification phase includes final place and route of the chip, any hand-modifications required, and final physical verification (using design rule checking and layout-versus-schematic [netlist] tools), as well as important analysis steps for issues which occur in advanced semiconductor processes such as IR drop, signal integrity, power network integrity, as well as satisfaction and design transformation for manufacturability (OPC, etc.). Fabrication, testing, packaging, and lab verification. When a SoC has been shipped to fabrication, it would seem time for the design team to relax. Instead, this is an opportunity for additional verification to be carried out — especially more verification of system SW running in context of the HW design — and for fixes, either of SW, or of the SoC HW on hopefully no more than one expensive iteration of the design — to be determined and planned. When the tested packaged parts arrive back for verification in the lab, the ideal scenario is to load the SW into the system and have the SoC and its system booted up and running SW within a few hours. Interestingly, the most advanced SoC design teams, with well-ordered design methodologies and processes, are able to achieve this quite regularly.

18.9 System-Level Design As discussed earlier, when describing the overall SoC design flow, system-level design, and SoC are essentially made for each other. A key aim of IP reuse and of SoC techniques such as platform-based design is to make the “back end” (RTL to GDS II) design implementation processes easier — fast and with lowrisk; and to shift the major design phase for SoC up in time and in abstraction level to the system level. This also means that the back-end tools and flows for SoC designs do not necessarily differ from those used for complex ASIC, ASSP, and custom IC design — it is the methodology of how they are used, and how blocks are sourced and integrated, that overlays the underlying design tools and flows, that may differ for SoC. However, the fundamental nature of IP-based design of SoC has a stronger influence on the system level.

© 2006 by Taylor & Francis Group, LLC

System-on-Chip and Network-on-Chip Design

18-11

It is at the system level that the vital tasks of deciding on and validating the basic system architecture and choice of IP blocks are carried out. In general, this is known as “design space exploration” (DSE). As part of this exploration, SoC platform customization for a particular derivative is carried out, should the SoC platform approach be used. Essentially one can think of platform DSE as being a similar task to general DSE, except that the scope and boundaries of the exploration are much more tightly constrained — the basic communications architecture and platform processor choices may be fixed, and the design team may be restricted to choosing certain customization parameters and choosing optional IP from a library. Other tasks include HW–SW partitioning, usually restricted to decisions about key processing tasks which might be mapped onto either HW or SW form and which have a big impact on system performance, energy consumption, on-chip communications bandwidth consumption, or other key attributes. Of course, in multiprocessor systems, there are “SW–SW” partitioning or codesign issues as well, deciding on the assignment of SW tasks to various processor options. Again, perhaps 80 to 95% of these decisions can or are made a priori, especially if a SoC is based on either a platform or an evolution of an existing system; such codesign decisions are usually made on a small number of functions which have critical impact. Because partitioning, codesign and DSE tasks at the system level involve much more than HW–SW issues, a more appropriate term for this is “function-architecture codesign” [19,20]. In this codesign model, systems are described on two equivalent levels: • The functional intent of the system — for example, a network of applications, decomposed into individual sets of functional tasks which may be modeled using a variety of models of computation such as discrete event, finite state machine, or dataflow. • The architectural structure of the system — the communications architecture, major IP blocks such as processor(s), memorie(s), and HW blocks, captured or modeled, for example, using some kind of IP or platform configurator. The methodology implied in this approach is then to build explicit mappings between the functional view of the system and the architectural view, which carry within them the implicit partitioning that is made for both computation and communications. This hybrid model can then be simulated, the results analyzed, and a variety of ancillary models (e.g., cost, power, performance, communications bandwidth consumption, etc.) can be utilized in order to examine the suitability of the system architecture as a vehicle for realizing or implementing the end product functionality. The function-architecture codesign approach has been implemented and used in both research and commercial tools [21] and forms the foundation of many system-level codesign approaches going forward. In addition, it has been found extremely suitable as the best system-level design approach for platform-based design of SoC [22].

18.10 Interconnection and Communication Architectures for SoC This topic is dealt with in more detail in other chapters in this book. Suffice it to say here that current SoC architectures deal in fairly traditional hierarchies of standard on-chip buses: for example, processorspecific buses, high-speed system buses, and lower-speed peripheral buses, using standards such as ARM’s AMBA and IBM’s CoreConnect [13], and traditional master–slave bus approaches. Recently, there has been a lot of interest in Network-on-Chip (NoC) communications architectures, based on packet-SW and a number of approaches have been reported in the literature but this remains primarily a research topic both in universities and industrial research labs [23].

18.11 Computation and Memory Architectures for SoC The primary processors used in SoC are embedded RISCs such as ARM processors, PowerPCs, MIPS architecture processors, and some of the configurable processors designed specifically for SoC such as

© 2006 by Taylor & Francis Group, LLC

18-12

Embedded Systems Handbook

Tensilica and ARC. In addition, embedded DSPs from traditional suppliers as TI, Motorola, ParthusCeva, and others are also quite common in many consumer applications, for embedded signal processing for voice and image data. Research groups have looked at compiling or synthesizing application-specific processors or coprocessors [24,25] and these have interesting potential in future SoCs which may incorporate networks of heterogeneous configurable processors collaborating to offer large amounts of computational parallelism. This is an especially interesting prospect given wider use of reconfigurable logic which opens up the prospect of dynamic adaptation of SoC to application needs. However, most multiprocessor SoCs today involve at most 2 to 4 processors of conventional design; the larger networks are more often found today in the industrial or university lab. Although several years ago most embedded processors in early SoCs did not use cache memory-based hierarchies, this has changed significantly over the years, and most RISC and DSP processors now involve significant amounts of Level 1 Cache memory, as well as higher level memory units both on- and off-chip (off-chip flash memory is often used for embedded SW tasks which may be only infrequently required). System design tasks and tools must consider the structure, size, and configuration of the memory hierarchy as one of the key SoC configuration decisions that must be made.

18.12 IP Integration Quality and Certiﬁcation Methods and Standards We have emphasized the design reuse aspects of SoC and the need for reuse of both internally and externally sourced IP blocks by design teams creating SoCs. In the discussion of the design process above, we mentioned issues such as IP quality standards and the need for incoming inspection and qualification of IP. The issue of IP quality remains one of the biggest impediments to the use of IP-based design for SoC [26]. The quality standards and metrics available from VSIA and OpenMORE, and their further enhancement help, but only to a limited extent. The industry could clearly use a formal certification body or lab for IP quality that would ensure conformance to IP transfer requirements and the integration quality of the blocks. Such a certification process would be of necessity quite complex owing to the large number of configurations possible for many IP blocks and the almost infinite variety of SoC contexts into which they might be integrated. Certified IP would begin the deliver the “virtual components” of the VSIA vision. In the absence of formal external certification (and such third party labs seem a long way off, if they ever emerge), design groups must provide their own certification processes and real reuse quality metrics, based on their internal design experiences. Platform-based design methods help owing to the advantages of prequalifying and characterizing groups of IP blocks and libraries of compatible domain-specific components. Short of independent evaluation and qualification, this is the best that design groups can do currently. One key issue to remember is that IP not created for reuse, with all the deliverables created and validated according to a well-defined set of standards, is inherently not reusable. The effort required to make a reusable IP block has been estimated to be 50 to 200% more than that required to use it once; however, assuming the most conservative extra cost involved implies positive payback with three uses of the IP block. Planned and systematic IP reuse and investment in those blocks with greatest SoC use potential gives a high chance of achieving significant productivity soon after starting a reuse programme. But ad hoc attempts to reuse existing design blocks not designed to reuse standards have failed in the past and are unlikely to provide the quality and productivity desired.

18.13 Summary In this chapter, we have defined SoC and surveyed a large number of the issues involved in its design. An outline of the important methods and processes involved in SoC design define a methodology which can be adopted by design groups and adapted to their specific requirements. Productivity in SoC design demands high levels of design reuse and the existence of the third party and internal IP groups and the chance to create a library of reusable IP blocks (true virtual components) are all possible for most design groups today.

© 2006 by Taylor & Francis Group, LLC

System-on-Chip and Network-on-Chip Design

18-13

The wide variety of design disciplines involved in SoC mean that unprecedented collaboration between designers of all backgrounds — from systems experts through embedded SW designers through architects through HW designers — is required. But the rewards of SoC justify the effort required to succeed.

References [1] Merrill Hunt and Jim Rowson, Blocking in a system on a chip. IEEE Spectrum, 33(11), 35–41, November 1996. [2] Rochit Rajsuman. System-on-a-Chip Design and Test. Artech House, Norwood, Massachusetts, 2000. [3] Michael Keating and Pierre Bricaud, Reuse Methodology Manual for System-on-a-Chip Designs. Kluwer Academic Publishers, Dordrecht, 1998 (1st ed.), 1999 (2nd ed.), 2002 (3rd ed.). [4] International Technology Roadmap for Semiconductors (ITRS), 2001 edn. http://public.itrs.net/. [5] Virtual Socket Interface Alliance, on the web at URL: http://www.vsia.org. This includes access to its various public documents, including the original Reuse Architecture document of 1997, as well as more recent documents supporting IP reuse released to the public domain. [6] B. Bailey, G. Martin, and T. Anderson (eds.). Taxonomies for the Development and verification of Digital Systems. Springer, New York, 2005. [7] The Virtual Component Exchange (VCX). Available at http://www.thevcx.com/. [8] Henry Chang, Larry Cooke, Merrill Hunt, Grant Martin, Andrew McNelly, and Lee Todd, Surviving the SOC Revolution: A Guide to Platform-Based Design. Kluwer Academic Publishers, Dordrecht, 1999. [9] K. Keutzer, S. Malik, A.R. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli, System-Level Design: Orthogonalization of Concerns and Platform-Based Design. IEEE Transactions on CAD of ICs and Systems, 19, 1523, 2000. [10] Alberto Sangiovanni-Vincentelli and Grant Martin, Platform-Based Design and Software Design Methodology for Embedded Systems. IEEE Design and Test of Computers, 18, 23–33, 2001. [11] IEEE Design and Test of Computers Special Issue on Platform-Based Design of SoCs, 19, 4–63, 2002. [12] G. Martin and F. Schirrmeister, A Design Chain for Embedded Systems. IEEE Computer, Embedded Systems Column, 35(3), 100–103, March 2002. [13] Grant Martin and Henry Chang, Eds., Winning the SOC Revolution: Experiences in Real Design. Kluwer Academic Publishers, Dordrecht, May 2003. [14] Patrick Lysaght, FPGAs as Meta-Platforms for Embedded Systems. In Proceedings of the IEEE Conference on Field Programmable Technology. Hong Kong, December 2002. [15] Thorsten Groetker, Stan Liao, Grant Martin, and Stuart Swan, System Design with SystemC. Kluwer Academic Publishers, Dordrecht, May 2002. [16] Janick Bergeron, Writing Testbenches, 3rd ed. Kluwer Academic Publishers, Dordrecht, 2003. [17] G. Martin and C. Lennard, Improving Embedded SW Design and Integration for SOCs. Invited Custom Integrated Circuits Conference Paper, May 2000, pp. 101–108. [18] Prakash Rashinkar, Peter Paterson, and Leena Singh, System-on-a-Chip Verification: Methodology and Techniques. Kluwer Academic Publishers, Dordrecht, 2001. [19] F. Balarin, M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, C. Passerone, A. SangiovanniVincentelli, E. Sentovich, K. Suzuki, and B. Tabbara, Hardware–Software Co-Design of Embedded Systems: The POLIS Approach. Kluwer Academic Publishers, Dordrecht, 1997. [20] S. Krolikoski, F. Schirrmeister, B. Salefski, J. Rowson, and G. Martin, Methodology and Technology for Virtual Component Driven Hardware/Software Co-Design on the System Level. Paper 94.1, ISCAS 99, Orlando, FL, May 30–June 2, 1999. [21] G. Martin and B. Salefski, System Level Design for SOC’s: A Progress Report — Two Years On. In System-on-Chip Methodologies and Design Languages, Jean Mermet, Ed. Kluwer Academic Publishers, Dordrecht, 2001, pp. 297–306.

© 2006 by Taylor & Francis Group, LLC

18-14

Embedded Systems Handbook

[22] G. Martin, Productivity in VC Reuse: Linking SOC Platforms to Abstract Systems Design Methodology. In Virtual Component Design and Reuse, Ralf Seepold and Natividad Martinez Madrid, Eds. Kluwer Academic Publishers, Dordrecht, 2001, pp. 33–46. [23] Axel Jantsch and Hennu Tenhunen, Eds., Networks on Chip. Kluwer Academic Publishers, Dordrecht, 2003. [24] Vinod Kithail, Shail Aditya, Robert Schreiber, B. Ramakrishna Rau, Darren C. Cronquist, and Mukund Sivaraman, PICO: Automatically Designing Custom Computers. IEEE Computer, 35, 39–47, 2002. [25] T.J. Callahan, J.R. Hauser, and J. Wawrzynek, The Garp Architecture and C Compiler. IEEE Computer, 33, 62–69, 2000. [26] DATE 2002 Proceedings, Session 1A: How to Choose Semiconductor IP?: Embedded Processors, Memory, Software, Hardware. In Proceedings of DATE 2002. Paris, March 2002, pp. 14–17.

© 2006 by Taylor & Francis Group, LLC

19 A Novel Methodology for the Design of Application-Speciﬁc Instruction-Set Processors 19.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-1 19.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-3 19.3 ASIP Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-4 Architecture Exploration • LISA Language

19.4 LISA Processor Design Platform . . . . . . . . . . . . . . . . . . . . . . . . 19-10 Hardware Designer Platform — For Exploration and Processor Generation • Software Designer Platform — For Software Application Design • System Integrator Platform — For System Integration and Verification

19.5 SW Development Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-12 Assembler and Linker • Simulator

19.6 Architecture Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-17 LISA Language Elements for HDL Synthesis • Implementation Results

19.7 Tools for Application Development . . . . . . . . . . . . . . . . . . . . 19-22

Andreas Hoffmann, Achim Nohl, and Gunnar Braun CoWare Inc.

Examined Architectures • Efficiency of the Generated Tools

19.8 Requirements and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 19-25 LISA Language • HLL C-compiler • HDL Generator

19.9 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-26 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-26

19.1 Introduction In consumer electronics and telecommunications high product volumes are increasingly going along with short life-times. Driven by the advances in semiconductor technology combined with the need for new From Andreas Hoffmann, Tim Kogel, Achim Nohl, Gunnar Braun, Oliver Schliebusch, Oliver Wahlen, Andreas Wieferink, and Heinrich Meyr. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20, 2001. With permission.

19-1

© 2006 by Taylor & Francis Group, LLC

19-2

Embedded Systems Handbook

applications like digital TV and wireless broadband communications, the amount of system functionality realized on a single chip is growing enormously. Higher integration and thus increasing miniaturization have led to a shift from using distributed hardware components towards heterogeneous system-on-chip (SOC) designs [1]. Due to the complexity introduced by such SOC designs and time-to-market constraints, the designer’s productivity has become the vital factor for successful products. For this reason a growing amount of system functions and signal processing algorithms is implemented in software rather than in hardware by employing embedded processor cores. In the current technical environment, embedded processors and the necessary development tools are designed manually, with very little automation. This is because the design and implementation of an embedded processor, such as a DSP device embedded in a cellular phone, is a highly complex process composed of the following phases: architecture exploration, architecture implementation, application software design, and system integration and verification. During the architecture exploration phase, software development tools (i.e., HLL compiler, assembler, linker, and cycle-accurate simulator) are required to profile and benchmark the target application on different architectural alternatives. This process is usually an iterative one that is repeated until a best fit between selected architecture and target application is obtained. Every change to the architecture specification requires a complete new set of software development tools. As these changes on the tools are carried out mainly manually, this results in a long, tedious, and extremely error-prone process. Furthermore, the lack of automation makes it very difficult to match the profiling tools to an abstract specification of the target architecture. In the architecture implementation phase, the specified processor has to be converted into a synthesizable HDL model. With this additional manual transformation it is quite obvious that considerable consistency problems arise between the architecture specification, the software development tools, and the hardware implementation. During the software application design phase, software designers need a set of production-quality software development tools. Since the demands of the software application designer and the hardware processor designer place different requirements on software development tools, new tools are required. For example, the processor designer needs a cycle/phase-accurate simulator for hardware/software partitioning and profiling, which is very accurate but inevitably slow, whereas the application designer demands more simulation speed than accuracy. At this point, the complete software development tool-suite is usually re-implemented by hand — consistency problems are self-evident. In the system integration and verification phase, co-simulation interfaces must be developed to integrate the software simulator for the chosen architecture into a system simulation environment. These interfaces vary with the architecture that is currently under test. Again, manual modification of the interfaces is required with each change of the architecture. The efforts of designing a new architecture can be reduced significantly by using a retargetable approach based on a machine description. The Language for Instruction Set Architectures (LISAs) [2,3] was developed for the automatic generation of consistent software development tools and synthesizable HDL code. A LISA processor description covers the instruction-set, the behavioral, and the timing model of the underlying hardware, thus providing all essential information for the generation of a complete set of development tools including compiler, assembler, linker, and simulator. Moreover, it contains enough micro-architectural details to generate synthesizable HDL code of the modelled architecture. Changes on the architecture are easily transferred to the LISA model and are applied automatically to the generated tools and hardware implementation. In addition, speed and functionality of the generated tools allow usage even after the product development has been finished. Consequently, there is no need to rewrite the tools to upgrade them to production quality standard. In its predicate to represent an unambiguous abstraction of the real hardware, a LISA model description bridges the gap between hardware and software design. It provides the software developer with all required information and enables the hardware designer to synthesize the architecture from the same specification the software tools are based on. The chapter is organized as follows: Section 19.2 reviews existing approaches on machine description languages and discusses their applicability for the design of application specific instruction set processors. Section 19.3 presents an overview on a typical ASIP design flow using LISA: from specification to implementation. Moreover, different processor models are worked out which contain the required information

© 2006 by Taylor & Francis Group, LLC

A Novel Methodology for the Design of ASIPs

19-3

the tools need for their retargetation. Besides, sample LISA code segments are presented showing how the different models are expressed in the LISA language. Section 19.4 introduces the LISA processor design platform. Following that, the different areas of application are illuminated in more detail. In Section 19.5 the generated software development tools are presented in more detail with a focus on different simulation techniques that are applicable. Section 19.6 shows the path to implementation and gives results for a case study that was carried out using the presented methodology. To prove the quality of the generated software development tools, in Section 19.7 simulation benchmark results are shown for modelled state-of-theart processors. In Section 19.8, requirements and limitations of the presented approach are explained. Section 19.9 summarizes the chapter and gives an outlook on future research topics.

19.2 Related Work Hardware description languages (HDLs) like VHDL or Verilog are widely used to model and simulate processors, but mainly with the goal to develop hardware. Using these models for architecture exploration and production quality software development tool generation has a number of disadvantages, especially for cycle-based or instruction-level processor simulation. They cover a huge amount of hardware implementation details which are not needed for performance evaluation, cycle-based simulation, and software verification. Moreover, the description of detailed hardware structures has a significant impact on simulation speed [4,5]. Another problem is that the extraction of the instruction set is a highly complex, manual task and some instruction set information, for example, assembly syntax, cannot be obtained from HDL descriptions at all. There are many publications on machine description languages providing instruction-set models. Most approaches using such models are addressing retargetable code generation [6–9]. Other approaches address retargetable code generation and simulation. The approaches of Maril [10] as part of the Marion environment and a system for VLIW compilation [11] are both using latency annotation and reservation tables for code generation. But models based on operation latencies are too coarse for cycle-accurate simulation or even generation of synthesizable HDL code. The language nML was developed at TU Berlin [12,13] and adopted in several projects [14–17]. However, the underlying instruction sequencer does not allow to describe the mechanisms of pipelining as required for cycle-based models. Processors with more complex execution schemes and instruction-level parallelism like the Texas Instruments TMS320C6x cannot be described, even at the instruction-set level, because of the numerous combinations of instructions. The same restriction applies to ISDL [18], which is very similar to nML. The language ISDL is an enhanced version of the nML formalism and allows the generation of a complete tool-suite consisting of HLL compiler, assembler, linker, and simulator. Even the possibility of generating synthesizable HDL code is reported, but no results on the efficiency of the generated tools nor on the generated HDL code are given. The EXPRESSION language [19] allows the cycle-accurate processor description based on a mixed behavioral/structural approach. However, no results on simulation speed have been published nor is it clear if it is feasible to generate synthesizable HDL code automatically. The FlexWare2 environment [20] is capable of generating assembler, linker, simulator, and debugger from the Insulin formalism. A link to implementation is non-existent, but test vectors can be extracted from the Insulin description to verify the HDL model. The HLL compiler is derived from a separate description targeting the CoSy [21] framework. Recently, various ASIP development systems have been introduced [22–24] for systematic co-design of instruction-set and micro-architecture implementation using a given set of application benchmarks. The PEAS-III system [25] is an ASIP development environment based on a micro-operation description of instructions that allows the generation of a complete tool-suite consisting of HLL compiler, assembler, linker, and simulator including HDL code. However, no further information about the formalism is given that parameterizes the tool generators nor have any results been published on the efficiency of the generated tools. The MetaCore system [26] is a benchmark driven ASIP development system based on a formal representation language. The system accepts a set of benchmark programs and estimates the hardware cost and performance for the configuration under test. Following that, software development tools and

© 2006 by Taylor & Francis Group, LLC

19-4

Embedded Systems Handbook

synthesizable HDL code are generated automatically. As the formal specification of the ISA is similar to the ISPS formalism [27], complex pipeline operations as flushes and stalls can hardly be modelled. In addition, flexibility in designing the instruction-set is limited to a predefined set of instructions. Tensilica Inc. customizes a RISC processor within the Xtensa system [28]. As the system is based on an architecture template comprising quite a number of base instructions, it is far too powerful and thus not suitable for highly application specific processors, which do in many cases only employ very few instructions. Our interest in a complete retargetable tool-suite for architecture exploration, production quality software development, architecture implementation, and system integration for a wide range of embedded processor architectures motivated the introduction of the language LISA used in our approach. In many aspects, LISA incorporates ideas which are similar to nML. However, it turned out from our experience with different DSP architectures that significant limitations of existing machine description languages must be overcome to allow the description of modern commercial embedded processors. For this reason, LISA includes improvements in the following areas: • Capability to provide cycle-accurate processor models, including constructs to specify pipelines and their mechanisms including stalls, flushes, operation injection, etc. • Extension of the target class of processors including SIMD, VLIW, and superscalar architectures of real-world processor architectures. • Explicit language statements addressing compiled simulation techniques. • Distinction between the detailed bit-true description of operation behavior including side-effects for the simulation and implementation on the one hand and assignment to arithmetical functions for the instruction selection task of the compiler on the other hand which allows to determine freely the abstraction level of the behavioral part of the processor model. • Strong orientation on the programming languages C/C++; LISA is a framework which encloses pure C/C++ behavioral operation description. • Support for instruction aliasing and complex instruction coding schemes.

19.3 ASIP Design Flow Powerful application specific programmable architectures are increasingly required in the DSP, multimedia and networking application domains in order to meet demanding cost and performance requirements. The complexity of algorithms and architectures in these application domains prohibits an ad hoc implementation and requests for an elaborated design methodology with efficient support in tooling. In this section, a seamless ASIP design methodology based on LISA will be introduced. Moreover, it will be demonstrated how the outlined concepts are captured by the LISA language elements. The expressiveness of the LISA formalism providing high flexibility with respect to abstraction level and architecture category is especially valuable for the design of high performance processors.

19.3.1 Architecture Exploration The LISA based methodology sets in after the algorithms, which are intended for execution on the programmable platform, are selected. The algorithm design is beyond the scope of LISA and is typically performed in an application specific system level design environment, such as, for example, COSSAP [29] for wireless communications or OPNET [30] for networking. The outcome of the algorithmic exploration is a pure functional specification usually represented by means of an executable prototype written in a high-level language (HLL) like C, together with a requirement document specifying cost and performance parameters. In the following, the steps of our proposed design flow depicted in Figure 19.1 are described, where the ASIP designer refines successively the application jointly with the LISA model of the programmable target architecture. First the performance critical algorithmic kernels of the functional specification have to be identified. This task can be easily performed with a standard profiling tool, that instrumentalizes the application

© 2006 by Taylor & Francis Group, LLC

A Novel Methodology for the Design of ASIPs

FIGURE 19.1

19-5

Application

LISA model

Exploration result

1

Assembly algorithm kernal

Datapath model

ISA accurate profiling (data)

2

Assembly program

Instruction model

ISA accurate profiling (data + control)

3

Revised assembly program

Cycle-true model

Cycle accurate profiling (data + control)

4

Assembly program

RTL model

HW cost + timing

LISA based ASIP development flow.

code in order to generate HLL execution statistics during the simulation of the functional prototype. Thus the designer becomes aware of the performance critical parts of the application and is therefore prepared to define the data path of the programmable architecture on the assembly instruction level. Starting from a LISA processor model which implements an arbitrary basic instruction set, the LISA model can be enhanced with parallel resources, special purpose instructions, and registers in order to improve the performance of the considered application. At the same time, the algorithmic kernel of the application code is translated into assembly by making use of the specified special purpose instructions. By employing assembler, linker, and processor simulator derived from the LISA model (cf. Section 19.5), the designer can iteratively profile and modify the programmable architecture in cadence with the application until both fulfill the performance requirements. After the processing intensive algorithmic kernels are considered and optimized, the instruction set needs to be completed. This is accomplished by adding instructions to the LISA model which are dedicated to the low speed control and configuration parts of the application. However, while these parts usually represent major portions of the application in terms of code amount, they have only negligible influence on the overall performance. Therefore it is very often feasible to employ the HLL C-compiler derived from the LISA model and accept suboptimal assembly code quality in return for a significant cut in design time. So far, the optimization has only been performed with respect to the software related aspects, while neglecting the influence of the micro-architecture. For this purpose the LISA language provides capabilities to model cycle-accurate behavior of pipelined architectures. The LISA model is supplemented by the instruction pipeline and the execution of all instructions is assigned to the respective pipeline stage. If the architecture does not provide automatic interlocking mechanisms, the application code has to be revised to take pipeline effects into account. Now the designer is able to verify that the cycle true processor model still satisfies the performance requirements. At the last stage of the design flow, the HDL generator (see Section 19.6) can be employed to generate synthesizable HDL code for the base structure and the control path of the architecture. After implementing the dedicated execution units of the data path, strainable numbers on hardware cost and performance parameters (e.g., design size, power consumption, clock frequency) can be derived by running the HDL processor model through the standard synthesis flow. On this high level of detail the designer can tweak the computational efficiency of the architecture by applying different implementations of the data path execution units.

19.3.2 LISA Language The language LISA [2,3] is aiming at the formalized description of programmable architectures, their peripherals and interfaces. LISA closes the gap between purely structural oriented languages (VHDL, Verilog) and instruction set languages.

© 2006 by Taylor & Francis Group, LLC

19-6

Embedded Systems Handbook

Timing model

Microarchitecture model

—

Instruction scheduling

—

—

Instruction translation

—

—

—

—

—

—

Memory model

Resource Behavioral Instruction model model set model

HLLcompiler

Register allocation

Instruction scheduling

Instruction selection

Assembler

—

—

Linker

Memory allocation

—

Simulator

Simulation of storage

—

Debugger

Display configuration

Profiling

—

—

—

—

HDL generator

Basic structure

write conflict resolution

—

Instruction decoder

Operation scheduling

Operation grouping

FIGURE 19.2

Operation Decoder/ Operation simulation disassembler sheduling

—

Model requirements for ASIP design.

LISA descriptions are composed of resources and operations. The declared resources represent the storage objects of the hardware architecture (e.g., registers, memories, pipelines) which capture the state of the system. Operations are the basic objects in LISA. They represent the designer’s view of the behavior, the structure, and the instruction set of the programmable architecture. A detailed reference of the LISA language can be found in Reference 31. The process of generating software development tools and synthesizing the architecture requires information on architectural properties and the instruction set definition as depicted in Figure 19.2. These requirements can be grouped into different architectural models — the entirety of these models constitutes the abstract model of the target architecture. The LISA machine description provides information consisting of the following model components:

The memory model. This lists the registers and memories of the system with their respective bit widths, ranges, and aliasing. The compiler gets information on available registers and memory spaces. The memory configuration is provided to perform object code linking. During simulation, the entirety of storage elements represents the state of the processor which can be displayed in the debugger. The HDL code generator derives the basic architecture structure. In LISA, the resource section lists the definitions of all objects which are required to build the memory model. A sample resource section of the ICORE architecture described in Reference 32 is shown in Figure 19.3. The resource section begins with the keyword RESOURCE followed by (curly) braces enclosing all object definitions. The definitions are made in C-style and can be attributed with keywords like, for example, REGISTER, PROGRAM_COUNTER, etc. These keywords are not mandatory but they are used to classify the definitions in order to configure the debugger display. The resource section in Figure 19.3 shows the declaration of program counter, register file, memories, the four-stage instruction pipeline, and pipeline-registers. The resource model. This describes the available hardware resources and the resource requirements of operations. Resources reflect properties of hardware structures which can be accessed exclusively by one operation at a time. The instruction scheduling of the compiler depends on this information. The HDL code generator uses this information for resource conflict resolution. Besides the definition of all objects, the resource section in a LISA processor description provides information about the availability of hardware resources. By this, the property of several ports, for example, to a register bank or a memory is reflected. Moreover, the behavior section within LISA operations announces the use of processor resources. This takes place in the section header using the keyword USES in conjunction with the resource name and the information if the used resource is read, written or both (IN, OUT, or INOUT respectively).

© 2006 by Taylor & Francis Group, LLC

A Novel Methodology for the Design of ASIPs

RESOURCE { PROGRAM_COUNTER REGISTER DATA_MEMORY PROGRAM_MEMORY

19-7

int signed int

PC; R[0..7];

signed int unsigned int

RAM[0..255]; ROM[0..255];

PIPELINE ppu_pipe = { FI; ID; EX; WB }; PIPELINE_REGISTER IN ppu_pipe { bit[6] Opcode; short operandA; short operandB; }; }

FIGURE 19.3 Specification of the memory model.

RESOURCE { REGISTER unsigned int DATA_MEMORY signed int }

R([0..7])6; RAM([0..15]);

OPERATION NEG_RM { BEHAVIOR USES (IN R[]; OUT RAM[];) { /* C-code */ RAM[address] = (-1) * R[index]; } }

FIGURE 19.4 Specification of the resource model.

For illustration purposes, a sample LISA code taken from the ICORE architecture is shown in Figure 19.4. The definition of the availability of resources is carried out by enclosing the C-style resource definition with round braces followed by the number of simultaneously allowed accesses. If the number is omitted, one allowed access is assumed. The figure shows the declaration of a register bank and a memory with six and one ports respectively. Furthermore, the behavior section of the operation announces the use of these hardware resources for read and write. The instruction set model. This identifies valid combinations of hardware operations and admissible operands. It is expressed by the assembly syntax, instruction word coding, and the specification of legal operands and addressing modes for each instruction. Compilers and assemblers can identify instructions based on this model. The same information is used at the reverse process of decoding and disassembling. In LISA, the instruction set model is captured within operations. Operation definitions collect the description of different properties of the instruction set model which are defined in several sections: • The CODING section describes the binary image of the instruction word. • The SYNTAX section describes the assembly syntax of instructions, operands, and execution modes. • The SEMANTICS section specifies the transition function of the instruction.

© 2006 by Taylor & Francis Group, LLC

19-8

Embedded Systems Handbook

OPERATION COMPARE_IMM { DECLARE { LABEL index; GROUP src1, dest = { register }; } CODING { 0b10011 index=0bx[5] src1 dest } SYNTAX { "CMP" src1 ˜"," index ˜"," dest } SEMANTICS { CMP (dest,src1,index) } }

FIGURE 19.5

Specification of the instruction set model.

OPERATION register { DECLARE { LABEL index; } CODING { index=0bx[4] } EXPRESSION { R[index] } }

OPERATION ADD { DECLARE { GROUP src1,src2,dest = { register }; } CODING { 0b010010 src1 src2 dest } BEHAVIOR { /* C-code */ dest = src1 + src2; saturate(&dest); } }

FIGURE 19.6

Specification of the behavioral model.

Figure 19.5 shows an excerpt of the ICORE LISA model contributing to the instruction set model information on the compare immediate instruction. The DECLARE section contains local declarations of identifiers and admissible operands. Operation register is not shown in the figure but comprises the definition of the valid coding and syntax for src1 and dest respectively. The behavioral model. This abstracts the activities of hardware structures to operations changing the state of the processor for simulation purposes. The abstraction level of this model can range widely between the hardware implementation level and the level of HLL statements. The BEHAVIOR and EXPRESSION sections within LISA operations describe components of the behavioral model. Here, the behavior section contains pure C-code that is executed during simulation whereas the expression section defines the operands and execution modes used in the context of operations. An excerpt of the ICORE LISA model is shown in Figure 19.6. Depending on the coding of the src1, src2, and dest field, the behavior code of operation ADD works with the respective registers of register bank R. As arbitrary C-code is allowed, function calls can be made to libraries which are later linked to the executable software simulator. The timing model. This specifies the activation sequence of hardware operations and units. The instruction latency information lets the compiler find an appropriate schedule and provides timing relations between operations for simulation and implementation. Several parts within a LISA model contribute to the timing model. First, the declaration of pipelines in the resource section. The declaration starts with the keyword PIPELINE, followed by an identifying name and the list of stages. Second, operations are assigned to pipeline stages by using the keyword IN and providing the name of the pipeline and the identifier of the respective stage, such as: OPERATION name_of _operation IN ppu_pipe.EX

© 2006 by Taylor & Francis Group, LLC

(19.1)

A Novel Methodology for the Design of ASIPs

19-9

RESOURCE { PIPELINE ppu_pipe = { FI; ID; EX; WB }; } OPERATION CORDIC IN ppu_pipe.EX { ACTIVATION { WriteBack } BEHAVIOR { PIPELINE_REGISTER(ppu_pipe, EX/WB).ResultE = cordic(); } } OPERATION WriteBack IN ppu_pipe.WB { BEHAVIOR { R[value] = PIPELINE_REGISTER(ppu_pipe, EX/WB).ResultE; } }

FIGURE 19.7 Specification of the timing model.

Thirdly, the ACTIVATION section in the operation description is used to activate other operations in the context of the current instruction. The activated operations are launched as soon as the instruction enters the pipeline stage the activated operation is assigned to. Non-assigned operations are launched in the pipeline stage of their activation. To exemplify this, Figure 19.7 shows sample LISA code taken from the ICORE architecture. Operations CORDIC and WriteBack are assigned to stages EX and WB of pipeline ppu_pipe, respectively. Here, operation CORDIC activates operation WriteBack which will be launched in the following cycle (in correspondence to the spacial ordering of pipeline stages) in case of an undisturbed flow of the pipeline. Moreover, in the ACTIVATION section, pipelines are controlled by means of predefined functions stall, shift, flush, insert, and execute which are automatically provided by the LISA environment for each pipeline declared in the resource section. All these pipeline control functions can be applied to single stages as well as whole pipelines, for example: PIPELINE(ppu_pipe,EX/WB).stall();

(19.2)

Using this very flexible mechanism, arbitrary pipelines, hazards, and mechanisms like forwarding can be modelled in LISA. The micro-architecture model. This allows grouping of hardware operations to functional units and contains the exact micro-architecture implementation of structural components such as adders, multipliers, etc. This enables the HDL generator to generate the appropriate HDL code from a more abstract specification. In analogy to the syntax of the VHDL language, operation grouping to functional units is formalized using the keyword ENTITY in the resource section of the LISA model, for example: ENTITY Alu { Add, Sub }

(19.3)

Here, LISA operations Add and Sub are assigned to the functional unit Alu. Information on the exact micro-architectural implementation of structural components can be included into the LISA model, for example, by calling DesignWare components [33] from within the behavior section or by inlining HDL code.

© 2006 by Taylor & Francis Group, LLC

19-10

Embedded Systems Handbook

19.4 LISA Processor Design Platform The LISA processor design platform (LPDP) is an environment that allows the automatic generation of software development tools for architecture exploration, hardware implementation, software development tools for application design, and hardware–software co-simulation interfaces from one sole specification of the target architecture in the LISA language. Figure 19.8 shows the components of the LPDP environment.

19.4.1 Hardware Designer Platform — For Exploration and Processor Generation As indicated in Section 19.3, architecture design requires the designer to work in two fields (see Figure 19.9): on the one hand the development of the software part including compiler, assembler, linker, and simulator and on the other hand the development of the target architecture itself. The software simulator produces profiling data and thus may answer questions concerning the instruction set, the performance of an algorithm and the required size of memory and registers. The required silicon area or power consumption can only be determined in conjunction with a synthesizable HDL model. To accommodate these requirements, the LISA hardware designer platform can generate the following tools: • LISA language debugger for debugging the instruction-set with a graphical debugger frontend. • Exploration C-compiler for the non-critical parts of the application. • Exploration assembler which translates text-based instructions into object code for the respective programmable architecture. • Exploration linker which is controlled by a dedicated linker command file. • Instruction-set architecture (ISA) simulator providing extensive profiling capabilities, such as instruction execution statistics and resource utilization. Besides the ability to generate a set of software development tools, synthesizable HDL code (both VHDL and Verilog) for the processors control path and instruction decoder can be generated automatically from the LISA processor description. This also comprises the pipeline and pipeline controller including complex Hardware designer Application C compiler

Assembler

Simulator Linker

Architecture exploration Software designer C-compiler

LISA Architecture specification

Assembler/ linker

Architecture implementation System integrator System on chip

Simulator/debug.

Software application design

FIGURE 19.8

LISA processor design environment.

© 2006 by Taylor & Francis Group, LLC

Integration and verification

A Novel Methodology for the Design of ASIPs

19-11

Target architecture

LISA description E x p l o r a t i o n

Language compiler

Language C-compiler

HDL description

LISA assembler LISA linker

Synthesis tools

LISA simulator

I m p l e m e n t a t i o n

Gate level model Evalution results Profiling data, execution speed

Evalution results Chip size, clock speed, power consumption

FIGURE 19.9 Exploration and implementation.

interlocking mechanisms, forwarding, etc. For the data path, hand-optimized HDL code has to be inserted manually into the generated model. This approach has been chosen as the data path typically represents the critical part of the architecture in terms of power consumption and speed (critical path). It is obvious that deriving both software tools and hardware implementation model from one sole specification of the architecture in the LISA language has significant advantages: only one model needs to be maintained, changes on the architecture are applied automatically to the software tools and the implementation model and the consistency problem among the software tools and between software tools and implementation model is reduced significantly.

19.4.2 Software Designer Platform — For Software Application Design To cope with the requirements of functionality and speed in the software design phase, the tools generated for this purpose are an enhanced version of the tools generated during architecture exploration phase. The generated simulation tools are enhanced in speed by applying the compiled simulation principle [34] — where applicable — and are faster by one to two orders in magnitude than the tools currently provided by architecture vendors. As the compiled simulation principle requires the content of the program memory not to be changed during the simulation run, this holds true for most DSPs. However, for architectures running the program from external memory or working with operating systems which load/unload applications to/from internal program memory, this simulation technique is not suitable. For this purpose, an interpretive simulator is also provided.

19.4.3 System Integrator Platform — For System Integration and Veriﬁcation Once the processor software simulator is available, it must be integrated and verified in the context of the whole system (SOC), which can include a mixture of different processors, memories, and interconnect components. In order to support the system integration and verification, the LPDP system integrator platform provides a well-defined application programmer interface (API) to interconnect the instructionset simulator generated from the LISA specification with other simulators. The API allows to control the simulator by stepping, running, and setting breakpoints in the application code and by providing access to the processor resources.

© 2006 by Taylor & Francis Group, LLC

19-12

Embedded Systems Handbook

The following chapters will present the different areas addressed be the LISA processor design platform in more detail — software development tools and HDL code generation. Additionally, Section 19.7 will prove the high quality of the generated software development tools by comparing them to those shipped by the processor vendors.

19.5 SW Development Tools The feasibility to generate automatically HLL C-compilers, assemblers, linkers, and ISA simulators from LISA processor models enables the designer to explore the design space rapidly. In this section, specialties and requirements of these tools are discussed with particular focus on different simulation techniques.

19.5.1 Assembler and Linker The LISA assembler processes textual assembly source code and transforms it into linkable object code for the target architecture. The transformation is characterized by the instruction-set information defined in a LISA processor description. Besides the processor specific instruction-set, the generated assembler provides a set of pseudo-instructions (directives) to control the assembling process and initialize data. Section directives enable the grouping of assembled code into sections which can be positioned separately in the memory by the linker. Symbolic identifiers for numeric values and addresses are standard assembler features and are supported as well. Moreover, besides mnemonic-based instruction formats, C-like algebraic assembly syntax can be processed by the LISA assembler. The linking process is controlled by a linker command file which keeps a detailed model of the target memory environment and an assignment table of the module sections to their respective target memories. Moreover, it is suitable to provide the linker with an additional memory model which is separated from the memory configuration in the LISA description and which allows linking code into external memories that are outside the architecture model.

19.5.2 Simulator Due to the large variety of architectures and the facility to develop models on different levels of abstraction in the domain of time and architecture (see section 19.3), the LISA software simulator incorporates several simulation techniques ranging from the most flexible interpretive simulation to more application- and architecture-specific compiled simulation techniques. Compiled simulators offer a significant increase in instruction (cycle) throughput, however, the compiled simulation technique is not applicable in any case. To cope with this problem, the most appropriate simulation technique for the desired purpose (debugging, profiling, verification), architecture (instruction-accurate, cycle-accurate), and application (DSP kernel, operating system) can be chosen before the simulation is run. An overview of the available simulation techniques in the generated LISA simulator is given in the following: The interpretive simulation technique is employed in most commercially available instruction set simulators. In general, interpretive simulators run significantly slower than compiled simulators, however, unlike compiled simulation, this simulation technique can be applied to any LISA model and application. Dynamically scheduled, compiled simulation reduces simulation time by performing the steps of instruction decoding and operation sequencing prior to simulation. This technique cannot be applied to models using external memories or applications consisting of self-modifying program code. Besides the compilation steps performed in dynamic scheduling, static scheduling and code translation additionally implement operation instantiation. While the latter technique is used for instructionaccurate models, the former is suitable for cycle-accurate models including instruction pipelines. Beyond, the same restrictions apply as for dynamically scheduled simulation.

© 2006 by Taylor & Francis Group, LLC

A Novel Methodology for the Design of ASIPs

19-13

A detailed discussion of the different compiled simulation techniques is given in the following sections, while performance results are given in Section 19.7. The interpretive simulator is not discussed. 19.5.2.1 Compiled Simulation The objective of compiled simulation is to reduce the simulation time. Considering instruction set simulation, efficient run-time reduction can be achieved by performing repeatedly executed operations only once before the actual simulation is run, thus inserting an additional translation step between application load and simulation. The preprocessing of the application code can be split into three major steps [35]: 1. Within the step of instruction decoding, instructions, operands, and modes are determined for each instruction word found in the executable object file. In compiled simulation, the instruction decoding is only performed once for each instruction, whereas interpretive simulators decode the same instruction multiple times, for example, if it is part of a loop. This way, the instruction decoding is completely omitted at run-time, thus reducing simulation time significantly. 2. Operation sequencing is the process of determining all operations to be executed for the accomplishment of each instruction found in the application program. During this step, the program is translated into a table-like structure indexed by the instruction addresses. The table lines contain pointers to functions representing the behavioral code of the respective LISA operations. Although all involved operations are identified during this step, their temporal execution order is still unknown. 3. The determination of the operation timing (scheduling) is performed within the step of operation instantiation and simulation loop unfolding. Here, the behavior code of the operations is instantiated by generating the respective function calls for each instruction in the application program, thus unfolding the simulation loop that drives the simulation into the next state. Besides fully compiled simulation, which incorporates all of the above steps, partial implementations of the compiled principle are possible by performing only some of these steps. The accomplishment of each of these steps gives a further run-time reduction, but also requires a non-neglectable amount of compilation time. The trade-off between compilation time and simulation time is (qualitatively) shown in Figure 19.10. There are two levels of compiled simulation which are of particular interest — dynamic scheduling and static scheduling resp. code translation. In case of the dynamic scheduling, the task of selecting operations from overlapping instructions in the pipeline is performed at run-time of the simulation. The static scheduling already schedules the operations at compile-time.

Fully compiled

Compilation time

Static scheduling, code translation

Operation instantiation

Dynamic scheduling

Operation sequencing Compile-time decoding

Simulation time

FIGURE 19.10 Levels of compiled simulation.

© 2006 by Taylor & Francis Group, LLC

Fully interpretive

19-14

Embedded Systems Handbook

19.5.2.2 Dynamic Scheduling As shown in Figure 19.10, the dynamic scheduling performs instruction decoding and operation sequencing at compile-time. However, the temporal execution order of LISA operations is determined at simulator run-time. While the operation scheduling is rather simple for instruction-accurate models, it becomes a complex task for models with instruction pipelines. In order to reflect the instructions’ timing exactly and to consider all possibly occurring pipeline effects like flushes and stalls, a generic pipeline model is employed simulating the instruction pipeline at run-time. The pipeline model is parameterized by the LISA model description and can be controlled via predefined LISA operations. These operations include: • • • • •

Insertion of operations into the pipeline (stages) Execution of all operations residing in the pipeline Pipeline shift Removal of operations (flush) Halt of entire pipeline or particular stages (stall)

Unlike for statically scheduled simulation, operations are inserted into and removed from the pipeline dynamically, that means, each operation injects further operations upon its execution. The information about operation timing is provided in the LISA description, that is, by the activation section as well as the assignment of operations to pipeline stages (see Section 19.3.2 — timing-model). It is obvious that the maintenance of the pipeline model at simulation time is expensive. Execution profiling on the generated simulators for the Texas Instruments TMS320C62xx [36] and TMS320C54x [37] revealed that more than fifty percent of the simulator’s run-time is consumed by the simulation of the pipeline. The situation could be improved by implementing the step of operation instantiation, consequently superseding the need for pipeline simulation. This, in turn, implies static scheduling, in other words, the determination of the operation timing due to overlapping instructions in the pipeline taking place at compile-time. Although there is no pipeline model in instruction-accurate processor models, it will be shown that operation instantiation also gives a significant performance increase for these models. Beyond that, operation instantiation is relatively easy to implement for instruction-accurate models (in contrast to pipelined models). 19.5.2.3 Static Scheduling Generally, operation instantiation can be described as the generation of an individual piece of (behavioral) simulator code for each instruction found in the application program. While this is straightforward for instruction-accurate processor models, cycle-true, pipelined models require a more sophisticated approach. Considering instruction-accurate models, the shortest temporal unit that can be executed is an instruction. That means, the actions to be performed for the execution of an individual instruction are determined by the instruction alone. In the simulation of pipelined models, the granularity is defined by cycles. However, since several instructions might be active at the same time due to overlapping execution, the actions performed during a single cycle are determined by the respective state of the instruction pipeline. As a consequence, instead of instantiating operations for each single instruction of the application program, behavioral code for each occurring pipeline state has to be generated. Several of such pipeline states might exist for each instruction, depending on the execution context of the instruction, that is, the instructions executed in the preceding and following cycles. As pointed out previously, the principle of compiled simulation relies on an additional translation step taking place before the simulation is run. This step is performed by a so-called simulation compiler, which implements the three steps presented in Section 5.2.1. Obviously, the simulation compiler is a highly architecture-specific tool, which is therefore retargeted from the LISA model description.

© 2006 by Taylor & Francis Group, LLC

A Novel Methodology for the Design of ASIPs

19-15

19.5.2.3.1 Operation Instantiation The objective of static scheduling is the determination of all possible pipeline states according to the instructions found in the application program. For purely sequential pipeline flow, that is, in case that no control hazards occur, the determination of the pipeline states can be achieved simply by overlapping consecutive instructions subject to the structure of the pipeline. In order to store the generated pipeline states, pipeline state tables are used, providing an intuitive representation of the instruction flow in the pipeline. Inserting instructions into pipeline state tables is referred to as scheduling in the following. A pipeline state table is a two-dimensional array storing pointers to LISA operations. One dimension represents the location within the application, the other the location within the pipeline, that is, the stage in which the operation is executed. When a new instruction has to be inserted into the state table, both intrainstruction and inter-instruction precedence must be considered to determine the table elements, in which the corresponding operations will be entered. Consequently, the actual time an operation is executed at depends on the scheduling of the preceding instruction as well as the scheduling of the operation(s) assigned to the preceding pipeline stage within the current instruction. Furthermore, control hazards causing pipeline stalls and/or flushes influence the scheduling of the instruction following the occurrence of the hazard. A simplified illustration of the scheduling process is given in Figure 19.11. Figure 19.11(a) shows the pipeline state table after a branch instruction has been inserted, composed of the operations fetch, decode, branch, and update_pc as well as a stall operation. The table columns represent the pipeline stages, the rows represent consecutive cycles (with earlier cycles in upper rows). The arrows indicate activation chains. The scheduling of a new instruction always follows the intra-instruction precedence, that means, fetch is scheduled before decode, decode before branch, and so on. The appropriate array element for fetch is determined by its assigned pipeline stage (FE) and according to inter-instruction precedences. Since the branch instruction follows the add instruction (which has already been scheduled), the fetch operation is inserted below the first operation of add (not shown in Figure 19.11[a]). The other operations are inserted according to their precedences. The stall of pipeline stage FE, which is issued from the decode operation of branch, is processed by tagging the respective table element as stalled. When the next instruction is scheduled, the stall is accounted for by moving the decode operation to the next table row resp. next cycle (see Figure 19.11[b]). Pipeline flushes are handled in a similar manner: if a selected table element is marked as flushed, the scheduling of the current instruction is abandoned. Assuming purely sequential instruction flow, the task of establishing a pipeline state table for the entire application program is very straightforward. However, every (sensible) application contains a certain amount of control flow (e.g., loops) interrupting this sequential execution. The occurrence of such control flow instructions makes the scheduling process extremely difficult or in a few cases even impossible.

(a)

DC

EX

WB

incr

fetch

decode

sub

incr

add

write_r

fetch

decode

add

write_r

branch

write_r

stalled

branch

write_r

DC

EX

WB

fetch

decode

sub

decode

stalled

(b)

FE

FE

upd_pc

FIGURE 19.11 Inserting instructions into pipeline state table.

© 2006 by Taylor & Francis Group, LLC

decode

upd_pc

19-16

Address a1 a2,a3 a4 a5 ... b1

Embedded Systems Handbook

Instruction i1 BC addr i4 i5 ... k1

PF

FE

DC

AC

RD

EX

BC

i1

addr

BC

i1

i4

addr

BC

i1

i5

i4

add

BC

i1

stall

stall

addr

BC

i1

addr

BC

Condition evaluated

Cycle

addr

i6

i5

i4

i7

i6

i5

i4

i8

i7

i6

i5

i4

i9

i8

i7

i6

i5

i10

i9

i8

i7

i6

FIGURE 19.12

k1

–

–

k2

k1

–

–

k3

k2

k1

–

–

i4

k4

k3

k2

k1

–

–

i5

k5

k4

k3

k2

k1

–

Pipeline behavior for a conditional branch.

Generally, all instructions modifying the program counter cause interrupts in the control flow. Furthermore, only instructions providing an immediate target address, that is, branches and calls whose target address is known at compile-time, can be scheduled statically. If indirect branches or calls occur, it is inevitable to switch back to dynamic scheduling at run-time. Fortunately, most control flow instructions can be scheduled statically. Figure 19.12 exemplarily shows the pipeline states for a conditional branch instruction as found in the TMS320C54x’s instruction-set. Since the respective condition cannot be evaluated until the instruction is executed, scheduling has to be performed for both eventualities (condition true resp. false), splitting the program into alternative execution paths. The selection of the appropriate block of prescheduled pipeline states is performed by switching among different state tables at simulator run-time. In order to prevent from doubling the entire pipeline state table each time a conditional branch occurs, alternative execution paths are left as soon as an already generated state has been reached. Unless several conditional instructions reside in the pipeline at the same time, these usually have the length of a few rows. 19.5.2.3.2 Simulator Instantiation After all instructions of the application program have been processed, and thus the entire operation schedule has been established, the simulator code can be instantiated. The simulation compiler backend thereby generates either C code or an operation table with the respective function pointers, both describing alternative representations of the application program. Figure 19.13 shows a simplified excerpt of the generated C code for a branch instruction. Cases represent instructions, while a new line starts a new cycle.

© 2006 by Taylor & Francis Group, LLC

A Novel Methodology for the Design of ASIPs

switch case case case

19-17

(pc) { 0x1584: fetch(); decode(); sub(); write_registers(); 0x1585: fetch(); decode(); test_condition(); add(); 0x1586: branch(); write_registers(); fetch(); update_pc(); fetch(); decode(); fetch(); decode(); load(); goto_0x1400_;

}

FIGURE 19.13

Generated simulator code.

19.5.2.4 Instruction-Based Code Translation The need for a scheduling mechanism arises from the presence of an instruction pipeline in the LISA model. However, even instruction-accurate processor models without pipeline benefit from the step of operation instantiation. The technique applied here is called instruction-based code translation. Due to the absence of instruction overlap, simulator code can be instantiated for each instruction independently, thus simplifying simulator generation to the concatenation of the respective behavioral code specified in the LISA description. In contrast to direct binary-to-binary translation techniques [38], the translation of target-specific into host-specific machine code uses C source code as intermediate format. This keeps the simulator portable, and thus independent from the simulation host. Since the instruction-based code translation generates program code that linearly increases in size with the number of instructions in the application, the use of this simulation technique is restricted to small and medium sized applications (less than ≈10k instructions, depending on model complexity). For large applications, the resultant worse cache utilization on the simulation host reduces the performance of the simulator significantly.

19.6 Architecture Implementation As we are targeting the development of application specific instruction set processors (ASIP), which are highly optimized for one specific application domain, the HDL code generated from a LISA processor description has to fulfill tight constraints to be an acceptable replacement for handwritten HDL code by experienced designers. Especially power consumption, chip area, and execution speed are critical points for this class of architectures. For this reason, the LPDP platform does not claim to be able to efficiently synthesize the complete HDL code of the target architecture. Especially the data path of an architecture is highly critical and must in most cases be optimized manually. Frequently, full-custom design technique must be used to meet power consumption and clock speed constraints. For this reason, the generated HDL code is limited to the following parts of the architecture: • Coarse processor structure such as register set, pipeline, pipeline registers, and test-interface. • Instruction decoder setting data and control signals which are carried through the pipeline and activate the respective functional units executed in context of the decoded instruction. • Pipeline controller handling different pipeline interlocks, pipeline register flushes and supporting mechanisms such as data forwarding. Additionally, hardware operations as they are described in the LISA model can be grouped to functional units (see Section 19.3.2 — micro-architecture model). Those functional units are generated as wrappers, that is, the ports of the functional units as well as the interconnects to the pipeline registers and other functional units are generated automatically while the content needs to be filled manually with code. Emerging driver conflicts in context with the interconnects are resolved automatically by the insertion of multiplexers.

© 2006 by Taylor & Francis Group, LLC

19-18

Embedded Systems Handbook

The disadvantage of rewriting the data path in the HDL description by hand is that the behavior of hardware operations within those functional units has to be described and maintained twice — on the one hand in the LISA model and on the other hand in the HDL model of the target architecture. Consequently, a problem here is verification which will be addressed in future research.

19.6.1 LISA Language Elements for HDL Synthesis The following sections will show in detail, how different parts of the LISA model contribute to the generated HDL model of the target architecture. 19.6.1.1 The Resource Section The resource section provides general information about the structure of the architecture (e.g., registers, memories, and pipelines, see Section 19.3.2 — resource/memory model). Based on this information, the coarse structure of the architecture can be generated automatically. Figure 19.14 shows an excerpt resource declaration of the LISA model of the ICORE architecture [32], which was used in our case study. The ICORE architecture has two different register sets — one for general purpose use named R, consisting of eight separate registers with 32 bits width and one for the address registers named AR, consisting of four elements each with eleven bits. The round brackets indicate the maximum number of simultaneously accesses allowed for the respective register bank — six for the general purpose register R and one for the address register set. From that, the respective number of access ports to the register banks can be generated automatically. With this information — bit-true widths, ranges and access ports — the register banks can be easily synthesized. Moreover, a data and program memory resource are declared — both 32 bits wide and with just one allowed access per cycle. Since various memory types are known and are generally very technology dependent, however, cannot be further specified in the LISA model, wrappers are generated with the appropriate number of access ports. Before synthesis, the wrappers need to be filled manually with code for the respective technology. The resources labelled as PORT are accessible from outside the model and can be attached to a testbench — in the ICORE the RESET and the STATE_BUS. Besides the processor resources such as memories, ports, and registers, also pipelines and pipeline registers are declared. The ICORE architecture contains a four stage instruction pipeline consisting of the stages FI (instruction fetch), ID (instruction decode), EX (instruction execution), and WB (write-back to registers). In between those pipeline stages, pipeline registers are located which forward information about the instruction such as instruction opcode, operand registers, etc. The declared pipeline registers are multiple instanced between each stage and are completely generated from the LISA model. For the pipeline and the stages, entities are created which are in a subsequent phase of the HDL generator run filled with code for functional units, instruction decoder, pipeline controller, etc. RESOURCE { REGISTER S32 REGISTER bit[11]

R([0..7])6; /* GP Registers */ AR([0..3]); /* Address Registers */

DATA_MEMORY S32 RAM([0..255]); /* Memory Space */ PROGRAM_MEMORY U32 ROM([0..255]);/* Instruction ROM */ PORT bit[1] PORT bit[32]

RESET; STATE_BUS;

/* Reset pin */ /* Processor state bus */

PIPELINE ppu_pipe = { FI; ID; EX; WB }; PIPELINE_REGISTER IN ppu_pipe { bit[6] Opcode; ... }; }

FIGURE 19.14

Resource declaration in the LISA model of the ICORE architecture.

© 2006 by Taylor & Francis Group, LLC

A Novel Methodology for the Design of ASIPs

19-19 Architecture entity

Base structure

Pipeline structure

Register entity

FE entity

Pipline entity

FE/DC

Stage structure—LISA entities

FIGURE 19.15

DC entity

Branch entity

Memory entity

DC/EX

EX entity

ALU entity

Shifter entity

Entity hierarchy in generated HDL model.

19.6.1.2 Grouping Operations to Functional Units As the LISA language describes the target architecture’s behavior and timing on the granularity of hardware operations, however, the synthesis requires the grouping of hardware operations to functional units that can then be filled with hand-optimized HDL code for the data path, a well known construct from the VHDL language was adopted for this purpose: the ENTITY (see Section 19.3.2 — micro-architecture model). Using the ENTITY to group hardware operations to a functional unit is not only an essential information for the HDL code generator but also for retargeting the HLL C-compiler which requires information about the availability of hardware resources to schedule instructions. As indicated in Section 19.6.1.1, the HDL code derived from the LISA resource section already comprises a pipeline entity including further entities for each pipeline stage and the respective pipeline registers. The entities defined in the LISA model are now part of the respective pipeline stages as shown in Figure 19.15. Here, a Branch entity is placed into the entity of the Decode stage. Moreover, the EX stage contains an ALU and a Shifter entity. As it is possible in LISA to assign hardware operations to pipeline stages, this information is sufficient to locate the functional units within the pipeline they are assigned to. As already pointed out, the entities of the functional units are wrappers which need to be filled with HDL code by hand. Nevertheless, in Section 19.6.2.1 will be shown that by far the largest part of the target architecture can be generated automatically from a LISA model. 19.6.1.3 Generation of the Instruction Decoder The generated HDL decoder is derived from information in the LISA model on the coding of instructions (see Section 19.3.2 — instruction-set model). Depending on the structuring of the LISA architecture description, decoder processes are generated in several pipeline stages. The specified signal paths within the target architecture can be divided into data signals and control signals. The control signals are a straight forward derivation of the operation activation tree which is part of the LISA timing model (see Section 19.3.2 – timing model). The data signals are explicitly modelled by the designer by writing values into pipeline registers and implicitly fixed by the declaration of used resources in the behavior sections of LISA operations.

19.6.2 Implementation Results The ICORE which was used in our case study is a low-power application specific instruction set processor (ASIP) for DVB-T acquisition and tracking algorithms. It has been developed in cooperation with Infineon Technologies. The primary tasks of this architecture are the FFT-window-position, sampling-clock synchronization for interpolation/decimation and carrier frequency offset estimation. In a previous project this architecture was completely designed by hand using semi-custom design. Thereby, a large amount of effort was spent in optimizing the architecture towards extremely low power consumption while keeping up the clock frequency at 120 MHz. At that time, a LISA model was already realized for architecture exploration purposes and for verifying the model against the handwritten HDL implementation.

© 2006 by Taylor & Francis Group, LLC

19-20

Embedded Systems Handbook

Pipeline FI

ID DAG

Instructionfetch

ZOLP

Shifter Mult

WB

Minmax ALU

Branch

EX

MOVE

Writeback

IIC Bitmanip Decoder

Pipeline control

Addsub Mem Decoder

Registers

I/O control

Memory ICORE architecture Data path Control path

FIGURE 19.16

Autom. gen. process Manual entity

Entity

The complete generated HDL model.

Except for the data path within functional units, the HDL code of the architecture has been generated completely. Figure 19.16 shows the composition of the model. The dark boxes have been filled manually with HDL code, whereas the light boxes and interconnects are the result of the generation process. 19.6.2.1 Comparison of Development Time The LISA model of the ICORE as well as the original handwritten HDL model of the ICORE architecture have been developed by one designer. The initial manual realization of the HDL model (without the time needed for architecture exploration) took approx. three months. As already indicated, a LISA model was built in this first realization of the ICORE for architecture exploration and verification purposes. It took the designer approx. one month to learn the LISA language and to create a cycle-accurate LISA model. After completion of the HDL generator, it took another two days to refine the LISA model to RTL-accuracy. The handwritten functional units (data path), that were added manually to the generated HDL model, could be completed in less than a week. This comparison clearly indicates, that the time expensive work in realizing the HDL model was to create structure, controller and decoder of the architecture. In addition, a major decrease of total architecture design time can be seen, as the LISA model results from the design exploration phase. 19.6.2.2 Gate Level Synthesis To verify the feasibility of generating automatically HDL code from LISA architecture descriptions in terms of power-consumption, clock speed, and chip area, a gate level synthesis was carried out. The model has not been changed (i.e., manually optimized) to enhance the results.

© 2006 by Taylor & Francis Group, LLC

A Novel Methodology for the Design of ASIPs

19-21

19.6.2.2.1 Timing and Size Comparison The results of the gate-level synthesis affecting timing and area optimization were compared to the handwritten ICORE model, which comprised the same architectural features. Moreover, the same synthesis scripts were used for both models. It shall be emphasized that the performance values are nearly the same for both models. Furthermore, it is interesting that the same critical paths were found in both, the handwritten and the generated model. The critical paths occur exclusively in the data path, which confirms the presumption that the data path is the most critical part of the architecture and should thus not be generated automatically from an abstract processor model. 19.6.2.2.2 Critical Path The synthesis has been performed with a clock of 8 nsec, this equals a frequency of 125 MHz. The critical path, starting from the pipeline register to the shifter unit and multiplexer to the next pipeline register, violates this timing constraints by 0.36 nsec. This matches the handwritten ICORE model, which has been improved from this point of state manually at gate-level. The longest combinatoric path of the ID stage runs through the decoder and the DAG entity and counts 3.7 nsec. Therefore, the generated decoder does not affect the critical path in any way. 19.6.2.2.3 Area The synthesized area has been a minor criteria, due to the fact that the constrains for the handwritten ICORE model are not area sensitive. The total area of the generated ICORE model is 59,009 gates. The combinational area takes 57% of the total area. The handwritten ICORE model takes a total area of 58,473 gates. The most complex part of the generated ICORE is the decoder. The area of the automatically generated decoder in the ID stage is 4693 gates, whereas the area of the handwritten equivalent is 5500 gates. This result must be considered carefully as the control logic varies in some implemented features — for example, the handwritten decoder and program flow controller support an idle and suspended state of the core. 19.6.2.2.4 Power Consumption Comparison Figure 19.17 shows the comparison of power consumption of the handwritten versus the generated ICORE realization. The handwritten model consumes 12,64 mW, whereas the implementation generated from a LISA model consumes 14,51 mW. The reason for the slightly worse numbers in power consumption of the generated model versus the handwritten is due to the early version of the LISA HDL generator which in its current state allows access to all registers and memories within the model via the test-interface. Without this unnecessary overhead, the same results as for the hand-optimized model are achievable.

16

Power consumption (mW)

14

14,51 mW 12,64 mW

12 10 8 6 4 2 0 Handwritten ICORE

FIGURE 19.17 Power consumption of different ICORE realizations.

© 2006 by Taylor & Francis Group, LLC

Generated ICORE

19-22

FIGURE 19.18

Embedded Systems Handbook

Graphical debugger frontend.

To summarize, it could be shown in this chapter that it is feasible to generate efficient HDL code from architecture descriptions in the LISA language.

19.7 Tools for Application Development The LPDP application software development tool-suite includes HLL C-compiler, assembler, linker, simulator as well as a graphical debugger frontend. Providing these tools, a complete software development environment is available which ranges from the C/assembly source file up to simulation within a comfortable graphical debugger frontend. The tools are an enhanced version of those tools used for architecture exploration. The enhancements concern for the software simulator the ability to graphically visualize the debugging process of the application under test. The LISA debugger frontend ldb is a generic GUI for the generated LISA simulator (see Figure 19.18). It visualizes the internal state of the simulation process. Both the C-source code and the disassembly of the application as well as all configured memories and (pipeline) registers are displayed. All contents can be changed in the frontend at run-time of the application. The progress of the simulator can be controlled by stepping and running through the application and setting breakpoints. The code generation tools (assembler and linker) are enhanced in functionality as well. The assembler supports more than 30 common assembler directives, labels, and symbols, named user sections, generation of source listing and symbol table and provides detailed error report and debugging facilities, whereas the linker is driven by a powerful linker command file with the ability to link sections into different address spaces, paging support, and the possibility to define user specific memory models.

19.7.1 Examined Architectures To examine the quality of the generated software development tools, four different architectures have been considered. The architectures were carefully chosen to cover a broad range of architectural characteristics and are widely used in the field of digital signal processing (DSP) and microcontrollers (µC). Moreover, the abstraction level of the models ranges from phase-accuracy (TMS320C62x) to instruction-set accuracy (ARM7): ARM7. The ARM7 core is a 32 bit microcontroller of Advanced RISC Machines Ltd [39]. The realization of a LISA model of the ARM7 µC at instruction-set accuracy took approx. two weeks.

© 2006 by Taylor & Francis Group, LLC

A Novel Methodology for the Design of ASIPs

19-23

ADSP2101. The Analog Devices ADSP2101 is a 16 bit fixed-point DSP with 20 bit instruction-word width [40]. The realization of the LISA model of the ADSP2101 at cycle-accuracy took approx. 3 weeks. TMS320C54x. The Texas Instruments TMS320C54x is a high performance 16 bit fixed-point DSP with a six stage instruction pipeline [37]. The realization of the model at cycle-accuracy (including pipeline behavior) took approx. 8 weeks. TMS320C62x. The Texas Instruments TMS320C62x is a general-purpose fixed-point DSP based on a very long instruction-word (VLIW) architecture containing an eleven stage pipeline [36]. The realization of the model at phase-accuracy (including pipeline behavior) took approx. 6 weeks. These architectures were modelled on the respective abstraction level with LISA and software development tools were generated successfully. The speed of the generated tools was then compared with the tools shipped by the respective tools of the architecture vendor. Of course the LISA tools are working on the same level of accuracy as the vendor tools. The vendor tools are exclusively using the interpretive simulation technique.

19.7.2 Efﬁciency of the Generated Tools Measurements took place on an AMD Athlon system with a clock frequency of 800 MHz. The system is equipped with 256 MB of RAM and is part of the networking system. It runs under the operating system Linux, kernel version 2.2.14. Tool compilation was performed with GNU GCC, version 2.92. The generation of the complete tool-suite (HLL C-compiler, simulator, assembler, linker, and debugger frontend) takes, depending on the complexity of the considered model, between 12 sec (ARM7 µC instruction-set accurate) and 67 sec (C6x DSP phase-accurate). Due to the early stage in research on the retargetable compiler (see Section 19.8), no results on code quality are presented. 19.7.2.1 Performance of the Simulator Figures 19.19 to 19.22 show the speed of the generated simulators in instructions per second and cycles per second, respectively. Simulation speed was quantified by running an application on the respective simulator and counting the number of processed instructions/cycles. The set of simulated applications on the architectures comprises a simple 20 tap FIR filter, an ADPCM G.721 (Adaptive Differential Pulse Code Modulation) coder/decoder and a GSM speech codec. For the ARM7, an ATM-QFC protocol application was additionally run, which is responsible for flow control and configuration in an ATM portprocessor chip. As expected, the compiled simulation technique applied by the generated LISA simulators outperforms the vendor simulators by one to two orders in magnitude.

Speed (in mega instr. per seconds)

40

LISA — compiled (code translation)

35 LISA — compiled (dynamic scheduling) 30 ARMulator — interpretive 25 ARM7 (real hardware, 25 MHz)

20 15 10 5 0 FIR

ADPCM

FIGURE 19.19 Speed of the ARM7 µC at instruction-accuracy.

© 2006 by Taylor & Francis Group, LLC

ATM-QFC

19-24

Embedded Systems Handbook 25

Speed (in mega cycles per sec)

LISA — compiled (code translation) LISA — compiled (dynamic scheduling)

20

Analog Devices xsim 2101 — interpretive 15

10

5 0,01 Meg

0,01 Meg

0,01 Meg

0 FIR

FIGURE 19.20

ADPCM

GSM

Speed of the ADSP2101 DSP at cycle-accuracy.

Speed (in mega cycles per sec)

6

LISA — compiled (static scheduling) LISA — compiled (dynamic scheduling)

5

Texas Instruments sim54x — interpretive 4 3 2 1 0,075 Meg

0,075 Meg

0,075 Meg

0 FIR

FIGURE 19.21

ADPCM

Speed of C54x DSP at cycle-accuracy.

1200

Speed (in kilo cycles per sec)

GSM

LISA — compiled (dynamic scheduling) Texas Instruments sim54x — interpretive

1000 800 600 400 200 15 K

15 K

15 K

0 FIR

FIGURE 19.22

Speed of the C6x DSP at cycle-accuracy.

© 2006 by Taylor & Francis Group, LLC

ADPCM

GSM

A Novel Methodology for the Design of ASIPs

19-25

As both the ARM7 and ADSP2101 LISA model contain no instruction pipeline, two different flavors of compiled simulation are applied in the benchmarks — instruction-based code translation and dynamic scheduling (see Section 19.5.2.4). It shows, that the highest possible degree of simulation compilation offers an additional speed-up of a factor 2–7 compared to dynamically scheduled compiled simulation. As explained in Section 19.5.2.4, the speed-up decreases with bigger applications due to cache misses on the simulating host. It is interesting to see that — considering an ARM7 µC running at a frequency of 25 MHz — the software simulator running at 31 MIPS even outperforms the real hardware. This makes application development suitable before the actual silicon is at hand. The LISA model of the C54x DSP is cycle-accurate and contains an instruction pipeline. Therefore, compiled simulation with static scheduling is applied (see Section 19.5.2.3). This pays off with an additional speed-up of a factor of 5 compared to a dynamically scheduled compiled simulator. Due to the superscalar instruction dispatching mechanism used in the C62x architecture, which is highly run-time dependent, the LISA simulator for the C62x DSP uses only compiled simulation with dynamic scheduling. However, the dynamic scheduled compiled simulator still offers a significant speed-up of a factor of 65 compared to the native TI simulator. 19.7.2.2 Performance of Assembler and Linker The generated assembler and linker are not as time critical as the simulator is. It shall be mentioned though that the performance (i.e., the number of assembled/linked instructions per second) of the automatically generated tools is comparable to that of the vendor tools.

19.8 Requirements and Limitations In this chapter the requirements and current limitations of different aspects of the processor design using the LISA language are discussed. These affect the modelling capabilities of the language itself as well as the generated tools.

19.8.1 LISA Language Common to all models described in LISA is the underlying zero-delay model. This means that all transitions are provided correctly at each control step. Control steps may be clock phases, clock cycles, instruction cycles or even higher levels. Events between these control steps are not regarded. However, this property meets requirements of current co-simulation environments [41–43] on processor simulators to be used for HW/SW co-design [44,45]. Besides, the LISA language currently contains no formalism to describe memory hierarchies such as multi-level caches. However, existing C/C++ models of memory hierarchies can easily be integrated into the LISA architecture model.

19.8.2 HLL C-compiler Due to the early stage of research, no further details on the retargetable compiler are presented within the scope of this chapter. At the current status, the quality of the generated code is only fair. However, it is evident that the proposed new ASIP design methodology can only be carried out efficiently at the presence of an efficient retargetable compiler. In our case study presented in Section 19.6, major parts of the application were realized in assembly code.

19.8.3 HDL Generator As LISA allows modelling the architecture using a combination of both LISA language elements and pure C/C++ code, certain coding guidelines need to be obeyed in order to generate synthesizable HDL code of the target architecture. Firstly, only the LISA language elements are considered — thus the usage of C-code in the model needs to be limited to the description of the data path which is not taken into account for

© 2006 by Taylor & Francis Group, LLC

19-26

Embedded Systems Handbook

HDL code generation anyway. Secondly, architectural properties, which can be modelled in LISA but are not synthesizable include pipelined functional units and multiple instruction word decoders.

19.9 Conclusion and Future Work In this chapter we presented the LISA processor design platform LPDP — a novel framework for the design of application specific integrated processors. The LPDP platform helps the architecture designer in different domains: architecture exploration, implementation, application software design, and system integration/verification. In a case study it was shown that an ASIP, the ICORE architecture, was completely realized using this novel design methodology — from exploration to implementation. The implementation results concerning maximum frequency, area and power consumption were comparable to those of the hand-optimized version of the same architecture realized in a previous project. Moreover, the quality of the generated software development tools was compared to those of the semiconductor vendors. LISA models were realized and tools successfully generated for the ARM7 µC, the Analog Devices ADSP2101, the Texas Instruments C62x and the Texas Instruments C54x on instructionset/cycle/phase-accuracy respectively. Due to the usage of the compiled simulation principle, the generated simulators run by one to two orders in magnitude faster than the vendor simulators. In addition, the generated assembler and linker can compete well in speed with the vendor tools. Our future work will focus on modelling further real-world processor architectures and improving the quality of our retargetable C-compiler. In addition, formal ways to model memory hierarchies will be addressed. For the HDL generator, data path synthesis will be examined in context of the SystemC modelling language.

References [1] M. Birnbaum and H. Sachs, How VSIA answers the SOC dilemma. IEEE Computer, 32, 42–50, 1999. [2] S. Pees, A. Hoffmann, V. Zivojnovic, and H. Meyr, LISA — machine description language for cycleaccurate models of programmable DSP architectures. In Proceedings of the Design Automation Conference (DAC). New Orleans, June 1999. [3] V. Živojnovi´c, S. Pees, and H. Meyr, LISA — machine description language and generic machine model for HW/SW co-design. In Proceedings of the IEEE Workshop on VLSI Signal Processing. San Francisco, October 1996. [4] K. Olukotun, M. Heinrich, and D. Ofelt, Digital system simulation: methodologies and examples. In Proceedings of the Design Automation Conference (DAC), June 1998. [5] J. Rowson, Hardware/software co-simulation. In Proceedings of the Design Automation Conference (DAC), 1994. [6] R. Stallman, Using and Porting the GNU Compiler Collection, gcc-2.95 ed. Free Software Foundation, Boston, MA, 1999. [7] G. Araujo, A. Sudarsanam, and S. Malik, Instruction set design and optimization for address computation in DSP architectures. In Proceedings of the International Symposium on System Synthesis (ISSS), 1996. [8] C. Liem et al., Industrial experience using rule-driven retargetable code generation for multimedia applications. In Proceedings of the International Symposium on System Synthesis (ISSS), September 1995. [9] D. Engler, VCODE: a retargetable, extensible, very fast dynamic code generation system. In Proceedings of the International Conference on Programming Language Design and Implementation (PLDI), May 1996.

© 2006 by Taylor & Francis Group, LLC

A Novel Methodology for the Design of ASIPs

19-27

[10] D. Bradlee, R. Henry, and S. Eggers, The Marion system for retargetable instruction scheduling. In Proceedings of the ACM SIGPLAN’91 Conference on Programming Language Design and Implementation. Toronto, Canada, 1991, pp. 229–240. [11] B. Rau, VLIW compilation driven by a machine description database. In Proceedings of the 2nd Code Generation Workshop. Leuven, Belgium, 1996. [12] M. Freericks, The nML machine description formalism. Technical Report 1991/15, Technische Universität Berlin, Fachbereich Informatik, Berlin, 1991. [13] A. Fauth, J. Van Praet, and M. Freericks, Describing instruction set processors using nML. In Proceedings of the European Design and Test Conference. Paris, March 1995. [14] M. Hartoog et al., Generation of software tools from processor descriptions for hardware/software codesign. In Proceedings of the Design Automation Conference (DAC), June 1997. [15] W. Geurts et al., Design of DSP systems with chess/checkers. In Proceedings of the 2nd International Workshop on Code Generation for Embedded Processors. Leuven, March 1996. [16] J. Van Praet et al., A graph based processor model for retargetable code generation. In Proceedings of the European Design and Test Conference (ED&TC), March 1996. [17] V. Rajesh and R. Moona, Processor modeling for hardware software codesign. In Proceedings of the International Conference on VLSI Design. Goa, India, January 1999. [18] G. Hadjiyiannis, S. Hanono, and S. Devadas, ISDL: an instruction set description language for retargetability. In Proceedings of the Design Automation Conference (DAC), June 1997. [19] A. Halambi et al., EXPRESSION: a language for architecture exploration through compiler/simulator retargetability. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE), March 1999. [20] P. Paulin, Design automation challenges for application-specific architecture platforms. In Proceedings of the SCOPES 2001 — Workshop on Software and Compilers for Embedded Systems, March 2001. [21] ACE — Associated Compiler Experts, The COSY Compilation System, 2001. http://www.ace.nl/ products/cosy.html [22] T. Morimoto, K. Saito, H. Nakamura, T. Boku, and K. Nakazawa, Advanced processor design using hardware description language AIDL. In Proceedings of the Asia South Pacific Design Automation Conference (ASPDAC), March 1997. [23] I. Huang, B. Holmer, and A. Despain, ASIA: automatic synthesis of instruction-set architectures. In Proceedings of the SASIMI Workshop, October 1993. [24] M. Gschwind, Instruction set selection for ASIP design. In Proceedings of the International Workshop on Hardware/Software Codesign, May 1999. [25] S. Kobayashi et al., Compiler generation in PEAS-III: an ASIP development system. In Proceedings of the SCOPES 2001 — Workshop on Software and Compilers for Embedded Systems, March 2001. [26] C.-M. Kyung, Metacore: an application specific DSP development system. In Proceedings of the Design Automation Conference (DAC), June 1998. [27] M. Barbacci, Instruction set processor specifications (ISPS): the notation and its application. IEEE Transactions on Computers, C-30, 24–40, 1981. [28] R. Gonzales, Xtensa: a configurable and extensible processor. IEEE Micro, 20, 2000. [29] Synopsys, COSSAP. http://www.synopsys.com [30] OPNET, http://www.opnet.com [31] LISA Homepage, ISS, RWTH Aachen, 2001, http://www.iss.rwth-aachen.de/lisa [32] T. Gloekler, S. Bitterlich, and H. Meyr, Increasing the power efficiency of application-specific instruction set processors using datapath optimization. In Proceedings of the IEEE Workshop on Signal Processing Systems (SIPS). Lafayette, October 2001. [33] Synopsys, DesignWare Components, 1999. http://www.synopsys.com/products/designware/ designware.html

© 2006 by Taylor & Francis Group, LLC

19-28

Embedded Systems Handbook

[34] A. Hoffmann, A. Nohl, G. Braun, and H. Meyr, Generating production quality software development tools using a machine description language. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE), March 2001. [35] S. Pees, A. Hoffmann, and H. Meyr, Retargeting of compiled simulators for digital signal processors using a machine description language. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE). Paris, March 2000. [36] Texas Instruments, TMS320C62x/C67x CPU and Instruction Set Reference Guide, March 1998. [37] Texas Instruments, TMS320C54x CPU and Instruction Set Reference Guide, October 1996. [38] R. Sites et al., Binary translation. Communications of the ACM, 36, 69–81, 1993. [39] Advanced Risc Machines Ltd., ARM7 Data Sheet, December 1994. [40] Analog Devices, ADSP2101 User’s Manual, September 1993. [41] Synopsys, Eaglei, 1999. http://www.synopsys.com/products/hwsw [42] Cadence, Cierto, 1999. http://www.cadence.com/technology/hwsw [43] Mentor Graphics, Seamless, 1999. http://www.mentor.com/seamless [44] L. Guerra et al., Cycle and phase accurate DSP modeling and integration for HW/SW co-verification. In Proceedings of the Design Automation Conference (DAC), June 1999. [45] R. Earnshaw, L. Smith, and K. Welton, Challenges in cross-development. IEEE Micro, 17, 28–36, 1997.

© 2006 by Taylor & Francis Group, LLC

20 State-of-the-Art SoC Communication Architectures 20.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-1 20.2 AMBA Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-2 AMBA System Bus • AMBA AHB Basic Operation • Advanced Peripheral Bus • Advanced AMBA Evolutions

20.3 CoreConnect Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-7 Processor Local Bus • On-Chip Peripheral Bus • Device Control Register Bus

20.4 STBus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-10 Bus Topologies

20.5 Wishbone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-11 The Wishbone Bus Transactions

20.6 SiliconBackplane MicroNetwork . . . . . . . . . . . . . . . . . . . . . . . 20-12 System Interconnect Bandwidth • Configuration Resources

José L. Ayala and Marisa López-Vallejo Universidad Politécnica de Madrid

Davide Bertozzi and Luca Benini University of Bologna

20.7 Other On-Chip Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . 20-14 Peripheral Interconnect Bus • Avalon • CoreFrame

20.8 Analysis of Communication Architectures. . . . . . . . . . . . . 20-15 Scalability Analysis

20.9 Packet-Switched Interconnection Networks . . . . . . . . . . . 20-20 20.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-21 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-21

20.1 Introduction The current high levels of on-chip integration allow for the implementation of increasingly complex Systems-on-Chip (SoC), consisting of heterogeneous components such as general-purpose processors, Digital Signal Processors (DSPs), coprocessors, memories, I/O units, and dedicated hardware accelerators. In this context, MultiProcessor Systems-on-Chip (MPSoC) are emerging as an effective solution to meet the demand for computational power posed by application domains such as network processors and parallel media processors. MPSoCs combine the advantages of parallel processing with the high integration levels of SoCs. It is expected that future MPSoCs will integrate hundreds of processing units and storage elements, and their performance will be increasingly interconnect dominated [1]. Interconnect technology and 20-1

© 2006 by Taylor & Francis Group, LLC

20-2

Embedded Systems Handbook

architecture will become the limiting factor for achieving operational goals, and the efficient design of low-power, high-performance on-chip communication architectures will pose novel challenges. The main issue regards scalability of system interconnects, since the trend for system integration is expected to continue. State-of-the-art on-chip buses rely on shared communication resources and on an arbitration mechanism that is in charge of serializing bus access requests. This widely adopted solution unfortunately suffers from power and performance scalability limitations, therefore a lot of effort is being devoted to the development of advanced bus topologies (e.g., partial or full crossbars, bridged buses) and protocols, some of them are already implemented in commercially available products. In the long run, a more aggressive approach will be needed, and a design paradigm shift will most probably lead to a packetized on-chip communication based on micronetworks of interconnects or Networks-on-Chip (NoC) [2,3]. This chapter focuses on state-of-the-art SoC communication architectures, providing an overview of the most relevant ones from an industrial and research viewpoint. Beyond describing the distinctive features of each of them, the chapter sketches the main evolution guidelines for these architectures by means of a protocol and topology analysis framework. Finally, some basic concepts on packet-switched interconnection networks will be put forward. Open bus specifications such as Advanced MicroController Bus Architecture (AMBA) and CoreConnect will be obviously described more in detail, providing the background which is needed to understand the more general description of proprietary industrial bus architectures, while at the same time being able to assess their contribution to the advance in the field.

20.2 AMBA Bus AMBA is a bus standard which was originally conceived by ARM to support communication among ARM processor cores. However, nowadays AMBA is one of the leading on-chip busing systems because it is licensed and deployed for use with third party Intellectual Property (IP) cores [4]. Designed for custom silicon, the AMBA specification provides standard bus protocols for connecting on-chip components, custom logic and specialized functions. These bus protocols are independent of the ARM processor and generalized for different SoC structures. AMBA defines a segmented bus architecture, wherein two bus segments are connected with each other via a bridge that buffers data and operations between them. A system bus is defined, which provides a highspeed, high-bandwidth communication channel between embedded processors and high-performance peripherals. Two system buses are actually specified: the AMBA High-Speed Bus (AHB) and the Advanced System Bus (ASB). Moreover, a low-performance and low-power peripheral bus (called Advanced Peripheral Bus, APB) is specified, which accommodates communication with general-purpose peripherals and is connected to the system bus via a bridge, acting as the only APB master. The overall AMBA architecture is illustrated in Figure 20.1.

20.2.1 AMBA System Bus ASB is the first generation of AMBA system bus, and sits above APB in that it implements the features required for high-performance systems including burst transfers, pipelined transfer operation and multiple bus masters. AHB is a later generation of AMBA bus which is intended to address the requirements of high-performance, high-clock synthesizable designs. ASB is used for simpler, more cost-effective designs whereas more sophisticated designs call for the employment of the AHB. For this reason, a detailed description of AHB follows. The main features of AMBA AHB can be summarized as follows: Multiple bus masters. Optimized system performance is obtained by sharing resources among different bus masters. A simple request-grant mechanism is implemented between the arbiter and each bus master. In this way, the arbiter ensures that only one bus master is active on the bus and also that when no masters are requesting the bus a default master is granted.

© 2006 by Taylor & Francis Group, LLC

SoC Communication Architectures

20-3

ARM CPU Test I/F controller

AMBA(AHB) System bus

SRAM

External memory

Bridge

SDRAM controller

Audio codec I/F

Smart card I/F

Color LCD controller

High speed

AMBA(APB) Peripheral bus Synchronous serial port UART

Low power

FIGURE 20.1 Schematic architecture of AMBA bus.

Pipelined and burst transfers. Address and data phases of a transfer occur during different clock periods. In fact, the address phase of any transfer occurs during the data phase of the previous transfer. This overlapping of address and data is fundamental to the pipelined nature of the bus and allows for highperformance operation, while still providing adequate time for a slave to provide the response to a transfer. This also implies that ownership of the data bus is delayed with respect to ownership of the address bus. Moreover, support for burst transfers allows for efficient use of memory interfaces by providing transfer information in advance. Split transactions. They maximize the use of bus bandwidth by enabling high latency slaves to release the system bus during dead time while they complete processing of their access requests. Wide data bus configurations. Support for high-bandwidth data-intensive applications is provided using wide on-chip memories. System buses support 32-, 64-, and 128-bit data bus implementations with a 32-bit address bus, as well as smaller byte and half-word designs. Nontristate implementation. AMBA AHB implements a separate read and write data bus in order to avoid the use of tristate drivers. In particular, master and slave signals are multiplexed onto the shared communication resources (read and write data buses, address bus, control signals). A typical AMBA AHB system contains the following components: AHB master. Only one bus master at a time is allowed to initiate and complete read and write transactions. Bus masters drive out the address and control signals and the arbiter determines which master has its signals routed to all of the slaves. A central decoder controls the read data and response signal multiplexor, which selects the appropriate signals from the slave that has been addressed. AHB slave. It signals back to the active master, the status of the pending transaction. It can indicate if the transfer is completed successfully, or there was an error or the master should retry the transfer or indicate the beginning of a split transaction. AHB arbiter. The bus arbiter serializes bus access requests. The arbitration algorithm is not specified by the standard and its selection is left as a design parameter (fixed priority, round-robin, latency-driven, etc.), although the request-grant based arbitration protocol has to be kept fixed. AHB decoder. This is used for address decoding and provides the select signal to the intended slave.

© 2006 by Taylor & Francis Group, LLC

20-4

Embedded Systems Handbook

20.2.2 AMBA AHB Basic Operation In a normal bus transaction, the arbiter grants the bus to the master until the transfer completes and the bus can then be handed over to another master. However, in order to avoid excessive arbitration latencies, the arbiter can break up a burst. In that case, the master must rearbitrate for the bus in order to complete the remaining data transfers. A basic AHB transfer consists of four clock cycles. During the first one, the request signal is asserted, and in the best case at the end of the second cycle a grant signal from the arbiter can be sampled by the master. Then, address and control signals are asserted for slave sampling on the next rising edge, and during the last cycle the data phase is carried out (read data bus-driven or information on the write data bus sampled). A slave may insert wait states into any transfer, thus extending the data phase, and a ready signal is available for this purpose. Four-, eight-, and sixteen-beat bursts are defined in the AMBA AHB protocol, as well as undefinedlength bursts. During a burst transfer, the arbiter rearbitrates the bus when the penultimate address has been sampled, so that the asserted grant signal can be sampled by the relative master at the same point where the last address of the burst is sampled. This makes bus master handover at the end of a burst transfer very efficient. For long transactions, the slave can decide to split the operation warning the arbiter that the master should not be granted access to the bus until the slave indicates it is ready to complete the transfer. This transfer splitting mechanism is supported by all advanced on-chip interconnects, since it prevents high latency slaves from keeping the bus busy without performing any actual transfer of data. On the contrary, split transfers can significantly improve bus efficiency, that is, reduce the number of bus busy cycles used just for control (e.g., protocol handshake) and not for actual data transfers. Advanced arbitration features are required in order to support split transfers, as well as more complex master and slave interfaces.

20.2.3 Advanced Peripheral Bus The AMBA APB is intended for general-purpose low-speed low-power peripheral devices. It enables the connection to the main system bus via a bridge. All bus devices are slaves, the bridge being the only peripheral bus master. This is a static bus that provides a simple addressing, with latched addresses and control signals for easy interfacing. ARM recommends a dual Read and Write bus implementation, but APB can be implemented with a single tristated data bus. The main features of this bus are the following: • Unpipelined architecture • Low-gate count • Low-power operation (a) Reduced loading of the main system bus is obtained by isolating the peripherals behind the bridge. (b) Peripheral bus signals are only active during low-bandwidth peripheral transfers. AMBA APB operation can be abstracted as a state machine with three states. The default state for the peripheral bus is IDLE, which switches to SETUP state when a transfer is required. SETUP state lasts just one cycle, during which the peripheral select signal is asserted. The bus then moves to ENABLE state, which also lasts only one cycle and which requires the address, control, and data signals to remain stable. Then, if other transfers are to take place, the bus goes back to SETUP state, otherwise to IDLE. As can be observed, AMBA APB should be used to interface to any peripherals which are low-bandwidth and do not require the high-performance of a pipelined bus interface.

© 2006 by Taylor & Francis Group, LLC

SoC Communication Architectures

20-5

20.2.4 Advanced AMBA Evolutions Recently, some advanced specifications of AMBA bus have appeared, featuring increased performance and better link utilization. In particular, the Multi-Layer AHB and the AMBA AXI interconnect schemes will be briefly addressed in the following subsections. It should be observed that interconnect performance improvement can be achieved by adopting new topologies and by choosing new protocols, at the expense of silicon area. The former strategy leads from shared buses to bridged clusters, partial or full crossbars, and eventually to NoCs, in an attempt to increase available bandwidth and to reduce local contention. The latter strategy instead tries to maximize link utilization by adopting more sophisticated control schemes and thus permitting a better sharing of existing resources. Multi-Layer AHB can be seen as an evolution of bus topology while keeping the AHB protocol unchanged. On the contrary, AMBA AXI represents an advanced interconnect fabric protocol. 20.2.4.1 Multi-Layer AHB The Multi-Layer AHB specification emerges with the aim of increasing the overall bus bandwidth and providing a more flexible interconnect architecture with respect to AMBA AHB. This is achieved by using a more complex interconnection matrix which enables parallel access paths between multiple masters and slaves in a system [5]. Therefore, the multi-layer bus architecture allows the interconnection of unmodified standard AHB master and slave modules with an increased available bus bandwidth. The resulting architecture becomes very simple and flexible: each AHB layer only has one master and no arbitration and master-to-slave muxing is needed. Moreover, the interconnect protocol implemented in these layers can be very simple: it does not have to support request and grant, nor retry or split transactions. The additional hardware needed for this architecture with respect to the AHB is a multiplexor to connect the multiple masters to the peripherals and some point arbitration is also required when more than one master wants to access the same slave simultaneously. Figure 20.2 shows a schematic view of the multi-layer concept. The interconnect matrix contains a decode stage for every layer in order to determine which slave is required during the transfer. The multiplexer is used to route the request from the specific layer to the desired slave. The arbitration protocol decides the sequence of accesses of layers to slaves based on a priority assignment. The layer with lowest priority has to wait for the slave to be freed. Different arbitration schemes can be used, and every slave port has its own arbitration. Input layers can be served in a round-robin fashion, changing every transfer or every burst transaction, or based on a fixed priority scheme. The number of input/output ports on the interconnect matrix is completely flexible and can be adapted to suit to system requirements. As the number of masters and slaves implemented in the system increases, the complexity of the interconnection matrix can become significant and some optimization techniques have to be used: defining multiple masters on a single layer, multiple slaves appearing as a single slave to the interconnect matrix, and defining local slaves to a particular layer. Finally, it is interesting to outline the capability of this topology to support multi-port slaves. Some devices, such as SDRAM controllers, work much more efficiently when processing transfers from different layers in parallel. 20.2.4.2 AMBA AXI Protocol AXI is the latest generation AMBA interface. It is designed to be used as a high-speed submicron interconnect, and also includes optional extensions for low-power operation [6]. This high-performance protocol provides flexibility in the implementation of interconnect architectures while still keeping backward-compatibility with existing AHB and APB interfaces. AMBA AXI builds upon the concept of point-to-point connection. AMBA AXI does not provide masters and slaves with visibility of the underlying interconnect, instead featuring the concept of master interfaces and symmetric slave interfaces. This approach, besides allowing seamless topology scaling, has

© 2006 by Taylor & Francis Group, LLC

20-6

Embedded Systems Handbook

Slave Decode Master

Mux Slave

Mux Master

Slave Decode

Slave

FIGURE 20.2

Schematic view of the multi-layer AHB interconnect.

the advantage of simplifying the handshake logic of attached devices, which only need to manage a point-to-point link. To provide high scalability and parallelism, four different logical monodirectional channels are provided in AXI interfaces: an address channel, a read channel, a write channel, and a write response channel. Activity on different channels is mostly asynchronous (e.g., data for a write can be pushed to the write channel before or after the write address is issued to the address channel), and can be parallelized, allowing multiple outstanding read and write requests. Figure 20.3(a) shows how a read transaction uses the read address and read data channels. The write operation over the write address and write data channels is presented in Figure 20.3(b). As can be observed, the data is transferred from the master to the slave using a write data channel, and it is transferred from the slave to the master using a read data channel. In write transactions, in which all the data flows from the master to the slave, the AXI protocol has an additional write response channel to allow the slave to signal to the master the completion of the write transaction. However, the AXI protocol is a master/slave to interconnect interface definition, and this enables a variety of different interconnect implementations. Therefore, the mapping of channels, as visible by the interfaces, to actual internal communication lanes is decided by the interconnect designer; single resources might be shared by all channels of a certain type in the system, or a variable amount of dedicated signals might be available, up to a full crossbar scheme. The rationale of this split-channel implementation is based upon the observation that usually the required bandwidth for addresses is much lower than that for data (e.g., a burst requires a single address but maybe four or eight data transfers). Availability of independently scalable resources might, for example, lead to medium complexity designs sharing a single internal address channel while providing multiple data read and write channels. Finally, some of the key incremental features of the AXI protocol can be listed as follows: • • • • •

Support for out-of-order completion of transactions. Easy addition of register stages to provide timing closure. Support for multiple address issuing. Separate read and write data channels to enable low-cost Direct Memory Access (DMA). Support for unaligned data transfers.

© 2006 by Taylor & Francis Group, LLC

SoC Communication Architectures

20-7 Read address channel

(a) Address and control Master

Slave Read data channel

interface

Read data

Read data

interface

Read data

Write address channel

(b) Address and control

Write data channel Master

Slave Write data

Write data

Write data

interface

interface

Write response channel Write response

FIGURE 20.3 Architecture of transfers: (a) read operation, (b) write operation.

20.3 CoreConnect Bus CoreConnect is an IBM-developed on-chip bus that eases the integration and reuse of processor, subsystem and peripheral cores within standard product platform designs. It is a complete and versatile architecture clearly targeting high-performance systems, and many of its features might be overkill in simple embedded applications [7]. The CoreConnect bus architecture serves as the foundation of IBM Blue Logic™ or other non-IBM devices. The Blue Logic ASIC/SoC design methodology is the approach proposed by IBM [8] to extend conventional ASIC design flows to current design needs: low-power and multiple-voltage products, reconfigurable logic, custom design capability, and analog/mixed-signal designs. Each of these offerings requires a well-balanced coupling of technology capabilities and design methodology. The use of this bus architecture allows the hierarchical design of SoCs. As can be seen in Figure 20.4, the IBM CoreConnect architecture provides three buses for interconnecting cores, library macros, and custom logic: • Processor Local Bus (PLB) • On-Chip Peripheral Bus (OPB) • Device Control Register (DCR) Bus The PLB bus connects the processor to high-performance peripherals, such as memories, DMA controllers, and fast devices. Bridged to the PLB, the OPB supports slower-speed peripherals. Finally, the DCR bus is a separate control bus that connects all devices, controllers, and bridges and provides a separate

© 2006 by Taylor & Francis Group, LLC

20-8

Embedded Systems Handbook DCR bus

System core

System core

System core

Processor local bus (PLB)

Arbiter

Peripheral core

Bus bridge

Peripheral core

On-Chip peripheral bus (OPB)

Arbiter

CoreConnect bus On-Chip Memory

Processor core

Auxiliary processor

DCR bus

FIGURE 20.4

Schematic structure of the CoreConnect bus.

path to set and monitor the individual control registers. It is designed to transfer data between the CPU’s general-purpose registers and the slave logic’s device control registers. It removes configuration registers from the memory address map, which reduces loading and improves bandwidth of the PLB. This architecture shares many high-performance features with the AMBA bus specification. On one hand both architectures allow split, pipelined, and burst transfers, multiple bus masters and 32-, 64-, or 128-bits architectures. On the other hand, CoreConnect also supports multiple masters in the peripheral bus. Please note that design toolkits are available for the CoreConnect bus and include functional models, monitors, and a bus functional language to drive the models. These toolkits provide an advanced validation environment for engineers designing macros to attach to the PLB, OPB, and DCR buses.

20.3.1 Processor Local Bus The PLB is the main system bus targeting high-performance and low-latency on-chip communication. More specifically, PLB is a synchronous, multi-master, arbitrated bus. It supports concurrent read and write transfers, thus yielding a maximum bus utilization of two data transfers per clock cycle. Moreover, PLB implements address pipelining, that reduces bus latency by overlapping a new write request with an ongoing write transfer and up to three read requests with an ongoing read transfer [9]. Access to PLB is granted through a central arbitration mechanism that allows masters to compete for bus ownership. This arbitration mechanism is flexible enough to provide for the implementation of various priority schemes. In fact, four levels of request priority for each master allow PLB implementation with various arbitration priority schemes. Additionally, an arbitration locking mechanism is provided to support master-driven atomic operations. PLB also exhibits the ability to overlap the bus request/grant protocol with an ongoing transfer. The PLB specification describes a system architecture along with a detailed description of the signals and transactions. PLB-based custom logic systems require the use of a PLB macro to interconnect the various master and slave macros. The PLB macro is the key component of PLB architecture, and consists of a bus arbitration control unit and the control logic required to manage the address and dataflow through the PLB. Each PLB master is attached to the PLB through separate address, read data, and write data buses and a plurality of transfer qualifier signals, while PLB slaves are attached through shared, but decoupled, address, read data, and write data buses (each one with its own transfer control and status signals). The separate address and data buses from the masters allow simultaneous transfer requests. The PLB macro arbitrates among them and sends

© 2006 by Taylor & Francis Group, LLC

SoC Communication Architectures

20-9

the address, data, and control signals from the granted master to the slave bus. The slave response is then routed back to the appropriate master. Up to 16 masters can be supported by the arbitration unit, while there are no restrictions in the number of slave devices.

20.3.2 On-Chip Peripheral Bus Frequently, the OPB architecture connects low-bandwidth devices such as serial and parallel ports, UARTs, timers, etc. and represents a separate, independent level of bus hierarchy. It is implemented as a multimaster, arbitrated bus. It is a fully synchronous interconnect with a common clock, but its devices can run with slower clocks, as long as all of the clocks are synchronized with the rising edge of the main clock. This bus uses a distributed multiplexer attachment implementation instead of tristate drivers. The OPB supports multiple masters and slaves by implementing the address and data buses as a distributed multiplexer. This type of structure is suitable for the less data intensive OPB bus and allows adding peripherals to a custom core logic design without changing the I/O on either the OPB arbiter or existing peripherals. All of the masters are capable of providing an address to the slaves, whereas both masters and slaves are capable of driving and receiving the distributed data bus. PLB masters gain access to the peripherals on the OPB bus through the OPB bridge macro. The OPB bridge acts as a slave device on the PLB and a master on the OPB. It supports word (32-bit), half-word (16-bit), and byte read and write transfers on the 32-bit OPB data bus, bursts and has the capability to perform target word first line read accesses. The OPB bridge performs dynamic bus sizing, allowing devices with different data widths to efficiently communicate. When the OPB bridge master performs an operation wider than the selected OPB slave can support, the bridge splits the operation into two or more smaller transfers. Some of the main features of the OPB specification are: • • • • • • • • •

Fully synchronous Dynamic bus sizing: byte, half-word, full-word, and double-word transfers Separate address and data buses Support for multiple OPB bus masters Single cycle transfer of data between OPB bus master and OPB slaves Sequential address (burst) protocol 16-cycle fixed bus timeout provided by the OPB arbiter Bus arbitration overlapped with last cycle of bus transfers Optional OPB DMA transfers

20.3.3 Device Control Register Bus The DCR bus provides an alternative path to the system for setting the individual device control registers. These latter are on-chip registers that are implemented outside the processor core, from an architectural viewpoint. Through the DCR bus, the host CPU can set up the device-control-register sets without loading down the main PLB. This bus has a single master, the CPU interface, which can read or write to the individual device control registers. The DCR bus architecture allows data transfers among OPB peripherals to occur independently from, and concurrently with data transfers between processor and memory, or among other PLB devices. The DCR bus architecture is based on a ring topology to connect the CPU interface to all devices. The DCR bus is typically implemented as a distributed multiplexer across the chip such that each subunit not only has a path to place its own DCRs on the CPU read path, but also has a path which bypasses its DCRs and places another unit’s DCRs on the CPU read path. DCR bus consists of a 10-bit address bus and a 32-bit data bus. This is a synchronous bus, wherein slaves may be clocked either faster or slower than the master, although a synchronization of clock signals with the DCR bus clock is required. Finally, bursts are not supported by this bus, and two-cycle minimum read or write transfers are allowed. Optionally, they can be extended by slaves or by the single master.

© 2006 by Taylor & Francis Group, LLC

20-10

Embedded Systems Handbook

20.4 STBus STBus is an STMicroelectronics proprietary on-chip bus protocol. STBus is dedicated to SoC designed for high-bandwidth applications such as audio/video processing [10]. The STBus interfaces and protocols are closely related to the industry standard VCI (Virtual Component Interface). The components interconnected by an STBus are either initiators (which initiate transactions on the bus by sending requests), or targets (which respond to requests). The bus architecture is decomposed into nodes (sub-buses in which initiators and targets can communicate directly), and the internode communications are performed through First In First Out (FIFO) buffers. Figure 20.5 shows a schematic view of the STBus interconnect. STBus implements three different protocols that can be selected by the designer in order to meet the complexity, cost, and performance constraints. From lower to higher, they can be listed as follows: Type 1: Peripheral protocol. This type is the low-cost implementation for low/medium-performance. Its simple design allows a synchronous handshake protocol and provides a limited transaction set. The peripheral STBus is targeted at modules that require a low complexity medium data rate communication path with the rest of the system. This typically includes standalone modules such as general-purpose input/output or modules which require independent control interfaces in addition to their main memory interface. Type 2: Basic protocol. In this case, the limited operation set of the peripheral interface is extended to a full operation set, including compound operations, source labeling and some priority and transaction labeling. Moreover, this implementation supports split and pipelined accesses, and is aimed at devices which need high-performance but do not require the additional system efficiency associated with shaped request/response packets or the ability to reorder outstanding operations. Type 3: Advanced protocol. The most advanced implementation upgrades previous interfaces with support for out-of-order execution and shaped packets, and is equivalent to the advanced VCI protocol. Split and pipelined accesses are supported. It allows the improvement of performance either by allowing more operations to occur concurrently, or by rescheduling operations more efficiently. A type 2 protocol preserves the order of requests and responses. One constraint is that, when communicating with a given target, an initiator cannot send a request to a new target until it has received all the responses from the current target. The unresponded requests are called pending, and a pending request controller manages them. A given type 2 target is assumed to send the responses in the same order as the request arrival order. In type 3 protocol, the order of responses may not be guaranteed, and an initiator can communicate with any target, even if it has not received all responses from a previous one. Initiators (masters)

Type 1

Type 2

Initiator IP

Type 3

Any bus IF Stbus IF

STBus

STBus IF Anybus IF Type 1

Type 2

Type 3 Initiator IP

Targets (slaves)

FIGURE 20.5

Schematic view of the STBus interconnect.

© 2006 by Taylor & Francis Group, LLC

SoC Communication Architectures

20-11

Associated with these protocols, hardware components have been designed in order to build complete reconfigurable interconnections between initiators and targets. A toolkit has been developed around this STBus (graphical interface) to automatically generate top level backbone, cycle accurate high-level models, way to implementation, bus analysis (latencies, bandwidth) and bus verification (protocol and behavior). An STBus system includes three generic architectural components. The node arbitrates and routes the requests and optionally the responses. The converter is in charge of converting the requests from one protocol to another (for instance, from basic to advanced). Finally, the size converter is used between two buses of the same type but of different widths. It includes buffering capability. The STBus can implement various strategies of arbitration and allows to change them dynamically. In a simplified single-node system example, a communication between one initiator and a target is performed in several steps: • A request/grant step between the initiator and the node takes place, corresponding to an atomic rendezvous operation of the system. • The request is transferred from the node to the target. • A response-request/grant step is carried out between the target and the node. • The response-request is transferred from the node to the initiator.

20.4.1 Bus Topologies STBus can instantiate different bus topologies, trading-off communication parallelism with architectural complexity. In particular, system interconnects with different scalability properties can be instantiated such as: • Single shared bus: suitable for simple low-performance implementations. It features minimum wiring area but limited scalability. • Full crossbar: targets complex high-performance implementations. Large wiring area overhead. • Partial crossbar: intermediate solution, medium performance, implementation complexity, and wiring overhead. It is worth observing that STBus allows for the instantiation of complex bus systems such as heterogeneous multi-node buses (thanks to size or type converters) and facilitates bridging with different bus architectures, provided proper protocol converters are made available (e.g., STBus and AMBA).

20.5 Wishbone The Wishbone SoC interconnect [11] defines two types of interfaces, called master and slave. Master interfaces are cores that are capable of generating bus cycles, while slave interfaces are capable of receiving bus cycles. Some relevant Wishbone features that are worth mentioning are the multi-master capability which enables multiprocessing, the arbitration methodology defined by end users attending to their needs, and the scalable data bus widths and operand sizes. Moreover, the hardware implementation of bus interfaces is simple and compact, and the hierarchical view of the Wishbone architecture supports structured design methodologies [12]. The hardware implementation supports various IP core interconnection schemes, including: point-topoint connection, shared bus, crossbar switch implementation, dataflow interconnection, and off-chip interconnection. The crossbar switch interconnection is usually used when connecting two or more masters together so that every one can access two or more slaves. In this scheme, the master initiates an addressable bus cycle to a target slave. The crossbar switch interconnection allows more than one master to use the bus provided they do not access the same slave. In this way, the master requests a channel on the switch and, once this is established, data is transferred in a point-to-point way.

© 2006 by Taylor & Francis Group, LLC

20-12

Embedded Systems Handbook

On one hand the overall data transfer rate of the crossbar switch is higher than shared bus mechanisms, and can be expanded to support extremely high data transfer rates. On the other hand, the main disadvantage is a more complex interconnection logic and routing resources.

20.5.1 The Wishbone Bus Transactions The Wishbone architecture defines different transaction cycles attending to the action performed (read or write) and the blocking/nonblocking access. For instance, single read/write transfers are carried out as follows. The master requests the operation and places the slave address onto the bus. Then the slave places data onto the data bus and asserts an acknowledge signal. The master monitors this signal and relies the request signals when data have been latched. Two or more back-to-back read/write transfers can also be strung together. In this case, the starting and stopping point of the transfers are identified by the assertion and negation of a specific signal [13]. A Read-Modify-Write (RMW) transfer is also specified, which can be used in multiprocessor and multitasking systems in order to allow multiple software processes to share common resources by using semaphores. This is commonly done on interfaces for disk controllers, serial ports, and memory. The RMW transfer reads and writes data to a memory location in a single bus cycle. For the correct implementation of this bus transaction, shared bus interconnects have to be designed in such a way that once the arbiter grants the bus to a master, it will not rearbitrate the bus until the current master gives it up. Also, it is important to note that a master device must support the RMW transfer in order to be effective, and this is generally done by means of special instructions forcing RMW bus transactions.

20.6 SiliconBackplane MicroNetwork SiliconBackplane MicroNetwork is a family of innovative communication architectures licensed by Sonics for use in SoC design. The Sonics architecture provides CPU independence, true mix-and-match of IP cores, a unified communication medium, and a structure that makes a SOC design simpler to partition, analyze, design, verify, and test [14]. The SiliconBackplane MicroNetwork allows high-speed pipelined transactions (data bandwidth of the interconnect scales from 50 MB/sec to 4.8 Gbyte/sec) where the real-time Quality of Service (QoS) of multiple simultaneous dataflows is guaranteed. A network utilization of up to 90% can be achieved. The SiliconBackplane relies on the SonicsStudio™ development environment for architectural exploration, and the availability of pre-characterization results enables reliable performance analysis and reduction of interconnect timing closure uncertainties. The ultimate goal is to avoid over-designing interconnects. The architecture can be described as a distributed communication infrastructure (thus facilitating placeand-route) which can be extended hierarchically in the form of Tiles (collection of functions requiring minimal assistance from the rest of the die) in an easy way. Among other features, the SiliconBackplane MicroNetwork provides advanced error handling in hardware (features for SoC-wide error detection and support mechanisms for software clean-up and recovery of unresponsive cores), runtime-operating reconfiguration to meet changing application demands and data multicast. The SiliconBackplane system consists of a physical interconnect bus configured with a combination of agents. Each IP core communicates with an attached agent through ports implementing the Open Core Protocol (OCP) standard interface. The agents then communicate with each other using a network of interconnects based on the SiliconBackplane protocol. This latter includes patented transfer mechanisms aiming at maximizing interconnect bandwidth utilization and optimized for streaming multimedia applications [15]. Figure 20.6 shows a schematic view of the SiliconBackplane system.

© 2006 by Taylor & Francis Group, LLC

SoC Communication Architectures

20-13

System initiator

System initiator / target

System target

Core

Core

Core

Master

Master

Slave

Slave

Response

OCP

Slave Bus initiator

Initiator

Request

Slave Bus initiator

Target / target

Master Bus initiator

ON-CHIP BUS

FIGURE 20.6 Schematic view of the SiliconBackplane system.

A few specific components can be identified in an agent architecture: Initiators. Who implements the interface between the bus and the master core (CPU, DSP, DMA, etc.). The initiator receives requests from the OCP, then transmits the requests according to the SiliconBackplane standard, and finally processes the responses from the target. Targets. Who implements the interface between the physical bus and the slave device (memories, UARTs, etc.). This module serves as the bridge between the system and the OCP. Service agent. Who is an enhanced initiator that provides additional capabilities such as debug and test.

20.6.1 System Interconnect Bandwidth One of the most interesting features of the SiliconBackplane network is the possibility of allocating bandwidth based on a two-level arbitration policy. The system designer can preallocate bandwidth to high priority initiators by means of the concept of Time-Division Multiple Access (TDMA). An initiator agent with a preassigned time slot has the rights over that slot. If the owner does not need it, the slot is reallocated in a round-robin fashion to one of the system devices, and this represents the second level of the arbitration policy. The TDMA approach provides fast access to variable-latency subsystems and is a simple mechanism to guarantee QoS. The TDMA bandwidth allocation tables are stored in a configuration register at every initiator, and can be dynamically over-written to fit the system needs. On the other hand, the fair roundrobin allocation scheme can be used to guarantee bandwidth availability to initiators with less predictable access patterns, since some or many of the TDMA slots may turn out to be left unallocated. Round-robin arbitration policy is particularly suitable for best-effort traffic.

20.6.2 Conﬁguration Resources All the configurable IP cores implemented in the SiliconBackplane system can be configured either at compile time or dynamically by means of specific configuration registers. These configuration devices are accessible by the operating system. Configuration registers are individually set for each agent, depending upon the services provided to the attached cores. The types of configuration registers are: • Unbuffered registers hold configuration values for the agent or its subsystem core. • Buffered registers hold configuration values that must be simultaneously updated in all agents. • Broadcast configuration registers hold values that must remain identical in multiple agents.

© 2006 by Taylor & Francis Group, LLC

20-14

Embedded Systems Handbook

20.7 Other On-Chip Interconnects 20.7.1 Peripheral Interconnect Bus The PI bus was developed by several European semiconductor companies (Advanced RISC Machines, Philips Semiconductors, SGS-THOMSON Microelectronics, Siemens, TEMIC/MATRA MHS) within the framework of a European project (OMI, Open Microprocessor Initiative framework.)1 After this, an extended backward-compatible PI Bus protocol standard frequently used in many hardware systems has been developed by Philips [16]. The high-bandwidth and low-overhead of the PI Bus provide a comfortable environment for connecting processor cores, memories, coprocessors, I/O controllers and other functional blocks in high-performance chips, for time-critical applications. The PI Bus functional modules are arranged in macrocells, and a wide range of functions are provided. Macrocells with a PI Bus interface can be easily integrated into a chip layout even if they are designed by different manufacturers. The potential bus agents require only a PI Bus interface of low complexity. Since there is no concrete implementation specified, PI Bus can be adapted to the individual requirements of the target chip design. For instance, the widths of the address and data bus may be varied. The main features of this bus are: • • • • • • • •

Processor independent implementation and design Demultiplexed operation Clock synchronous Peak transfer rate of 200 MB/sec (50 MHz bus clock) Address and data bus scalable (up to 32 bits) 8-, 16-, 32-bit data access Broad range of transfer types from single to multiple data transfers Multi-master capability

The PI Bus does not provide cache coherency support, broadcasts, dynamic bus sizing, and unaligned data access. Finally, the University of Sussex has developed a VHDL toolkit to meet the needs of embedded system designers using the PI bus. Macrocell testing for PI bus compliance is also possible using the framework available in the toolkit [17].

20.7.2 Avalon Avalon is Altera’s parameterized interface bus used by the Nios embedded processor. The Avalon switch fabric has a set of predefined signal types with which a user can connect one or more IP blocks. It can only be implemented on Altera devices using SOPC Builder, a system development tool that automatically generates the Avalon switch fabric logic [18]. The Avalon switch fabric enables simultaneous multi-master operation for maximum system performance by using a technique called slave-side arbitration. It determines which master gains access to a certain slave, in the event that multiple masters attempt to access the same slave at the same time. Therefore, simultaneous transactions for all bus masters are supported and arbitration for peripherals or memory interfaces that are shared among masters is automatically included. The Avalon interconnect includes chip-select signals for all peripherals, even user-defined peripherals, to simplify the design of the embedded system. Separate, dedicated address and data paths provide an easy interface to on-chip user logic. User-defined peripherals are not required to decode data and address bus cycles. Dynamic bus sizing allows developers to use low-cost, narrow memory devices that do not match the native bus size of their CPU. The switch fabric supports each type of transfer supported by the Avalon interface. Each peripheral port into the switch is generated with reduced amount of logic to meet the requirements of the peripheral, including wait state logic, data width matching, and passing wait signals. 1 The

PI Bus has been incorporated as OMI Standard OMI 324.3D.

© 2006 by Taylor & Francis Group, LLC

SoC Communication Architectures

20-15

Read and write operations with latency can be performed. Latent transfers are useful to masters wanting to issue multiple sequential read or write requests to a slave, which may require multiple cycles for the first transfer but fewer cycles for subsequent sequential transfers. This can be beneficial for instruction-fetch operations and DMA transfers to or from SDRAM. In these cases, the CPU or DMA master may prefetch (post) multiple requests prior to completion of the first transfer and thereby reduce overall access latency. Interestingly, the Avalon interface includes signals for streaming data between master/slave pairs. These signals indicate the peripheral’s capacity to provide or accept data. A master does not have to access status registers in the slave peripheral to determine whether the slave can send or receive data. Streaming transactions maximize throughput between master–slave pairs, while avoiding data overflow or underflow on the slave peripherals. This is especially useful for DMA transfers [19].

20.7.3 CoreFrame The CoreFrame architecture has been developed by Palmchip Corporation and relies on point-to-point signals and multiplexing instead of shared tristate lines. It aims at delivering high-performance while simultaneously reducing design and verification time. The distinctive features of CoreFrame are [20]: • • • • • • • • • •

400 MB/sec bandwidth at 100 MHz (bus speed is scalable to technology and design requirements) Unidirectional buses only Central, shared memory controller Single clock cycle data transfers Zero wait state register accesses Separate peripheral I/O and DMA buses Simple protocol for reduced gate count Low-capacitive loading for high-frequency operation Hidden arbitration for DMA bus masters Application-specific memory map and peripherals

The most distinctive feature of CoreFrame is the separation of I/O and memory transfers onto different buses. The PalmBus provides for the I/O backplane and allows the processor to configure and control peripheral blocks while the MBus provides a DMA connection from peripherals to main memory, allowing a direct data transfer without processor intervention. Other on-chip interconnects are not described here owing to lack of space: IPBus from IDT [21], IP Interface from Motorola [22], MARBLE asynchronous bus from University of Manchester [23], Atlantic from Altera [24], ClearConnect from ClearSpeed Techn. [25], and FISPbus from Mentor Graphics [26].

20.8 Analysis of Communication Architectures Traditional SoC interconnects, as exemplified by AMBA AHB, are based upon low-complexity shared buses, in an attempt to minimize area overhead. Such architectures, however, are not adequate to support the trend for SoC integration, motivating the need for more scalable designs. Interconnect performance improvement can be achieved by adopting new topologies and by choosing new protocols, at the expense of silicon area. The former strategy leads from shared buses to bridged clusters, partial or full crossbars, and eventually to NoC, in an attempt to increase available bandwidth and to reduce local contention. The latter strategy instead tries to maximize link utilization by adopting more sophisticated control schemes and thus permitting a better sharing of existing resources. While both approaches can be followed at the same time, we perform separate analysis for the sake of clarity. At first, scalability of evolving interconnect fabric protocols is assessed. Three state-of-the-art shared buses are stressed under an increasing traffic load: a traditional AMBA AHB link is not only more advanced, but also more expensive, evolutionary solutions as offered by STBus (type 3) and AMBA AXI (based upon a Synopsys implementation).

© 2006 by Taylor & Francis Group, LLC

20-16

Embedded Systems Handbook

These system interconnects were selected for analysis because of their distinctive features, which allow to sketch the evolution of shared-bus based communication architectures. AMBA AHB makes two data links (one for read, one for write) available, but only one of them can be active at any time. Only one bus master can own the data wires at any time, preventing the multiplexing of requests and responses on the interconnect signals. Transaction pipelining (i.e., split ownership of data and address lines) is provided, but not as a means of allowing multiple outstanding requests, since address sampling is only allowed at the end of the previous data transfer. Bursts are supported, but only as a way to cut down on rearbitration times, and AHB slaves do not have a native burst notion. Overall, AMBA AHB is designed for a low silicon area footprint. The STBus interconnect (with shared bus topology) implements split request and response channels. This means that, while a system initiator is receiving data from an STBus target, another one can issue a second request to a different target. As soon as the response channel frees up, the second request can immediately be serviced, thus hiding target wait states behind those of the first transfer. The amount of saved wait states depends on the depth of the prefetch FIFO buffers on the slave side. Additionally, the split channel feature allows for multiple outstanding requests by masters, with support for out-of-order retirement. An additional relevant feature of STBus is its low-latency arbitration, which is performed in a single cycle. Finally, AMBA AXI builds upon the concept of point-to-point connection and exhibits complex features, such as multiple outstanding transaction support (with out-of-order or in-order delivery selectable by means of transaction IDs) and time interleaving of traffic toward different masters on internal data lanes. Four different logical monodirectional channels are provided in AXI interfaces, and activity on them can be parallelized allowing multiple outstanding read and write requests. In our protocol exploration, to provide a fair comparison, a “shared bus” topology is assumed, which comprises of a single internal lane per each one of the four AXI channels. Figure 20.7 shows an example of the efficiency improvements made possible by advanced interconnects in the test case of slave devices having two wait states, with three system processors and four-beat burst CLOCK

READY1

1

2

3

4

(a) READY2

1

2

3

READY3

READY1

1

2

3

4

(b) READY2

1

2

3

4

READY3

READY1

1

1

2

3

1

4

(c) READY2

1

2

3

4

READY3

READY1

1

1

(d) READY2 READY3

2 1

3 2

1

4 3

2

1 4

3

3

4

2 1

4

2

3 2

1

2

FIGURE 20.7 Concept waveforms showing burst interleaving for the three interconnects. (a) AMBA AHB, (b) STBus (with minimal buffering), (c) STBus (with more buffering), and (d) AMBA AXI.

© 2006 by Taylor & Francis Group, LLC

SoC Communication Architectures

20-17

transfers. AMBA AHB has to pay two cycles of penalty per transferred datum. STBus is able to hide latencies for subsequent transfers behind those of the first one, with an effectiveness which is a function of the available buffering. AMBA AXI is capable of interleaving transfers, by sharing data channel ownership in time. Under conditions of peak load, when transactions always overlap, AMBA AHB is limited to a 33% efficiency (transferred words over elapsed clock cycles), while both STBus and AMBA AXI can theoretically reach a 100% throughput.

20.8.1 Scalability Analysis SystemC models of AMBA AHB, AMBA AXI (provided within the Synopsys CoCentric/Designware® [27] suites), and STBus are used within the framework of the MPARM simulation platform [28–30]. For the STBus model, the depth of FIFOs instantiated by the target side of the interconnect is a configurable parameter; their impact can be noticed on concept waveforms in Figure 20.7. 1-stage (“STBus” hereafter) and 4-stage (“STBus [B]”) FIFOs were benchmarked. The simulated on-chip multiprocessor consists of a configurable number of ARM cores attached to the system interconnect. Traffic workload and pattern can easily be tuned by running different benchmark code on the cores, by scaling the number of system processors, or by changing the amount of processor cache, which leads to different amounts of cache refills. Slave devices are assumed to introduce one wait state before responses. To assess interconnect scalability, a benchmark independently but concurrently runs on every system processor performing accesses to its private slave (involving bus transactions). This means that, while producing real functional traffic patterns, the test setup was not constrained by bottlenecks owing to shared slave devices. Scalability properties of the system interconnects can be observed in Figure 20.8, reporting the execution time variation when attaching an increasing amount of system cores to a single shared interconnect under heavy traffic load. Core caches are kept very small (256 bytes) in order to cause many cache misses and therefore significant levels of interconnect congestion. Execution times are normalized against those for a two-processor system, trying to isolate the scalability factor alone. The heavy bus congestion case is considered here because the same analysis performed under light traffic conditions (e.g., with 1 kB caches) shows that all of the interconnects perform very well (they are all always close to 100%), with only AHB showing a moderate performance decrease of 6% when moving from two to eight running processors. With 256 bytes caches, the resulting execution times, as Figure 20.8 shows, get 77% worse for AMBA AHB when moving from two to eight cores, while AXI and STBus manage to stay within 12% and 15%. The impact of FIFOs in STBus is noticeable, since the interconnect with minimal buffering shows execution times 36% worse than in the two-core setup. The reason behind the behavior pointed out in Figure 20.8 is that under heavy traffic load and with many processors, interconnect saturation takes place. This is clearly indicated in Figure 20.9, which reports the fraction of cycles during which some transaction was pending on the bus with respect to total execution time. In such a congested environment, as Figure 20.10 shows, AMBA AXI and STBus (with 4-stage FIFOs) are able to achieve transfer efficiencies (defined as data actually moved over bus contention time) of up to 81% and 83%, respectively, while AMBA AHB reaches 47% only — near to its maximum theoretical efficiency of 50% (one wait state per data word). These plots stress the impact that comparatively low-area-overhead optimizations can sometimes have in complex systems. According to simulation results, some of the advanced features in AMBA AXI provided highly scalable bandwidth, but at the price of latency in low-contention setups. Figure 20.11 shows the minimum and average amount of cycles required to complete a single write and a burst read transaction in STBus and AMBA AXI. STBus has a minimal overhead for transaction initiation, as low as a single cycle if communication resources are free. This is confirmed by figures showing a best-case three-cycle latency for single accesses (initiation, wait state, data transfer) and a nine-cycle latency for four-beat bursts. AMBA AXI, owing to its complex channel management and arbitration, requires more time to initiate and close a transaction: recorded minimum completion times are 6 and 11 cycles for single writes and burst reads,

© 2006 by Taylor & Francis Group, LLC

Embedded Systems Handbook

Relative execution time (%)

20-18 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0

2 Cores 4 Cores 6 Cores 8 Cores

AHB

FIGURE 20.8

AXI

STBus

STBus (B)

Execution times with 256 bytes caches. 100 90

Interconnect busy (%)

80 70 60 2 Cores 4 Cores 6 Cores 8 Cores

50 40 30 20 10 0 AHB

FIGURE 20.9

AXI

STBus

STBus (B)

Bus busy time with 256 bytes caches.

respectively. As bus traffic increases, completion latencies of AMBA AXI and STBus get more and more similar because the bulk of transaction latency is spent in contention. It must be pointed out, however, that protocol improvements alone cannot overcome the intrinsic performance bound owing to the shared nature of the interconnect resources. While protocol features can push the saturation boundary further, and get near to a 100% efficiency, traffic loads taking advantage of more parallel topologies will always exist. The charts reported here already show some traces of saturation even for the most advanced interconnects. However, the improved performance achieved by more parallel topologies strongly depends on the kind of bus traffic. In fact, if the traffic is dominated by accesses to

© 2006 by Taylor & Francis Group, LLC

Interconnect usage efficiency (%)

SoC Communication Architectures

20-19

100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0

2 Cores 4 Cores 6 Cores 8 Cores

AHB

FIGURE 20.10

AXI

STBus

STBus (B)

Bus usage efficiency with 256 bytes caches.

14 13 Latency for access completion (cycles)

12 11 10 STBus (B) write avg STBus (B) write min STBus (B) read avg STBus (B) read min AXI write avg AXI write min AXI read avg AXI read min

9 8 7 6 5 4 3 2 1 0 2 Cores

FIGURE 20.11

4 Cores

6 Cores

8 Cores

Transaction completion latency with 256 bytes caches.

shared devices (shared memory, semaphores, interrupt module), they have to be serialized anyway, thus reducing the effectiveness of area-hungry parallel topologies. It is therefore evident that crossbars behave best when data accesses are local and no destination conflicts arise. This is reflected in Figure 20.12, showing average completion latencies in read accesses for different bus topologies: shared buses (AMBA AHB and STBus), partial crossbars (STBus-32 and STBus-54), and full crossbars (STBus-FC). Four benchmarks are considered, consisting of matrix multiplications performed independently by each processor or in pipeline, with or without an underlying OS (Operating System)

© 2006 by Taylor & Francis Group, LLC

20-20

Embedded Systems Handbook Average time for read (cycles)

20 18 16 14 AMBA ST-BUS ST-FC ST-32 ST-54

12 10 8 6 4 2 0 ASM-IND

FIGURE 20.12

OS-IND

ASM-PIP

OS-PIP

Reads average latency.

(OS-IND, OS-PIP, ASM-IND, and ASM-PIP, respectively). IND benchmarks do not give rise to interprocessor communication, which is instead at the core of PIP benchmarks. Communication goes through the shared memory. Moreover, OS-assisted code implicitely uses both semaphores and interrupts, while standalone ASM applications rely on an explicit semaphore polling mechanism for synchronization purposes. Crossbars show a substantial advantage in OS-IND and ASM-IND benchmarks, wherein processors only access private memories: this operation is obviously suitable for parallelization. Both ST-FC and ST-54 achieve the minimum theoretical latency where no conflict on private memories ever arises. ST-32 trails immediately behind ST-FC and ST-54, with rare conflicts which do not occur systematically because execution times shift among conflicting processors. OS-PIP still shows significant improvement for crossbar designs. ASM-PIP, in contrast, puts ST-BUS at the same level of crossbars, and sometimes the shared bus even proves slightly faster. This can be explained with the continuous semaphore polling performed by this (and only this) benchmark; while crossbars may have an advantage in private memory accesses, the resulting speedup only gives processors more opportunities to poll the semaphore device, which becomes a bottleneck. Unpredictability of conflict patterns can then explain why a simple shared bus can sometimes slightly outperform crossbars, therefore the selection of bus topology should carefully match the target communication pattern.

20.9 Packet-Switched Interconnection Networks Previous sections have illustrated on-chip interconnection schemes based on shared buses and on evolutionary communication architectures. This section introduces a more revolutionary approach to on-chip communication, known as Network-on-Chip [2,3]. The NoC architecture consists of a packet-switched interconnetion network integrated onto a single chip, and it is likely to better support the trend for SoC integration. The basic idea is borrowed from the wide-area networks domain, and envisions router (or switch)-based networks of interconnects on which on-chip packetized communication takes place. Cores access the network by means of proper interfaces, and have their packets forwarded to destination through a certain number of hops. SoCs differ from wide area networks in their local proximity and because they exhibit less nondeterminism. Local, high-performance networks — such as those developed for large-scale multiprocessors — have

© 2006 by Taylor & Francis Group, LLC

SoC Communication Architectures

20-21

similar requirements and constraints. However, some distinctive features, such as energy constraints and design-time specialization, are unique to SoC networks. Topology selection for NoCs is a critical design issue. It is determined by how efficiently communication requirements of an application can be mapped onto a certain topology, and by physical level considerations. In fact, regular topologies can be designed with a better control on electrical parameters and therefore on communication noise sources (such as crosstalk), although they might result in link under-utilization or localized congestion from an application viewpoint. On the contrary, irregular topologies have to deal with more complex physical design issues but are more suitable to implement customized, domain-specific communication architectures. Two-dimensional mesh networks are a reference solution for regular NoC topologies. The scalable and modular nature of NoCs and their support for efficient on-chip communication potentially leads to NoC-based multiprocessor systems characterized by high structural complexity and functional diversity. On one hand, these features need to be properly addressed by means of new design methodologies, while on the other hand more efforts have to be devoted to modeling on-chip communication architectures and integrating them into a single modeling and simulation environment combining both processing elements and communication architectures. The development of NoC architectures and their integration into a complete MPSoC design flow is the main focus of an ongoing worldwide research effort [30–33].

20.10 Conclusions This chapter addresses the critical issue of on-chip communication for gigascale MPSoCs. An overview of the most widely used on-chip communication architectures is provided, and evolution guidelines aiming at overcoming scalability limitations are sketched. Advances regard both communication protocol and topology, although it is becoming clear that in the long term more aggressive approaches will be required to sustain system performance, namely packet-switched interconnection networks.

References [1] R. Ho, K.W. Mai, and M.A. Horowitz. The future of wires. Proceedings of the IEEE, 89: 490–504, 2001. [2] L. Benini and G. De Micheli. Networks on chips: a new SoC paradigm. IEEE Computer, 35: 70–78, 2002. [3] J. Henkel, W. Wolf, and S. Chakradhar. On-chip networks: a scalable, communication-centric embedded system design paradigm. In Proceedings of the International Conference on VLSI Design, January 2004, pp. 845–851. [4] ARM. AMBA Specification v2.0, 1999. [5] ARM. AMBA Multi-Layer AHB Overview, 2001. [6] ARM. AMBA AXI Protocol Specification, 2003. [7] IBM Microelectronics. CoreConnect Bus Architecture Overview, 1999. [8] G.W. Doerre and D.E. Lackey. The IBM ASIC/SoC methodology. A recipe for first-time success. IBM Journal of Research & Development, 46: 649–660, 2002. [9] IBM Microelectronics. The CoreConnect Bus Architecture White Paper, 1999. [10] P. Wodey, G. Camarroque, F. Barray, R. Hersemeule, and J.P. Cousin. LOTOS code generation for model checking of STBus based SoC: the STBus interconnection. In Proceedings of ACM and IEEE International Conference on Formal Methods and Models for Co-Design, June 2003, pp. 204–213. [11] Richard Herveille. Combining WISHBONE Interface Signals, Application Note, April 2001. [12] Richard Herveille. WISHBONE System-on-Chip (SoC) Interconnection Architecture for Portable IP Cores. Specification, 2002. [13] Rudolf Usselmann. OpenCores SoC Bus Review, 2001.

© 2006 by Taylor & Francis Group, LLC

20-22

Embedded Systems Handbook

[14] Sonics Inc. µ-Networks. Technical Overview, 2002. [15] Sonics Inc. SiliconBackplane III MicroNetwork IP. Product Brief, 2002. [16] Philip de Nier. Property checking of PI-bus modules. In Proceedings of the Workshop on Circuits, Systems and Signal Processing (ProRISC99), J.P. Veen, Ed. STW, Technology Foundation, Mierlo, The Netherlands, 1999, pp. 343–354. [17] ESPRIT, 1996, http://www.cordis.lu/esprit/src/results/res_area/omi/omi10.htm [18] Altera. AHB to Avalon & Avalon to AHB Bridges, 2003. [19] Altera. Avalon Bus Specification, 2003. [20] Palmchip. Overview of the CoreFrame Architecture, 2001. [21] IDT. IDT Peripheral Bus (IPBus). Intermodule Connection Technology Enables Broad Range of System-Level Integration, 2002. [22] Motorola. IP Interface. Semiconductor Reuse Standard, 2001. [23] W.J. Bainbridge and S.B. Furber. MARBLE: an asynchronous on-chip macrocell bus. Microprocessors and Microsystems, 24: 213–222, 2000. [24] Altera. Atlantic Interface. Functional Specification, 2002. [25] ClearSpeed. ClearConnect Bus. Scalable High Performance On-Chip Interconnect, 2003. [26] Summary of SoC Interconnection Buses, 2004, http://www.silicore.net/uCbusum.htm [27] Synopsys CoCentric, 2004, http://www.synopsys.com [28] L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, and M. Poncino. SystemC cosimulation and emulation of multiprocessor SoC designs. IEEE Computer, 36: 53–59, 2003. [29] F. Poletti, D. Bertozzi, A. Bogliolo, and L. Benini. Performance analysis of arbitration policies for SoC communication architectures. Journal of Design Automation for Embedded Systems, 8: 189–210, 2003. [30] M. Loghi, F. Angiolini, D. Bertozzi, L. Benini, and R. Zafalon. Analyzing on-chip communication in a MPSoC environment. In Proceedings of the IEEE Design Automation and Test in Europe Conference (DATE04), February 2004, pp. 752–757. [31] E. Rijpkema, K. Goossens, and A. Radulescu. Trade-offs in the design of a router with both guaranteed and best-effort services for networks on chip. In Proceedings of Design Automation and Test in Europe, March 2003, pp. 350–355. [32] K. Lee et al. A 51 Mw 1.6 GHz on-chip network for low power heterogeneous SoC platform. In ISSCC Digest of Technical Papers, 2004, pp. 152–154. [33] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny. QNoC: QoS architecture and design process for network on chip. The Journal of Systems Architecture, Special Issue on Networks on Chip, 50(2–3): 105–128, February 2004.

© 2006 by Taylor & Francis Group, LLC

21 Network-on-Chip Design for Gigascale Systems-on-Chip 21.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Design Challenges for On-Chip Communication Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 NoC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Davide Bertozzi and Luca Benini University of Bologna

Giovanni De Micheli Stanford University

21-1 21-3 21-4 21-5

Network Link • Switch • Network Interface

21.5 NoC Topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-13 Domain-Specific NoC Synthesis Flow

21.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-16 Acknowledgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-17 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-17

21.1 Introduction The increasing integration densities made available by shrinking device geometries will have to be exploited to meet the computational requirements of parallel applications, such as multimedia processing, automotive, multiwindow TV, ambient intelligence, etc. As an example, systems designed for ambient intelligence will be based on high-speed digital signal processing with computational loads ranging from 10 MOPS for lightweight audio processing, 3 GOPS for video processing, 20 GOPS for multilingual conversation interfaces, and up to 1 TOPS for synthetic video generation. This computational challenge will have to be addressed at manageable power levels and affordable costs [1]. Such a performance cannot be provided by a single processor, but requires a heterogeneous on-chip multiprocessor system containing a mix of general-purpose programmable cores, application specific processors, and dedicated hardware accelerators. In this context, performance of gigascale Systems-on-Chip (SoC) will be communication dominated, and only an interconnect-centric system architecture will be able to cope with this problem. Current on-chip interconnects consist of low-cost shared arbitrated buses, based on the serialization of bus access requests; only one master at a time can be granted access to the bus. The main drawback of this solution is its lack of scalability, which will result in unacceptable performance degradation for complex SoCs 21-1

© 2006 by Taylor & Francis Group, LLC

21-2

Embedded Systems Handbook

Core

Core NI

NI S

Core

S

NI S

S

Core

S

NI

NI

NI Core

Core

NI –– network interface S –– switch

FIGURE 21.1 Example of NoC architecture.

(more than a dozen of integrated cores). Moreover, the connection of new blocks to a shared bus increases its associated load capacitance, resulting in more energy consuming bus transactions. A scalable communication infrastructure that better supports the trend of SoC integration consists of an on-chip micronetwork of interconnects, generally known as Network-on-Chip (NoC) architecture [2–4]. The basic idea is borrowed from the wide-area networks domain, and envisions router (or switch)based networks on which on-chip packetized communication takes place, as depicted in Figure 21.1. Cores access the network by means of proper interfaces, and have their packets forwarded to destination through a certain number of hops. The scalable and modular nature of NoCs and their support for efficient on-chip communication potentially leads to NoC-based multiprocessor systems characterized by high structural complexity and functional diversity. On one hand, these features need to be properly addressed by means of new design methodologies [5], while on the other hand more efforts have to be devoted to modeling on-chip communication architectures and integrating them into a single modeling and simulation environment combining both processing elements and communication infrastructures [6–8]. These efforts are needed to include on-chip communication architecture in any quantitative evaluation of system design during design space exploration [9,10], so as to be able to assess the impact of the interconnect on achieving a target system performance. An important design decision for NoCs regards the choice of topology. Several researchers [4,5,11,12] envision NoCs as regular tile-based topologies (such as mesh networks and fat trees), which are suitable for interconnecting homogeneous cores in a chip multiprocessor. However, SoC component specialization (used by designers to optimize performance at low power consumption and competitive cost) leads to the on-chip integration of heterogeneous cores having varied functionality, size, and communication requirements. If a regular interconnect is designed to match the requirements of a few communicationhungry components, it is bound to be largely overdesigned with respect to the needs of the remaining components. This is the main reason why most of the current SoCs use irregular topologies, such as bridged buses and dedicated point-to-point links [13]. This chapter introduces basic principles and guidelines for the NoC design. At first, the motivation for the design paradigm shift of SoC communication architectures from shared buses to NoCs is examined. Then, the chapter goes into the details of NoC building blocks (switch, network interface, and switch-toswitch links), discussing the design guidelines and presenting a case study where some of the most advanced concepts in NoC design have been applied to a real NoC architecture (called Xpipes and developed at University of Bologna [14]). Finally, the challenging issue of heterogeneous NoC design will be addressed, and the effects of mapping the communication requirements of an application onto a domain-specific NoC, instead of a network with regular topology, will be detailed by means of an illustrative example.

© 2006 by Taylor & Francis Group, LLC

NoC Design for Gigascale SoC

21-3

21.2 Design Challenges for On-Chip Communication Architectures SoC design challenges that are driving the evolution of traditional bus architectures toward NoCs can be outlined as follows: Technology issues. While gate delays scale down with technology, global wire delays typically increase or remain constant as repeaters are inserted. It is estimated that in 50 nm technology, at a clock frequency of 10 GHz, a global wire delay might range from 6 to 10 clock cycles [2]. Therefore, limiting the on-chip distance traveled by critical signals will be key to guarantee the performance of the overall system, and will be a common design guideline for all kinds of system interconnects. On the contrary, other challenges posed by deep submicron technologies are leading to a paradigm shift in the design of SoC communication architectures. For instance, global synchronization of cores on future SoCs will be unfeasible due to deep submicron effects (clock skew, power associated with clock distribution tree, etc.), and an alternative scenario consists of self-synchronous cores that communicate with one another through a network-centric architecture [15]. Finally, signal integrity issues (crosstalk, power supply noise, soft errors, etc.) will lead to more transient and permanent failures of signals, logic values, devices, and interconnects, thus raising the reliability concern for on-chip communication [16]. In many cases, on-chip networks can be designed as regular structures, allowing electrical parameters of wires to be optimized and well controlled. This leads to lower communication failure probabilities, thus enabling the use of low swing signaling techniques [17], and to the capability of exploiting performance optimization techniques, such as wavefront pipelining [18]. Performance issues. In traditional buses, all communication actors share the same bandwidth. As a consequence, performance does not scale with the level of system integration, but degrades significantly. Though, once the bus is granted to a master, access occurs with no additional delay. On the contrary, NoCs can provide much better performance scalability. No delays are experienced for accessing the communication infrastructure, since multiple outstanding transactions originated by multiple cores can be handled at the same time, resulting in a more efficient network resources utilization. However, given a certain network dimension (e.g., number of instantiated switches), large latency fluctuations for packet delivery could be experienced as a consequence of network congestion. This is unacceptable when hard real-time constraints of an application have to be met, and two solutions are viable: network overdimensioning (for NoCs designed to support Best Effort [BE] traffic only) or implementation of dedicated mechanisms to provide guarantees for timing constrained traffic (e.g., loss-less data transport, minimal bandwidth, bounded latency, minimal throughput, etc.) [19]. Design productivity issues. It is well known that synthesis and compiler technology development do not keep up with IC manufacturing technology development [20]. Moreover, time-to-market needs to be kept as low as possible. Reuse of complex preverified design blocks is an efficient mean to increase productivity, and regards both computation resources and the communication infrastructure [21]. It would be highly desirable to have processing elements that could be employed in different platforms by means of a plugand-play design style. To this purpose, a scalable and modular on-chip network represents a more efficient communication infrastructure compared with shared-bus-based architectures. However, the reuse of processing elements is facilitated by the definition of standard network interfaces, which also make the modularity property of the NoC effective. The Virtual Socket Interface Alliance (VSIA) has attempted to set the characteristics of this interface industry-wide [22]. Open Core Protocol (OCP) [23] is another example of standard interface sockets for cores. It is worth remarking that such network interfaces also decouple the development of new cores from the evolution of new communication architectures. The core developer will not have to make assumptions about the system, when the core will be plugged into. Similarly, designers of new on-chip interconnects will not be constrained by the knowledge of detailed interfacing requirements for particular legacy SoC components. Finally, let us observe that NoC components (e.g., switches or interfaces) can be instantiated multiple times in the same design (as opposed to the arbiter of traditional shared buses, which is instance-specific) and reused in a large number of products targeting a specific application domain.

© 2006 by Taylor & Francis Group, LLC

21-4

Embedded Systems Handbook

The development of NoC architectures and protocols is fueled by the aforementioned arguments, in spite of the challenges represented by the need for new design methodologies and an increased complexity of system design.

21.3 Related Work The need to progressively replace on-chip buses with micronetworks was extensively discussed in [2,4]. A number of NoC architectures have been proposed in the literature so far. Sonics MicroNetwork [24] is an on-chip network making use of communication architectureindependent interface sockets. The MicroNetwork is an example of evolutionary solutions [25], which move from a physical implementation as a shared bus, and propose generalizations to support higher bandwidth (such as partial and full crossbars). STBUS interconnect from STMicroelectronics is another example of evolutionary architecture that provides designers with the capability to instantiate both shared bus or partial or full crossbar interconnect configurations. Even though these architectures provide higher bandwidth than simple buses, addressing the wiring delay and scalability challenge in the long term requires more radical solutions. One of the earliest contributions in this area is the Maia heterogeneous signal processing architecture, proposed by Zhang et al. [26], based on a hierarchical mesh network. Unfortunately, Maia’s interconnect is fully instance-specific. Furthermore, routing is static at configuration time: network switches are programmed once for all for a given application (as in a Field Programmable Gate Array [FPGA]). Thus, communication is based on circuit switching, as opposed to packet switching. In this direction, Dally and Lacy [27] sketch the architecture of a VLSI multicomputer using 2009 technology. A chip with 64 processor-memory tiles is envisioned. Communication is based on packet switching. This seminal work draws upon past experiences in designing parallel computers and reconfigurable architectures (FPGAs and their evolutions) [28–30]. Most proposed NoC platforms are packet switched and exhibit regular structure. An example is a mesh interconnection, which can rely on a simple layout and the switch independence on the network size. The NOSTRUM network described in Reference 5 takes this approach: the platform includes both a mesh architecture and the design methodology. The Scalable Programmable Integrated Network (SPIN) described in Reference 31 is another regular, fat-tree-based network architecture. It adopts cut-through switching to minimize message latency and storage requirements in the design of network switches. The Linkoeping SoCBUS [32] is a two-dimensional mesh network that uses a packet connected circuit (PCC) to set up routes through the network: a packet is switched through the network locking the circuit as it goes. This notion of virtual circuit leads to deterministic communication behavior but restricts routing flexibility for the rest of the communication traffic. The need to map communication requirements of heterogeneous cores may lead to the adoption of irregular topologies. The motivation for such architectures lies in the fact that each block can be optimized for a specific application (e.g., video or audio processing), and link characteristics can be adapted to the communication requirements of the interconnected cores. Supporting heterogeneous architectures requires a major design effort and leads to coarser-granularity control of physical parameters. Many recent heterogeneous SoC implementations are still based on shared buses (such as the single chip MPEG-2 codec reported in Reference 33, but the growing complexity of customizable media embedded processor architectures for digital media processing will soon require NoC-based communication architectures and proper hardware/software development tools. The Aethereal NoC design framework presented in Reference 34 aims at providing a complete infrastructure for developing heterogeneous NoC with end-to-end quality of service guarantees. The network supports guaranteed throughput (GT) for real-time applications and BE traffic for timing unconstrained applications. Support for heterogeneous architectures requires highly configurable network building blocks, customizable at instantiation time for a specific application domain. For instance, the Proteo NoC [35] consists of

© 2006 by Taylor & Francis Group, LLC

NoC Design for Gigascale SoC

21-5

a small library of predefined, parameterized components that allow the implementation of a large range of different topologies, protocols, and configurations. Xpipes interconnect [14] and its synthesizer XpipesCompiler [36] push this approach to the limit, by instantiating an application-specific NoC from a library of composable soft macros (network interface, link, and switch). The components are highly parameterizable and provide reliable and latency-insensitive operation.

21.4 NoC Architecture Most of the terminology for on-chip packet-switched communication is adapted from computer network and multiprocessor domain. Messages that have to be transmitted across the network are usually partitioned into fixed-length packets. Packets in turn are often broken into message flow control units called flits. In the presence of channel width constraints, multiple physical channel cycles can be used to transfer a single flit. A phit is the unit of information that can be transferred across a physical channel in a single step. Flits represent logical units of information, as opposed to phits that correspond to physical quantities. In many implementations, a flit is set to be equal to a phit. The basic building blocks for packet-switched communication across NoCs are: 1. Network link 2. Switch 3. Network interface and will be described hereafter.

21.4.1 Network Link The performance of interconnect is a major concern in scaled technologies. As geometries shrink, gate delay improves much faster than the delay in long wires. Therefore, the long wires increasingly determine the maximum clock rate, and hence performance, of the entire design. The problem becomes particularly serious for domain-specific heterogeneous SoCs, where the wire structure is highly irregular and may include both short and extremely long switch-to-switch links. Moreover, it has been estimated that only a fraction of the chip area (between 0.4 and 1.4%) will be reachable in one clock cycle [37]. A solution to overcome the interconnect-delay problem consists of pipelining interconnects [38,39]. Wires can be partitioned into segments (or relay stations, which have a function similar to one of the latches on a pipelined data path) whose length satisfies predefined timing requirements (e.g., desired clock speed of the design). In this way, link delay is changed into latency, but data introduction rate is not bounded by the link delay any more. Now, the latency of a channel connecting two modules may end up being more than one clock cycle. Therefore, if the functionality of the design is based on the sequencing of the signals and not on their exact timing, then link pipelining does not change the functional correctness of the design. This requires the system to be made of modules whose behavior does not depend on the latency of the communication channels (latency-insensitive operation). As a consequence, the use of interconnect pipelining can be seen as a part of a new and more general methodology for deep submicron (DSM) designs, which can be envisioned as synchronous distributed systems composed by functional modules that exchange data on communication channels according to a latency-insensitive protocol. This protocol ensures that functionally correct modules behave correctly independently of the channel latencies [38]. The effectiveness of the latency-insensitive design methodology is strongly related to the ability of maintaining a sufficient communication throughput in the presence of increased channel latencies. The International Technology Roadmap for Semiconductors (ITRS) 2001 [15] assumes that interconnect pipelining is the strategy of choice in its estimates of achievable clock speeds for MPUs. Some industrial designs already make use of interconnect pipelining. For instance, the NETBURST microarchitecture of Pentium 4 contains instances of a stage dedicated exclusively to handle wire delays: in fact, a so-called drive

© 2006 by Taylor & Francis Group, LLC

21-6

Embedded Systems Handbook

stage is used only to move signals across the chip without performing any computation and, therefore, can be seen as a physical implementation of a relay station [40]. Xpipes interconnect makes use of pipelined links and of latency-insensitive operation in the implementation of its building blocks. Switch-to-switch links are subdivided into basic segments whose length guarantees that the desired clock frequency (i.e., the maximum speed provided by a certain technology) can be used. In this way, the system operating frequency is not bound by the delay of long links. According to the link length, a certain number of clock cycles is needed by a flit to cross the interconnect. If network switches are designed in such a way that their functional correctness depends on the flit arriving order and not on their timing, input links of the switches can be different and of any length. These design choices are at the basis of latency-insensitive operation of the NoC and allow the construction of an arbitrary network topology and hence support for heterogeneous architectures. Figure 21.2 illustrates the link model, which is equivalent to a pipelined shift register. Pipelining has been used both for data and control lines. The figure also illustrates how pipelined links are used to support latency-insensitive link-level error control, ensuring robustness against communication errors. The retransmission of a corrupted flit between two successive switches is represented. Multiple outstanding flits propagate across the link during the same clock cycle. When flits are correctly received at the destination switch, an ACK is propagated back to the source, and after N clock cycles (where N is the Flits at source switch

Link C

Transmission

Flit acknowledgment

D

C

B

A

D

C

B

A

Destination switch

B

A

D

C

B A

ACK=1 ACK_Valid=1 D Detection of corrupted flit

D

C

B

C B

A

ACK=1 ACK_Valid = 1 ACK/NACK propagation

D

C

B

A ACK = 0 ACK_Valid = 1

A

A

ACK = 0 ACK_Valid = 1 B Retransmission

D

C

B

Go-Back-N

D

C

B

D

D

C

B

FIGURE 21.2 Pipelined link model and latency-insensitive link-level error control.

© 2006 by Taylor & Francis Group, LLC

A

C

B

A

NoC Design for Gigascale SoC

21-7

length of the link expressed in number of repeater stages) the flit will be discarded from the buffer of the source switch. On the contrary, a corrupted flit is NACKed and will be retransmitted in due time. The implemented retransmission policy is GO-BACK-N , to keep the switch complexity as low as possible.

21.4.2 Switch The task of the switch is to carry packets injected into the network to their final destination, following a statically defined or dynamically determined routing path. The switch transfers packets from one of its input ports to one or more of its output ports. Switch design is usually characterized by a power-performance trade-off: power-hungry switch memory resources can be required by the need to support high-performance on-chip communication. A specific design of a switch may include both input and output buffers or only one type of buffer. Input queuing uses fewer buffers, but suffers from head-of-line blocking. Virtual output queuing has a higher performance, but at the cost of more buffers. Network flow control (or routing mode) specifically addresses the limited amount of buffering resources in switches. Three policies are feasible in this context [41]. In store-and-forward routing, an entire packet is received and entirely stored before being forwarded to the next switch. This is the most demanding approach in terms of memory requirements and switch latency. Also virtual cut-through routing requires buffer space for an entire packet, but allows lower latency communication, in that a packet is forwarded as soon as the next switch guarantees that the complete packet will be accepted. If this is not the case, the current router must be able to store the whole packet. Finally, a wormhole routing scheme can be employed to reduce switch memory requirements and to permit low latency communication. The first flit of a packet contains routing information, and header flit decoding enables the switches to establish the path and subsequent flits simply follow this path in a pipelined fashion by means of switch output port reservation. A flit is passed to the next switch as soon as enough space is available to store it, even though there is not enough space to store the whole packet. If a certain flit faces a busy channel, subsequent flits have to wait at their current locations and are therefore spread over multiple switches, thus blocking the intermediate links. This scheme avoids buffering the full packet at one switch and keeps end-to-end latency low, although it is more sensitive to deadlock and may result in low link utilization. Guaranteeing quality of service in switch operation is another important design issue, which needs to be addressed when time-constrained (hard or soft real-time) traffic is to be supported. Throughput guarantees or latency bounds are examples of time-related guarantees. Contention related delays are responsible for large fluctuations of performance metrics, and a fully predictable system can be obtained only by means of contention free routing schemes. With circuit switching, a connection is setup over which all subsequent data is transported. Therefore, contention resolution takes place at setup at the granularity of connections, and time-related guarantees during data transport can be given. In time division circuit switching (see Reference 24 for an example), bandwidth is shared by time division multiplexing connections over circuits. In packet switching, contention is unavoidable since packet arrival cannot be predicted. Therefore arbitration mechanisms and buffering resources must be implemented at each switch, thus delaying data in an unpredictable manner and making it difficult to provide guarantees. BE NoC architectures can mainly rely on network overdimensioning to bound fluctuations of performance metrics. The Aethereal NoC architecture makes use of a router that tries to combine GT and BE services [34]. The GT router subsystem is based on a time-division multiplexed circuit switching approach. A router uses a slot table to (1) avoid contention on a link, (2) divide up bandwidth per link between connections, and (3) switch data to the correct output. Every slot table T has S time slots (rows), and N router outputs (columns). There is a logical notion of synchronicity: all routers in the network are in the same fixedduration slot. In a slot s at most one block of data can be read/written per input/output port. In the next slot, the read blocks are written to their appropriate output ports. Blocks thus propagate in a store and forward fashion. The latency a block incurs per router is equal to the duration of a slot and bandwidth

© 2006 by Taylor & Francis Group, LLC

21-8

Embedded Systems Handbook

is guaranteed in multiples of block size per S slots. The BE router uses packet switching, and it has been showed that both input queuing with wormhole routing or virtual cut-through routing and virtual output queuing with wormhole routing are feasible in terms of buffering cost. The BE and GT router subsystem are combined in the Aethereal router architecture of Figure 21.3. The GT router offers a fixed end-to-end latency for its traffic, which is given the highest priority by the arbiter. The BE router uses all the bandwidth (slots) that has not been reserved or used by GT traffic. GT router slot tables are programmed by means of BE packets (see the arrow “program” in Figure 21.3). Negotiations, resulting in slot allocation, can be done at compile time, and be configured deterministically at runtime. Alternatively, negotiations can be done at runtime. A different perspective has been taken in the design of the switch for the BE Xpipes NoC. Figure 21.4 shows an example configuration with four inputs, four outputs, and two virtual channels multiplexed across the same physical output link. A physical link is assigned to different virtual channels on a flitby-flit basis, thereby improving network throughput. Switch operation is latency-insensitive, in that correct operation is guaranteed for arbitrary link pipeline depth. In fact, as explained above, network

(a)

(b) BE

BE Program

GT

GT Low priority

Program

Preempt

Control path

High priority Buffers

Arbitration

Data path

Switch

FIGURE 21.3 A combined GT–BE router: (a) conceptual view; (b) hardware view.

In[1]

Out[1] In[2]

Out[0]

Switch

Out[2]

2N + M flits In[0] Out[3]

FIGURE 21.4 Example of switch configuration with two virtual channels.

© 2006 by Taylor & Francis Group, LLC

In[3]

NoC Design for Gigascale SoC

21-9

links in Xpipes interconnect are pipelined with a flexible number of stages, thereby decoupling link data introduction rate from its physical length. For latency-insensitive operation, the switch has virtual channel registers to store 2N + M flits, where N is the link length (expressed as number of basic repeater stages) and M is a switch architecture-related contribution (12 cycles in this design). The reason is that each transmitted flit has to be acknowledged before being discarded from the buffer. Before an ACK is received, the flit has to travel across the link (N cycles), an ACK/NACK decision has to be taken at the destination switch (a portion of M cycles), the ACK/NACK signal has to be propagated back (N cycles) and recognized by the source switch (remaining portion of M cycles). During this time, other 2N + M flits are transmitted but not yet ACKed. Output buffering was chosen for Xpipes switches, and the resulting architecture is reported in Figure 21.5. It consists of a replication of the same output module, accepting all input ports as its own inputs. Flow control signals generated by each output block are directed to a centralized module, that takes care of generating proper ACKs or NACKs for the incoming flits from the different input ports. Each output module is deeply pipelined (seven pipeline stages) so as to maximize the operating clock frequency of the switch. Architectural details on the pipelined output module are illustrated in Figure 21.6. Forward flow control is used and a flit is transmitted to the next switch only when adequate storage is available. The CRC decoders for error detection work are in parallel with the switch operation, thereby hiding their impact on switch latency.

In[1]

Out[1]

flow_control[1]

In[0] In[1] In[2]

Out[0] portOUTtot_switch[0].h

In[3] Internal_flow_control[0] In[0]

In[0]

In[2]

In[1] In[2] Out[0]

Out[1] ACK management

portOUTtot_switch[1].h

Out[2]

In[3] Internal_flow_control[1]

flow_control[0]

flow_control[2] In[0] In[1] In[2]

Out[3] portOUTtot_switch[3].h

In[3] Internal_flow_control[3]

flow_control[3] In[3]

Out[3]

FIGURE 21.5 Architecture of output buffered Xpipes switch.

© 2006 by Taylor & Francis Group, LLC

21-10

Embedded Systems Handbook

CRC_decoder[0]

CRC_decoder[1]

CRC_decoder[2]

CRC_decoder[3]

crc_ACK[0]

crc_ACK[1]

Error detection logic

crc_ACK[2]

crc_ACK[3] Output module

In[0] In[1] Out[0] which_in

In[2] In[3]

ACK

ACK

ACK

ACK_Valid

NACK

NACK

NACK

ACK

Matching input and Arbiter output port

MUXstage Virtual channel arbiter

Virtual channel registers

Forward flow control

Output link arbiter

FIGURE 21.6 Architecture of an Xpipes switch output module.

The first pipeline stage checks the header of incoming packets on different input ports to determine whether those packets have to be routed through the output port under consideration. Only matching packets are forwarded to the second stage, which resolves contention based on a round robin policy. Arbitration is carried out against receipt of the tail flits of preceding packets, so that all other flits of a packet can be propagated without contention resolution at this stage. A NACK for flits of nonselected packets is generated. The third stage is just a multiplexer, which selects the prioritized input port. The following arbitration stage keeps the status of virtual channel registers and decides whether the flits can be stored into the registers or not. A header flit is sent to the register with more free locations, followed by successive flits of the same packet. The fifth stage is the actual buffering stage, and the ACK/NACK message at this stage indicates whether a flit has been successfully stored or not. The following stage takes care of forward flow control and finally a last arbitration stage multiplexes the virtual channels on the physical output link. Finally, the switch is highly parameterizable. Design parameters are: number of I/O ports, flit width, number of virtual channels, length of switch-to-switch links, size of output registers.

21.4.3 Network Interface The most relevant tasks of the network interface are: (1) hiding the details about the network communication protocol to the cores, so that they can be developed independently of the communication infrastructure, (2) communication protocol conversion (from end-to-end to network protocol), (3) data packetization (packet assembly, delivery, and disassembly). The former objective can be achieved by means of standard interfaces. For instance, the VSIA vision [22] is to specify open standards and specifications that facilitate the integration of software and hardware

© 2006 by Taylor & Francis Group, LLC

NoC Design for Gigascale SoC

21-11

virtual components from multiple sources. Different complexity interfaces are described in the standard, from Peripheral Virtual Complexity Interfaces (VCI) to Basic VCI and Advanced VCI. Another example of standard socket to interface cores to networks is represented by Open Core Protocol (OCP) [23]. Its main characteristics are a high degree of configurability to adapt to the core’s functionality and the independence of request and response phases, thus supporting multiple outstanding requests and pipelining of transfers. Data packetization is a critical task for the network interface, and has an impact on the communication latency, besides the latency of the communication channel. The packet-preparation process consists of building packet header, payload, and packet tail. The header contains the necessary routing and network control information (e.g., source and destination address). When source routing is used, the destination address is ignored and replaced with a route field that specifies the route to the destination. This overhead in terms of packet header is counterbalanced by the simpler routing logic at the network switches: they simply have to look at the route field and route the packet over the specified switch output port. The packet tail indicates the end of a packet and usually contains parity bits for error-detecting or error-correcting codes. An insight in the Xpipes network interface implementation will provide an example of these concepts. It provides a standardized OCP-based interface to network nodes. The network interface for cores that initiate communication (initiators) needs to turn OCP-compliant transactions into packets to be transmitted across the network. It represents the slave side of an OCP end-to-end connection, and it is therefore referred to as network interface slave (NIS). Its architecture is showed in Figure 21.7.

Initiator core

numSB Lutword

Datastream Request phase (OCP)

STATIC_PACKETING

Flitout

req_tx_datastream

DP_FAST

teq_tx_flitout BUFFER_OUT

Output to the network

busy_buffer busy_dpfast

Master if

start_receive_response

enable_new_read

Datastream Response phase (OCP)

req_tx_datastream

RECEIVE_RESPONSE

busy_receive_response

NIS

FIGURE 21.7 Architecture of the Xpipes NIS.

© 2006 by Taylor & Francis Group, LLC

SYNCHRO

Input from the network

21-12

Embedded Systems Handbook

The NIS has to build the packet header, which has to be spread over a variable number of flits depending on the length of the path to the destination node. In fact, Xpipes relies on a static routing algorithm called street sign routing. Routes are derived by the network interface by accessing a look-up table based on the destination address. Such information consists of direction bits read by each switch and indicating the output port of the switch to which flits belonging to a certain packet have to be directed to. The look-up table is accessed by the STATIC_PACKETING block, a finite state machine that forwards the routing information numSB (number of hops to destination) and lutword (word read from the lookup table) as well as the request-related information datastream from the initiator core to the DP_FAST block, provided the enable signal busy_dpfast is not asserted. Based on the input data, module DP_FAST has the task of building the flits to be transmitted via the output buffer BUFFER_OUT, according to the mechanism illustrated in Figure 21.8. Let us assume that a packet requires numSB = 5 hops to get to destination, and that the direction to be taken at each switch is expressed by DIR. Module DP_FAST builds the first flit by concatenating the flit type field with path information. If there is some space left in the flit, it is filled with header information derived from the input datastream. The unused part of the datastream is stored in a regpark register, so that a new datastream can be read from the STATIC_PACKETING block. The following header and payload flits will be formed by combining data stored in regpark and reg_datastream. No partially filled flits are transmitted to make transmission more efficient. Finally, module BUFFER_OUT stores flits to be sent across the network, and allows the NIS to keep preparing successive flits when the network is congested. Size of this buffer is a design parameter. The response phase is carried out by means of two modules. SYNCHRO receives incoming flits and reads out only useful information (e.g., it discards route fields). At the same time, it contains buffering resources

DIR DIR DIR DIR DIR Lutword FTYPE

numSB = 5

Path W Flit Header

W–FTYPEWD Datastream Regflit

reg_datastream

Header

Regpark

FIGURE 21.8 Mechanism for building header flits.

© 2006 by Taylor & Francis Group, LLC

NoC Design for Gigascale SoC

21-13

to synchronize the network’s requests to transmit remaining packet flits with the core consuming rate. The RECEIVE_RESPONSE module translates useful header and payload information into OCP-compliant response fields. When a read transaction is initiated by the master core, the STATIC_PACKETING block asserts a start_receive_response signal that triggers the waiting phase of the RECEIVE_RESPONSE module for the requested data. As a consequence, the NIS supports only one outstanding read operation to keep interface complexity low. Although no read after read transactions can be initiated unless the previous one has completed, an indefinite number of write transactions can be carried out after an outstanding read has been initiated. The architecture of a network interface master is similar to the one just described, and is not reported here for lack of space. At instantiation time, the main network interface related parameters to be set are: total number of core blocks, flit width, and maximum number of hops across the network.

21.5 NoC Topology The individual components of SoCs are inherently heterogeneous with widely varying functionality and communication requirements. The communication infrastructure should optimally match communication patterns among these components accounting for the individual component needs. As an example, consider the implementation of an MPEG4 decoder [42], depicted in Figure 21.9(b), where blocks are drawn roughly to scale and links represent interblock communication. First, the embedded memory (SDRAM) is much larger than all other cores and it is a critical communication bottleneck. Block sizes are highly nonuniform and the floorplan does not match the regular, tile-based floorplan shown in Figure 21.9(a). Second, the total communication bandwidth to/from the embedded SDRAM is much larger than that required for communication among the other cores. Third, many neighboring blocks do not need to communicate. Even though it may be possible to implement MPEG4 onto a homogeneous fabric, there is a significant risk of either underutilizing many tiles and links, or, at the opposite extreme, of achieving poor performance because of localized congestion. These factors motivate the use of an application-specific on-chip network [26]. With an application-specific network, the designer is faced with the additional task of designing network components (e.g., switches) with different configurations (e.g., different I/Os, virtual channels, buffers) and interconnecting them with links of uneven length. These steps require significant design time and the need to verify network components and their communications for every design. The library-based nature of network building blocks seems the more appropriate solution to support domain-specific custom NoCs. Two relevant examples have been reported in the open literature: Proteo and Xpipes Interconnects. Proteo consists of a fully reusable and scalable component library where the

(a)

(b)

AU

RAST

SRAM

VU

SDRAM

MCPU

Core BAB DSP

UPS SRAM

ADSP RISC

FIGURE 21.9 Homogeneous versus heterogeneous architectural template: (a) tile-based on-chip multiprocessor; (b) MPEG4 SoC.

© 2006 by Taylor & Francis Group, LLC

21-14

Embedded Systems Handbook

components can be used to implement networks from very simple bus emulation structures to complex packet networks. It uses standardized VCI interface between the functional cores and the communication network. Proteo is described using synthesizable VHDL and relies on an interconnect node architecture that targets flexible on-chip communication. It is used as a testing platform when the efficiency of network topologies and routing schemes are investigated for on-chip environments. The node is constructed from a collection of parameterized and reusable hardware blocks, including components such as FIFO (first in first out) buffers, routing controllers, and standardized interface wrappers. A node can be tuned to fulfill the desired characteristics of communication by properly selecting the internal architecture of the node itself. Xpipes NoC takes a similar approach. As described throughout this chapter, its network building blocks have been designed as highly configurable and design-time composable soft macros described in SystemC at the cycle-accurate level. An optimal system solution will also require an efficient mapping of high-level abstractions on to the underlying platform. This mapping procedure involves optimizations and trade-offs among many complex constraints, including quality of service, real-time response, power consumption, area, etc. Tools are urgently needed to explore this mapping process, and assist and automate optimization where possible. The first challenge for these tools is to bridge the gap in building custom NoCs that optimally match the communication requirements of the system. The network components they build should be highly optimized for that particular NoC design, providing large savings in area, power, and latency with respect to standard NoCs based on regular structures. In the following section, an example of design methodology for heterogeneous SoCs is briefly illustrated. It is relative to Xpipes interconnect and relies on a tool that automatically instantiates an applicationspecific NoC for heterogeneous on-chip multiprocessors (called XpipesCompiler [36]).

21.5.1 Domain-Speciﬁc NoC Synthesis Flow The complete XpipesCompiler-based NoC design flow is depicted in Figure 21.10. From the specification of an application, the designer (or a high-level analysis and exploration tool) creates a high-level view of the SoC floorplan, including nodes (with their network interfaces), links, and switches. Based on clock

Xpipes Library NI files

Application

Application specific NoC

Switch files

Link files

Instantiation software

Routing tables

Core source files XpipesCompiler tool

FIGURE 21.10

NoC synthesis flow with XpipesCompiler.

© 2006 by Taylor & Francis Group, LLC

SystemC SystemC files Files ofwhole whole of design design

NoC Design for Gigascale SoC

21-15

VU 190

Media CPU

AU

40

0.5

600

SRAM1

ADSP

40

250

SDRAM 0.5

RAST

60

910 Up SAMP

SRAM2 670 173

iDCT, etc.

500

32 BAB

RISC

FIGURE 21.11 Core graph representation of an example MPEG4 design with annotated average communication requirements.

speed target and link routing, the number of pipeline stages for each link is also specified. The information on the network architecture is specified in an input file for the XpipesCompiler. Routing tables for the network interfaces are also specified. The tool takes as additional input the SystemC library of soft network components. The output is a SystemC hierarchical description, which includes all switches, links, network nodes, and interfaces, and specifies their topological connectivity. The final description can then be compiled and simulated at the cycle-accurate and signal-accurate level. At this point, the description can be fed to backend register transfer level (RTL) synthesis tools for silicon implementation. In a nutshell, the XpipesCompiler generates a set of network component instances that are customtailored to the specification contained in its input network description file. This tool allows a very instructive comparison of the effects (in terms of area, power, and performance) of mapping applications on customized domain-specific NoCs and regular mesh NoCs. Let us focus on the MPEG4 decoder already introduced in this chapter. Its core graph representation together with its communication requirements are reported in Figure 21.11. The edges are annotated with the average bandwidth requirements of the cores in MB/sec. Customized application-specific NoCs that closely match the application’s communication characteristics have been manually developed and compared to a regular mesh topology. The different NoC configurations are reported in Figure 21.12. In the MPEG4 design considered, many of the cores communicate with each other through the shared SDRAM. So a large switch is used for connecting the SDRAM with other cores (Figure 21.12[b]) while smaller switches are used for other cores. An alternate custom NoC is also considered (Figure 21.12[c]): it is an optimized mesh network, with superfluous switches and switch I/Os removed. Area (in 0.1 µm technology) and power estimates for the different NoC configurations are reported in Table 21.1. Since all cores communicate with many other cores, many switches are needed and therefore area savings are not extremely significant for custom NoCs. Based on the average traffic through each network component, the power dissipation for each NoC design has been calculated. Power savings for the custom solutions are not very significant, as most of the traffic traverses the larger switches connected to the memories. As power dissipation on a switch increases nonlinearly with increase in switch size, there is more power dissipation in the switches of custom NoC1 (that has an 8 × 8 switch) than the mesh NoC. However most of the traffic traverses short links in this custom NoC, thereby giving marginal power savings for the whole design.

© 2006 by Taylor & Francis Group, LLC

21-16

Embedded Systems Handbook

(a)

vu s1

raster izer

SRAM

s2

s1

au

Media CPU

(c)

(b)

s2

au vu

s2 BAB Calc up samp

s2 s3

DDR SDRAM

raster S3 izer

DDR SDRAM

RISC CPU

S3

Audio DSP

s1 iDCT, etc.

s2

s2 s1

SRAM

S3

BAB calc Audio DSP

RISC CPU

Audio DSP up samp

S3

BAB Calc

SRAM iDCT etc.

RISC CPU

DDR SDRAM

s1 iDCT, etc.

s3

SRAM

s2 –– 3 × 3

s8 –– 8 × 8

s3 –– 5 × 5

s2

s1 –– 5 × 5

s3 –– 3 × 3

s2 –– 4 × 4

s2

Media CPU

s1

Media CPU

up samp

s1 –– 3 × 3

SRAM

S3

S8 s3

au vu raster izer s2 s2

SRAM

s3 –– 4 × 4

FIGURE 21.12 NoC configurations for MPEG4 decoder: (a) mesh NoC, (b) application-specific NoC1, and (c) application-specific NoC2. TABLE 21.1 Area and Power Estimates for the MPEG4-Related NoC Configurations NoC configuration Mesh Custom 1 Custom 2

Area (mm2 )

Ratio mesh/cust

Power (mW)

Ratio mesh/cust

1.31 0.86 0.71

1.52 1.85

114.36 110.66 93.66

1.03 1.22

Figure 21.13 reports the variation of average packet latency (for 64B packets, 32 bit flits) with link bandwidth. Custom NoCs, as synthesized by XpipesCompiler, have lower packet latencies as the average number of switches and link traversals is lower. At the minimum plotted bandwidth value, almost 10% latency saving is achieved. Moreover, the latency increases more rapidly with the mesh NoC as the link bandwidth decreases. Also, custom NoCs have better link utilization: around 1.5 times the link utilization of a mesh topology. Area, power, and performance optimizations by means of custom NoCs turn out to be more difficult for MPEG4 than other applications, such as Video Object Plane Decoders and MultiWindow Displayer [36].

21.6 Conclusions This chapter has described the motivation for packet-switched networks as communication paradigm for deep submicron SoCs. After an overview of NoC proposals from the open literature, the chapter has gone into the details of NoC architectural components (switch, network interface, and point-to-point links), illustrating the Xpipes library of composable soft macros as a case study. Finally, the challenging issue of heterogeneous NoC design has been addressed, showing an example NoC synthesis flow and detailing area, power, and performance metrics of customized application-specific NoC architectures with respect to regular mesh topologies. The chapter aims at highlighting the main guidelines and open issues for NoC design on gigascale SoCs.

© 2006 by Taylor & Francis Group, LLC

NoC Design for Gigascale SoC

21-17

50 Mesh 48

Cust1 Cust2

Average Packet Latency (in Cy)

46

44

42

40

38

36

34 32 3.2

3

2.8 2.6 BW (in Gb/sec)

2.4

2.2

FIGURE 21.13 Average packet latency as a function of the link bandwidth.

Acknowledgment This work was supported in part by MARCO/DARPA Gigascale Silicon Research Center.

References [1] F. Boekhorst. Ambient Intelligence, the Next Paradigm for Consumer Electronics: How will it Affect Silicon? In ISSCC 2002, Vol. 1, February 2002, pp. 28–31. [2] L. Benini and G. De Micheli. Networks on Chips: a New SoC Paradigm. IEEE Computer, 35, 2002, 70–78. [3] P. Wielage and K. Goossens. Networks on Silicon: Blessing or Nightmare? In Proceedings of the Euromicro Symposium on Digital System Design DSD02, September 2002, pp. 196–200. [4] W.J. Dally and B. Towles. Route Packets, not Wires: On-Chip Interconnection Networks. In Proceedings of the Design and Automation Conference DAC01, June 2001, pp. 684–689. [5] S. Kumar, A. Jantsch, J.P. Soininen, M. Forsell, M. Millberg, J. Oeberg, K. Tiensyrja, and A. Hemani. A Network on Chip Architecture and Design Methodology. In IEEE Symposium on VLSI ISVLSI02, April 2002, pp. 105–112. [6] L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, and M. Poncino. SystemC Cosimulation and Emulation of Multiprocessor SoC Designs. IEEE Computer, 36, 2003, 53–59.

© 2006 by Taylor & Francis Group, LLC

21-18

Embedded Systems Handbook

[7] S. Nugent, D.S. Wills, and J.D. Meindl. A Hierarchical Block-Based Modeling Methodology for SOC in GENESYS. In Proceedings of the IEEE ASIC/SOC Conference, September 2002, pp. 239–243. [8] P. Gerin, S. Yoo, G. Nicolescu, and A.A. Jerraya. Scalable and Flexible Cosimulation of SoC Designs with Heterogeneous Multi-Processor Target Architecture. In Proceedings of the ASP-DAC 2001, January/February 2001, pp. 63–68. [9] H. Blume, H. Huebert, H.T. Feldkaemper, and T.G. Noll. Model-Based Exploration of the Design Space for Heterogeneous Systems on Chip. In Proceedings of the IEEE Conference on ApplicationSpecific Systems, Architectures and Processors ASAP02, 2002. [10] P.G. Paulin, C. Pilkington, and E. Bensoudane. StepNP: a System-level Exploration Platform for Network Processors. IEEE Design and Test of Computers, November–December 2002, pp. 17–26. [11] P. Guerrier and A. Greiner. A Generic Architecture for On-Chip Packet Switched Interconnections. In Proceedings of the Design, Automation and Testing in Europe DATE00, March 2000, pp. 250–256. [12] S.J. Lee et al. An 800 MHz Star-Connected On-Chip Network for Application to Systems on a Chip. In ISSCC03, February 2003. [13] H. Yamauchi et al. A 0.8 W HDTV Video Processor with Simultaneous Decoding of Two MPEG2 MP@HL Streams and Capable of 30 Frames/s Reverse Playback. In ISSCC02, Vol. 1, February 2002, pp. 473–474. [14] M. Dall’Osso, G. Biccari, L. Giovannini, D. Bertozzi, and L. Benini. Xpipes: a Latency Insensitive Parameterized Network-on-Chip Architecture for Multi-Processor SoCs. In ICCD03, October 2003. [15] ITRS. 2001, http://public.itrs.net/Files/2001ITRS/Home.htm. [16] D. Bertozzi, L. Benini, and G. De Micheli. Energy-Reliability Trade-Off for NoCs. In Networks on Chip, A. Jantsch and Hannu Tenhunen, Eds., Kluwer Academic Press, Boston, MA, 2003, pp. 107–129. [17] H. Zhang, V. George, and J.M. Rabaey. Low-Swing On-Chip Signaling Techniques: Effectiveness and Robustness. IEEE Transactions on VLSI Systems, 8, 2000, 264–272. [18] J. Xu, and W. Wolf, Wave Pipelining for Application-Specific Networks-on-Chips. In CASES02, October 2002, pp. 198–201. [19] K. Goossens, J. Dielissen, J. van Meerbergen, P. Poplavko, A. Radulescu, E. Rijpkema, E. Waterlander, and P. Wielage. Guaranteeing the Quality of Services in Networks on Chip. In Networks on Chip, A. Jantsch and Hannu Tenhunen, Eds., Kluwer Academic Press, Boston, MA, 2003, pp. 61–82. [20] ITRS, 1999, http://public.itrs.net/files/1999_SIA_Roadmap/. [21] A. Jantsch and H. Tenhunen. Will Networks on Chip Close the Productivity Gap? In Networks on Chip, A. Jantsch and Hannu Tenhunen, Eds., Kluwer Academic Press, Boston, MA, 2003, pp. 3–18. [22] VSI Alliance. Virtual Component Interface Standard, 2000. [23] OCP International Partnership. Open Core Protocol Specification, 2001. [24] D. Wingard. MicroNetwork-Based Integration for SoCs. In Design Automation Conference DAC01, June 2001, pp. 673–677. [25] D. Flynn. AMBA: enabling Reusable On-Chip Designs. IEEE Micro, 17, 1997, 20–27. [26] H. Zhang et al. A 1V Heterogeneous Reconfigurable DSP IC for Wireless Baseband Digital Signal Processing. IEEE Journal of SSC, 35, 2000, 1697–1704. [27] W.J. Dally and S. Lacy. VLSI Architecture: Past, Present and Future. In Conference of Advanced Research in VLSI, 1999, pp. 232–241. [28] D. Culler, J.P. Singh, and A. Gupta. Parallel Computer Architecture. An Hardware/Software Approach. Morgan Kaufmann, San Francisco, CA, 1999. [29] K. Compton and S. Hauck. Reconfigurable Computing: a Survey of System and Software. ACM Computing Surveys, 34, 2002, 171–210. [30] R. Tessier and W. Burleson. Reconfigurable Computing and Digital Signal Processing: a Survey. Journal of VLSI Signal Processing, 28, 2001, 7–27.

© 2006 by Taylor & Francis Group, LLC

NoC Design for Gigascale SoC

21-19

[31] J. Walrand and P. Varaja. High Performance Communication Networks. Morgan Kaufmann, San Francisco, CA, 2000. [32] Dale Liu et al. SoCBUS: The Solution of High Communication Bandwidth on Chip and Short TTM, invited paper in Real Time and Embedded Computing Conference, September 2002. [33] S. Ishiwata et al., A Single Chip MPEG-2 Codec Based on Customizable Media Embedded Processor. IEEE JSSC, 38, 2003, 530–540. [34] E. Rijpkema, K. Goossens, A. Radulescu, J. van Meerbergen, P. Wielage, and E. Waterlander. Trade Offs in the Design of a Router with both Guaranteed and Best-Effort Services for Networks on Chip. In Design Automation and Test in Europe DATE03, March 2003, pp. 350–355. [35] I. Saastamoinen, D. Siguenza-Tortosa, and J. Nurmi. Iterconnect IP Node for Future Systemson-Chip Designs. IEEE Workshop on Electronic Design, Test and Applications, January 2002, pp. 116–120. [36] A. Jalabert, S. Murali, L. Benini, and G. De Micheli. XpipesCompiler: a Tool for Instantiating Application Specific Networks-on-Chip, DATE 2004, pp. 884–889. [37] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger. Clock Rate Versus IPC: The End of the Road for Conventional Microarchitectures. In Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000, pp. 248–250. [38] L.P. Carloni, K.L. McMillan, and A.L. Sangiovanni-Vincentelli. Theory of Latency-Insensitive Design. IEEE Transactions on CAD of ICs and Systems, 20, 2001, 1059–1076. [39] L. Scheffer. Methodologies and Tools for Pipelined On-Chip Interconnects, International Conference on Computer Design, 2002, pp. 152–157. [40] P. Glaskowsky. Pentium 4 (Partially) Previewed. Microprocessor Report, 14, 2000, 10–13. [41] J. Duato, S. Yalamanchili, and L. Ni. Interconnection Networks: an Engineering Approach. IEEE Computer Society Press, Washington, 1997. [42] E.B. Van der Tol and E.G.T. Jaspers. Mapping of MPEG4 Decoding on a Flexible Architecture Platform. In SPIE 2002, January 2002, pp. 1–13.

© 2006 by Taylor & Francis Group, LLC

22 Platform-Based Design for Embedded Systems 22.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-1 22.2 Platform-Based Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-3 22.3 Platforms at the Articulation Points of the Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-4 (Micro-)Architecture Platforms • API Platform • System Platform Stack

22.4 Network Platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-7 Definitions • Quality of Service • Design of Network Platforms

22.5 Fault-Tolerant Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-9

Luca P. Carloni, Fernando De Bernardinis, Claudio Pinello, Alberto L. Sangiovanni-Vincentelli, and Marco Sgroi University of California at Berkeley

Types of Faults and Platform Redundancy • Fault-Tolerant Design Methodology • The API Platform (FTDF Primitives) • Fault-Tolerant Deployment • Replica Determinism

22.6 Analog Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-15 Definitions • Building Performance Models • Mixed-Signal Design Flow with Platforms • Reconfigurable Platforms

22.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-22 Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-24 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-24

22.1 Introduction Platform-Based Design (PBD) [1,2] has emerged as an important design style as the electronics industry faced serious difficulties owing to three major factors: 1. The disaggregation (or “horizontalization”) of the electronics industry has begun about a decade ago and has affected the structure of the industry favoring the move from a vertically oriented business model to a horizontally oriented one. In the past, electronic system companies used to maintain full control of the product development cycle, from product definition to final manufacturing. Today, the identification of a new market opportunity, the definition of the detailed system specifications, the development and assembly of the components, and the manufacturing of the final product are tasks performed more and more frequently by distinct organizations. In fact, the complexity of electronic designs and the number of technologies that must be mastered to bring winning products to market have forced electronic companies to focus on their core competence. In this 22-1

© 2006 by Taylor & Francis Group, LLC

22-2

Embedded Systems Handbook

scenario, the integration of the design chain becomes a serious problem at the hand-off points from one company to another. 2. The pressure for reducing time-to-market of electronic products in the presence of exponentially increasing complexity has forced designers to adopt methods that favor component reuse at all levels of abstraction. Furthermore, each organization that contributes a component to the final product naturally strives for the flexibility in their design approach that allows to make continuous adjustments and accommodate last-minute engineering changes. 3. The dramatic increase in NonRecurring Engineering (NRE) costs owing to mask making at the Integrated Circuit (IC) implementation level (a set of masks for the 90 nm technology node costs more than two million US dollars), development of production plants (a new fab costs more than two billion US dollars), and design cost (a new generation microprocessor design requires more than 500 designers with all the associated costs in tools and infrastructure!) has created, on one hand, the necessity of correct-the-first-time designs and on the otherhand, the push for consolidation of efforts in manufacturing.1 The combination of these factors has caused several system companies to substantially reduce their ASIC (Application Specific Integrated Circuits) design efforts. Traditional paradigms in electronic system and IC design have to be revisited and readjusted or altogether abandoned. Along the same line of reasoning, IC manufacturers are moving toward the development of parts that have guaranteed highvolume production form a single mask set (or that are likely to have high-volume production, if successful) thus moving differentiation and optimization to reconfigurability and programmability. Platform-Based Design has emerged over the years as a way of coping with the problems listed earlier. The term “platform” has been used in several domains: from service providers to system companies, from tier-one suppliers to IC companies. In particular, IC companies have been very active, lately, to espouse platforms. The TI OMAP platform for cellular phones, the Philips Viper and Nexperia platforms for consumer electronics, the Intel Centrino platform for laptops, are a few examples. Recently, Intel has been characterized by its CEO Ottellini as a “platform company.” As is often the case for fairly radical new approaches, the methodology emerged as a sequence of empirical rules and concepts, but we have reached a point where a rigorous design process was needed together with supporting EDA environments and tools. The PBD: • Sets the foundation for developing economically feasible design flows because it is a structured methodology that theoretically limits the space of exploration, yet still achieves superior results in the fixed time constraints of the design. • Provides a formal mechanism for identifying the most critical hand-off points in the design chain. The hand-off point between system companies and IC design companies and the one between IC design companies (or divisions) and IC manufacturing companies (or divisions) represent the articulation points of the overall design process. • Eliminates expensive design iterations because it fosters design reuse at all abstraction levels thus enabling the design of an electronic product by assembling and configuring platform components in a rapid and reliable fashion. • Provides an intellectual framework for the complete electronic design process. This chapter presents the foundations of this discipline and outlines a variety of domains where the PBD principles can be applied. In particular, Section 22.2 defines the main principles of PBD. Our goal is to provide a precise reference that may be used as the basis for reaching a common understanding in the electronic system and circuit design community. Then, we present the platforms that define the articulation points between system definition and implementation (Section 22.3). In the following sections 1 The cost of fabs has changed the landscape of IC manufacturing in a substantial way forcing companies to team up

for developing new technology nodes (see, e.g., the recent agreement among Motorola, Philips, and ST Microelectronics and the creation of Renesas in Japan).

© 2006 by Taylor & Francis Group, LLC

Platform-Based Design for Embedded Systems

22-3

we show that the PBD paradigm can be applied to all levels of design: from very high levels of abstraction such as communication networks (Section 22.4) and fault-tolerant platforms for the design of safetycritical feedback-control systems (Section 22.5) to low levels such as analog parts (Section 22.6), where performance is the main focus.

22.2 Platform-Based Design The basic tenets of PBD are: • The identification of design as a meeting-in-the-middle process, where successive refinements of specifications meet with abstractions of potential implementations. • The identification of precisely defined layers where the refinement and abstraction process take place. Each layer supports a design stage that provides an opaque abstraction of lower layers that allows accurate performance estimations. This information is incorporated in appropriate parameters that annotate design choices at the present layer of abstraction. These layers of abstraction are called platforms to stress their role in the design process and their solidity. A platform is a library of components that can be assembled to generate a design at that level of abstraction. This library not only contains computational blocks that carry out the appropriate computation but also communication components that are used to interconnect the functional components. Each element of the library has a characterization in terms of performance parameters together with the functionality it can support. For every platform level, there is a set of methods used to map the upper layers of abstraction into the platform and a set of methods used to estimate performances of lower level abstractions. As illustrated in Figure 22.1, the meeting-in-the-middle process is the combination of two efforts: • Top-down: map an instance of the top platform into an instance of the lower platform and propagate constraints. • Bottom-up: build a platform by defining the library that characterizes it and a performance abstraction (e.g., number of literals for technology independent optimization, area, and propagation delay for a cell in a standard cell library). A platform instance is a set of architecture components that are selected from the library and whose parameters are set. Often the combination of two consecutive layers and their “filling” can be interpreted as a unique abstraction layer with an “upper” view, the top abstraction layer, and a “lower” view, the bottom layer. A platform stack is a pair of platforms, along with the tools and methods that are used to map the upper layer of abstraction onto the lower layer. Note that we can allow a platform stack to include several sub-stacks if we wish to span a large number of abstractions.

Performance estimation

Constraints propagation

Upper layer of abstraction

Lower layer of abstraction

FIGURE 22.1 Interactions between abstraction layers.

© 2006 by Taylor & Francis Group, LLC

22-4

Embedded Systems Handbook

Platforms should be defined to eliminate large loop iterations for affordable designs: they should restrict design space via new forms of regularity and structure that surrender some design potential for lower cost and first-pass success. The library of function and communication components is the design space that we can explore at the appropriate level of abstraction. Establishing the number, location, and components of intermediate platforms is the essence of PBD. In fact, designs with different requirements and specification may use different intermediate platforms, hence different layers of regularity and design-space constraints. A critical step of the PBD process is the definition of intermediate platforms to support predictability, which enables the abstraction of implementation detail to facilitate higher-level optimization, and verifiability, that is, the ability to formally ensure correctness. The trade-offs involved in the selection of number and characteristics of platforms relate to the size of the design space to be explored and the accuracy of the estimation of the characteristics of the solution adopted. Naturally, the larger the step across platforms, the more difficult is the prediction of performance, optimizing at the higher levels of abstraction, and providing a tight lower bound. In fact, the design space for this approach may actually be smaller than the one obtained with smaller steps because it becomes harder to explore meaningful design alternatives and the restriction on search impedes complete designspace exploration. Ultimately, predictions/abstractions may be so inaccurate that design optimizations are misguided and the lower bounds are incorrect. It is important to emphasize that the PBD paradigm applies to all levels of design. While it is rather easy to grasp the notion of a programmable hardware platform, the concept is completely general and should be exploited through the entire design flow to solve the design problem. In the following sections, we will show that platforms can be applied to low levels of abstraction such as analog components, where flexibility is minimal and performance is the main focus, as well as to very high levels of abstraction such as networks, where platforms have to provide connectivity and services. In the former case platforms abstract hardware to provide (physical) implementation, while in the latter communication services abstract software layers (protocol) to provide global connectivity.

22.3 Platforms at the Articulation Points of the Design Process As we mentioned in Section 22.2, the key to the application of the design principle is the careful definition of the platform layers. Platforms can be defined at several point of the design process. Some levels of abstraction are more important than others in the overall design trade-off space. In particular, the articulation point between system definition and implementation is a critical one for design quality and time. Indeed, the very notion of PBD originated at this point (see [1,3–5]). In References 1, 2, and 5, we have discovered that at this level there are indeed two distinct platforms forming a system platform stack. These need to be defined together with the methods and the tools necessary to link them: a (micro-)architecture platform and an Application Programming Interface (API) platform. The API platform allows system designers to use the services that a (micro-)architecture offers. In the world of Personal Computers (PCs), this concept is well known and is the key to the development of application software on different hardware that share some commonalities allowing the definition of a unique API.

22.3.1 (Micro-)Architecture Platforms Integrated circuits used for embedded systems will most likely be developed as an instance of a particular (micro-)architecture platform. That is, rather than being assembled from a collection of independently developed blocks of silicon functionalities, they will be derived from a specific family of micro-architectures, possibly oriented toward a particular class of problems, that can be extended or reduced by the system developer. The elements of this family are a sort of “hardware denominator” that could be shared across multiple applications. Hence, an architecture platform is a family of micro-architectures that share some commonality, the library of components that are used to define the micro-architecture. Every element

© 2006 by Taylor & Francis Group, LLC

Platform-Based Design for Embedded Systems

22-5

of the family can be obtained quickly through the personalization of an appropriate set of parameters controlling the micro-architecture. Often, the family may have additional constraints on the components of the library that can or should be used. For example, a particular micro-architecture platform may be characterized by the same programmable processor and the same interconnection scheme, while the peripherals and the memories of a specific implementation may be selected from the predesigned library of components depending on the given application. Depending on the implementation platform that is chosen, each element of the family may still need to go through the standard manufacturing process including mask making. This approach then conjugates the need of saving design time with the optimization of the element of the family for the application at hand. Although it does not solve the mask cost issue directly, it should be noted that the mask cost problem is primarily owing to the generation of multiple mask sets for multiple design spins, which is addressed by the architecture platform methodology. The less constrained the platform, the more freedom a designer has in selecting an instance and the more potential there is for optimization, if time permits. However, more constraints mean stronger standards and easier addition of components to the library that defines the architecture platform (as with PC platforms). Note that, the basic concept is similar to the cell-based design layout style, where regularity and the reuse of library elements allow faster design time at the expense of some optimality. The tradeoff between design time and design “quality” needs to be kept in mind. The economics of the design problem must dictate the choice of the design style. The higher the granularity of the library, the more leverage we have in shortening the design time. Given that the elements of the library are reused, there is a strong incentive to optimize them. In fact, we argue that the “macro-cells” should be designed with great care and attention given to area and performance. It also makes sense to offer a variation of cells with the same functionality but with implementations that differ in performance, area, and power dissipation. Architecture platforms are, in general, characterized by (but not limited to) the presence of programmable components. Then, each of the platform instances that can be derived from the architecture platform maintains enough flexibility to support an application space that guarantees the production volumes required for economically viable manufacturing. The library that defines the architecture platform may also contain reconfigurable components, which comes in two flavors. With runtime reconfigurability, FPGA (Field Programmable Gate Arrays) blocks can be customized by the user without the need of changing mask set, thus saving both design cost and fabrication cost. With design-time reconfigurability, where the silicon is still application specific, only design time is reduced. An architecture platform instance is derived from an architecture platform by choosing a set of components from its library and by setting parameters of reconfigurable components of the library. The flexibility, or the capability of supporting different applications, of a platform instance is guaranteed by programmable components. Programmability will ultimately be of various forms. One is software programmability to indicate the presence of a microprocessor, Digital Signal Processor (DSP) or any other software programmable component. Another is hardware programmability to indicate the presence of reconfigurable logic blocks such as FPGAs, whereby logic function can be changed by software tools without requiring a custom set of masks. Some of the new architecture and/or implementation platforms being offered in the market mix the two types into a single chip. For example, Triscend, Altera, and Xilinx are offering FPGA fabrics with embedded hard processors. Software programmability yields a more flexible solution, since modifying software is, in general, faster and cheaper than modifying FPGA personalities. On the other hand, logic functions mapped on FPGAs execute orders of magnitude faster and with much less power than the corresponding implementation as a software program. Thus, the trade-off here is between flexibility and performance.

22.3.2 API Platform The concept of architecture platform by itself is not enough to achieve the level of application software reuse we require. The architecture platform has to be abstracted at a level where the application software “sees” a high-level interface with the hardware that we call API or Programmer Model. A software

© 2006 by Taylor & Francis Group, LLC

22-6

Embedded Systems Handbook

layer is used to perform this abstraction. This layer wraps the essential parts of the architecture platform: • The programmable cores and the memory subsystem via a Real-Time Operating System (RTOS). • The I/O subsystem via the device drivers. • The network connection via the network communication subsystem. In our framework, the API is a unique abstract representation of the architecture platform via the software layer. Therefore, the application software can be reused for every platform instance. Indeed, the API is a platform itself that we can call the API platform. Of course, the higher the abstraction level at which a platform is defined, the more instances it contains. For example, to share the source code, we need to have the same operating system but not necessarily the same instruction set, while to share the binary code, we need to add the architectural constraints that force us to use the same ISA (Instruction Set Architecture), thus greatly restricting the range of architectural choices. The RTOS is responsible for the scheduling of the available computing resources and of the communication between them and the memory subsystem. Note that, in several embedded system applications, the available computing resources consist of a single microprocessor. In others, such as wireless handsets, the combination of a Reduced Instruction Set Computer (RISC) microprocessor or controller and DSP has been used widely in 2G, and now for 2.5G and 3G, and beyond. In set-top boxes, a RISC for control and a media processor have also been used. In general, we can imagine a multiple core architecture platform where the RTOS schedules software processes across different computing engines.

22.3.3 System Platform Stack The basic idea of system platform stack is captured in Figure 22.2. The vertex of the two cones represents the combination of the API and the architecture platform. A system designer maps its application onto the abstract representation that “includes” a family of architectures that can be chosen to optimize cost, efficiency, energy consumption, and flexibility. The mapping of the application onto the actual architecture in the family specified by the API can be carried out, at least in part, automatically if a set of appropriate software tools (e.g., software synthesis, RTOS synthesis, device-driver synthesis) is available. It is clear that the synthesis tools have to be aware of the architecture features as well as of the API. This set of tools makes use of the software layer to go from the API platform to the architecture platform. Note that, the system platform effectively decouples the application development process (the upper triangle) from the architecture implementation process (the lower triangle). Note also that, once we use the abstract definition of “API” as described earlier, we may obtain extreme cases such as traditional PC platforms on

Application space Application instance Platform mapping System platform Platform design-space export Platform instance Architectural space

FIGURE 22.2 System platform stack.

© 2006 by Taylor & Francis Group, LLC

Platform-Based Design for Embedded Systems

22-7

one side and full hardware implementation on the other. Of course, the programmer model for a full custom hardware solution is trivial since there is a one-to-one map between functions to be implemented and physical blocks that implement them. In the latter case, PBD amounts to adding to traditional design methodologies some higher level of abstractions.

22.4 Network Platforms In distributed systems the design of the protocols and channels that support the communication among the system components is a difficult task owing to the tight constraints on performances and cost. To make the communication design problem more manageable, designers usually decompose the communication function into distinct protocol layers, and design each layer separately. According to this approach, of which the Open Systems Interconnection (OSI) Reference Model is a particular instance, each protocol layer together with the lower layers define a platform that provides CSs to the upper layers and to the application-level components. Identifying the most effective layered architecture for a given application requires one to solve a trade-off between performances, which increase by minimizing the number of layers, and design manageability, which improves with the number of the intermediate steps. Present embedded system applications, owing to their tight constraints, increasingly demand the codesign of protocol functions that in less-constrained applications are assigned to different layers and considered separately (e.g., cross-layer protocol design of MAC and routing protocols in sensor networks). The definition of an optimal layered architecture, the design of the correct functionality for each protocol layer, and the design-space exploration for the choice of the physical implementation must be supported by tools and methodologies that allow to evaluate the performances and guarantees the satisfaction of the constraints after each step. For these reasons, we believe that the PBD principles and methodology provide the right framework to design communication networks. In this section, first, we formalize the concept of Network Platform (NP). Then, we outline a methodology for selecting, composing, and refining NP [6].

22.4.1 Deﬁnitions A Network Platform is a library of resources that can be selected and composed together to form a Network Platform Instance (NPI) and support the interaction among a group of interacting components. The structure of a NPI is defined by abstracting computation resources as nodes and communication resources as links. Ports interface nodes with links or with environment of the NPI. The structure of a node or a link is defined by its input and output ports, the structure of a NPI is defined by a set of nodes and links connecting them. The behaviors and the performances of a NPI are defined in terms of the type and the quality of the CSs it offers. We formalize the behaviors of a NPI using the Tagged Signal Model [7]. NPI components are modeled as processes and events model the instances of the send and receive actions of the processes. An event is associated with a message that has a type and a value, and with tags that specify attributes of the corresponding action instance (e.g., when it occurs in time). The set of behaviors of a NPI is defined by the intersection of the behaviors of the component processes. A NPI is defined as a tuple, NPI = (L, N , P, S), where: • L = {L1 , L2 , . . . , LNl } is a set of directed links. • N = {N1 , N2 , . . . , NNn } is a set of nodes. • P = {P1 , P2 , . . . , PNp } is a set of ports. A port Pi is a triple (Ni , Li , d), where Ni ∈ N is a node, Li ∈ L ∪ Env is a link or the NPI environment, and d = in if it is an input port, d = out if it is an output port. The ports that interface the NPI with the environment define the sets P in = {(Ni , Env, in)} ⊆ P, P out = {(Ni , Env, out)} ⊆ P. • S = Nn+Nl Ri is the set of behaviors, where Ri indicates the set of behaviors of a resource that can be a link in L or a node in N .

© 2006 by Taylor & Francis Group, LLC

22-8

Embedded Systems Handbook

The basic services provided by a NPI are called Communication Services (CSs). A CS consists of a sequence of message exchanges through the NPI from its input to its output ports. A CS can be accessed by NPI users through the invocation of send and receive primitives whose instances are modeled as events. A NPI API consists of the set of methods that are invoked by the NPI users to access the CS. For the definition of a NPI API it is essential to specify not only the service primitives but also the type of CS they provide access to (e.g., reliable send, out-of-order delivery etc.). Formally, a CS is a tuple (P¯ in , P¯ out , M , E, h, g ,
22.4.2 Quality of Service NPIs can be classified according to the number, the type, the quality, and the cost of the CS they offer. Rather than in terms of event sequences, a CS is more conveniently described using Quality of Service (QoS) parameters such as error rate, latency, throughput, jitter, and cost parameters such as consumed power and manufacturing cost of the NPI components. QoS parameters can be simply defined by using the annotation functions that associate individual events with quantities, such as the time when an event occurs and the power consumed by an action. Hence, one can compare the values of pairs of input and output events associated with the same message to quantify the error rate, or compare the timestamp of events observed at the same port to compute the jitter. The most relevant QoS parameters are defined ¯ in ¯ out using a notation where e i,j ∈ e M ,(P ∪P ) indicates an event carrying the i-th message and observed at the j-th port, v(e) and t (e) represents, respectively, the value of the message carried by event e and the timestamp of the action modeled by event e. Delay. The communication delay of a message is given by the difference between the timestamps of the input and output events carrying that message. Assuming that the i-th message is transferred from input port j1 to output port j2 , the delay i of the i-th message, the average delay Av and the peak delay |M | Peak are defined, respectively, as i = t (e j2 ,i ) − t (e j1 ,i ), Av = i=1 (t (e j2 ,i ) − t (e j1 ,i ))/|M |, Peak = max i {t (e j2 ,i ) − t (e j1 ,i )}. Throughput. The throughput is given by the number of output events in an interval (t0 , t1 ), that is, the cardinality of the set = {ei ∈ E|h(ei ) ∈ P¯ out , t (ei ) ∈ (t0 , t1 )}. Error rate. The Message Error Rate (MER) is given by the ratio between the number of lost or corrupted output events and the total number of input events. Given LostM = {ei ∈ E|h(ei ) ∈ P¯ in , ¬∃ej ∈ E s.t. h(ej ) ∈ P¯ out g (ej ) = g (ei )}, CorrM = {ei ∈ E|h(ei ) ∈ P¯ in , ∃ej ∈ E s.t. h(ej ) ∈ P¯ out , g (ej ) = g (ei ), v(ej ) = v(ei )} and In M = {ei ∈ E|h(ei ) ∈ P¯ in }, the MER = (|LostM | + |CorrM |)/|InM |. Using information on message encoding MER can be converted to packet and bit error rate. The number of CS that a NPI can offer is large, so the concept of Class of Communication Services (CCSs) is introduced to simplify the description of a NPI. CCS define a new abstraction (and therefore a platform) that groups together CS of similar type and quality. For example, a CCS may include all the CS that transfer a periodic stream of messages with no errors, another CCS may include all the CS that transfer a stream of input messages arriving at a bursty rate with a 1% error rate. CCS can be identified based on the type of messages (e.g., packets, audio samples, video pixels etc.), the input arrival pattern (e.g., periodic, bursty etc.), and the range of QoS parameters. For each NPI supporting multiple CS, there are several ways to group them into CCS. It is the task of the NPI designer to identify the CCS and provide the proper abstractions to facilitate the use of the NPI.

© 2006 by Taylor & Francis Group, LLC

Platform-Based Design for Embedded Systems

22-9

22.4.3 Design of Network Platforms The design methodology for NPs derive a NPI implementation by successive refinement from the specification of the behaviors of the interacting components and the declaration of the constraints that a NPI implementation must satisfy. The most abstract NPI is defined by a set of end-to-end direct logical links connecting pairs of interacting components. Communication refinement of a NPI defines at each step a more detailed NPI by replacing one or multiple links in the original NPI with a set of components or NPIs. During this process another NPI can be used as a resource to build other NPIs. A correct refinement procedure generates a NPI that provides CS equivalent to those offered by the original NPI with respect to the constraints defined at the upper level. A typical communication refinement step requires to define both the structure of the refined NPI , that is, its components and topology, and the behavior of these components, that is, the protocols deployed at each node. One or more NP components (or predefined NPIs) are selected from a library and composed to create CS of better quality. Two types of compositions are possible. One consists of choosing a NPI and extending it with a protocol layer to create CS at a higher level of abstraction (vertical composition). The other is based on the concatenation of NPIs using an intermediate component called adapter (or gateway) that maps sequences of events between the ports being connected (horizontal composition).

22.5 Fault-Tolerant Platforms The increasing role of embedded software in real-time feedback-control systems drives the demand for fault-tolerant design methodologies [8]. The aerospace and automotive industries offer many examples of systems whose failure may have unacceptable costs (financial, human, or both). Designing cost-sensitive real-time control systems for safety-critical applications require a careful analysis of the cost/coverage trade-offs of fault-tolerant solutions. This further complicates the task of deploying the embedded software that implements the control algorithms on the execution platform. The latter is often distributed around the plant as it is typical, for instance, in automotive applications. In this section, we present a synthesis-based design methodology that relieves the designers from the burden of specifying detailed mechanisms for addressing the execution platform faults, while involving them in the definition of the overall fault-tolerance strategy. Thus, they can focus on addressing plant faults within their control algorithms, selecting the best components for the execution platform, and defining an accurate fault model. Our approach is centered on a new model of computation, Fault-Tolerant Data Flows (FTDF), that enables the integration of formal validation techniques.

22.5.1 Types of Faults and Platform Redundancy In a real-time feedback-control system, like the one in Figure 22.3, the controller interacts with the plant by means of sensors and actuators. A controller is a hardware–software system where the software algorithms that implement the control law run on an execution platform. An execution platform is a distributed system that is typically made of a software layer (RTOS, middleware services, . . . ) and a hardware layer (a set of processing elements, called Electronic Control Units or ECUs, connected via communication channels such as buses, crossbars, or rings). The design of these heterogeneous reactive distributed systems is made even more challenging by the requirement of making them resilient to faults. Technically, a fault is the cause of an error, an error is the part of the system state which may cause a failure, and a failure is the deviation of the system from the specification [9]. A deviation from the specification may be owing to the designer’s mistakes (“bugs”) or the accidents occurring while the system is operating. The latter can be classified into two categories that are relevant for feedback-control systems: plant faults and execution platform faults. Theoretically, all bugs can be eliminated before the system is deployed. In practice, they are minimized by using design environments that are based on precise Models of Computation (MoCs), whose well-defined semantics enable formal validation techniques [10–12], (e.g., synchronous languages [13]).

© 2006 by Taylor & Francis Group, LLC

22-10

Embedded Systems Handbook

Actuator

Sensor Plant

Actuator

Sensor

Actuator driver

Control law algorithms

Sensor driver Sensor driver

Actuator driver

Embedded software RTOS and middle ware Hardware architecture ECU

ECU

ECU

ECU

ECU

ECU

ECU

ECU

Execution platform Controller

FIGURE 22.3 A real-time control system.

Instead, plant faults and execution platform faults must be dealt with online. Hence, they must be included in the specification of the system to be designed. Plant faults, including faults in sensors and actuators, must be handled at the algorithmic level using estimation techniques and adaptive control methods. For instance, a drive-by-wire system [14, 15] might need to handle properly a tire puncture or the loss of one of the four brakes. Faults in the execution platform affect the computation, storage, and communication elements. For instance, a loss of power may turn off an ECU, momentarily or forever. System operation can be preserved in spite of platform faults if alternative resources supplying the essential functionality of the faulty one are available. Hence, the process of making the platform fault-tolerant usually involves the introduction of redundancy with obvious impact on the final cost. While the replication of a bus or the choice of a faster microprocessor may not affect sensibly the overall cost of a new airplane, their impact is quite significant for high-volume products like the ones of the automotive industry. The analysis of the trade-offs between higher redundancy and lower costs is a challenging hardware–software codesign task that designers of fault-tolerant systems for cost-sensitive applications must face in addition to the following two: (1) how to introduce redundancy, and (2) how to deploy the redundant design on a distributed execution platform. Since these activities are both tedious and error prone, designers often rely on off-the-shelf solutions to address fault tolerance, such as Time-Triggered Architecture (TTA) [16]. One of the main advantages of off-the-shelf solutions is that the application does not need to be aware of the fault-tolerant mechanisms that are transparently provided by the architecture to cover the execution platform faults. Instead, designers may focus their attention on avoiding design bugs and tuning the control algorithms to address the plant faults. However, the rigidity of off-the-shelf solutions may lead to suboptimal results from a design cost viewpoint.

22.5.2 Fault-Tolerant Design Methodology We present an interactive design methodology that involves designers in the exploration of the redundancy/cost trade-off [17]. To do so efficiently, we need automatic tools to bridge the different platforms in the system platform stack. In particular, we introduce automatic synthesis techniques that process simultaneously the algorithm specification, the characteristics of the chosen execution platform, and the corresponding fault model. Using this methodology, the designers focus on the control algorithms and the selection of the components and architecture for the execution platform. In particular, they also

© 2006 by Taylor & Francis Group, LLC

Platform-Based Design for Embedded Systems

22-11

specify the relative criticality of each algorithm process. Based on a statistical analysis of the failure rates, which should be part of the characterization of the execution platforms library, designers specify the expected set of platform faults, that is, the fault model. Then, we use this information to (1) automatically deduce the necessary software process replication, (2) distribute each process on the execution platform, and (3) derive an optimal scheduling of the processes on each ECU to satisfy the overall timing constraints. Together, the three steps (replication, mapping, and scheduling) result in the automatic deployment of the embedded software on the distributed execution platform. Platforms export performance estimates, and we can determine for each control process its worst case execution time (WCET) on a given component.2 Then, we can use a set of verification tools to assess the quality of the deployment, most notably we have a static timing analysis tool to predict the worst case latency from sensors to actuators. When the final results do not satisfy the timing constraints for the control application, precise guidelines are returned to the designers who may use them to refine the control algorithms, modify the execution platform, and revisit the fault model. While being centered on a synthesis step, our approach does not exclude the use of predesigned components, such as TTA modules, communication protocols such as TTP [19] and faulttolerant operating systems. These components can be part of a library of building blocks that the designer uses to further explore the fault-coverage/cost trade-off. Finally, the proposed methodology is founded on a new MoC, FTDF, thus making it amenable to the integration of formal validation techniques. The corresponding API platform consists primarily of the FTDF MoC. 22.5.2.1 Fault Model For the sake of simplicity we assume fail silence: components either provide correct results or do not provide any result at all. Recent work shows that fail-silent platforms can be realized with limited area overhead and virtually no performance penalty [20]. The fail silence assumption can be relaxed if invalid results are detected otherwise, as in the case of CRC-protected communication and voted computation [21]. However, it is important to note that the proposed API platform (FTDF) is fault model independent. For instance, the presence of value errors, where majority voting is needed, can be accounted for in the implementation of the FTDF communication media (see Section 22.5.3). The same is true for Byzantine failures, where components can have any behavior, including malicious ones like coordinating to bring the system down to a failure [22]. In addition to the type of faults, a fault model also specifies the number (or even the mix) of faults to be tolerated [23]. A statistical analysis of the various components MTBFs (Mean Time Between Faults), their interactions, and MTBR (Mean Time Between Repairs), should determine which subsystems have a compound MTBF that is so short to be of concern, and should be part of the platform component characterization. The use of failure patterns to capture effectively these interactions was proposed in Reference 24, which is the basis of our approach [17]. 22.5.2.2 Setup Consider the feedback-control system in Figure 22.3. The control system repeats the following sequence at each period Tmax : (1) sensors are sampled, (2) software routines are executed, and (3) actuators are updated with the newly processed data. The actuator updates are applied to the plant at the end of the period to help minimize jitter, a well-known technique in the real-time control community [25, 26]. In order to guarantee correct operation, the WCET among all possible iterations, that is, the worst case latency from sensors to actuators, must be smaller than the given period Tmax (the real-time constraint), which is determined by the designers of the controller based on the characteristics of the application. Moreover, the critical subset of the control algorithms must be executed in spite of the specified platform faults. 22.5.2.3 Example Figure 22.4 illustrates a FTDF graph for a paradigmatic feedback-control application, the inverted pendulum control system. The controller is described as a bipartite directed graph G where the vertices, called actors and communication media, represent software processes and data communication. 2 See

Reference 18 for some issues and techniques to estimate WCETs.

© 2006 by Taylor & Francis Group, LLC

22-12

Embedded Systems Handbook

Inverted pendulum (the plant)

m Sensor

Sensor

Sensor

Coarse control task

m

m

Input

m

m

Arbiter Fine control task

m

Actuator

m

Output

m m

Actuator

FIGURE 22.4 Controlling an inverted pendulum.

ECU0

ECU1

ECU2

CH0

CH1

FIGURE 22.5 A simple platform graph.

Figure 22.5 illustrates a possible platform graph (PG), where vertices represent ECUs and communication channels and edges describe their interconnections. 22.5.2.4 Platform Characteristics Each vertex of PG is characterized by its failure rate and by its timing performance. A failure pattern is a subset of vertices of PG that may fail together during the same iteration, with a probability so high to be of concern. A set of failure patterns identify the fault scenarios to be tolerated. Based on the timing performance, we can determine the WCET of actors on the different ECUs and the worst case transmission time of data on channels. Graphs G and PG are related in two ways: • Fault-tolerance binding : for each failure pattern the execution of a corresponding subset of the actors of G must be guaranteed. This subset is identified a priori based on the relative criticality assignment.

© 2006 by Taylor & Francis Group, LLC

Platform-Based Design for Embedded Systems

22-13

• Functional binding : a set of mapping constraints and performance estimates indicate where on PG each vertex of G may be mapped and the corresponding WCET. These bindings are the basis to derive a fault-tolerant deployment of G on PG. We use software replication to achieve fault tolerance: critical routines are replicated statically (at compile time) and executed on separate ECUs and the processed data are routed on multiple communication paths to withstand channel failures. In particular, to have a correct deployment in absence of faults, it is necessary that all actors and data communications are mapped onto ECUs and channels in PG. Then, to have a correct fault-tolerant deployment, critical elements of G must be mapped onto additional PG vertices to guarantee their correct and timely execution under any possible failure pattern in the fault model. 22.5.2.5 Design Flow Using the interactive design flow of Figure 22.6 designers: • • • • •

Specify the controller (the top-left FTDF graph) Assemble the execution platform (the top-right PG) Specify a set of failure patterns (subsets of PG) Specify the fault-tolerance binding (fault behavior) Specify the functional binding

All this information contributes to specifying what the system should do and how it should be implemented. A synthesis tool automatically: • Introduces redundancy in the FTDF graph • Maps actors and their replicas onto PG • Schedules their execution Finally, a verification tool checks whether the fault-tolerant behavior and the timing constraints are met. If no solution is found, the tool returns a violation witness that can be used to revisit the specification and to provide hints to the synthesis tool.

Sens Sens

Input

Coarse CTRL

Output Act

Fine CTRL

Sens

Fault behavior

Act Arbiter best

Mapping

ECU0

Coarse CTRL

Sens

Act

CH0 ECU1

Sens

Input

Coarse CTRL

Arbiter best

Sens

Input

Fine CTRL

Arbiter best

Output

CH1 ECU2

FIGURE 22.6 Interactive design flow.

© 2006 by Taylor & Francis Group, LLC

Output

Act

22-14

Embedded Systems Handbook

22.5.3 The API Platform (FTDF Primitives) In this section we present the structure and general semantics of the FTDF MoC. The basic building blocks are actors and communication media. FTDF actors exchange data tokens at each iteration with synchronous semantics [13]. An actor belongs to one of six possible classes: sensors, actuators, inputs, outputs, tasks, or arbiters. Sensor and actuator actors read and update, respectively, the sensor and actuator devices interacting with the plant. Input actors perform sensor fusion, output actors are used to balance the load on the actuators, while task actors are responsible for the computation workload. Arbiter actors mix the values that come from actors with different criticality to reach to the same output actor (e.g., braking command and Antilock Braking System [ABS]).3 Finally, state memories are connected to actors and operate as one-iteration delays. With a slight abuse of terminology the terms state memory and memory actor are used interchangeably in this section. 22.5.3.1 Tokens Each token consists of two fields: Data, the actual data being communicated; Valid, a boolean flag indicating the outcome of fault detection on this token. When Valid is “false” either no data is available for this iteration, or the available data is not correct. In both cases the Data field should be ignored. The Valid flag is just an abstraction of more concrete and robust fault detection implementations. 22.5.3.2 Communication Media Communication occurs via unidirectional (possibly many-to-many) communication media. All replicas of the same source actor write to the same medium, and all destination actors read from it. Media act as both mergers and repeaters sending the single “merged” result to all destinations. More formally, the medium provides the correct merged result or an invalid token if no correct result is determined. Assuming fail silence, merging amounts to selecting any of the valid results; assuming value errors, majority voting is necessary; assuming Byzantine faults requires rounds of voting (see the consensus problem [27]). Communication media must be distributed to withstand platform faults. Typically, this means to have a repeater on each source ECU and a merger on each destination ECU (broadcasting communication channels help reducing message traffic greatly). Using communication media, actors always receive exactly one token per input and the application behavior is independent of the type of platform faults. The transmission of tokens is initiated by the active elements: regular actors and memory actors. 22.5.3.2.1 Regular Actors When an actor fires, its sequential code is executed. This code is: stateless (state must be stored in memory actors), deterministic (identical inputs generate identical outputs), nonblocking (once fired, it does not await for further tokens, data, or signals from other actors), and terminating (bounded WCET). The firing rule specifies which subsets of input tokens must be valid to fire the actor, typically all of them (AND firing rule). However, the designer may need to specify partial firing rules for input and arbiter actors. For example, an input actor reading data from three sensors may produce a valid result even when one of the sensors cannot deliver data (e.g., when the ECU where the sensor is mapped is faulty). 22.5.3.2.2 Memory Actors (State Memories) A memory provides its state at the beginning of an iteration and has a source actor, possibly replicated, that updates its state at every iteration. State memories are analogous to latches in a sequential digital circuit: they store the results produced during the current iteration for use in the next one. Finally FTDF graphs can express redundancy, that is, one or more actors may be replicated. All the replicas of an actor v ∈ A are denoted by R(v) ⊂ A. Note that any two actors in R(v) are of the same type and must compute the same function. This basic condition is motivated in Section 22.5.5 where replica 3 We advocate running nonsafety critical tasks, for example, door controllers, on separate hardware. However, some

performance enhancement tasks, for example, side-wind compensation, may share sensors and actuators with critical tasks (steer-by-wire). It may be profitable to have them share the execution platform as well.

© 2006 by Taylor & Francis Group, LLC

Platform-Based Design for Embedded Systems

22-15

determinism is discussed. Note that the replication of sensors and actuators is not performed automatically because they may have a major impact on cost, we discuss the implications of this choice in Reference 17.

22.5.4 Fault-Tolerant Deployment The result of the synthesis is a redundant mapping L, that is, an association of elements of the FTDF network to multiple elements of the execution platform, and for each element in the execution platform a schedule S , that is, a total order in which actors should be executed and data should be transmitted. A pair (L, S ) is called a deployment. To avoid deadlocks, the total orders defined by S must be compatible with the partial order in L, which in turn derives directly from the partial order in which the FTDF actors in the application must be executed. To avoid causality problems, memory actors are scheduled before any other actor, thus using the results of the previous iteration. Schedules based on total orders are called static: there are no runtime decisions to make, each ECU and each channel controller simply follows the schedule. However, in the context of a faulty execution platform an actor may not receive enough valid inputs to fire and this may lead to starvation. This problem is solved by skipping an actor if it cannot fire and by skipping a communication if no data is available [24].

22.5.5 Replica Determinism Given a mapping L, it is important to preserve replica determinism: if two replicas of a same actor fire, they produce identical results. For general MoCs the order of arrival of results must also be the same for all replicas. Synchrony of FTDF makes this check unnecessary. Clearly, the execution platform must contain the implementation of a synchronization algorithm [28]. Replica determinism in FTDF can be achieved enforcing two conditions: (1) all replicas compute the same function, and (2) for any failure pattern, if two replicas get a firing subset of inputs they get the same subset of inputs. Condition (1) is enforced by construction by allowing only identical replicas. Condition (2) amounts to a consensus problem and it can either be checked at runtime (like for Byzantine agreement rounds of voting), or it can be analyzed statically at compile time (if the fault model is milder). Our interest in detectably faulty execution platforms makes the latter approach appear more promising and economical. Condition (2) is trivially true for all actors with the “AND firing rule.” For input and arbiter actors the condition must be checked and enforced [17].

22.6 Analog Platforms Emerging applications such as multimedia devices (video cell phones, digital cameras, wireless PDAs to mention but a few) are driving the SoC market towards the integration of analog components in almost every system. Today, system-level analog design is a design process dominated by heuristics. Given a set of specifications/requirements that describes the system to be realized, the selection of a feasible (let alone optimal) implementation architecture comes mainly out of experience. Usually, what is achieved is just a feasible point at the system level, while optimality is sought locally at the circuit level. This practice is caused by the number of second order effect that are very hard to deal with at high level without actually designing the circuit. Platform-based design can provide the necessary insight to develop a methodology for analog components that takes into consideration system level specifications and can choose among a set of possible solutions including digital approaches wherever it is feasible to do so. If the “productivity gap” between analog and digital components is not overcome, time-to-market and design quality of SoC will be seriously affected by the small analog sections required to interface with the real world. Moreover, SoC designs will expose system level explorations that would be severely limited if the analog section is not provided with a proper abstraction level that allows system performance estimation in an efficient way and across the analog/digital boundary. Therefore, there is a strong need to develop more abstract design techniques that can encapsulate analog design into a methodology that could shorten design time without

© 2006 by Taylor & Francis Group, LLC

22-16

Embedded Systems Handbook

compromising the quality of the solutions, leading to a hardware/software/analog co-design paradigm for embedded systems.

22.6.1 Deﬁnitions The platform abstraction process can be extended to analog components in a very natural way. Deriving behavioral and performance models, however, is more involved due to the tight dependency of analog components on device physics that requires the use of continuous mathematics for modeling the relations among design variables. Formally, an Analog Platform (AP) consists of a set of components, each decorated with: • a set of input variables u ∈ U , a set of output (performance) variables y ∈ Y , a set of “internal” variables (including state variables) x ∈ X , a set of configuration parameters κ ∈ K; some parameters take values in a continuous space, some take values in a discrete set, for example when they encode the selection of a particular alternative. • a behavioral model that expresses the behavior of the component represented implicitly as F (u, y, x, κ) = 0, where F (·), may include integro-differential components; in general, this set determines uniquely x and y given u and κ. Note that the variables considered here can be function of time and that the functional F includes constraints on the set of variables (for example, the initial conditions on the state variables). • a feasible performance model. Let φy (u, κ) denote the map that computes the performance y corresponding to a particular value of u and κ by solving the behavioral model. The set of feasible analog performance (such as gain, distortion, power), is the set described by the relation P (y (u )) = 1 ⇔ ∃κ , y (u ) = φy (κ , u ). • validity laws L(u, y, x, κ) ≤ 0 i.e., constraints (or assumptions) on the variables and parameters of the component that define the range of the variables for which the behavioral and performance models are valid. Note that there is no real need to define the feasible performance model since the necessary information is all contained in the behavioral model. We prefer to keep them separate because of the use we make of them in explaining our approach. At the circuit level of abstraction, the behavioral models are the circuit equations with x being the voltages, currents and charges, y being a subset of x and/or a function of x and κ when they express performance figures such as power or gain. To compute performance models, we need to solve the behavioral models that implies solving ordinary differential equations, a time consuming task. In the past, methods to approximate the relation between y and κ (the design variables) with an explicit function were proposed. In general, to compute this approximation, a number of evaluations of the behavioral model for a number of parameters κ is performed (by simulation, for example) and then an interpolation or approximation scheme is used to derive the approximation to the map φy . We see in Section 22.6.2 how to compute an approximation to the feasible performance set directly. Example 22.1 Considering an OTA for an arbitrary application, we can start building a platform from the circuit level by defining: • U as the set of all possible input voltages Vin (t ) s.t. |Vin | < 100 mV and bandwidth Vin < 3 MHz; Y as the space of vectors {Vout (t ), gain, IIP3, rout } (IIP3 is the third order intermodulation intercept point referred to the input, rout is the output resistance) X the set of all internal current and voltages, and K the set of transistor sizings. • for a transistor level component, the behavioral model F consists of the solution of the circuit equations, e.g. through a circuit simulator. • φy (u, κ) as the set of all possible y. • validity laws L are obtained from Kirchoff laws when composing individual transistors and other constraints, e.g. maximum power ratings of breakdown voltages.

© 2006 by Taylor & Francis Group, LLC

Platform-Based Design for Embedded Systems

22-17

We can build a higher level (level 1) OpAmp platform where: • U 1 is the same, Y 1 is the output voltage of the OpAmp, X is empty, K1 consists of possible {gain, IIP3, rout } triples (thus it is a projection of Y 0 ); • F 1 can be expressed in explicit form,

y1 (t ) = h (t ) ⊗(a1 · u (t ) + a3 · u (t )3 ) + noise

y2 = a1 ; y3 =

4 3

·

a1 a3

• φy is the set of possible y; • there are no validity constraints, L < 0 always.

When a platform instance is considered, we have to compose the models of the components to obtain the corresponding models for the instance. The platform instance is then characterized by • • • •

a set of internal variables of the platform ξ = [ξ1 , ξ2 , ..., ξn ] ∈ , a set of inputs of the platform, h ∈ H a set of performances υ ∈ ϒ, a set of parameters ζ ∈ Z .

The variable names are different from the names used to denote the variables of the components to stress that there may be situations where some of the component variables change roles (for example, an input variable of one component may become an internal variable; a new parameter can be identified in the platform instance that is not visible or useful at the component level). To compose the models, we have to include in the platform the composition rules. The legal compositions are characterized by the interconnect equations that specify which variables are shared when composing components and by constraints that define when the composition is indeed possible. These constraints may involve range of variables as well as nonlinear relations among variables. Formally, a connection is establishing a pairwise equality between internal variables for example ξi = ξj , inputs and performance; we denote the set of interconnect relations with c (h, ξ , ζ , κ) = 0 that are in general a set of linear equalities. The composition constraints are denoted by L(h, ξ , υ, ζ ) ≤ 0, that are in general, non linear inequalities. Note that in the platform instance all internal variables of the components are present as well as all input variables. In addition, there is no internal or input variable of the platform instance that is not an internal or input variable of one of the components. The behavioral model of the platform instance is the union of all behavioral models of the components conjoined with the interconnect relations. The validity laws are the conjunction of the validity laws of the components and of the composition constraints. The feasible performance model may be defined anew on the platform instance but it may also be obtained by composition of the performance models of the components. There is an important and interesting case when the composition may be done considering only the feasible performance models of the components obtained by appropriate approximation techniques. In this case, the composition constraints assume the semantics of defining when the performance models may be composed. For example, if we indicate with λ the parameters related to internal nodes that characterizes the interface in Figure 22.7(a) (e.g. input/output impedance in the linear case), then matching between λ has to be enforced during composition. In fact, both PA and PB were characterized with specific λs (Figure 22.7[b]), so L has to constrain A − B composition consistently with performance models. In this case, an architectural exploration step consisting of forming different platform instances out of the component library and evaluating them, can be performed very quickly albeit possibly with restrictions on the space of the considered instances caused by the composition constraints.

© 2006 by Taylor & Francis Group, LLC

22-18

Embedded Systems Handbook (a) A

B l

Platform composition A driving B with interface paramater l (b) Beq

A

Aeq

lL

B lS

Characterization setup for platform A and B

FIGURE 22.7 Interface parameter λ during composition A–B and characterization of A and B.

Example 22.2 We can build a level 2 platform consisting of an OpAmp (OA) and a unity gain buffer following it (UB, the reader can easily find a proper definition for it), then we can define a higher level OpAmp platform component so that: OA , ξ = V UB , ξ = V UB and connect them in series specifying ξ = ξ ; • ξ1 = VinOA , ξ2 = Vout 3 4 2 3 in in • h connected to ξ1 is the set of input voltages Vin (t ); • ϒ is the space of υ1 (t ), the cascade response in time, υ2 = gain, υ3 = IIP3. In this case υ2 immediately equals y2OA , while υ3 is a non linear function of y OA and y UB ; • Z consists of all parameters specifying a platform instance, in this case we may have Z = YOA ∪YUB . • a platform instance composability law L requires that the load impedance ZL > 100rout both at the output of the OpAmp and the unity buffer.

22.6.2 Building Performance Models An important part of the methodology is obtaining performance models. We already mentioned that we need to approximate the set Yˆ explicitly eliminating the dependence on the internal variables x. To do so a simulation-based approach is proposed. 22.6.2.1 Performance Model Approximation In general terms, simulation maps a configuration set (typically connected) K into a performance set in Y , thus establishing a relation among points belonging to the mapped set. Classic regression schemes

provides an efficient approximation to the mapping function φ(·), however our approach requires dealing with performance data in two different ways. The first one, referred to as performance model P , allows discriminating between points in Yˆ and points in Y \Yˆ . A second one, µ(·) = φ −1 (·), implementing the inverse mapping from Yˆ into K, used to map down from a higher-level platform layer to a lower one. However, fundamental issues (i.e. φ(·) being an invertible function) and accuracy issues (a regression from Rm into Rn ) suggest a table-lookup implementation for µ(·), possibly followed by a local optimization phase to improve mapping. Therefore, we will mainly focus on basic performance models P . The set Yˆ ⊂ Y defines a relation in Y denoted with P . We use Support Vector Machines (SVMs) as a way of approximating the performance relation P [29]. SVMs provide approximating functions of the form 2 αi e −γ |x−xi | − ρ) (22.1) f (x) = sign( i

© 2006 by Taylor & Francis Group, LLC

Platform-Based Design for Embedded Systems

22-19

where x is the vector to be classified, xi are observed vectors, αi s are weighting multipliers, ρ is a biasing constant and γ is a parameter controlling the fit of the approximation. More specifically, SVMs exploit mapping to Hilbert spaces so that hyperplanes can be exploited to perform classification. Mapping to high dimensional spaces is achieved through kernel functions, so that a kernel k(κ, ·) is associated at each point κ. Since the only general assumption we can make on φ(·) is continuity and on K is connectivity 4 , we can only deduce that Yˆ is connected as well. Therefore, the radial basis function Gaussian kernel is 2 chosen, k(κ, κ ) = e −γ ·κ−κ , where γ is a parameter of the kernel and controls the “width” of the kernel function around κ. We resort to a particular formulation of SVMs known as one-class SVM where an optimal hyperplane is determined to separate data from the origin. The optimal hyperplane be computed very efficiently through a quadratic problem, as detailed in [30]. 22.6.2.2 Optimizing the Approximation Process Sampling schemes for approximating unknown functions are exponentially dependent on the size of the function support. In the case of circuit, none but very simple circuits could be realistically characterized in this way. Fortunately, there is no need to sample the entire space K since we can use additional information obtained from design considerations to exclude parts of the parameter space. The set of “interesting” parameters is delimited by a set of constraints of two types: • topological — constraints derived from the use of particular circuit structures, such as two stacked transistor sharing the same current or a set of VDS summing to zero; • physical — constraints induced by device physics, such as VGS –VDS relation to enforce saturation or gm –ID relations; • performance — constraints on circuit performances, such as minimum gain or minimum phase margin, that can be achieved. Additional constraints can be added as designers’ understanding of circuit improves. The more constraints we add, the smaller the interesting configuration space K. However, if a constraint is tight, i.e., it either defines lower dimensional manifolds for example when the constraint is an equality, or the measure of the manifold is small, the more likely it is to introduce some bias in the sampling mechanism because of the difficulty in selecting points in these manifolds. To eliminate this ill-conditioning effect, we “relax” these constraints to include a larger set of interesting parameters. We adopt a statistical means of relaxing constraints by introducing random errors with the aim of dithering systematic errors and recovering accuracy in a statistical sense. Given an equality constraint f (κ) = 0 and its approximation f˜ (κ) = 0, we derive a relaxation |f˜ (κ)| ≤ . For each constraint f some statistics have to be gathered on so as to minimize the overhead on the size of K for introducing it. Once we have this set of constraints, we need to take them into account to define the space of interesting parameters. To do so, we can establish an ordering of the constraints so that the evaluation of the space is faster and sampling can be restricted effectively. Analog Constraint Graphs (ACGs) are introduced as a bipartite graph representation of configuration constraints. One set of nodes corresponds to equations, the other to variables κ. Bipartite graphs are a common form for dealing with system of equations [31]. We exploit ACGs to find solutions to configuration constraints thus providing an executable configuration sampler to be used in our platform configuration framework. A maximal bipartite matching in the ACG is used to compute an evaluation order for equations that is then translated into executable code capable of generating configurations in K by construction. In our experience, even with reasonably straightforward ˆ

K) 16 constraints, ratios of the order of 10−6 were observed for size( size(K) with K ⊂ R . When we deal with the intersection of achievable performance and performance constraints in the top-down phase of the design, we can add to the set of constraints we use to restrict the sampling space the performance constraints so that the results reported above are even more impressive.

4 More

in general, a union of a finite number of connected sets.

© 2006 by Taylor & Francis Group, LLC

22-20

Embedded Systems Handbook

22.6.3 Mixed-Signal Design Flow with Platforms The essence of platform-based design is building a set of abstractions that facilitate the design of complex systems by a successive refinement/abstraction process. The abstraction takes place when an existing set of components forming a platform at a given level of abstraction is elevated to a higher level by building appropriate behavioral and performance models together with the appropriate validity laws. This process can take either components at a level of abstraction and abstract each of them, or abstract a set of platform instances. Since both platform instances and platform components are described at the same level of abstraction the process is essentially the same. What changes is the exploration approach. On the other side of the coin, the top-down phase progresses through refinement. Design goals are captured as constraints and cost function. At the highest level of abstraction, the constraints are intersected with the feasible performance set to identify the set of achievable performance that satisfy design constraints. The cost function is then optimized with respect to the parameters of the platform instances at the highest level of abstraction ensuring they lie in the intersection of the constraint and the feasible performance set. This constrained optimization problem yields a point in the feasible performance space and in the parameter space for the platform instances at the highest level of abstraction. Using the inverse of the abstraction map φy , these points are mapped back at a lower level of abstraction where the process is repeated to yield a new point in the achievable performance set and in the parameter space until we reach a level where the circuit diagrams and even a layout is available. If the abstraction map is a conservative map, then every time we map down, we always find a consistent solution that can be achieved. Hence the design process can be shortened considerably. The crux of the matter is how many feasible points are not considered because of the conservative approximation. Thus the usual design speed versus design quality trade-off has to be explored. In mathematical terms, the bottom-up phase consists of defining an abstraction ψ l that maps the inputs, performance, internal variables, parameters, behavioral and performance models, and validity laws of a component or platform instance at level l into the corresponding objects at the next level (l + 1). The map is conservative if all feasible performance vectors yˆ l +1 correspond to feasible performance vectors yˆ l . Note that if approximations are involved in defining the models and the maps, this is not necessarily true, i.e., abstraction maps may be non conservative. In other words, a feasible performance vector at level l + 1 may not correspond to a feasible one at level l. A simplified diagram of the bottom-up phase for circuit level components is shown in Figure 22.8. For each library component, we define a behavioral model and a performance model. Then, a suitably topology is determined, an ACG is derived to constrain the configuration space K and a performance model is generated. This phase can be iterated, leading to a library that can span multiple topologies as reported in Figure 22.9. The top-down phase then proceeds formulating a top-level design problem as an optimization problem with a cost function C (ytop ) and a set of constraints defined in the Ytop space, gtop (ytop ) ≤ 0 that identifies a feasible set in Ytop . The complete optimization problem has to include the set Yˆ top that defines the set of achievable performance at the top level. The intersection of the two sets define the feasible set for the optimization process. The result of the process is a ytop opt . Then the process is to map back the selected point to the lower levels of the hierarchy. If the abstractions are conservative, the top-down process is straightforward. Otherwise, at each level of the hierarchy, we have to verify using the performance models, the behavioral models and the validity laws. In some cases, a better design may be obtained by introducing in the top-down phase cost functions and constraints that are defined only at a particular abstraction level. In this case, the space of achievable performances intersected with this new set of constraints defines the search space for the optimization process. At times, it is more convenient to project down the cost function and the constraints of the higher-level abstraction to the next level down. In this case, then the search space is the result of the intersection of three sets in the performance space and the cost function is a combination of the projected cost function and the one defined at this level. A flow chart summarizing the top-down flow with platforms is shown in Figure 22.10. In Figure 22.11 the set of configurations evaluated during an optimization run for the UMTS frontend in [32] is reported visualizing how multiple topologies are exploited in selecting optimal points.

© 2006 by Taylor & Francis Group, LLC

Platform-Based Design for Embedded Systems

22-21

Define behavioral model

Define performance model P

Select new topology

Select new topology

Derive ACG and nominal configuration

Derive ACG and nominal configuration

Generate P

Generate P

FIGURE 22.8 Bottom-up phase for generating AP.

LNA

PW (G, NF, P, IP3, f –3dB)

P(G, NF, P, IP3)

Wideband

PL(G, NF, P, IP3, f0, Q, Q )

Tuned

Active–L

PT (G, NF, P, IP3, f0, Q)

np–input Pnp(G, NF, P, IP3, f0, Q, IP2)

FIGURE 22.9 Sample model hierarchy for an LNA platform. The root node provides performance constraints for a generic LNA, which is then refined by more detailed P for specific classes of LNAs.

The peculiarity of a platform approach to mixed signal design resides in the accurate performance model constraints P that propagate to the top-level architecture related constraints. For example, a platform stack can be built where multiple analog implementation architectures are presented at a common level of abstraction together with digital enhancement platforms (possibly including several algorithms and hardware architectures), each component being annotated with feasible performance spaces. Solving the system design problem at the top level where the platforms contain both analog and digital components, allows selecting optimal platform instances in terms of analog and digital solutions, comparing how different digital solutions interact with different analog topologies and finally selecting the best tradeoff. The final verification step is also greatly simplified by the platform approach since, at the end, models and performances used in the top-down phase were obtained with a bottom-up scheme. Therefore, a consistency check of models, performances and composition effects is all that is required at a hierarchical

© 2006 by Taylor & Francis Group, LLC

22-22

Embedded Systems Handbook

Build system with APs

Define a formal set of conditions for feasibility

Define an objective function for optimization

Optimize system constraining behavioral models to their P

Refine/add platforms

FIGURE 22.10

Return optimal performances and candidate solutions

Top-down phase for analog design-space exploration.

level, followed by more costly, low-level simulations that check for possible important effects that were neglected when characterizing the platform.

22.6.4 Reconﬁgurable Platforms Analog platforms can also be used to model programmable fabrics. In the digital implementation platform domain, FPGAs provide a very intuitive example of platform, for example including microprocessors on chip. The appearance of Field Programmable Analog Arrays [33] constitutes a new attempt to build reconfigurable Analog Platform. A platform stack can be built by exploiting the software tools that allow mapping complex functionalities (filters, amplifiers, triggers and so on) directly on the array. The top level platform, then, provides an API to map and configure analog functionalities, exposing analog hardware at the software level. By exploiting this abstraction, not only design exploration is greatly simplified, but new synergies between higher layers and analog components can be leveraged to further increase the flexibility/reconfigurability and optimize the system. From this abstraction level, implementing a functionality with digital signal processing (FPGA) or analog processing (FPAA) becomes subject to system level optimization while exposing the same abstract interface. Moreover, very interesting tradeoffs can be explored exploiting different partitionings between analog and digital components and leveraging the reconfigurability of the FPAA. For example, limited analog performances can be mitigated by proper reconfiguration of the FPAA, so that a tight interaction between analog and digital subsystems can provide a new optimum from the system level perspective.

22.7 Concluding Remarks We defined PBD as an all-encompassing intellectual framework in which scientific research, design tool development, and design practices can be embedded and justified. In our definition, a platform is simply

© 2006 by Taylor & Francis Group, LLC

Platform-Based Design for Embedded Systems

22-23

Optimization trace 10

8.4

NF

6.8

5.2

3.6

2.1 0.0014

0.0082

0.015

0.022

0.029

Pd

FIGURE 22.11 Example of architecture selection during top-down phase. In the picture, an LNA is being selected. Circles correspond to architecture 1 instances, crosses to architecture 2 instances. The black circle is the optimal LNA configuration. It can be inferred that after an initial exploration phase alternating both topologies, simulated annealing finally focuses on the architecture 1 to converge.

an abstraction layer that hides the details of the several possible implementation refinements of the underlying layer. PBD allows designers to trade-off various components of manufacturing, NRE and design costs, while sacrificing as little as possible potential design performance. We presented examples of these concepts at different key articulation points of the design process, including system platforms as composed of two platforms (micro-architecture and API), NPs, and AP. This concept can be used to interpret traditional design steps in ASIC development such as synthesis and layout. In fact, logic synthesis takes a level of abstraction consisting of HDL representation (HDL platform) and maps it onto a set of gates that are defined in a library. The library itself is the gate-level platform. The logic synthesis tools are the mapping methods that select a platform instance (a particular netlist of gates that implements the functionality described at the HDL platform level) according to a cost function defined on the parameters that characterize the quality of the elements of the library in view of the overall design goals. The present difficulties in achieving timing closure in this flow indicate the need for a different set of characterization parameters for the implementation platform. In fact, in the gate-level platform the cost associated to the selection of a particular interconnection among gates is not reflected, a major problem since the performance of the final implementation depend critically on this. The present solution of making a larger step across platforms by mixing mapping tools such as logic synthesis, placement, and routing may not be the right one. Instead, a larger pay-off could be had by changing levels of abstractions and including better parametrization of the implementation platform. We argued in this chapter that the value of PBD can be multiplied by providing an appropriate set of tools and a general framework where platforms can be formally defined in terms of rigorous semantics,

© 2006 by Taylor & Francis Group, LLC

22-24

Embedded Systems Handbook

manipulated by appropriate synthesis, and optimization tools and verified. Examples of platforms have been given using the concepts that we have developed. We conclude by mentioning that the Metropolis design environment [34], a federation of integrated analysis, verification, and synthesis tools supported by a rigorous mathematical theory of meta-models and agents, has been designed to provide a general open-domain PBD framework.

Acknowledgments We gratefully acknowledge the support of the Gigascale Silicon Research Center (GSRC), the Center for Hybrid Embedded System Software (CHESS) supported by an NSF ITR grant, the Columbus Project of the European Community, and the Network of Excellence ARTIST. Alberto Sangiovanni–Vincentelli would like to thank Alberto Ferrari, Luciano Lavagno, Richard Newton, Jan Rabaey, and Grant Martin for their continuous support in this research. We also thank the member of the DOP center of the University of California at Berkeley for their support and for the atmosphere they created for our work. The Berkeley Wireless Research Center and our industrial partners, (in particular: Cadence, Cypress Semiconductors, General Motors, Intel, Xilinx, and ST Microelectronics) have contributed with designs and continuous feedback to make this approach more solid. Felice Balarin, Jerry Burch, Roberto Passerone, Yoshi Watanabe, and the Cadence Berkeley Labs team have been invaluable in contributing to the theory of meta-models and the Metropolis framework.

References [1] K. Keutzer, S. Malik, A.R. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli. System level design: orthogonalization of concerns and platform-based design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 19(12), 2000. [2] A.L. Sangiovanni-Vincentelli. Defining platform-based design. In EEDesign, February 2002, Available at www.eedesign.com/story/OEG20020204S0062). [3] Felice Balarin, Massimiliano Chiodo, Paolo Giusto, Harry Hsieh, Attila Jurecska, Luciano Lavagno, Claudio Passerone, Alberto Sangiovanni-Vincentelli, Ellen Sentovich, Kei Suzuki, and Bassam Tabbara. Hardware–Software Co-Design of Embedded Systems: The POLIS Approach. Kluwer Academic Publishers, Boston/Dordrecht/London, 1997. [4] Henry Chang, Larry Cooke, Merrill Hunt, Grant Martin, Andrew McNelly, and Lee Todd. Surviving the SOC Revolution: A Guide to Platform Based Design. Kluwer Academic Publishers, Boston/Dordrecht/London, 1999. [5] A. Ferrari and A.L. Sangiovanni-Vincentelli. System design: traditional concepts and new paradigms. In Proceedings of the International Conference on Computer Design, October 1999, pp. 1–12. [6] Marco Sgroi. Platform-based design methodologies for communication networks. PhD thesis, Electronics Research Laboratory, University of California, Berkeley, CA, December 2002. [7] E.A. Lee and A. Sangiovanni-Vincentelli. A framework for comparing models of computation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 17: 1217–1229, 1998. [8] E.A. Lee. What’s ahead for embedded software? Computer, 33: 18–26, 2000. [9] J.C. Laprie, (Ed.). Dependability: Basic Concepts and Terminology in English, French, German, Italian and Japanese, Vol. 5 Series Title: Dependable Computing and Fault-Tolerant Systems. Springer-Verlag, New York, 1992. [10] R. Alur, T. Dang, J. Esposito, Y. Hur, F. Ivancic, V. Kumar, I. Lee, P. Mishra, G.J. Pappas, and O. Sokolsky. Hierarchical modeling and analysis of embedded systems. Proceedings of the IEEE, 91: 11–28, 2003.

© 2006 by Taylor & Francis Group, LLC

Platform-Based Design for Embedded Systems

22-25

[11] S. Edwards, L. Lavagno, E. Lee, and A.L. Sangiovanni-Vincentelli. Design of embedded systems: formal methods, validation and synthesis. Proceedings of the IEEE, 85: 266–290, 1997. [12] J. Eker, J.W. Janneck, E.A. Lee, J. Liu, J. Ludwig, S. Neuendorffer, S. Sachs, and Y. Xiong. Taming heterogeneity — the ptolemy approach. Proceedings of the IEEE, 91: 127–144, 2003. [13] A. Benveniste, P. Caspi, S. Edwards, N. Halbwachs, P. Le Guernic, and R. de Simone. The synchronous language twelve years later. Proceedings of the IEEE, 91: 64–83, 2003. [14] R. Bannatyne. Time triggered protocol-fault tolerant serial communications for real-time embedded systems. In Wescon/98. Conference Proceedings, 1998. [15] R. Schwarz and P. Rieth. Global chassis control — integration of chassis systems. Automatisierungstechnik, 51: 300–312, 2003. [16] H. Kopetz and D. Millinger. The transparent implementation of fault tolerance in the timetriggered architecture. In Dependable Computing for Critical Applications. San Jose, CA, 1999. [17] C. Pinello, L.P. Carloni, and A.L. Sangiovanni-Vincentelli. Fault-tolerant deployment of embedded software for cost-sensitive real-time feedback-control applications. In Proceedings of the European Design and Test Conference. ACM Press, 2004. [18] C. Ferdinand, R. Heckmann, M. Langenbach, F. Martin, M. Schmidt, H. Theiling, S. Thesing, and R. Wilhelm. Reliable and precise WCET determination for a real-life processor. Lecture Notes in Computer Science, 2211: 469–485, 2001. [19] H. Kopetz and G. Grundsteidl. TTP — a protocol for fault-tolerant real-time systems. IEEE Computer, 27: 14–23, 1994. [20] M. Baleani, A. Ferrari, L. Mangeruca, A. Sangiovanni-Vincentelli, M. Peri, and S. Pezzini. Faulttolerant platforms for automotive safety-critical applications. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM Press, 2003, pp. 170–177. [21] F.V. Brasileiro, P.D. Ezhilchelvan, S.K. Shrivastava, N.A. Speirs, and S. Tao. Implementing fail-silent nodes for distributed systems. IEEE Transactions on Computers, 45: 1226–1238, 1996. [22] L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4: 382–401, 1982. [23] H.S. Siu, Y.H. Chin, and W.P. Yang. Reaching strong consensus in the presence of mixed failure types. Transactions on Parallel and Distributed Systems, 9, 1998. [24] C. Dima, A. Girault, C. Lavarenne, and Y. Sorel. Off-line real-time fault-tolerant scheduling. In Proceedings of the Euromicro 2001, Mantova, Italy, February 2001. [25] T.A. Henzinger, B. Horowitz, and C.M. Kirsch. Embedded control systems development with Giotto. In Proceedings of the Languages, Compilers, and Tools for Embedded Systems. ACM Press, 2001, pp. 64–72. [26] A.J. Wellings, L. Beus-Dukic, and D. Powell. Real-time scheduling in a generic fault-tolerant architecture. In Proceedings of the RTSS’98. Madrid, Spain, December 1998. [27] M. Barborak, M. Malek, and A. Dahbura. The consensus problem in fault-tolerant computing. ACM Computing Surveys, 25: 171–220, 1993. [28] L. Lamport and P. Melliar-Smith. Byzantine clock synchronization. In Proceedings of the Third ACM Symposium on Principles of Distributed Computing. ACM Press, New York, 1984, pp. 68–74. [29] F. De Bernardinis, M.I. Jordan, and A.L. Sangiovanni Vincentelli. Support vector machines for analog circuit performance representation. In Proceedings of the Design Automation Conference, June 2003. [30] J. Platt. Sequential minimal optimization: a fast algorithm for training support vector machines. Microsoft Research, MSR-TR-98-14, 1998. [31] P. Bunus and P. Fritzson. A debugging scheme for declarative equation based modeling languages. Practical Aspects of Decl. Languages: 4th Int. Symp, 280, 2002. [32] F. De Bernardinis, S. Gambini, F. Vinci, F. Svelto, R. Castello, and A. Sangiovanni-Vincentelli. Design space exploration for a UMTS front-end exploiting analog platforms. In Proceedings of the International Conference on Computer-Aided Design, 2004.

© 2006 by Taylor & Francis Group, LLC

22-26

Embedded Systems Handbook

[33] I. Macbeth. Programmable analog systems: the missing link. In EDA Vision (www.edavision.com), July 2001. [34] F. Balarin, Y. Watanabe, H. Hsieh, L. Lavagno, C. Passerone, and A. Sangiovanni-Vincentelli. Metropolis: an integrated electronic system design environment. IEEE Computer, 36: 45–52, 2003.

© 2006 by Taylor & Francis Group, LLC

23 Interface Speciﬁcation and Converter Synthesis 23.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-1 23.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-2 23.3 Automata-Based Converter Synthesis . . . . . . . . . . . . . . . . . . 23-4 Interface Specification • Requirements Specification • Synthesis

23.4 Algebraic Formulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-10 Trace-Based Solution • End-to-End Specification

Roberto Passerone Cadence Design Systems, Inc.

23.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-18 Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-19 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-19

23.1 Introduction Reuse is an established technique in modern design methodologies to reduce the complexity of designing a system. Design reuse complements a design methodology by providing precharacterized components that can be put together to perform the desired function. Together with abstraction and refinement techniques, design reuse is at the basis of such methodologies as platform-based design [1–3]. A platform consists of a set of library elements, or resources, that can be assembled and interconnected according to predetermined rules to form a platform instance. One step in a platform-based design flow involves mapping a function or a specification onto different platform instances, and evaluating its performance. By employing existing components and interconnection structures, reuse in a platform-based design flow shifts the functional verification problem from the verification of the individual elements to the verification of their interaction [4,5]. This technique can be used at all the levels of abstraction in a design in order to come to a complete implementation. A design process can therefore be simplified by using a methodology that promotes the reuse of existing components, also known as intellectual property, or IPs.1 However, despite the advantages of precharacterization, the correct deployment of these blocks when the IPs have been developed by different groups inside the same company, or by different companies, is notoriously difficult. Unforeseen interactions may 1 The term “intellectual property” is used to highlight the intangible nature of virtual components which essentially consist of a set of property rights that are licensed, rather than of a physical entity that is sold.

23-1

© 2006 by Taylor & Francis Group, LLC

23-2

Embedded Systems Handbook

often make the behavior of the resulting design unpredictable. Design rules have been proposed that try to alleviate the problem by forcing the designers to be precise about the behavior of the individual components and to verify this behavior under a number of assumptions about the environment in which they have to operate. While this is certainly a step in the right direction, it is by no means sufficient to guarantee correctness: extensive simulation and prototyping are still needed on the compositions. Several methods have been proposed for hardware and software components that encapsulate the IPs so that their behavior is protected from the interaction with other components. Interfaces are then used to ensure the compatibility between components. Roughly speaking, two interfaces are compatible if they “fit together” as they are. Simple interfaces, typically specified in the type system of a system description language, may describe the types of values that are exchanged between the components. This is the case, for example, of highlevel programming languages and hardware description languages. More expressive interfaces, typically specified informally in design documents, may describe the protocol for the component interaction [6–11]. Several formal methodologies have been proposed for specifying the protocol aspects of interfaces in a way that supports automatic compatibility checks [7, 8, 12]. The key elements of these approaches are the interpretation of an interface in the context of its environment, a model-independent formalism, and the use of automata and game-theoretic algorithms for compatibility checking. With these approaches, given interfaces for different IPs, one can check whether these IPs can be composed. When components are taken from legacy systems or from third-party vendors, interface protocols are unlikely to be compatible. However, this does not necessarily mean that components cannot be combined together: approaches have been proposed that adapt the components by constructing a converter among the incompatible communication protocols [10, 13]. We refer to these techniques collectively as interface synthesis or converter synthesis. Thus, informally, two interfaces are adaptable if they “fit together” by communicating through a third component, the adapter. If interfaces specify only value types, then adapters are simply type converters. However, if interfaces specify interaction protocols, then adapters are protocol converters. For instance, a protocol may be defined as a formal language (a set of strings from an alphabet) and can be finitely represented using automata [10]. The problem of converting one protocol into another can then be addressed by considering their conjuction in terms of the product of the corresponding automata and by removing the states and transitions that lead to a violation of one of the two protocols. The converter uses state information to rearrange the communication between the original interfaces, in order to ensure compatibility. A specification in the form of a third component can be used to define which rearrangements are appropriate in a given communication context. For instance, it is possible to specify that the converter can change the timing of messages, but not their order, using an n-bounded buffer, or that some messages may or may not be duplicated. In this work we initially review this methodology, and then introduce a mathematically sound interpretation and generalization that can be applied in several different contexts. This chapter is organized as follows. First we review some related work in Section 23.2. Then, in Section 23.3, we illustrate with an example the automata-based approach to the synthesis of protocol converters. We then introduce more general frameworks in Section 23.4 and discuss the solution of the protocol conversion problem in Section 23.4.1.

23.2 Related Work One of the first approach to interface synthesis was proposed by Borriello [14, 15], who introduces the “event graph” to establish correct synchronization of the operations and to determine the data sequencing. The event graph is constructed at a very low level of abstraction (waveforms), and can be derived from the timing diagrams of a protocol. In this approach, the two protocols should be made compatible by manually assigning labels to the data on both sides in order to establish the correct correspondence. Because the specification is expressed in terms of the real timing of the signals, this approach can handle both synchronous and asynchronous protocols. Sun and Brodersen [16] extend the approach by providing

© 2006 by Taylor & Francis Group, LLC

Interface Speciﬁcation and Converter Synthesis

23-3

a library of components that frees the user from considering lower-level details, without, however, lifting the requirement of manually identifying the data correspondence. Another approach is that of Akella and McMillan [17]: the protocols are described as two finite state machines, while a third finite state machine represents the valid transfer of data. The correspondence between the protocols is therefore embedded in this last specification. The synthesis procedure consists of taking the product machine of the two protocols, which is then pruned of the invalid/useless states, according to the specification. In the form proposed by the authors, the procedure, however, does not account for data explicitly, so that the converter is unable to handle data width mismatches in the protocols. A different approach is that taken by Narayan and Gajski [18]: first, the protocol specification is reduced to the combination of five basic operations (data read/write, control read/write, time delay); the protocol description is then broken into blocks (called relations) whose execution is guarded by a condition on one of the control wires or by a time delay; finally the relations of the two protocols are matched into sets that transfer the same amount of data. Because the data is broken up into sets, this algorithm is able to account for data width mismatch between the communicating parties. However, the procedural specification of the protocols makes it difficult to adapt different sequencing (order) of the data, so that only the synchronization problem is solved. Some of the limitations above are addressed by the procedure proposed by Passerone et al. [10]. The specification is simplified by describing the protocols as regular expressions, which more closely match the structure of a protocol, rather than as finite state machines (of course, the two formalisms carry the same expressive power). In addition, typing information is used to automatically deduce the correspondence of data between the communicating parties, so that a third specification for the valid transfers is not necessary. The synthesis procedure then follows the approach proposed by Akella by first translating the regular expressions into automata, then constructing a product machine, and finally pruning it of the illegal states. This approach was then extended to also include a specification of the valid transactions, and was cast in the framework of game theory to account for more complex properties, such as liveness [13]. Recently, Siegmund and Müller [19] have proposed a similar approach where the regular expressions are embedded in the description language, in this case SystemC, through the use of appropriate supporting classes. The advantage is that the interface description can be simulated directly with the existing application. However, in this approach the user is required to describe the converter itself, instead of having it be generated automatically from a description of the communicating protocols. In other words, issues of synchronization and data sequencing must be solved upfront. Register transfer level code for the interface can then be generated automatically from the SystemC specification. More recent work has been focused on studying the above issues in a more general setting, generalizing the approach to modeling interfaces and to synthesis by abstracting away from the particular model of computation. De Alfaro and Henzinger propose to use block algebras to describe the relation between components and interfaces [8]. Block algebras are mathematical structures that are used to model a system as a hierarchical interconnection of blocks. Blocks are further classified as components and interfaces. Informally components are descriptions of blocks that say what the block does. Conversely, interfaces are descriptions of blocks that say the expectations that the block has with respect to its environment. This distinction is based upon the observation that physical components do something in any possible environment, whether they behave well or misbehave. In contrast, interfaces describe for each block the environments that can correctly work with the block. Several different kinds of block algebras have been developed for synchronous models, real-time models, and resource models, each carrying a particular notion of compatibility [7, 20–22]. The authors, however, limit their study to questions of compatibility, and do not address the problem of synthesizing adapters. The solution to the problem of protocol synthesis in an abstract setting will be discussed in more detail in Section 23.4, along with the presentation of the relevant related work. Informally, the problem is formulated as an equation of the form P1 C P2 ≤ G, where P1 and P2 are the incompatible protocols, C the protocol converter, and G a global specification that defines the terms of the transactions. The operator represents the operation of composition while the relation ≤ expresses the notion of

© 2006 by Taylor & Francis Group, LLC

23-4

Embedded Systems Handbook

conformance to the specification. This problem was first addressed by Larsen and Xinxin in the framework of process algebra [23]. The solution is derived constructively by building a special form of transition system. More recently, Yevtushenko et al. [24] present a formulation of the problem in terms of languages (sets of sequences of actions) under various kinds of composition operators. By working directly with languages, the solution can then be specialized to different specific representations, including automata and finite state machines. Finally, Passerone generalize the solution by representing the models as abstract algebras, and derive the conditions that guarantee the existence of a solution [12].

23.3 Automata-Based Converter Synthesis We introduce the problem of interface specification and protocol conversion by way of an example. We first set up the conversion problem for send–receive protocols, where the sender and the receiver are specified as automata. A third automaton, the requirement, is also introduced to specify constraints on the converter, such as buffer size and the possibility of message loss. We then solve the protocol conversion by “manually” (of course, the procedure is easy to automate!) deriving an adapter that conforms to both the protocols and the requirements. Section 23.4.1 will discuss an algebraic solution to the same problem.

23.3.1 Interface Speciﬁcation A producer and a consumer component wish to communicate some complex data across a communication channel. They both partition the data into two parts. The interface of the producer is defined so that it can wait an unbounded amount of time between the two parts. Because the sender has only outputs, this is equivalent to saying that the interface does not guarantee to its environment that the second part will follow the first within a fixed finite time. On the other hand, the interface of the consumer is defined so that it requires that once the first part has been received, the second is also received during the state transition that immediately follows the first. Because the receiver has only inputs, this specification corresponds to an assumption that the receiver makes on the set of possible environments that it can work with. Clearly, the two protocols are incompatible. In fact, the sender may elect to send the first part of the data and then wait for some time before sending the second part. Upon receiving the first part, the receiver will, however, assume that the second part will be delivered right away. Since this is not the case, a protocol violation will occur. In other words, the guarantees of the sender are not sufficient to prove that the assumptions of the receiver are always satisfied. Thus a direct composition would result in a possible violation of the protocols. Because no external environment can prevent this violation (the system has no inputs after the composition), an intermediate converter must be inserted to make the communication possible. Below, we illustrate how to synthesize a converter that enables sender and receiver to communicate correctly. The two protocols can be represented by the automata shown in Figure 23.1. There, the symbols a and b (and their primed counterparts) are used to denote the first and the second part of the data, respectively. The symbol denotes instead the absence or irrelevance of the data. In other words, it acts as a don’t care. Figure 23.1(a) shows the producer protocol. The self-loop in state 1 indicates that the transmission of a can be followed by any number of cycles before b is also transmitted. We call this protocol handshake because it could negotiate when to send the second part of the data. After b is transmitted, the protocol returns to its initial state, and is ready for a new transaction. Figure 23.1(b) shows the receiver protocol. Here state 1 does not have a self-loop. Hence, once a has been received, the protocol assumes that b is transmitted in the cycle that immediately follows. This protocol is called serial because it requires a and b to be transferred back-to-back. Similarly to the sender protocol, once b is received the automaton returns to its initial state, ready for a new transaction. We have used nonprimed and primed versions of the symbols in the alphabet of the automata to emphasize that the two sets of signals are different and should be connected through a converter. It is the specification (below) that defines the exact relationships that must hold between the elements of the two alphabets. Note that in the definition of the two protocols nothing relates the quantities of one

© 2006 by Taylor & Francis Group, LLC

Interface Speciﬁcation and Converter Synthesis (a)

23-5 (b)

0

a

0

b

1

a

b

1

FIGURE 23.1 (a) Handshake and (b) serial protocols. (From Roberto Passerone, Luca de Alfaro, Thomas A. Henzinger, and Alberto L. Sangiovanni-Vincentelli. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’02), November 2002. With permission. Copyright 2002 IEEE.)

(a and b) to those of the other (a and b ). The symbol a could represent the toggling of a signal, or could symbolically represent the value of, for instance, an 8-bit variable. It is only in the interpretation of the designer that a and a actually hold the same value. The specification that we are about to describe does not enforce this interpretation, but merely defines the (partial) order in which the symbols can be presented to and produced by the converter. It is possible to explicitly represent the values passed; this is necessary when the behavior of the protocols depends on the data, or when the data values provided by one protocol must be modified (translated) before being forwarded to the other protocol. The synthesis of a protocol converter would then yield a converter capable of both translating data values, and of modifying their timing and order. However, the price to pay for the ability to synthesize data translators is the state explosion in the automata to describe the interfaces and the specification. Observe also that if a and b are symbolic representation of data, some other means must be available in the implementation to distinguish when the actual data corresponds to a or to b. At this level of the description we do not need to be specific: we simply assume that the sender has a way to distinguish whether the symbol a or the symbol b is being produced, and the receiver has a way to distinguish whether a or b is being provided. Examples of methods include toggling bits, or using data fields to specify message types. However, we do not want to be tied to any particular method at this time.

23.3.2 Requirements Speciﬁcation What constitutes a correct transaction? Or in other words, what properties do we want the communication to have? In the context of this particular example the answer seems straightforward. Nonetheless, different criteria could be enforced depending on the application. Each criterion is embodied by a different specification. One example of a specification is shown in Figure 23.2. The alphabet of the automaton is derived from the Cartesian product of the alphabets of the two protocols for which we want to build a converter. This specification states that no symbols should be discarded or duplicated by the converter, and symbols must be delivered in the same order in which they were received; moreover, the converter can store at most one undelivered symbol at any time. The three states in the specification correspond to three distinct cases: • State 0 denotes the case in which all received symbols have been delivered (or that no symbol has been received, yet).

© 2006 by Taylor & Francis Group, LLC

23-6

Embedded Systems Handbook (,) (a,a) (b,b)

0

(,a)

(,b) (b,)

(a,)

(a,a)

(b,b) a

(,)

b (,)

(a,b)

(b,a)

FIGURE 23.2 Specification automaton. (From Roberto Passerone, Luca de Alfaro, Thomas A. Henzinger, and Alberto L. Sangiovanni-Vincentelli. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’02), November 2002. With permission. Copyright 2002 IEEE.)

• State a denotes the case in which symbol a has been received, but it has not been output yet. • Similarly, state b denotes the case in which symbol b has been received, but not yet output. Note that this specification is not concerned with the particular form of the protocols being considered (or else it would itself function as the converter); for example, it does not require that the symbols a or b be received in any particular order (other than the one in which they are sent). On the other hand, the specification makes precise what the converter can, and cannot do, ruling out, for instance, converters that simply discard all input symbols from one protocol, never producing any output for the destination protocol. In fact, the specification admits the case in which a and b are transferred in the reversed order. It also does not enforce that a and b always occur in pairs, and admits a sequence of as without intervening bs (or vice versa). The specification merely asserts that a should occur no earlier than a (an ordering relation), and that a must occur whenever a new a or b occurs. In fact, we can view the specification as an observer that specifies what can happen (a transition on some symbol is available) and what should not happen (a transition on some symbol is not available). As such, it is possible to decompose the specification into several automata, each one of which specifies a particular property that the synthesized converter should exhibit. This is similar to the monitor-based property specification proposed by Shimizu et al. [11] for the verification of communication protocols. In our work, however, we use the monitors to drive the synthesis so that the converter is guaranteed to exhibit the desired properties (correct-by-construction). A high-level view of the relationship between the protocols and the specification is presented in Figure 23.3. The protocol handshake produces outputs a and b, the protocol serial accepts inputs a and b . The specification accepts inputs a, b, a , b , and acts as a global observer that states what properties the converter should have. Once we compose the two protocols and the specification, we obtain a system with outputs a, b, and inputs a , b (Figure 23.3). The converter will have inputs and outputs exchanged: a and b are the converter inputs, and a , b its outputs.

23.3.3 Synthesis The synthesis of the converter begins with the composition (product machine) of the two protocols, shown in Figure 23.4. Here the direction of the signals is reversed: the inputs to the protocols become the outputs

© 2006 by Taylor & Francis Group, LLC

Interface Speciﬁcation and Converter Synthesis

23-7

a,b,

a,b,

Handshake protocol

Serial protocol

Converter

a,b,

a,b,

Specification

FIGURE 23.3 Inputs and outputs of protocols, specification, and converter. (From Roberto Passerone, Luca de Alfaro, Thomas A. Henzinger, and Alberto L. Sangiovanni-Vincentelli. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’02), November 2002. With permission. Copyright 2002 IEEE.) a

0

T

1

b T T/T

T/a T/b

0

b/a

b /b a

b

a /T b / T a/a

a/b 1

T/b T/T

T/a

FIGURE 23.4 Composition between handshake and serial. (From Roberto Passerone, Luca de Alfaro, Thomas A. Henzinger, and Alberto L. Sangiovanni-Vincentelli. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’02), November 2002. With permission. Copyright 2002 IEEE.)

of the converter, and vice versa. This composition is also a specification for the converter, since on both sides the converter must comply with the protocols that are being interfaced. However this specification does not have the notion of synchronization (partial order, or causality constraint) that the specification discussed above dictates. We can ensure that the converter satisfies both specifications by taking the converter to be the composition of the product machine with the specification, and by removing transitions that violate either protocol or the correctness specification. Figure 23.5 through Figure 23.7 explicitly show the steps that we go through to compute this product. The position of the state reflects the position of the corresponding

© 2006 by Taylor & Francis Group, LLC

23-8

Embedded Systems Handbook a

T

0

1

b T T/T

0

T/a T/b

0

b b /a

b /b b /T a

b a /T 1

a

T/T

a /a

a /b T/b

0

T/a

FIGURE 23.5 Converter computation, phase 1. (From Roberto Passerone, Luca de Alfaro, Thomas A. Henzinger, and Alberto L. Sangiovanni-Vincentelli. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’02), November 2002. With permission. Copyright 2002 IEEE.)

state in the protocol composition, while the label inside the state represents the corresponding state in the specification. Observe that the bottom-right state is reached when the specification goes back to state 0. This procedure corresponds to the synthesis algorithm proposed in Reference 10. The approach here is however fundamentally different: the illegal states are defined by the specification, and not by the particular algorithm employed. The initial step is shown in Figure 23.5. The composition with the specification makes the transitions depicted in dotted line illegal (if taken, the specification would be violated). However, transitions can be removed from the composition only if doing so does not result in an assumption on the behavior of the sender. In Figure 23.5, the transition labeled /a leaving state 0 can be removed because the machine can still respond to a input by taking the self-loop, which is legal. The same applies to the transition labeled b / leaving state a which is replaced by the transition labeled b /a . However, removing the transition labeled /b leaving the bottom-right state would make the machine unreceptive to input . Equivalently, the converter is imposing an assumption on the producer that will not occur in that state. Because this assumption is not verified, and because we cannot change the producer, we can only avoid the problem by making the bottom-right state unreachable, and remove it from the composition. The result is shown in Figure 23.6. The transitions that are left dangling because of the removal of the state should also be removed, and are now shown in dotted lines. The same reasoning as before applies, and we can only remove transitions that can be replaced by others with the same input symbol. In this case, all illegal transitions can be safely removed. The resulting machine shown in Figure 23.7 has now no illegal transitions. This machine complies both with the specification and with the two protocols, and thus represents the correct conversion (correct relative to the specification). Notice how the machine at first stores the symbol a without sending it (transition a/ ). Then, when b is received, the machine sends a , immediately followed in the next cycle by b , as required by the serial protocol.

© 2006 by Taylor & Francis Group, LLC

Interface Speciﬁcation and Converter Synthesis

23-9 a

1

0

T

b T T/T T/b

0

0

b b/a

a

b a/ T

a/a

a/b a

1 T/T

T/a

FIGURE 23.6 Converter computation, phase 2. (From Roberto Passerone, Luca de Alfaro, Thomas A. Henzinger, and Alberto L. Sangiovanni-Vincentelli. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’02), November 2002. With permission. Copyright 2002 IEEE.)

a

T

0

1

b T T/T

0

0

T/b

b

b/a b

a

a/T a/b a

1

T/T

FIGURE 23.7 Converter computation, phase 3. (From Roberto Passerone, Luca de Alfaro, Thomas A. Henzinger, and Alberto L. Sangiovanni-Vincentelli. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’02), November 2002. With permission. Copyright 2002 IEEE.)

© 2006 by Taylor & Francis Group, LLC

23-10

Embedded Systems Handbook

23.4 Algebraic Formulation The problem of converter synthesis can be seen as a special case of the more general problem of the synthesis of a local specification, shown in Figure 23.8 (also known as the “unknown component” problem). Here, we are given a global specification G and a partial implementation, called a context, which consists of the composition of several modules, such as P1 and P2 . The implementation is only partially specified, and is completed by inserting an additional module X to be composed with the rest of the context. The problem consists of finding a local specification L for X , such that if X implements L, then the full implementation I implements the global specification G. If we denote with the implementation relation, then the local specification synthesis problem can be expressed as solving for the variable X the following inequality P1 X P2 G The problem of local specification synthesis is very general and can be applied to a variety of situations. One area of application is, for example, that of supervisory control synthesis [25]. Here a plant is used as the context, and a control relation as the global specification. The problem consists of deriving the appropriate control law to be applied in order for the plant to follow the specification. Engineering changes is another area, where modifications must be applied to part of a system in order for the entire system to satisfy a new specification. This procedure is also known as rectification. Note that the same rectification procedure could be used to optimize a design. Here, however, the global specification is unchanged, while the local specification represents all the possible admissible implementations of an individual component of the system, thus exposing its full flexibility [26]. In the case of converter synthesis, the context consists of the protocols that must be connected, while the specification may simply insist that data be passed from one side to the other within a set of requirements. In this case the local specification describes the additional element in the implementation required to make the communication possible, that is, the converter. The literature on techniques to solve the local specification synthesis problem is vast. Here we focus on three of the proposed techniques and highlight in particular their differences in scope and aim. Larsen and Xinxin [23] solve the problem of synthesizing the local specification for a system of equations in a process algebra. In order to represent the flexibility in the implementation, the authors introduce the Disjunctive Modal Transition System (DMTS). Unlike traditional labeled transition systems, the DMTS model includes two kinds of transitions: transitions that may exist and transitions that must exist. The transitions that must exist are grouped into sets, of which only one is required in the implementation. In other words, the DMTS is a transition system that admits several possible implementation in terms of traditional transition systems. Global specification Local specification G

L

Implies

X

P1

X

P2 I

FIGURE 23.8 Local specification synthesis.

© 2006 by Taylor & Francis Group, LLC

Interface Speciﬁcation and Converter Synthesis

23-11

The system is solved constructively. Given a context and a specification, the authors construct a DMTS whose implementations include all and only the solution to the equation. To do so, the context is first translated from its original equational form into an operational form where a transition includes both the consumption of an event from the unknown component, and the production of an event. The transitions of the context and of the specification are then considered in pairs to deduce whether the implementation may or may not take certain actions. A transition is possible, but not required, in the solution whenever the context does not activate such transition. In that case, the behavior of the solution may be arbitrary afterwards. A transition is required whenever the context activates the transition, and the transition is used to match a corresponding transition in the specification. A transition is not allowed in the solution (thus it is neither possible, nor required) whenever the context activates it, and the transition is contrary to the specification. The solution proposed by Larsen and Xinxin has the advantage that it provides a direct way of computing the set of possible implementations. On the other hand, it is specific to one model of computation (transition systems). Yevtushenko et al. [24] present a more general solution where the local specification is obtained by solving abstract equations over languages under various kinds of composition operators. By working directly with languages, the solution can then be specialized to different kinds of representations, including automata and finite state machines. In the formalism introduced by Yevtushenko et al., a language is a set of finite strings over a fixed alphabet. The particular notion of refinement (or implementation) proposed in this work corresponds to language containment: language P refines a language Q if and only if P ⊆ Q. If we denote with P¯ the operation of complementation of the language P (i.e., P¯ is the language that includes all the finite strings over the alphabet that are not in P), then the most general solution to the equation in the variable X A·X ⊆C is given by the formula S = A · C¯ The language S is called the most general solution because a language P is a solution of the equation if and only if P ⊆ S. In the formulas above, the operator “·” can be replaced by different flavors of parallel composition, including synchronous and asynchronous composition. These operators are both constructed as a series of an expansion of the alphabet of the languages, followed by a restriction. For the synchronous composition, the expansion and the restriction do not alter the length of the strings of the languages to which they are applied. Conversely, expansion in the asynchronous composition inserts arbitrary substrings of additional symbols thus increasing the length of the sequence, while the restriction discards the unwanted symbols while shrinking the string. The language equations are then specialized to various classes of automata, including finite automata and finite state machines. This provides an algorithmic way of solving the equation for restricted classes of languages (i.e., those that can be represented by the automaton). The problem in this case consists of proving certain closure properties that ensure that the solution can be expressed in the same finite representation as the elements of the equation. In particular, the authors consider the problem of receptiveness (there called I -progression) and prefix closure. A similar solution is proposed in the framework of Agent Algebra by Passerone et al. [12, 27]. The approach is, however, more general, and does not make any particular assumption about the form that the protocols or the specification can take. In other words, the solution is not limited to protocols represented as languages over an alphabet, or as transition systems. This is similar to the block algebras proposed by de Alfaro and Henzinger (see Section 23.2). There is however a fundamental difference in the way interfaces and components interact. In de Alfaro and Henzinger, the distinction between interfaces and components seems to ultimately arise from the fact that components, by making no assumptions, are unable to constrain their environment. For this reason, components are often called input-enabled, or receptive. Interfaces, on the other hand, constrain the environment by failing to respond to some

© 2006 by Taylor & Francis Group, LLC

23-12

Embedded Systems Handbook

of their possible inputs. Receptiveness and environment constraints are not, however, mutually exclusive. The two notions coexist, and are particularly well behaved, in the so-called trace-based models such as Dill’s trace structures [9] and Negulescu’s Process Spaces [28, 29]. We refer to these models as two-set trace models. In two-set trace models, traces, which are individual executions of a component, are classified as either successes or failures. In order for a system to be failure-free, the environment of each component must not exercise the failure traces. Failure traces therefore represent the assumptions that a component makes relative to its environment. However, the combination of failure and success traces makes the component receptive. Agent Algebras generalize these concepts by shifting the notion of compatibility from the individual executions to the components themselves. The interface models proposed by de Alfaro and Henzinger can easily be seen in these terms. For example, interface automata [7] can be explained almost exactly in terms of the prefix closed trace structures of Dill [9]. In particular, the composition operator in interface automata is an implementation of Dill’s autofailure manifestation and failure exclusion. Therefore, Agent Algebras do not distinguish between the notion of an interface and a component. Or, to be more precise, the distinction between a component and its interface has only to do with a difference in the level of abstraction, rather than with a difference in their nature. In Agent Algebra, the problem of local specification synthesis, and therefore of protocol conversion, is set up as usual as the equation proj(A)(P1 P2 X ) G Note that here the operation of restriction on the alphabet is not part of the composition and is made explicit by the operator proj(A), whose effect is to retain only the elements of the alphabet that are contained in the set A. The solution to the equation is expressed in the form C mirror(proj(A)(P1 P2 mirror(G))) where mirror is a generalized complementation operation whose form depends on the particular model of computation and on its notion of compatibility. The details of the derivation of this solution are outside the scope of this chapter [12]. Instead, we only concentrate on protocols represented as two-set trace structures.

23.4.1 Trace-Based Solution Two-set trace structures are particularly well suited to modeling behavioral interfaces and protocols. The set of failure traces, in fact, states the conditions of correct operation of a component. They can therefore be interpreted as assumptions that components make relative to their environment. Two components are compatible whenever they respect those assumptions, that is, they do not engage in behaviors that make the other component fail. Interface protocols can often be described in this way. The transactions that do not comply with the protocol specification are considered illegal, and therefore result in an incorrect operation of the component that implements the protocol. The solution to the protocol conversion problem described in Section 23.3 requires that we develop a trace-based model of a synchronous system. The model that we have in mind is essentially identical to the synchronous models proposed by Burch [30] and Wolf [31]. For our simple case, an individual execution of a component (a trace) is a sequence of actions from the alphabet A = {, a, b, a , b }, where denotes the absence of an action. Each component T consists of two sets of traces S and F , corresponding to the successes and the failures, respectively. A projection, or hiding of signals, in a trace can be obtained by replacing everywhere in the trace the actions to be hidden by the special value , denoting the absence of any action. In this way, while we abstract away the information about the signal, we do retain the cycle count, ensuring that the model is synchronous. For instance, proj({a})( a, b, a, , b, a, b, b, a, . . .) = a, , a, , , a, , , a, . . .

© 2006 by Taylor & Francis Group, LLC

Interface Speciﬁcation and Converter Synthesis

23-13

where the argument of the projection lists the signals that must be retained. The operation of projection is applied to all success and all failure traces of a component. Parallel composition is more complex. A trace is a possible execution of a component whether it is a success or a failure. It is not a possible execution if it is neither a success nor a failure. If T1 and T2 are two components, then their parallel composition should contain all and only those traces that are possible executions of both T1 and T2 . One such trace will be a success of the composition if it is a success of both T1 and T2 . However, a trace is a failure of the composite if it is a possible trace of one component, and it is also a failure of the other component. Note that if a trace is a failure of one component, but it is not a possible trace of the other component (i.e., it is neither a success nor a failure of the other component), then the trace does not appear as a failure of the composite (in fact, it is not a trace of the composition at all). This is because, in the interaction, the particular behavior that results in that failure will never be exercised, as it is ruled out by the other component. Formally, if T1 = (S1 , F1 ) and T2 = (S2 , F2 ), then the parallel composition T = T1 T2 is given by T = (S1 ∩ S2 , (F1 ∩ (S2 ∪ F2 )) ∪ (F2 ∩ (S1 ∪ F1 ))) If the two components do not share the same alphabet, parallel composition must also include an intermediate step of projection or inverse projection to equalize the signals. Because the length of the sequence is retained during a projection, parallel composition results in a lock step execution of the components. Because components consist of two sets of executions, the relation of implementation cannot be reduced to a simple set containment. Instead, a component T implements another component T if all the possible behaviors of T are also possible behaviors of T , and if T fails less often than T . This ensures that replacing T for T does not produce any additional failure in the system. Formally, T T whenever S ∪ F ⊆ S ∪ F

and

F ⊆ F

The operation of complementation, or mirroring, must also take successes and failures into account. The complement of T is defined as the most general component that can be composed with T without generating any failure. Given the definitions of composition and the implementation relation, the mirror of T is defined as mirror(T ) = (S − F , S ∪ F ) In other words, the possible behaviors of mirror(T ) include all behaviors that are not failures of T . Of those, the successes of T are also successes of its mirror. It is easy therefore to verify that the composition of a component with its complement has always an empty set of failures. The two protocols and the correctness specification of the example of Section 23.3 are easily represented as two-set models. In fact, sets of traces can be represented using automata as recognizers. However, for each component, we must represent two sets. This can be accomplished in the automaton by adding failure states that accept the failure traces. For the particular example presented in Section 23.3, we can still use the automata shown in Figure 23.1 and Figure 23.2. Note that we do not need to add failures to either the sender protocol or to the specification, since they have only outputs and therefore do not constrain the environment in any way. The receiver, on the other hand, must be augmented with a state representing the failure traces. A transition to this additional state is taken from each state on all the inputs for which an action is not already present. In this case, if P1 is the sender protocol, P2 the receiver, C the converter, and G the specification, we may compute the converter by setting up the following local specification synthesis problem: P1 P2 C G The solution is therefore, C mirror(P1 P2 mirror(G))

© 2006 by Taylor & Francis Group, LLC

23-14

Embedded Systems Handbook

Note that projections are not needed in this case, since the alphabet is always A = {, a, b, , a , b }, which is also the alphabet of C. The solution to the problem thus consists of taking the complement of the global specification, compose it with the context (i.e., the two protocols), and complementing the result. After taking the complementation, the resulting component may not be receptive. This can be avoided by applying the operations of autofailure manifestation and failure exclusion, similarly to the synchronous trace structure algebra of Wolf [31], before computing the mirror. A state is an autofailure if all its outgoing transitions are failures. In that case, the state can be bypassed by directing its incoming transitions to the outgoing failure state. Failure exclusion, instead, results in the removal of successful transitions whenever they are matched by a corresponding failure transition on the same input in the same state. The complementation can then be most easily done by first making the automaton deterministic (note, however, that this is a potentially expensive computation). For a deterministic and receptive automaton the mirror can be computed by removing the existing outgoing failure transitions of each state and by adding transitions to a new failure state for each of the input actions that does not already result in a success. When doing so in the example above, we obtain exactly the result depicted in Figure 23.7, with additional failure transitions that stand to represent the flexibility in the implementation. In particular, the state labeled 0 in Figure 23.7 has failure transitions on input b, the state labeled 1 on input a, and the state labeled 2 on input b. This procedure is explained in more details below.

23.4.2 End-to-End Speciﬁcation A potentially better approach to protocol conversion consists of changing the topology of the local specification problem, by providing a global specification that extends end to end from the sender to the receiver, as shown in Figure 23.9. The global specification in this case may be limited to talking about the behavior of the communication channel as a whole, and would be independent of the particular signals employed internally by each protocol. In addition, in a scenario where the sender and the receiver function as layers of two communicating protocol stacks, the end-to-end behavior is likely to be more abstract, and therefore simpler to specify, than the inner information exchange. We illustrate this case by modifying the previous example. In order to change the topology, the sender and receiver protocols must be modified to include inputs from (for the sender) and outputs to (for the receiver) the environment. This is necessary to let the protocols receive and deliver the data transmitted over the communication channel, and to make it possible to specify a global behavior. In addition to adding connections to the environment, in this example we also explicitly model the data. Thus, unlike the previous example where the specification only required that a certain ordering relationship on the data be satisfied, we can here express true correctness by specifying that if a value is input to the system, the same value is output by the system at the end of the transaction. Since the size of the state space of the automata increases exponentially with the size of the data, we will limit the example to the communication of a two-bit integer value. Abstraction techniques must be used to handle larger problems. To make the example more interesting, we modify the protocols so that the sender serializes the least significant bit

Handshake protocol

Converter

Specification

FIGURE 23.9 End-to-end specification.

© 2006 by Taylor & Francis Group, LLC

Serial protocol

Interface Speciﬁcation and Converter Synthesis

23-15

nft, -- / nst, 0 ft, 00 / st, 0

nft, -- / nst, 0 00

ft, 11 / st, 1 s

nft, -- / st, 0

nft, -- /st, 1

ft, 01 / st, 1

nft, -- / nst, 0 11

ft, 10 / st, 0 nft, -- /st, 1

nft, -- / nst, 0

nft, -- /nst, 0

nft, -- /st, 0 01

ft, -- / nst, 0

10

ft, -- / nst, 0

ft, -- / nst, 0

ft, -- / nst, 0

F –, --/ nst, 0

FIGURE 23.10

The sender protocol.

first, while the receiver expects the most significant bit first. In this case, the converter will also need to reorder the sequence of the bits received from the sender. All signals in the system are binary valued. The protocols are simple variations of the ones depicted in Figure 23.1. The inputs to the sender protocol include a signal ft that is set to 1 when data is available, and two additional signals that encode the two-bit integer to be transmitted. The outputs also include a signal st that clocks the serial delivery of the data, and one signal sd for the data itself. The sender protocol is depicted in Figure 23.10. We adopt the convention that a signal is true in the label of a transition when it appears with its original name, and it is false when its name is preceded by an n. Hence, for example, ft implies that ft = 1, and nft that ft = 0. The shaded state labeled F in the automaton accepts the failure traces, while the rest of the states accept the successful traces. Note that the protocol assumes that the environment refrains from sending new data while in the middle of a transfer. In addition, the protocol may wish to delay the transmission of the second bit of the data for as many cycles as desired. Similarly, the receiver protocol has inputs rt and rd, where rt is used to synchronize the start of the serial transfer with the other protocol; the output tt finally informs the environment when new data is available. The receiver protocol is depicted in Figure 23.11. The receiver fails if the second bit of the data is not received within the clock cycle that follows the delivery of the first bit. The automaton for the global specification is shown in Figure 23.12. The global specification has the same inputs as the sender protocol, and the same outputs as the receiver protocol. A trace is successful if a certain value is received on the sender side, and the same value is emitted immediately or after an arbitrary delay on the receiver side. Analogously to the sender protocol, the specification fails if a new data value is received while the old value has not been delivered yet. Following the same notation as the previous example, the solution to the conversion problem can be stated as C mirror(proj({st, sd, rt, rd})(P1 P2 mirror(G))) The projection is now essential to scope down the solution to only the signals that concern the conversion algorithm. The components must again be receptive, therefore similar considerations as those expressed before for the computation of the mirror apply. In particular, autofailure manifestation and failure exclusion is applied before computing the mirror. The automaton is also made deterministic if necessary.

© 2006 by Taylor & Francis Group, LLC

23-16

Embedded Systems Handbook nrt, – / ntt, 00

r rt, 0 / ntt, 00 rt, 1 / ntt, 00

rt, 0 / tt, 10 rt, 1 / tt, 11

rt, 0 / tt, 00 rt, 1 / tt, 01 0

1

nrt, – / tt, 00

nrt, – / tt, 00

F

–, – / ntt, 00

FIGURE 23.11

The receiver protocol. nft, -- / ntt, -ft, 00 / tt, 00 ft, 01 / tt, 01 ft, 10 / tt, 10 ft, 11 / tt, 11 nft, -- / ntt, -00

ft, 00 / ntt, -nft, -- / tt, 00

ft, 11 / ntt, -p

ft, 01 / ntt, --

nft, -- / tt, 11 ft, 10 / ntt, --

nft, -- / tt, 01 01

ft, -- / –, --

nft, -- / ntt, -10

ft, -- / –, -ft, -- / –, --

ft, -- /–, --

F –, -- / –, --

The global specification.

© 2006 by Taylor & Francis Group, LLC

11

nft, -- / tt, 10

nft, -- / ntt, --

FIGURE 23.12

nft, -- / ntt, --

Interface Speciﬁcation and Converter Synthesis nst, 0/ nrt, –

23-17

nst, 0/ nrt, –

nst, 0/ nrt, –

st, 0 / nrt, –

st, 1 / nrt, –

–0 st, 1 / nrt, –

st, 0 / nrt, –

st, 0 / nrt, – nst, 0 / rt, 0

10

nst, 0 / rt, 1 nst, 0 / rt, 0

nst, 1 / –, – st, – / –, –

st, 0 / rt, 0 st, 1 / rt, 1

nst, 0 / nrt, –

nst, 0 / nrt, – 00

*0

st, 1 / nrt, –

nst, 0 / rt, 1

st, 0 / rt, 0 st, 1 / rt, 1 nst, 0 / nrt, –

nst, 0 / nrt, –

–1

c

01

nst, 0 / rt, 0

nst, 1 / –, –

nst, 1 / –, – st, – / –, –

nst, 1 / –, – st, – / –, – nst, 1 / –, – st, – / –, –

nst, 1 / –, –

11

nst, 1 / –, – st, – / –, –

nst, 0 / rt, 1

*1

nst, 1 / –, – st, – / –, –

nst, 1 / –, –

F

–, – / –, –

FIGURE 23.13

The local converter specification.

The result of the computation is shown in Figure 23.13, where, for readability, the transitions that lead to the failure states have been displayed in dotted lines. The form of the result is essentially identical to that of Figure 23.7. Note how the converter switches the position of the most and the least significant bit of the data during the transfer. In this way the converter makes sure that the correct data is transferred from one end to the other. Note, however, that the new global specification (Figure 23.12) had no knowledge whatsoever of how the protocols were supposed to exchange data. Failure traces again express the flexibility in the implementation, and at the same time represent assumptions on the environment. These assumption are guaranteed to be satisfied (modulo a failure in the global specification), since the environment is composed of the sender and the receiver protocol, which are known variables in the system. The solution excludes certain states that lead to a deadlock situation. This is in fact an important side effect of our specific choice of synchronous model, and has to do with the possibility of combinational loops that may arise as a result of a parallel composition. When this is the case, the mirror of an otherwise receptive component may not be receptive. This is because it is perfectly admissible in the model to avoid a failure by withholding an input, that is, by constraining the environment not to generate an input. But since the environment is not constrained, this can only be achieved by “stopping time” before reaching the deadlock state. Since this would be infeasible in any reasonable physical model, we consider deadlock states tantamount to an autofailure, and remove them from the final result. This problem can be solved by employing a synchronous model that deals with combinational loops directly. This is an aspect of the implementation that has been extensively studied by Wolf [31], who proposes to use a three-valued model that includes the usual binary values 0 and 1, and one additional value to represent the oscillating, or unknown, behavior that results from the combinational loops. Exploring the use of this model in the context of protocol specification and converter synthesis is part of our future work. A similar condition may occur when a component tries to“guess”the future, by speculating the sequence of inputs that will be received in the following steps. If the sequence is not received, the component will find itself in a deadlock situation, unable to roll back to a consistent state. This is again admissible in

© 2006 by Taylor & Francis Group, LLC

23-18

Embedded Systems Handbook nst, 0/ nrt, –

nst, 0/ nrt, –

st, 0 / nrt, –

–0

c

nst, 0/ nrt, –

st, 1 / nrt, –

–1

st, 1 / nrt, – nst, 0 / rt, 0

nst, 0 / rt, 1 st, 0 / rt, 0 st, 1 / rt, 1

st, 0 / rt, 0 st, 1 / rt, 1

*0

*1 nst, 1 / –, – nst, 1 / –, – st, – / –, –

nst, 1 / –, – st, – / –, –

nst, 1 / –, –

F

nst, 1 / –, –

–, – / –, –

FIGURE 23.14

The optimized converter.

our model, but would be ruled out if the right notion of receptiveness were adopted. These states and transitions are also pruned as autofailures. The procedure outlined above has been implemented in a prototype application in approximately 2400 lines of C++ code. In the code, we explicitly represent the states and their transitions, while the formulas in the transitions are represented implicitly using BDDs (obtained from a separate package). This representation obviously suffers from the problem of state explosion. This is particularly true when the value of the data is explicitly handled by the protocols and the specification, as already discussed. A better solution can be achieved if the state space and the transition relation are also represented implicitly using BDDs. Note, in fact, that most of the time the data is simply stored and passed on by a protocol specification and is therefore not involved in deciding its control flow. The symmetries that result can therefore likely be exploited to simplify the problem and make the computation of the solution more efficient. Note that the converter that we obtain is nondeterministic and could take paths that are “slower” than one could expect them to be. This is evident in particular for the states labeled −0 and −1 which can react to the arrival of the second piece of data by doing nothing, or by transitioning directly to the states *0 and *1, respectively, while delivering the first part of the data. This is because our procedure derives the full flexibility of the implementation, and the specification depicted in Figure 23.12 does not mandate that the data be transferred as soon as possible. A “faster” implementation can be obtained by selecting the appropriate paths whenever a choice is available, as shown in Figure 23.14. In this case, the converter starts the transfer in the same clock cycle in which the last bit from the sender protocol is received. Other choices as also possible. In general, a fully deterministic converter can be obtained by optimizing certain parameters, such as the number of states or the latency of the computation. More sophisticated techniques might also try to enforce properties that were not included already in the global specification.

23.5 Conclusions Emerging new design methodologies promote reuse of intellectual property as one of the basic techniques to handle complexity in the design process. In a methodology based on reuse, the components are predesigned and precharacterized, and are assembled in the system to perform the desired function.

© 2006 by Taylor & Francis Group, LLC

Interface Speciﬁcation and Converter Synthesis

23-19

System verification thus reduces to the verification of the interaction of the components used in the system. In this chapter, we have reviewed and explored techniques that are useful to define the interface that components expose to their environment. These interfaces include not only the basic typing information, typical of today’s programming and hardware description languages, but also sequencing and behavioral information that is necessary to verify correct synchronization. The interface specifications of the components are then used to automatically construct adapters if the components do not already satisfy each other’s requirements. This technique was first presented in the context of automata theory. Later, we have presented similar, but stronger, results in the context of language theory and algebraic specifications. A simple example was used to illustrate a possible implementation of a converter synthesis algorithm.

Acknowledgments Several people collaborated to the work described in this chapter, including Jerry Burch, Alberto Sangiovanni-Vincentelli, Luca De Alfaro, Thomas Henzinger, and James Rowson. The author would like to acknowledge their contribution.

References [1] Henry Chang, Larry Cooke, Merrill Hunt, Grant Martin, Andrew J. McNelly, and Lee Todd. Surviving the SOC Revolution. A Guide to Platform-Based Design. Kluwer Academic Publishers, Norwell, MA, 1999. [2] Alberto Ferrari and Alberto L. Sangiovanni-Vincentelli. System design: traditional concepts and new paradigms. In Proceedings of the International Conference on Computer Design, ICCD 1999, October 1999, pp. 2–12. [3] Alberto L. Sangiovanni-Vincentelli. Defining platform-based design. EEdesign, February 2002. [4] James A. Rowson and Alberto L. Sangiovanni-Vincentelli. Interface-based design. In Proceedings of the 34th Design Automation Conference, DAC 1997, Anaheim, CA, June 9–13, 1997, pp. 178–183. [5] Marco Sgroi, Michael Sheets, Andrew Mihal, Kurt Keutzer, Sharad Malik, Jan Rabaey, and Alberto Sangiovanni-Vincentelli. Addressing system-on-a-chip interconnect woes through communication-based design. In Proceedings of the 38th Design Automation Conference, DAC 2001, Las Vegas, NV, June 2001, pp. 667–672. [6] S. Chaki, S.K. Rajamani, and J. Rehof. Types as models: model checking message-passing programs. In Proceedings of the 29th ACM Symposium on Principles of Programming Languages, 2002. [7] Luca de Alfaro and Thomas A. Henzinger. Interface automata. In Proceedings of the Ninth Annual Symposium on Foundations of Software Engineering, ACM Press, Vienna, Austria, 2001, pp. 109–120. [8] Luca de Alfaro and Thomas A. Henzinger. Interface theories for component-based design. In Thomas A. Henzinger and Christoph M. Kirsch, Eds., Embedded Software, Vol. 2211 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2001, pp. 148–165. [9] David L. Dill. Trace Theory for Automatic Hierarchical Verification of Speed-Independent Circuits. ACM Distinguished Dissertations. MIT Press, Cambridge, MA, 1989. [10] Roberto Passerone, James A. Rowson, and Alberto L. Sangiovanni-Vincentelli. Automatic synthesis of interfaces between incompatible protocols. In Proceedings of the 35th Design Automation Conference, San Francisco, CA, June 1998. [11] Kanna Shimizu, David L. Dill, and Alan J. Hu. Monitor-based formal specification of PCI. In Proceedings of the Third International Conference on Formal Methods in Computer-Aided Design, Austin, TX, November 2000. [12] Roberto Passerone. Semantic Foundations for Heterogeneous Systems. Ph.D. thesis, Department of EECS, University of California, Berkeley, CA, May 2004. [13] Roberto Passerone, Luca de Alfaro, Thomas A. Henzinger, and Alberto L. Sangiovanni-Vincentelli. Convertibility verification and converter synthesis: two faces of the same coin. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’02), November 2002.

© 2006 by Taylor & Francis Group, LLC

23-20

Embedded Systems Handbook

[14] G. Borriello. A New Interface Specification Methodology and its Applications to Transducer Synthesis. Ph.D. thesis, University of California at Berkeley, Berkeley, CA, 1988. [15] G. Borriello and R.H. Katz. Synthesis and optimization of interface transducer logic. In Proceedings of the International Conference on Computer Aided Design, November 1987. [16] J.S. Sun and R.W. Brodersen. Design of system interface modules. In Proceedings of International Conference on Computer Aided Design, 1992, pp. 478–481. [17] J. Akella and K. McMillan. Synthesizing converters between finite state protocols. In Proceedings of the International Conference on Computer Design, Cambridge, MA, October 14–15, 1991, pp. 410–413. [18] S. Narayan and D.D. Gajski. Interfacing incompatible protocols using interface process generation. In Proceedings of the 32nd Design Automation Conference, San Francisco, CA, June 12–16, 1995, pp. 468–473. [19] Robert Siegmund and Dietmar Müller. A novel synthesis technique for communication controller hardware from declarative data communication protocol specifications. In Proceedings of the 39th conference on Design Automation, New Orleans, LA, 2002, pp. 602–607. [20] Arindam Chakrabarti, Luca de Alfaro, Thomas A. Henzinger, and Freddy Y.C. Mang. Synchronous and bidirectional component interfaces. In Proceedings of the 14th International Conference on Computer-Aided Verification (CAV), Vol. 2404 of Lecture Notes in Computer Science. SpringerVerlag, Heidelberg, 2002, pp. 414–427. [21] Arindam Chakrabarti, Luca de Alfaro, Thomas A. Henzinger, and Marielle Stoelinga. Resource interfaces. In Proceedings of the Third International Conference on Embedded Software (EMSOFT), Vol. 2855 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2003. [22] Luca de Alfaro, Thomas A. Henzinger, and Marielle Stoelinga. Timed interfaces. In Proceedings of the Second International Workshop on Embedded Software (EMSOFT), Vol. 2491 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2002, pp. 108–122. [23] Kim G. Larsen and Liu Xinxin. Equation solving using modal transition systems. In Proceedings of the Fifth Annual IEEE Symposium on Logic in Computer Science (LICS 90), June 4–7, 1990, pp. 108–117. [24] Nina Yevtushenko, Tiziano Villa, Robert K. Brayton, Alex Petrenko, and Alberto L. SangiovanniVincentelli. Sequential synthesis by language equation solving. Memorandum No. UCB/ERL M03/9, Electronic Research Laboratory, University of California at Berkeley, Berkeley, CA, 2003. [25] Adnan Aziz, Felice Balarin, Robert K. Brayton, Maria D. Di Benedetto, Alex Saldanha, and Alberto L. Sangiovanni-Vincentelli. Supervisory control of finite state machines. In Pierre Wolper, Ed., Proceedings of Computer Aided Verification: Seventh International Conference, CAV’95, Vol. 939 of Lecture Notes in Computer Science, Liege, Belgium, July 1995. Springer, Heidelberg, 1995. [26] Jerry R. Burch, David L. Dill, Elizabeth S. Wolf, and Giovanni De Micheli. Modeling hierarchical combinational circuits. In Proceedings of the IEEE/ACM International Conference on ComputerAided Design (ICCAD’93), November 1993, pp. 612–617. [27] Jerry R. Burch, Roberto Passerone, and Alberto L. Sangiovanni-Vincentelli. Notes on agent algebras. Technical Memorandum UCB/ERL M03/38, University of California, Berkeley, CA, November 2003. [28] Radu Negulescu. Process Spaces and the Formal Verification of Asynchronous Circuits. Ph.D. thesis, University of Waterloo, Canada, 1998. [29] Radu Negulescu. Process spaces. In C. Palamidessi, Ed., CONCUR, Vol. 1877 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2000. [30] Jerry R. Burch. Trace Algebra for Automatic Verification of Real-Time Concurrent Systems. Ph.D. thesis, School of Computer Science, Carnegie Mellon University, August 1992. [31] Elizabeth S. Wolf. Hierarchical Models of Synchronous Circuits for Formal Verification and Substitution. Ph.D. thesis, Department of Computer Science, Stanford University, October 1995.

© 2006 by Taylor & Francis Group, LLC

24 Hardware/Software Interface Design for SoC 24.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-1 24.2 SoC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-3 System-Level Design Flow • SoC Design Automation — An Overview

24.3 HW/SW IP Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-5 Introduction to IP Integration • Bus-Based and Core-Based Approaches • Integrating Software IP • Communication Synthesis • IP Derivation

24.4 Component-Based SoC Design . . . . . . . . . . . . . . . . . . . . . . . . . 24-8

Wander O. Cesário TIMA Laboratory

Flávio R. Wagner UFRGS — Instituto de Informática

A.A. Jerraya TIMA Laboratory

Design Methodology Principles • Virtual Architecture • Target MPSoC Architecture Model • HW/SW Wrapper Architecture • Design Tools • Defining IP-Component Interfaces

24.5 Component-Based Design of a VDSL Application . . . . 24-14 Specification • DFU Abstract Architecture • MPSoC RTL Architecture • Results • Evaluation

24.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-19 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-19

24.1 Introduction Modern system-on-chip (SoC) design shows a clear trend toward integration of multiple processor cores. The SoC system driver section of the“International Technology Roadmap for Semiconductors [1]”predicts that the number of processor cores will increase fourfold per technology node in order to match the processing demands of the corresponding applications. Typical multiprocessor SoC (MPSoC) applications such as network processors, multimedia hubs, and base-band telecom circuits have particularly tight time-to-market and performance constraints that require a very efficient design cycle. Our conceptual model of the MPSoC platform is composed of four kinds of components: software tasks, processor and intellectual property (IP) cores, and a global on-chip interconnect IP (see Figure 24.1[a]). Moreover, to complete the MPSoC platform we must also include hardware/software (HW/SW) elements that adapt platform components to each other. MPSoC platforms are quite different from single-master processor SoCs (SMSoCs). For instance, their implementation of system communication is

24-1

© 2006 by Taylor & Francis Group, LLC

24-2

Embedded Systems Handbook

(a)

(b)

(c)

IP

SW interface MPU core HW interface

SW application

Interface

SW tasks

IP core

Platform API Dedicated SW Platform API

HW interface

On-chip communication interconnect

SW design

Custom OS Drivers MPU core

SW communication abstraction Communication interconnect HW communication abstraction Abstract HW interfaces HW components (RTL and Layout)

SoC design

HW design

FIGURE 24.1 (a) MPSoC platform, (b) software stack, and (c) concurrent development environment.

more complicated since heterogeneous processors may be involved and complex communication protocols and topologies may be used. The hardware adaptation layer must deal with some specific issues: 1. In SMSoC platforms, most peripherals (excluding DMA controllers) operate as slaves with respect to the shared communication interconnect. MPSoC platforms may use many different types of processor cores; in this case, sophisticated synchronization is needed to control shared communication between several heterogeneous masters. 2. While SMSoC platforms use simple master/slave shared-bus interconnections, MPSoC platforms often use several complex system buses or micronetworks as global interconnect. In MPSoC platforms, we can separate computation and communication design by using communication coprocessors and profiting from the multimaster architecture. Communication coprocessors/controllers (masters) implement high-level communication protocols in hardware and execute them in parallel with the computation executed on processor cores. Application software is generally organized as a stack of layers that runs on each processor core (see Figure 24.1[b]). The lowest layer contains drivers and low-level routines to control/configure the platform. For the middle layer we can use any commercial embedded operating system (OS) and configure it according to the application. The upper layer is an application-programming interface (API) that provides some predefined routines to access the platform. All these layers correspond to the software adaptation layer in Figure 24.1(a), coding application software can then be isolated from the design of the SoC platform (software coding is not the topic of this chapter and will be omitted). One of the main contributions of this work is to consider this layered approach for the dedicated software (often-called firmware) also. Firmware is the software that controls the platform, and, in some cases, executes some nonperformance critical application functions. In this case, it is not realistic to use a generic OS as the middle layer due to code size and performance reasons. A lightweight custom OS supporting an application-specific and platform-specific API is required. Software and hardware adaptation layers isolate platform components enabling concurrent development as shown in Figure 24.1(c). With this scheme, the software design team uses APIs for both application and dedicated software development. The hardware design team uses abstract interfaces provided by communication coprocessors/controllers. SoC design team can concentrate on implementing HW/SW abstraction layers for the selected communication interconnect IP. Designing these HW/SW abstraction layers represent a major effort, and design tools are lacking. Established EDA tools are not well adapted to this new MPSoC design scenario, and consequently many challenges are emerging; some major issues are: 1. Higher abstraction level is needed: the register-transfer level (RTL) is very time consuming to model and verify the interconnection between multiple processor cores. 2. Higher-level programming is needed: MPSoCs will include hundred thousands of lines of dedicated software (firmware). This software cannot be programmed at the assembler level as today.

© 2006 by Taylor & Francis Group, LLC

Hardware/Software Interface Design for SoC

24-3

3. Efficient HW/SW interfaces are required: microprocessor interfaces, bank of registers, shared memories, software drivers, and OSs must be optimized for each application. This chapter presents a component-based design automation approach for MPSoC platforms. Section 24.2 introduces the basic concepts for MPSoC design and discusses some related platform and component-based approaches. Section 24.3 details IP-based methodologies for HW/SW IP integration. Section 24.4 details our specification model and design flow. Section 24.5 presents the application of this flow for the design of a VDSL circuit and the analysis of the results.

24.2 SoC Design 24.2.1 System-Level Design Flow This section gives an overview of current SoC design methodologies using a template design flow (see Figure 24.2). The basic theory behind this flow is the separation between communication and computation refinement for platform and component-based design [2,3]; it has five main design steps:

1

Specification

2

Abstract platform

Architecture exploration

HW models

Performance analysis HW/SW partitioning

“Golden” abstract architecture

3

4

HW/SW interfaces design IP core

SW tasks API SW design

RTL architecture HW/SW IP integration

FIGURE 24.2 System-level design flow for SoC.

© 2006 by Taylor & Francis Group, LLC

interface 5 HW design

Architecture design

SW models

System specification

1. System specification: system designers and the end-customer must agree on an informal model containing all application’s functionality and requirements. Based on this model, system designers build a more formal specification that can be validated by the end-customer. 2. Architecture exploration: system designers build an executable model of the specification and iterate through a performance analysis loop to decide the HW/SW partitioning for the SoC architecture. This executable specification uses an abstract platform composed of abstract models for HW/SW components. For instance, an abstract software model can concentrate on I/O execution profiles, most frequent use cases, or worst-case scheduling. Abstract hardware can be described using

24-4

Embedded Systems Handbook

transaction-level models or behavioral models. This step produces the “golden” architecture model that is the customized SoC platform or a new architecture created by system designers after selecting processors, the global communication interconnect and other IP components. Once HW/SW partitioning is decided, software and hardware development can be done concurrently. 3. Software design: since the final hardware platform will not be available during software development, some kind of hardware abstraction layer (HAL) or API must be provided to the software design team. 4. Hardware design: hardware IP designers implement the functionality described by the abstract hardware models at the RTL. Hardware IPs can use specific interfaces for a given platform or standard interfaces as defined by Virtual Socket Interface Alliance (VSIA) [4]. 5. HW/SW IP integration: SoC designers create HW/SW interfaces to the global communication interconnect. The golden architecture model must specify performance constrains to assure a good HW/SW integration. SW/HW communication interfaces are designed to conform to these constrains.

24.2.2 SoC Design Automation — An Overview Many academic and industrial works propose tools for SoC design automation covering many, but not all, design steps presented before. Most approaches can be classified into three groups: system-level synthesis, platform-based design, and component-based design. System-level synthesis methodologies are top-down approaches, the SoC architecture and software models are produced by synthesis algorithms from a system-level specification. COSY [5] proposes a HW/SW communication refinement process that starts with an extended Kahn Process Network model on design step (1), uses virtual channel connection (VCC) [6] for step (2), callback signals over a standard real-time operating system (RTOS) for the API in step (3), and VSIA interfaces for steps (4) and (5). SpecC [7] starts with an untimed functional specification model written in extended C on design step (1), uses performance estimation for a structural architecture model for step (2), HW/SW interface synthesis based on a timed bus-functional communication model for step (5), synthesized C code for step (3), and behavioral synthesis for step (4). Platform-based design is a meet-in-the-middle approach that starts with a functional system specification and a predesigned SoC platform. Performance estimation models are used to try different mappings between the set of application’s functional modules and the set of platform components. During these iterations, designers can try different platform customizations and functional optimizations. VCC [6] can produce a performance model using a functional description of the application and a structural description of the SoC platform for design steps (1) and (2). CoWare N2C [8] is a good complement for VCC for design steps (4) and (5). Still the API for software components and many architecture details must be implemented manually. Section 24.3 discusses HW/SW IP integration in the context of current IP-based design approaches. Most IP-based design approaches build SoC architectures from the bottom-up using predesigned components with standard interfaces and a standard bus. For instance, IBM defined a standard bus called CoreConnect [9], Sonics proposes a standard on-chip network called Silicon Backplane µNetwork [10], and VSIA defined a standard component protocol called VCI. When needed, wrappers adapt incompatible buses and component interfaces. Frequently, internally developed components are tied to in-house (nonpublic) standards; in this case, adopting public standards implies a big effort to redesign interfaces or wrappers for old components. Section 24.4 introduces a higher-level IP-based design methodology for HW/SW interface design called component-based design. This methodology defines a virtual architecture model composed of HW/SW components and uses this model to automate design step (5), by providing automatic generation of hardware interfaces (4), device drivers, OSs, and APIs (3). Even if this approach does not provide much help in automating design steps (1) and (2), it provides a considerable reduction of design time for design

© 2006 by Taylor & Francis Group, LLC

Hardware/Software Interface Design for SoC

24-5

steps (3), (4), and (5) and facilitates component reuse. The key improvements over other state-of-art platform and component-design approaches are: 1. Strong support for software design and integration: the generated API completely abstracts the hardware platform and OS services. Software development can be concurrent to, and independent of, platform customization. 2. Higher-level abstractions: the use of a virtual architecture model allows designers to deal with HW/SW interfaces at a high abstraction level. Behavior and communication are separated in the system specification, thus, they can be refined independently. 3. Flexible HW/SW communication: automatic HW/SW interfaces generation is based on the composition of library elements. It can be used with a variety of IP interconnect components by adding the necessary supporting library.

24.3 HW/SW IP Integration There are two major approaches for the integration of HW/SW IP components into a given design. In the first one, component interfaces follow a given standard (such as a bus or core interface, for hardware components, or a set of high-level communication primitives, for software components) and can be thus directly connected to each other. In the second approach, components are heterogeneous in nature and their integration requires the generation of HW/SW wrappers. In both cases, an RTOS must be used to provide services that are needed in order that the application software fits into the SoC architecture. This section describes different solutions to the integration of HW/SW IP components.

24.3.1 Introduction to IP Integration The design of an embedded SoC starts with a high-level functional specification, which can be validated. This specification must already follow a clear separation between computation and communication [11], in order to allow their concurrent evolution and design. An abstract architecture is then used to evaluate this functionality based on a mapping that assigns functional blocks to architectural ones. This highlevel architectural model abstracts away all low-level implementation details. A performance evaluation of the system is then performed, by using estimates of the computation and communication costs. Communication refinement is now possible, with a selection of particular communication mechanisms and a more precise performance evaluation. According to the platform-based design approach [2], the abstract architecture follows an architectural template that is usually domain specific. This template includes both a hardware platform, consisting of a given communication structure and given types of components (processors, memories, hardware blocks), and a software platform, in the form of a high-level API. The target embedded SoC will be designed as a derivative of this template, where the communication structure, the components, and the software platform are all tailored to fit the particular application needs. The IP-based design approach follows the idea that the architectural template may be implemented by assembling reusable HW/SW IP components, maybe even delivered by third-party companies. The IP integration step comprises a set of tasks that are needed to assemble predesigned components in order to fulfill system requirements. As shown in Figure 24.3, it takes as inputs the abstract architecture and a set of HW/SW IP components that have been selected to implement the architectural blocks. Its output is a microarchitecture where hardware components are described at the RTL with all cycle-andpin accurate details that are needed for a further automatic synthesis. Software components are described in an appropriate programming language, such as C, and can be directly compiled to the target processors of the architecture. In an ideal situation, IP components would fit directly together (or to the communication structure) and exactly match the desired SoC functionality. In a more general situation, the designer may need to adapt each component’s functionality (a step called IP derivation) and synthesize HW/SW wrappers

© 2006 by Taylor & Francis Group, LLC

24-6

Embedded Systems Handbook

Application software IP Specific API

Abstract architecture SW wrapper

OS services scheduler, interrupt,... Drivers I/O, interrupt,...

HW IP components CPU Bus IP Memory HW/SW IP integration step

CPU

IP

Memory

HW wrapper

HW wrapper

HW wrapper

Application SW IP Microarchitecture

Communication network

FIGURE 24.3 The HW/SW IP integration design step.

to interconnect them. For programmable components, although adaptation may be easily performed by programming the desired functionality, the designer may still need to develop software wrappers (usually device and bus drivers) to match the application software to the communication infrastructure. The generation of HW/SW wrappers is usually known as interface or communication synthesis. Besides them, application software may also need to be retargeted to the processors and OS of the chosen architecture. In the following subsections, different approaches to IP integration are introduced and their impact on the possible integration subtasks is analyzed.

24.3.2 Bus-Based and Core-Based Approaches In the bus-based design approach [9,12,13], IP components communicate through one or more buses (interconnected by bus bridges). Since the bus specification can be standardized, libraries of components whose interfaces directly match this specification can be developed. Even if components follow the bus standard, very simple bus interface adapters may still be needed [14]. For components that do not directly match the specification, wrappers have to be built. Companies offer very rich component libraries and specialized development and simulation environments for designing systems around their buses. A somewhat different approach is the core-based design, as proposed by the VSIA VCI standard [4] and by the OCP-IP organization [15]. In this case, IP components are compliant to a bus-independent and standardized interface and can be thus directly connected to each other. Although the standard may support a wide range of functionality, each component may have an interface containing only the functions that are relevant for it. These components may also be interconnected through a bus, in which case standard wrappers can adapt the component interface to the bus. Sonics [13] follows this approach, proposing wrappers to adapt the bus-independent OCP socket to the MicroNetwork bus. For particular needs, the SoC may be built around a sophisticated and dedicated network-on-chip (NoC) [16] that may deliver very high performance for connecting a large number of components. Even in this case, a bus or core-based approach may be adopted to connect the components to the network. Bus-based and core-based design methodologies are integration approaches that depend on standardized component or bus interfaces. They allow the integration of homogeneous IP components that follow these standards and can be directly connected to each other, without requiring the development of complex wrappers. The problem we face is that many de facto standards exist, coming from different companies or organizations, thus preventing a real interchange of libraries of IP components developed for different substandards.

© 2006 by Taylor & Francis Group, LLC

Hardware/Software Interface Design for SoC

24-7

24.3.3 Integrating Software IP Programmable components are important in a reusable architectural platform, since it is very cost-effective to tailor a platform to different applications by simply adapting the low-level software and maybe only configuring certain hardware parameters, such as memory sizes and peripherals. As illustrated in Figure 24.3, the software view of an embedded system shows three different layers: 1. The bottom layer is composed of services directly provided by hardware components (processor and peripherals) such as instruction sets, memory and peripheral accesses, and timers. 2. The top layer is the application software, which should remain completely independent from the underlying hardware platform. 3. The middle layer is composed of three different sublayers, as seen from bottom to top: (a) Hardware-dependent software (HdS), consisting for instance of device drivers, boot code, parts of an RTOS (such as context switching code and configuration code to access the memory management unit [MMU]), and even some domain-oriented algorithms that directly interact with the hardware. (b) Hardware-independent software, typically high-level RTOS services, such as task scheduling and high-level communication primitives. (c) The API, which defines a system platform that isolates the application software from the hardware platform and from all basic software layers and enables their concurrent design. The standardization of this API, which can be seen as a collection of services usually offered by an OS, is essential for software reuse above and below it. At the application software level, libraries of reusable software IP components can implement a large number of functions that are necessary for developing systems for given application domains. If, however, one tries to develop a system by integrating application software components that do not directly match a given API, software retargeting to the new platform will be necessary. This can be a very tedious and error-prone manual process, which is a candidate for an automatic software synthesis technique. Nevertheless, reuse can be also obtained below the API. Software components implementing the hardware-independent parts of the RTOS can be more easily reused, especially if the interface between this layer and the HdS layer is standardized. Although the development of reusable HdS may be harder to accomplish, because of the diversity of hardware platforms, it can be at least obtained for platforms aimed at specific application domains. There are many academic and industrial alternatives providing RTOS services. The problem with most approaches, however, is that they do not consider specific requirements for SoC, such as minimizing memory usage and power consumption. Recent research efforts propose the development of applicationspecific RTOS containing only the minimal set of functions needed for a given application [17,18] or including dynamic power management techniques [19]. IP integration methodologies should thus consider the generation of application-specific RTOS that are compliant to a standard API and optimized for given system requirements. In recent years, many standardization efforts aimed at hardware IP reuse have been developed. Similar efforts for software IP reuse are now needed. VSIA [4] has recently created working groups to deal with HdS and platform-based design.

24.3.4 Communication Synthesis Solutions for the automatic synthesis of communication wrappers to connect hardware IP components that have incompatible interfaces have been already proposed. In the PIG tool [20], component interfaces are specified as protocols described as regular expressions, and a finite state machine (FSM) interface for connecting two arbitrary protocols is automatically generated. The Polaris tool [21] generates adapters based on state machines for converting component protocols into a standard internal protocol, together with send and receive buffers and an arbiter.

© 2006 by Taylor & Francis Group, LLC

24-8

Embedded Systems Handbook

These approaches, however, do not address the integration of software IP components. The TEReCS tool [18] synthesizes communication software to connect software IP components, given a specification of the communication architecture and a binding of IP components to processors. In the IPChinook environment [22], abstract communication protocols are synthesized into low-level bus protocols according to the target architecture. While the IPChinook environment also generates a scheduler for a given partitioning of processes into processors, the TEReCS approach is associated with the automatic synthesis of a minimal OS, assembled from a general-purpose library of reusable objects that are configured according to application demands and the underlying hardware. Recent solutions uniformly handle HW/SW interfaces between IP components. In the COSY approach [5], design is performed by an explicit separation between function and architecture. Functions are then mapped to architectural components. Interactions between functions are modeled by high-level transactions and then mapped to HW/SW communication schemes. A library provides a fixed set of wrapper IPs, containing HW/SW implementations for given communication schemes.

24.3.5 IP Derivation Hardware IP components may come in several forms [23]. They may be hard, when all gates and interconnects are placed and routed, soft, with only an RTL representation, or firm, with an RTL description together with some physical floorplanning or placement. The integration of hard IP components cannot be performed by adapting their internal behavior and structure. If they have the advantage of a more predictable performance, in turn they are less flexible and therefore less reusable than adaptable components. Several approaches for enhancing reusability are based on adaptable components. Although one can think of very simple component configurations (for instance by selecting a bit width), a higher degree of reusability can be achieved by components whose behavior can be more freely modified. Object orientation is a natural vehicle for high-level modeling and adaptation of reusable components [24,25]. This approach, which can be better classified as IP derivation, is adequate for not only firm and soft hardware IP components, but also for software IP [26]. Although component reusability is enhanced by this approach, the system integrator has a greater design effort, and it becomes more difficult to predict IP performance. Intellectual property derivation and communication synthesis are different approaches to solve the same problem of integration between heterogeneous IP components, which do not follow standards (or the same substandards). IP derivation is a solution usually based on object-oriented concepts coming from the software community. It can be applied to the integration of application software components and for hardware soft and firm components, but it cannot be used for hard IP components. Communication synthesis, on the other hand, follows the path of the hardware community on automatic logic and high-level synthesis. It is the only solution to the integration of heterogeneous hard IP components, although it can also be used for integrating software IP and soft and firm hardware IP. While IP derivation is essentially a user-guided manual process, communication synthesis is an automatic process, with no user intervention.

24.4 Component-Based SoC Design This section introduces the component-based design methodology, a high-level IP-based methodology aimed at the integration of heterogeneous HW/SW IP components. It follows an automatic communication synthesis approach, generating both HW/SW wrappers. It also generates a minimal and dedicated OS for programmable components. It uses a high-level API, which isolates the application software from the implementation of a HW/SW solution for the system platform, such that software retargeting is not necessary. This approach enables the automatic integration of heterogeneous (components that do not follow a given bus or core standard) and hard IP components (whose internal behavior or structure is not known).

© 2006 by Taylor & Francis Group, LLC

Hardware/Software Interface Design for SoC

24-9

However, the approach is also very well suited to the integration of homogeneous and soft IP components. The methodology has been conceived to fit to any communication structure, such as an NoC [16] or a bus. The component-based methodology is based on a clear definition of three abstraction levels that are also adopted by other current approaches: system (pure functional), macroarchitecture, and microarchitecture (RTL). These levels constitute clear“interfaces”between design steps, promoting reuse of both components and tools for design tasks at each of these levels.

24.4.1 Design Methodology Principles The design flow starts with a virtual architecture model that corresponds to the “golden” architecture in Figure 24.2, and allows automatic generation of wrappers, device drivers, OSs, and APIs. The goal is to produce a synthesizable RTL model of the MPSoC platform that is composed of processor cores, IP cores, the communication interconnect IP, and HW/SW wrappers. The latter are automatically generated from the interfaces of virtual components (as indicated by the arrows in Figur 24.4). Software written for the virtual architecture specification runs without modification on the implementation because the same APIs are provided by the generated custom OSs. The input abstract architecture (see Figure 24.4[a]) is composed of virtual modules (VM), corresponding to processing and memory IPs, connected by any communication structure, also encapsulated within a VM. This abstract architecture model clearly separates computation from communication, allowing independent and concurrent implementation paths for components and for communication. VMs that

(a)

IP core (blackbox) Wrapper Module Task Configuration parameters Virtual component

Communication interconnect IP (blackbox)

Virtual port Virtual channel

(b)

MPU core 1 ...

API OS

SW wrapper

Wrapper

IP core1

Wrapper

HW wrapper

Communication interconnect IP

FIGURE 24.4 MPSoC design flow: (a) virtual architecture and (b) target MPSoC platform.

© 2006 by Taylor & Francis Group, LLC

24-10

Embedded Systems Handbook

correspond to processors may be hierarchically decomposed into submodules containing software tasks assigned to this processor. VMs communicate through virtual ports, which are sets of hierarchical internal and external ports through which services are requested and provided. The separation between internal and external ports makes possible the connection of modules described at different abstraction levels.

24.4.2 Virtual Architecture The virtual architecture represents a system as an abstract netlist of virtual components (see Figure 24.4[a]). It is described in VADeL, a SystemC [27] extension that includes a platform-independent API offering high-level communication primitives. This API abstracts the underlying hardware platform, thus enhancing the free development of reusable components. In the abstract architecture model, the interfaces of software tasks are the same for SW/SW and SW/HW connections, even if the software tasks are executed by different processors. Different HW/SW realizations of this API are possible. Architectural design space exploration can be thus achieved without influencing the functional description of the application. Virtual components use wrappers to adapt accesses from the internal component (a set of software tasks or a hardware function) to the external channels. The wrapper is modeled as a set of virtual ports that contain internal and external ports that can be different in terms of: (1) communication protocol, (2) abstraction level, and (3) specification language. This model is not directly synthesizable or executable because wrapper’s behavior is not described. These wrappers can be generated automatically, in order to produce a detailed architecture that can be both synthesized and simulated. The required SystemC extensions implemented in VADeL are: 1. Virtual module: consists of a module and its wrapper. 2. Virtual port : groups some internal and external ports that have a conversion relationship. The wrapper is the set of virtual ports for a given VM. 3. Virtual channel: groups several channels having a logical relationship (e.g., multiple channels belonging to the same communication protocol). 4. Parameters: used to customize hardware interfaces (e.g., buffer size and physical addresses of ports), OSs, and drivers. In VADeL, there are also predefined ports with special semantics called SAP (Service Access Ports). They can be used to access some services that are implemented by hardware or software wrapper components. For instance, the timer SAP can be used to request an interrupt from a hardware timer after a given delay.

24.4.3 Target MPSoC Architecture Model We use a generic MPSoC architecture where processors and other IP cores are connected to a global communication interconnect IP via wrappers (see Figure 24.4[b]). In fact, processors are separated from the physical communication IP by wrappers that act as communication coprocessors or bridges, freeing processors from communication management and enabling parallel execution of computation tasks and communication protocols. Software tasks also need to be isolated from hardware through an OS that plays the role of software wrapper. When defining this model our goal was to have a generic model where both computation and communication may be customized to fit the specific needs of the application. For computation, we may change the number and kind of components and, for communication, we can select a specific communication IP and protocols. This architecture model is suitable to a wide domain of applications; more details can be found in Reference 28.

24.4.4 HW/SW Wrapper Architecture Wrappers are automatically generated as point-to-point adapters between each VM and the communication structure, as shown in Figure 24.4(b) [28]. This approach allows the connection of components to standard buses as well as point-to-point connections between cores.

© 2006 by Taylor & Francis Group, LLC

Hardware/Software Interface Design for SoC (a)

SW wrapper

FIGURE 24.5

(b)

Task1() {... _write(d); yield2sched(); FIFO write(…) Task schedule Write Reg(…)

24-11

Software module

Yiel d schedule

...

I/O

Int.

FIFO

...

...

...

Processor adapter

API Services

ib_it ib_data CA

ib_enable CA

Drivers

HW/SW wrapper architecture: (a) software wrapper; (b) hardware wrapper.

Wrappers may have HW/SW parts. The internal architecture of a wrapper on the hardware side is shown in Figure 24.5(b). It consists of a processor adapter, one or more channel adapters, and an internal bus. The number of channel adapters depends on the number of channels that are connected to the corresponding VM. This architecture allows the easy generation of multipoint, multiprotocol wrappers. The wrapper dissociates communication from computation, since it can be considered as a communication coprocessor that operates concurrently with other processing functions. On the software side [17], as shown in Figure 24.5(a), wrappers provide the implementation of the high-level communication primitives (available through the API) used in the system specification and drivers to control the hardware. If required, the wrapper will also provide sophisticated OS services such as task scheduling and interrupt management minimally tailored for the particular application. The synthesis of wrappers is based on libraries of basic modules from which hardware wrappers and dedicated OSs are assembled. These libraries may be easily extended with modules that are needed to build wrappers for processors, memories, and other components that follow various bus and core standards.

24.4.5 Design Tools Figure 24.6 shows an overall view of our design environment, which is called ROSES. The input model may be imported from a specification analysis tool (e.g., Reference 6) or manually coded using our extended SystemC library. All design tools use a unified design model that contains an abstract HW/SW netlist annotated with parameters (Colif [29]). Hardware wrapper generation [28] transforms the input model into a synthesizable architecture. The software wrapper generator [17] produces a custom OS for each processor on the target platform. For validation, we use the cosimulation wrapper generator [30] to produce simulation models. Details about these tools can be found in the references, only their principle will be discussed here. Hardware wrapper generation assembles library components using the architecture template presented before (Figure 24.5[b]) to produce the RTL architecture. This library contains generalized descriptions of hardware components in a macrolanguage (m4 like); it has two parts: the processor library and the protocol library. The former contains local template architectures for processors with four types of elements: processor cores, local buses, local IP components (e.g., local memory, address decoder, coprocessors, etc.), and processor adapters. The latter consists of a list of channel adapters. Each channel adapter has simulation, estimation, and synthesis models that are parameterized (by the channel parameters, e.g., direction, storage size, and data type) as the elements in the processor library. The software wrapper generator produces OSs streamlined and preconfigured for the software module(s) that run(s) on each target processor. It uses a library organized in three parts: APIs, communication/system services, and device drivers. Each part contains elements that will be used in a given software layer in the generated OS. The generated OS provides services: communication services (e.g., FIFO [first in first out] communication), I/O services (e.g., AMBA bus drivers), memory services (e.g., cache or virtual memory usage), etc. Services have dependency between them, for instance, communication services are dependent on I/O services. Elements of the OS library also have dependency

© 2006 by Taylor & Francis Group, LLC

24-12

Embedded Systems Handbook

Extended SystemC

Virtual architecture OS library HW wrapper library

APIs

Custom OS generation

Communication and system services

HW wrapper generation

Processor library Protocol library

Device drivers

RTL architecture Cosimulation library RTL synthesis and compilation

Cosimulation wrapper generation

Simulator library Channel library

Emulation platform

FIGURE 24.6

Executable cosimulation model

Design automation tools for MPSoC.

information. This mechanism is used to keep the size of the generated OS at a minimum; the elements that provide unnecessary services are not included. There are two types of service code: reusable (or existing) code and expandable code. As an example of existing code, AMBA bus-master service code can exist in the OS library in the form of C language. As an example of expandable code, OS kernel functions can exist in the OS library in the form of macrocode (m4 like). There are several preemptive schedulers available in the OS library such as round-robin scheduler, priority-based scheduler, etc. In the case of round-robin scheduler, time-slicing (i.e., assigning different CPU load to tasks) is supported. To make the OS kernel very small and flexible, (1) the task scheduler can be selected from the requirements of the application code and (2) a minimal amount (less than 10% of kernel code size) of processor specific assembly code is used (for context switching and interrupt service routines). The cosimulation wrapper generator [30] produces an executable model composed of a SystemC simulator that acts as a master for other simulators. A variety of simulators can participate in this cosimulation: SystemC, VHDL, Verilog, and Instruction-set simulators. Cosimulation wrappers have the same structure as that of hardware wrappers (see Figure 24.5[b]), with simulation adapters in the place of processor adapters and simulation models in the place of channel adapters. In the cosimulation wrapper library, there are simulation adapters for the different simulators supported and channel adapters that implement all supported protocols in different languages. In terms of functionality, the cosimulation wrapper transforms channel access(es) via internal port(s) to channel access(es) via external port(s) using the following functional chain: channel interface, channel resolution, data conversion, module communication behavior. Internal ports use channel functions (e.g., FIFO available, FIFO write) to exchange data. Channel interface provides the implementation of these channel functions. Channel resolution maps N-to-M correspondence between internal and external ports. Data conversion is required since different abstraction levels can use different data types to represent the same data. Module communication behavior is required to exchange data via external port(s), that is, to call port functions of external ports.

© 2006 by Taylor & Francis Group, LLC

Hardware/Software Interface Design for SoC

24-13

24.4.6 Deﬁning IP-Component Interfaces Hardware and software component interfaces must be composed using basic elements of the hardware wrapper and software wrapper generators libraries (respectively). Table 24.1 lists some API functions available for different kinds of software task interfaces and some services provided by channel adapters available to be used in hardware component interfaces. Software tasks must communicate through API functions provided by the software wrapper generator library. For instance, the shared memory (SHM) API provides read/write functions for intertask communication. The guarded shared memory (GSHM) API adds semaphores services to the SHM API by providing lock/unlock functions. Hardware IP components must communicate through communication primitives provided by the channel adapters of the hardware wrapper generator library. For instance, FIFO channel adapters (sender and receiver) implement a buffered two-phase handshake protocol (put/get) and provide full/empty functions for accessing the state of the buffer. ASFIFO channel adapters instead use a single-phase handshake protocol and can generate an interrupt for signaling the full and empty state of the buffer. A recurrent problem in library-based approaches is library size explosion. In ROSES, this problem is minimized by the use of layered library structures where a service is factorized so that its implementation uses elements of different layers. This scheme increases reuse of library elements since the elements of the upper layers must use the services provided by the elements in the immediate lower layer. Designers are able to extend ROSES libraries since they are implemented in an open format. This is an important feature since it enables the support of different standards while reusing most of the basic elements in the libraries. Table 24.2 shows some of the existing HW/SW components in the current ROSES IP library and give the type of communication they use in their interfaces. TABLE 24.1 HW/SW Communication APIs Basic component interfaces

TABLE 24.2

API functions

SW

Register Signal FIFO SHM GSHM

Put/get Sleep/wakeup Put/get Read/write Lock/unlock/read/write

HW

Register FIFO ASFIFO Buffer Event AHB master/slave Timer

Put/get Put/get/full/empty Put/get/IT(full/empty) BPut/BGet Send/IT(receiver) Read/write Set/wait

Sample IP Library

IP

Description

Interfaces

SW

host-if Rand mult-tx reg-config shm-sync stream

Host PC interface Random number generator Multipoint FIFO data transmission Register configuration SHM synchronization FIFO data streaming

Register/signal Signal/FIFO FIFO Register/FIFO/SHM SHM/signal GSHM/FIFO/signal

HW

ARM7 TX_Framer

Processor core Data package framing

ARM7 pins 17 registers, 1 FIFO

© 2006 by Taylor & Francis Group, LLC

24-14

Embedded Systems Handbook

(a) P3

(Signal) P2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Stream

void stream::stream_beh() { long int * P; ... for(;;) {... P=(long int*)P1.Lock(); P2.Sleep(); for (int i=0; i<8; i++) { long int val = P3.Get(); P4.Put(*(P+i+8)); ...};... P1.Unlock(); } ... } P4

(b)

(FIFO)

(Register) P1 (Register) P2 (Register) P3

Register 1 Register 2 Register 3

(Register) P15 (Register) P16 (Register) P17

Register 15 Register 16 Register 17

(GSHM) P1

(FIFO) P18 (FIFO) P19 (FIFO) P20

(FIFO)

FIGURE 24.7 (a) The stream software IP and (b) the TX_Framer hardware IP.

Figure 24.7(a) shows the “stream” software IP and part of its code to demonstrate the utilization of the communication APIs. Its interface is composed of four ports: two for the FIFO API (P3 and P4), one for the signal API (P2), and one for the GSHM API (P1). In line 7 of Figure 24.4(a), the stream IP uses P1 to lock the access to the SHM that contains the data that will be streamed. P2 is used to suspend the task that fills-up the SHM (line 8). Then, some header information is got from the input FIFO using P3 (line 11) and streamed to the output FIFO using P4 (line 12). When streaming is finished, P1 is used to unlock the access to the SHM (line 14). Figure 24.7(b) shows the “TX_Framer” hardware IP, which is part of a VDSL modem and responsible for packaging data into ATM-network compatible frames. Its interface is composed of 17 configuration registers (P1–P17) and one single-handshake input FIFO (P18–P20). The registers are used to configure the IP functionality and have bit sizes varying from 2 to 11, while the FIFO is used to store data packets that will be inserted into specific places in the output ATM frames. These ports are driven directly by the compatible outputs from register and ASFIFO channel adapters that are generated by the hardware wrapper generator.

24.5 Component-Based Design of a VDSL Application 24.5.1 Speciﬁcation The design presented in this section illustrates the IP integration capabilities of ROSES. We redesigned part of a VDSL modem that was prototyped by Reference 31 using discrete components (the shaded part in Figure 24.8[a]). The block diagram for the modem subset used in the rest of this paper is shown in Figure 24.8(b). It corresponds to a deframing/framing unit (DFU), composed of two ARM7 processors and the TX_Framer. The TX_Framer is part of the VDSL Protocol Processor. In this experiment, it is used as a hard IP component described at the RTL. The partition of processors/tasks was suggested by the design team of the VDSL-modem prototype. Processors exchange data using three asynchronous FIFO buffers. The TX_Framer IP has some configuration registers and input a data stream through a synchronous FIFO buffer. Tasks use a variety of control and data-transmission protocols to communicate. For instance, a task can block/unblock the execution of other tasks by sending them an OS signal. Tasks use for data transmission: a FIFO memory buffer, two shared memories (with or without semaphores), and direct register access. Despite representing only

© 2006 by Taylor & Francis Group, LLC

Hardware/Software Interface Design for SoC (a)

24-15

Host PC

DSP

ARM7

RAM

MCU1

Analog front-end

Twisted-Pair (copper line)

ARM7 MCU2

BL-M

ASIC Digital front-end

V-M

FPGA

Constellation processor

VDSL modem processor

VDSL protocol processor I-M

FPGA BL-M: bit-loading memory V-M: variance memory

ATM layer

Di-M

I-M: interleaver memory Di-M: de-interleaver memory

Part redesigned as a multicore SoC (b)

HW IP1 (ARM7 processor)

T4

HW IP3 (TX_Framer) config. reg.

HW IP2 (ARM7 processor)

FIFO

Register 1 Register 2 Register 3

T2 T5 Signal

FIFO

...

Host PC

T1 Signal

T6

FIFO

... Register 15 Register 16 Register 17

T8

Data processing

Signal T7

T3 T9

Software IP (task)

FIGURE 24.8

Signal

FIFO

Data communication

Shared memory space

Control communication

FIFO memory buffer

(a) VDSL modem prototype and (b) DFU block diagram.

a subset of the VDSL modem, the design of the DFU remains quite challenging. In fact, it uses two processors executing parallel tasks. The control over the three modules of the specification is fully distributed. All three modules act as masters when interacting with their environment. Additionally, the application includes multipoint communication channels requiring sophisticated OS services.

24.5.2 DFU Abstract Architecture Figure 24.9 shows the abstract architecture model that captures the DFU specification with point-to-point communications between the three main IP cores. VM1 and VM2 are two virtual processors, and VM3 corresponds to the TX_Framer function. VM1 and VM2 include several submodules corresponding to software tasks T1 through T9 assigned to these processors. This abstract model can be mapped onto different concrete microarchitectures depending on the selected IP components and on desired performance, area, and power constraints. For instance, the three point-to-point connections (VC1, VC2, and VC3) between VM1 and VM2 can be mapped onto a bus or onto an SHM.

© 2006 by Taylor & Francis Group, LLC

24-16

Embedded Systems Handbook

VM1

VM2

M1

VM3 M3

M2 SAP

T4

VC1 T2 VC2 VC3

T8

T5

.. .

.. .

.. .

.. .

.. .

T1 T6 T7 T3

T9

FIGURE 24.9 DFU abstract architecture specification.

TABLE 24.3 HW/SW IP Utilization IP

Description

Use

SW

host-if Rand mult-tx reg-config shm-sync Stream

Host PC interface Random number generator Multipoint FIFO data transmission Register configuration SHM synchronization FIFO data streaming

T1 T2, T3 T4, T8 T5 T6, T9 T7

HW

ARM7 TX_Framer

Processor core Data package framing

VM1, VM2 VM3

24.5.3 MPSoC RTL Architecture For the implementation of the DFU virtual architecture, two hardware IP cores have been selected: an ARM7 processor and the TX_Framer. The application software has been built by reusing several available software IP components for implementing tasks T1 to T9. Table 24.3 lists the selected IP components and indicates their correspondence to the VMs and submodules in the DFU virtual architecture. The interfaces of the selected software IP components in Table 24.3 (see Table 24.2) match the communication type of the software tasks of the virtual architecture in Figure 24.8(b). Figure 24.10 shows the RTL microarchitecture obtained after HW/SW wrapper generation. It is important to notice that, from an abstract target architecture containing an “abstract” ARM7 processor, ROSES automatically generates a “concrete”ARM7 local architecture containing additional IP components, which implement local memory, local bus, and address decoder. Each software wrapper (custom OS) is customized to the set of software IPs corresponding to the tasks executed by the processor core. For example, software IPs running on VM2 access the custom OS using communication primitives available through the API: register is used to write/read to/from the configuration/status registers inside the TX_Framer block, while SHM and GSHM are used to manage shared-memory communication. Each OS contains a round-robin scheduler (Sched) and resource management services (Sync, IT). The driver layer contains low-level code to access the channel adapters within

© 2006 by Taylor & Francis Group, LLC

Hardware/Software Interface Design for SoC

24-17 Custom OS

Custom OS

Pipe

Direct

Signal

Pipe

SHM

Sched Direct register

Sched

Sync

IT

Pipe LReg

Signal LReg

Signal internal

GSHM

Signal

Sync

Pipe buffer

Pipe LReg

Pipe internal

Timer IT

SHM internal

Semaph internal

Signal internal

Timer LReg

VM1 ARM7 processor core

Memory (RAM/ROM)

RS CK

CK

CS

CK

CS

Memory (RAM/ROM)

RS CK

CK

...

control

address

data

CK

HW wrapper

CS

ARM7 processor adapter

CK

ctrl

add.

data

CK

ctrl

data

Polling 16 CA

add.

...

ctrl

CK

Polling 1 CA

add.

data

CK

ctrl ctrl

...

HNDSHK 3 CA

add. add.

1 CA

data data

CK

ctrl

data

CK HNDSHK

HNDSHK 3 comm. adapter

add.

ctrl

add.

data

ctrl

add.

data CK

HNDSHK 1 comm. adapter

test vector

RS CK

Wrapper bus

ctrl

data

add.

ctrl CK

CS

control

RS CK

TIMER

ctrl

add.

data

add.

data

Polling comm. adapter

address

data

control

address

data

HW wrapper

CS

ARM7 processor adapter

Wrapper bus

CK

IP (TX_Framer)

ARM7 local bus

CK RS

CK

control

data

ARM7 local bus

Address decoder

CS address

control

address

data

control

data

address

control

address

data

data

control

clock

addres

reset

VM3

VM2 ARM7 processor core

Address decoder

FIFO CA

...

FIGURE 24.10 Generated MPSoC Architecture.

TABLE 24.4

Results for OS Generation

OS results VM1 VM2 Context switch (cycles) Latency for interrupt treatment (cycles) System call latency (cycles) Resume of task execution (cycles)

Number of lines in C 968 1872

Number of lines in assembly

Code size (bytes)

281 281

3829 6684

Data size (bytes) 500 1020

36 59(OS) + 28(ARM7) 50 26

the hardware wrappers (e.g., Pipe LReg for the HNDSHK channel adapter), and some low-level kernel routines.

24.5.4 Results The manual design of a full VDSL modem requires several person-years; the presented DFU was estimated as a more than five persons-year effort. When using the ROSES IP integration capabilities, the overall experiment took only one person during 4 months, including all validation and verification time (but not counting the effort to develop library components and to debug design tools). This corresponds to a 15-times reduction in design effort (a more detailed presentation can be found in Reference 32). Application code and generated OS are compiled and linked together to be executed on each ARM7 processor. The hardware wrapper can be synthesized using RTL synthesis. As can be seen in Table 24.4, most OS code is generated in C, only a small part of it is in assembly and includes some low-level routines (e.g., context switching and processor boot) that are specific to each processor. If we compare the numbers presented in Table 24.4 with commercial embedded OSs, the results are still very good. The minimum size for such OSs is around 4 KB; but with this size, few of them could provide the required functionality. Table 24.5 shows the numbers obtained after RTL synthesis of the hardware wrappers using a CMOS (complementary metal oxide semiconductor) 0.35 µm technology. These are good results because wrappers account for less than 5% of the ARM7s core surface and have a critical path that

© 2006 by Taylor & Francis Group, LLC

24-18

Embedded Systems Handbook TABLE 24.5

Results for Hardware Wrapper Generation

HW interfaces VM1 VM2 Latency for read operation (clock cycles) Latency for write operation (clock cycles) Number of code lines (RTL VHDL)

Number of gates

Critical path delay (nsec)

Maximum frequency (MHz)

3284 3795

5.95 6.16

168 162

6 2 2168

corresponds to less than 15% of the clock cycle for the 25 MHz ARM7 processors used in this case study.

24.5.5 Evaluation Results show that the component-based approach can generate HW/SW interfaces and OSs that are as efficient as the manually coded/configured ones. The HW/SW frontier in wrapper implementation can be easily displaced by changing some library components. This choice is transparent to the final user since everything that implements the interconnect API is generated automatically (the API does not change only its implementation does). Furthermore, correctness and coherence can be verified inside tools and libraries against the API semantics without having to impose fixed boundaries to the HW/SW frontier (in contrast to standardized component interfaces or buses). The utilization of layered library components provides lots of flexibility; the design environment can be easily adapted to accommodate different languages to describe system behavior, different task scheduling and resource management policies, different global communication interconnect topologies and protocols, a diversity of processor cores and IP cores, and different memory architectures. In most cases, inserting a new design element in this environment only requires to add the appropriate library components. Layered library components are at the roots of the methodology; the principle followed is to contain a unique functionality, and to respect well-defined interfaces that enable easy composition. This layered structure prevents library size explosion since composition is used to implement complex functionality and to increase component reutilization. As explained in this chapter and illustrated by the design case study, ROSES uses a component-based methodology that presents a unique combination of features: • It implements a general approach for the automatic integration of heterogeneous and hard IP components, although it easily accommodates the integration of homogeneous and soft IP components. • It offers an architectural-independent API, integrated into SystemC, containing high-level communication primitives and enhancing the free development of reusable components. Application software accessing this API does not need to be retargeted to each system implementation. • It adopts a library-based approach to wrapper generation. As long as components communicate through a known protocol, communication synthesis can be done automatically, without any additional design effort. In a formal-based approach, instead, the designer must describe the component interface by some appropriate formalism, such as an FSM [21] or a regular expression [20]. • It uniformly addresses the generation of HW/SW parts of communication wrappers for programmable components. While some approaches consider only hardware [20,21] or software [18,22] wrappers, others also consider HW/SW parts but are restricted to predefined wrapper libraries

© 2006 by Taylor & Francis Group, LLC

Hardware/Software Interface Design for SoC

24-19

for given communication schemes [5]. The library-based approach of ROSES, in turn, allows the synthesis of software interfaces for various communication schemes. • It can be used with any architectural template and communication structure, such as a bus, an NoC, or point-to-point connections between components. It is also configurable to synthesize wrappers for any bus or core standard.

24.6 Conclusions Reuse of IP components is a major requirement for the design of complex embedded SoCs. However, reuse is a complex process that involves many steps, requires support from specialized tools and methodologies, and influences current design practices. The integration of IP components into a particular design is maybe the most complex step of the reuse process. Many design approaches, such as bus- and core-based design and platform-based design, are aimed at an easier IP integration. Nevertheless, many problems are still open, in particular the automatic synthesis of HW/SW wrappers between heterogeneous and hard IP components. The chapter has shown that the component-based design methodology provides a complete, generic, and efficient solution to the HW/SW interfaces design problem. Starting from a high-level function specification and an abstract architecture, design tools can automatically generate HW/SW wrappers that are necessary to integrate heterogeneous IP components that have been selected to implement the application. Designers do not need to design any low-level interfacing details manually. The chapter has also shown how HW/SW component interfaces can be decomposed and easily adapted to different communication structures and bus and core standards.

References [1] ITRS. Available at http://public.itrs.net/. [2] K. Keutzer, A.R. Newton, J.M. Rabaey, and A. Sangiovanni-Vincentelli. System-Level Design: Orthogonalization of Concerns and Platform-based Design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 19: 1523–1543, 2000. [3] M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S. Malik, J. Rabaey, and A. Sangiovanni-Vincentelli. Addressing the System-on-Chip Interconnect Woes through Communication-Based Design. In Proceedings of the 38th Design Automation Conference, Las Vegas, NV, June 2001. [4] VSIA. http://www.vsi.org. [5] J.-Y. Brunel, W.M. Kruijtzer, H.J.H.N. Kenter, F. Pétrot, L. Pasquier, E.A. de Kock, and W.J.M. Smits. COSY Communication IPs. In Proceedings of the 37th Design Automation Conference, Los Angeles, CA, June 2000. [6] Cadence Design Systems, Inc. Virtual Component Co-design. http://www.cadence.com/products/ vcc.html. [7] D. Gajski, J. Zhu, R. Domer, A. Gerslauer, and S. Zhao. SpecC Specification Language and Methodology. Kluwer Academic Publishers, Dordrecht, 2000. [8] CoWare. http://www.coware.com. [9] IBM CoreConnect Bus Architecture. http://www.chips.ibm.com/bluelogic. [10] D. Wingard. MicroNetwork-Based Integration for SOCs. In Proceedings of the 38th Design Automation Conference, Las Vegas, NV, June 2001. [11] J. Rowson and A. Sangiovanni-Vincentelli. Interface-Based Design. In Proceedings of the 34th Design Automation Conference, 1997. [12] ARM AMBA. http://www.arm.com. [13] Sonics SiliconBackplane MicroNetwork. http://www.sonicsinc.com. [14] R.A. Bergamaschi and W.R. Lee. Designing Systems-on-Chip Using Cores. In Proceedings of the 37th Design Automation Conference, 2000.

© 2006 by Taylor & Francis Group, LLC

24-20

Embedded Systems Handbook

[15] Open Core Protocol. http://www.ocpip.org. [16] G. de Micheli and L. Benini. Networks-on-Chip: A New Paradigm for Systems-on-Chip Design. In Proceedings of the Design, Automation and Test in Europe Conference, 2002. [17] L. Gauthier, S. Yoo, and A.A. Jerraya. Automatic Generation and Targeting of Application Specific Operating Systems and Embedded Systems Software. IEEE Transactions on Computer-Aided Design and Integrated Circuits and Systems, 20(11): 1293–1301, 2001. [18] C. Böke. Combining Two Customization Approaches: Extending the Customization Tool TEReCS for Software Synthesis of Real-Time Execution Platforms. In Proceedings of the Workshop on Architectures of Embedded Systems (AES2000), Karlsruhe, Germany, January 2000. [19] L. Benini, A. Bogliolo, and G. De Micheli. Dynamic Power Management of Electronic Systems. In Proceedings of International Conference on Computer Aided Design, 1998. [20] R. Passerone, J.A. Rowson, and A. Sangiovanni-Vincentelli. Automatic Synthesis of Interfaces between Incompatible Protocols. In Proceedings of the 35th Design Automation Conference, 1998. [21] J. Smith and G. deMicheli. Automated Composition of Hardware Components. In Proceedings of the 35th Design Automation Conference, 1998. [22] P. Chou et al. IPChinook: An Integrated IP-Based Design Framework for Distributed Embedded Systems. In Proceedings of the 36th Design Automation Conference, 1999. [23] M. Birnbaum and H. Sachs. How VSIA Answers the SoC Dilemma. IEEE Computer, 32: 42–50, June 1999. [24] C. Barna and W. Rosenstiel. Object-Oriented Reuse Methodology for VHDL. In Proceedings of the Design, Automation and Test in Europe Conference, 1999. [25] P. Schaumont et al., Hardware Reuse at the Behavioral Level, In Proceedings of the 36th Design Automation Conference, 1999. [26] F.J. Rammig, Web-based System Design with Components off The Shelf (COTS). In Proceedings of the Forum on Design Languages, Tuebingen, 2000. [27] SystemC, http://www.systemc.org. [28] D. Lyonnard, S. Yoo, A. Baghdadi, and A.A. Jerraya, Automatic Generation of Application-Specific Architectures for Heterogeneous Multiprocessor System-on-Chip. In Proceedings of the 38th Design Automation Conference, Las Vegas, NV, June 2001. [29] W.O. Cesário, G. Nicolescu, L. Gauthier, D. Lyonnard, and A.A. Jerraya. Colif: A Design Representation for Application-Specific Multiprocessor SOCs. IEEE Design & Test of Computers, 18: 8–20, 2001. [30] S. Yoo, G. Nicolescu, D. Lyonnard, A. Baghdadi, and A.A. Jerraya. A Generic Wrapper Architecture for Multi-Processor SoC Cosimulation and Design. In Proceedings of the Interantional Symposium on HW/SW Codesign (CODES), 2001. [31] M. Diaz-Nava and G.S. Okvist. The Zipper Prototype: A Complete and Flexible VDSL MultiCarrier Solution. ST Journal Special Issue xDSL, 2(1): 1/3–21/3, September 2001. [32] W. Cesário, A. Baghdadi, L. Gauthier, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, M. Diaz-Nava, and A.A. Jerraya. Component-Based Design Approach for Multicore SoCs. In Proceedings of the 39th Design Automation Conference, New Orleans, June 2002.

© 2006 by Taylor & Francis Group, LLC

25 Design and Programming of Embedded Multiprocessors: An Interface-Centric Approach 25.1 25.2 25.3 25.4

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TTL Interface Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . TTL Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25-1 25-2 25-3 25-4

Inter-Task Communication • TTL Multi-Tasking Interface • TTL APIs

25.5 Multiprocessor Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-8 Source Code Transformation • Automated Transformation

25.6 TTL on an Embedded Multi-DSP . . . . . . . . . . . . . . . . . . . . . . 25-14

Pieter van der Wolf, Erwin de Kock, Tomas Henriksson, Wido Kruijtzer, and Gerben Essink Philips Research

The Multi-DSP Architecture • TTL Implementation • Implementation Results • Implementation Conclusions

25.7 TTL in a Smart Imaging Core. . . . . . . . . . . . . . . . . . . . . . . . . . . 25-18 The Smart Imaging Core • TTL Shells

25.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-21 Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-22 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-22

25.1 Introduction Modern consumer devices need to offer a broad range of functions at low cost and with low energy consumption. The core of such devices often is a multiprocessor System-on-Chip (MPSoC) that implements the functions as an integrated hardware/software solution. The integration technology used for Originally published as: P. van der Wolf, E. de Kock, T. Henriksson, W. Kruijtzer, and G. Essink. Proceedings of CODES + ISSS 2004 Conference, Stockholm, September 8–10. ACM Press. Reprinted with permission.

25-1

© 2006 by Taylor & Francis Group, LLC

25-2

Embedded Systems Handbook

building such MPSoCs from a set of hardware and software modules is typically based on low-level interfaces for the integration of the modules. For example, the usual way of working is to use bus interfaces for the integration of hardware devices, with ad-hoc mechanisms based on memory mapped registers and interrupts to synchronize hardware and software modules. Further, support for reuse is typically poor and a method for exploring trade-offs is often missing. As a consequence MPSoC integration is a labor-intensive and error-prone task and opportunities for reuse of hardware and software modules are limited. Integration technology for MPSoCs should be based on an abstract interface for the integration of hardware and software modules. Such abstract interface should help to close the gap between the application models used for specification and the optimized implementation of the application on a multi-processor architecture. The interface must enable mapping technology that supports systematic refinement of application models into optimized implementations. Such interface and mapping technology will help to structure MPSoC integration, thereby enhancing both the productivity and the quality of MPSoC design. We present design technology for MPSoC integration with an emphasis on three contributions: 1. We present TTL, a task-level interface that can be used both for developing parallel application models and as a platform interface for integrating hardware and software tasks on a platform infrastructure. The TTL interface makes services for inter-task communication and multi-tasking available to tasks. 2. We show how mapping technology can be based on TTL to support the structured design and programming of embedded multi-processor systems. 3. We show that the TTL interface can be implemented efficiently on different architectures. We present both a software and a hardware implementation of the interface. After discussing related work in Section 25.2, we present the requirements for the TTL interface in Section 25.3. The TTL interface is presented in Section 25.4. Section 25.5 discusses the mapping technology, exemplified by several code examples. We illustrate the design technology in Sections 25.6 and 25.7 with two industrial design cases: a multi-DSP solution and a smart-imaging multi-processor. We present conclusions in Section 25.8.

25.2 Related Work Interface-based design has been proposed as a way to separate communication from behavior so that communication refinement can be applied [1]. Starting from abstract token passing semantics, communication mechanisms are incrementally refined down to the level of physical interconnects. In References 2 and 3, a library-based approach is proposed for generating hardware and software wrappers for the integration of heterogeneous sets of components. The wrappers provide the glue to integrate components having different (low-level) interfaces. No concrete interface is proposed. In Reference 4 transaction level models (TLMs) on the device or component level are discussed. In contrast, we present an abstract task-level interface, named TTL, which can be implemented as platform interface. This interface is the target for the mapping of tasks. Previously, several task-level interfaces and their implementations have been developed at Philips [5–7]. TTL brings these interfaces together in a single framework, to unify them as a set of interoperable interface types. The data transfer and storage exploration (DTSE) method [8] of IMEC focuses on source code transformation to optimize memory accesses and memory footprint. To our knowledge the method does not address the mapping of concurrent applications onto multiprocessor platforms. The Task Concurrency Management [9] method focuses on run-time scheduling of tasks on multiprocessor platforms to optimize energy consumption under real-time constraints. The interaction between these tasks is based on low-level primitives such as mutexes and semaphores. As a result, the tasks are less re-usable than TTL tasks and the design and transformation of tasks is more difficult and time consuming.

© 2006 by Taylor & Francis Group, LLC

Design and Programming of Embedded Multiprocessors

25-3

The Open SystemC Initiative [10] provides a modeling environment to enable system-level design and IP exchange. Currently, the environment does not standardize the description of tasks at the high level of abstraction that we aim at. However, TTL can be made available as a class library for SystemC in the future.

25.3 TTL Interface Requirements We present a design method for implementing media processing applications as MPSoCs. A key ingredient of our design method is the Task Transaction Level (TTL) interface. On the one hand, application developers can use TTL to build executable specifications. On the other hand, TTL provides a platform interface for implementing applications as communicating hardware and software tasks on a platform infrastructure. The TTL interface enables mapping technology that automates the refinement of application models into optimized implementations. Using the TTL interface to go from specification to implementation allows the mapping process to be an iterative process, where during each step selected parts of the application model are refined. Figure 25.1 illustrates the basic idea, with the TTL interface drawn as dashed lines. For the TTL interface to provide a proper foundation for our design method, it must satisfy a number of requirements. First it must offer well-defined semantics for modeling media processing applications. It must allow parallelism and communication to be made explicit to enable mapping to multi-processor architectures. Further, the TTL interface must be an abstract interface. This makes the interface easy to use for application development because the developer does not have to consider low-level details. An abstract interface also helps to make tasks reusable, as it hides underlying implementation details. For example, if a task uses an abstract interface for synchronization with other tasks, it can be unaware and independent of the implementation of the synchronization with, for example, semaphores or some interrupt-based scheme. The platform infrastructure makes services available to tasks via the TTL platform interface. Specifically, these are services for inter-task communication, multi-tasking, and (re)configuration. Rather than offering a low-level interface and implementing, for example, synchronization as part of all the tasks, we factor out such generic services from the tasks to implement them as part of the platform infrastructure. This implementation is done once for a platform, optimized for the targeted application domain and the underlying multiprocessor architecture. An abstract platform interface provides freedom for implementing the platform infrastructure. It must allow a broad range of platform implementations, including different multiprocessor architectures. For example, both shared memory and message-passing architectures should be supported. Further, the

Task Task

Task

TTL

T

Mapping

A

S

K

S

TTL Platform infrastructure

FIGURE 25.1 TTL interface for building parallel application models and implementing them on a platform infrastructure.

© 2006 by Taylor & Francis Group, LLC

25-4

Embedded Systems Handbook

Task 1

Task 2 TTL API

Task 3

SW Shell ASP TTL HW

CPU

interface

HW Shell

Interconnect

FIGURE 25.2

TTL interface as software API and as hardware interface in example architecture.

abstraction allows critical parts of a platform implementation to be optimized transparently and enables evolution of a platform implementation as technology evolves. For example, smooth transition from bus-based interconnects towards the use of network-on-chip technology should be supported. The TTL interface must allow efficient implementations of the platform infrastructure and the tasks integrated on top of it. To enable integration of hardware and software tasks, the interface must be available both as an API and as a hardware interface. An example of how the TTL interface could manifest itself in a simple multiprocessor architecture is shown in Figure 25.2. In the left part of Figure 25.2 the TTL interface is implemented as an API of a software shell executing on a CPU. Software tasks executing on the CPU can access the platform services via the API. In the right part of Figure 25.2 a task is implemented as an application-specific processor (ASP). The TTL interface for integrating the ASP is available as a hardware interface. A hardware shell implements the platform services on top of a lower interconnect. Such interconnect could, for example, have an interface like AXI [11], OCP [12], or DTL [13].

25.4 TTL Interface In this section we present the TTL interface. Specifically, we discuss the TTL interface for inter-task communication and multi-tasking services. We do not discuss reconfiguration. In this chapter all task graphs are static.

25.4.1 Inter-Task Communication We define the following terminology and associated logical model for the communication between tasks. The logical model provides the basis for the definition of the TTL inter-task communication interface. It identifies the relevant entities and their relationships (see Figure 25.3). A task is an entity that performs computations and that may communicate with other tasks. Multiple tasks can execute concurrently to achieve parallelism. The medium through which the data communication takes place is called a channel. A task is connected to a channel via a port. A channel is used to transfer values from one task to another. A variable is a logical storage location that can hold a value. A private variable is a variable that is accessible by one task only. A token is a variable that is used to hold a value that is communicated from one task to another. A token can be either full or empty. Full tokens are tokens that contain a value. Empty tokens do not contain a valid value, but merely provide space for a task to put a value in. We also refer to full and empty tokens as data and room, respectively. Tasks communicate with other tasks by calling TTL interface functions on their ports. Hence, a task has to identify a port when calling an interface function. We focus on streaming communication: communicating tasks exchange sequences of values via channels. A set of communicating tasks is organized as a task graph.

© 2006 by Taylor & Francis Group, LLC

Design and Programming of Embedded Multiprocessors

Empty token

Task

Full token

25-5 Private variable with value

Channel

Port

FIGURE 25.3 Logical model for inter-task communication. TABLE 25.1 TTL Interface Types Acronym CB RB RN DBI DNI DBO DNO

Full name Combined blocking Relative blocking Relative non-blocking Direct blocking in-order Direct non-blocking in-order Direct blocking out-of-order Direct non-blocking out-of-order

25.4.1.1 Interface Types Considering the varying needs for modeling media processing applications and the great variety in potential platform implementations, it is not likely that a single narrow interface can satisfy all requirements. For example, applications may process tokens of different granularities, where streams of tokens may or may not be processed strictly in order. Platform implementations may have different costs associated with synchronization between tasks, data transfers, and the use of memory. Certain architectures efficiently implement message-passing communication whereas shared memory architectures offer a single address space for memory-based communication between tasks. In our view designers are best served if they are offered a palette of communication styles from which they can use the most appropriate one for the problem at hand. The TTL interface offers support for different communication styles by providing a set of different interface types. Each interface type is easy to use and implement. All interface types are based on the same logical model, which enables interoperability across interface types. A task designer must select an interface type for each port. Different interface types can be used in a single model, even in a single task. This allows models to be refined iteratively, where during each step selected parts of a model are refined. In defining the interface types, we have to choose which properties to support and which properties to combine in a particular interface type. Some properties hold for all interface types. Specifically, all channels are uni-directional and support reliable and ordered communication. TTL supports arbitrary communication data types, but each individual channel can communicate tokens of a single type only. Multi-cast is supported, that is, a channel has one producing task but can have multiple consuming tasks. The TTL interface types are listed in Table 25.1. 25.4.1.2 Interface Type CB The interface type CB provides two functions for communication between tasks: write (port, vector, size) read (port, vector, size)

The write function is used by a producer to write a vector of size values into the channel connected to port. The read function is used by a consumer to read a vector of size values from the channel connected to port. The write and read functions are also available as scalar functions that operate

© 2006 by Taylor & Francis Group, LLC

25-6

Embedded Systems Handbook

on a single value at a time. The write and read functions are blocking functions, that is, they do not return until the complete vector has been written or read, respectively. This interface type is based on our earlier work on YAPI [5]. This interface type is the most abstract TTL interface type. Since it hides low-level details from the tasks, it is easy to use and supports reuse of tasks. The write and read functions perform both the synchronization and the data transfer associated with communication. That is, they check for availability of room/data, copy data to/from the channel, and signal the availability of data/room. The length of the communicated vectors may exceed the number of tokens in the channel. The platform implementation may transfer such vectors in smaller chunks, transparent to the communicating tasks [14]. This interface type is named CB as it combines (C) synchronization and data transfer in a single function with blocking (B) semantics. This interface type can be implemented efficiently on message-passing architectures or on shared memory architectures where the processors have local buffers that can hold the values that are read or written. However, on shared memory architectures where the processors do not have such local buffers, this interface type may yield overhead in copying data between private variables, situated in shared memory, and the channel buffer in shared memory. 25.4.1.3 Interface Types RB and RN To provide more flexibility for making trade-offs upon task implementation, the other TTL interface types offer separate functions for synchronization and data transfer. The availability of room or data can be checked explicitly by means of an acquire function and can be signaled by means of a release function. The acquire function can be blocking or non-blocking. A non-blocking acquire function does not wait for data or room to be available, but returns immediately to report success of failure. The functions for the producer are: reAcquireRoom (port, count) tryReAcquireRoom (port, count) store (port, offset, vector, size) releaseData (port, count)

reAcquireRoom is the blocking acquire function and tryReAcquireRoom is the non-blocking acquire function. The acquire and release functions synchronize for vectors of count tokens at a time. The acquire functions are named “reacquire” since they also acquire tokens that have previously been acquired and not yet released. That is, they do not change the state of the channel. This helps to reduce the state saving effort for tasks as the acquire function can simply be issued again upon a next task invocation. This behavior is similar to GetSpace in Reference 6. Data accesses can be performed on acquired room with the store function, which copies a vector of size values to the acquired empty tokens. The store function can perform out-of-order accesses on the acquired empty tokens using a relative reference offset. An offset of 0 refers to the oldest acquired and not yet released token. The store function is also available as a scalar function. The releaseData function releases the count oldest acquired tokens as full tokens on port. The functions for the consumer are: reAcquireData (port, count) tryReAcquireData (port, count) load (port, offset, vector, size) releaseRoom (port, count)

These interface types are named RB and RN with the R of relative, B of blocking, and N of non-blocking. Offering separate functions for synchronization and data transfer allows data transfers to be performed on a different granularity and rate than the related synchronizations. This may, for example, be used to reduce the cost of synchronization by performing synchronization at a coarse grain outside a loop, while performing computations and data transfers at a finer grain inside the loop. This interface type can be used to avoid the overhead of memory copies on shared memory architectures at a lower cost than

© 2006 by Taylor & Francis Group, LLC

Design and Programming of Embedded Multiprocessors

25-7

with CB, as coarse-grain synchronization can be combined with small local buffers, for example, registers, for fine-grain data transfers. Additionally, for some applications the support for out-of-order accesses helps to reduce the cost of private variables that are needed in a task. Further, with this interface type, tasks can selectively load only part of the data from the channel, thereby allowing the cost of data transfers to be reduced. The drawback, compared to CB, is that these interface types are less abstract. 25.4.1.4 Interface Type DBI and DNI The RB and RN interface types hide the memory addresses of the tokens from the tasks. This supports reuse of tasks. However, it may also incur inefficiencies upon data transfers, like function call overhead, accesses to the channel administration, and address calculations. To avoid such inefficiencies, TTL offers interface types that support direct data accesses. In these interface types the acquire functions return a reference to the acquired token in the channel. This reference can subsequently be used by the task to directly access the data/room in the channel without using a TTL interface function. The functions for the producer are: acquireRoom (port, &token) tryAcquireRoom (port, &token) token->field = value releaseData (port)

The functions for the consumer are: acquireData (port, &token) tryAcquireData (port, &token) value = token->field releaseRoom (port)

The acquire and release functions acquire/release a single token at a time. Supporting vector operations for these interface types would result in a complex interface. For example, it would expose the wrap-around in the channel buffer or would require a vector of references to be returned. Since tasks must still be able to acquire more than one token, these acquire functions acquire the first un-acquired token and change the state of the channel, unlike the reacquire functions of RB and RN. The release functions release the oldest acquired token on port. The interface types are named DBI and DNI with the D of direct, B of blocking, N of non-blocking, and I of in-order as tokens are released in the same order as they are acquired. These interface types can be implemented efficiently on shared memory architectures [7] and are suited for software tasks that process coarse-grain tokens. 25.4.1.5 Interface Type DBO and DNO In some cases tasks do not finish the processing of data in the same order as the data was acquired. In particular when large tokens are used, it should be possible to release a token as soon as a task is finished with it. For this purpose TTL offers the DBO and DNO interface types (O for out-of-order). The only difference with the DBI and DNI interface types is in the release functions: releaseData (port, &token) releaseRoom (port, &token)

The token reference allows the task to specify which token should be released. The out-of-order release supports efficient use of memory at the cost of a more complex implementation of the channel.

25.4.2 TTL Multi-Tasking Interface To support different forms of multi-tasking, TTL offers different ways for tasks to interact with the scheduler. Thereto TTL supports three task types. The task type process is for tasks that have their own (virtual) thread of execution and that do not explicitly interact with the scheduler. This task type is suited for tasks that have their private processing

© 2006 by Taylor & Francis Group, LLC

25-8

Embedded Systems Handbook

resource or that rely on the platform infrastructure to perform task switching and state saving implicitly. For example, this task type is well suited for software tasks executing on an OS. The task type co-routine is for cooperative tasks that interact explicitly with the scheduler at points in their execution where task switching is acceptable. For this purpose TTL offers a suspend function. This task type may be used to reduce the task-switching overhead by allowing the task to suspend itself at points where only little state needs to be saved. The task type actor is for fire-exit tasks that perform a finite amount of computations and then return to the scheduler, similar to a function call. Unless explicitly saved, state is lost upon return. This task type may be used for a set of tasks that have to be scheduled statically.

25.4.3 TTL APIs The TTL interface is available both as C++ and as C API. The use of C++ gives cleaner descriptions of task interfaces, due to C++ support for templates and function overloading. We use C to link to software compilers for embedded processors and hardware synthesizers since most of them do not support C++ as input language. For both the C++ API and the C API we offer a generic run-time environment, which can be used for functional modeling and verification of TTL application models.

25.5 Multiprocessor Mapping In this section we present a systematic approach to map applications efficiently onto multiprocessors. The key advantage of TTL is that it provides a smooth transition from application development to application implementation. In our approach, we rewrite the source code of applications to improve efficiency. We focus on source code transformations for multiprocessor architectures taking into account costs of memory usage, synchronization cycles, data transfer cycles, and address generation cycles. We do not consider algorithmic transformations because these transformations are application-specific. Typically, application developers perform these transformations. We also do not consider code transformations for single target processors because these transformations are processor-specific. We assume that processor-specific compilers and synthesizers support these transformations, although in today’s practice programmers also write processor-specific C. In the remainder of this section we present methods and tools to transform source code. First we present source code transformations to illustrate the advantages of using TTL. Next we present tools that we developed to automate these transformations.

25.5.1 Source Code Transformation We use a simple example to illustrate the use of TTL. The example consists of an inverse quantization (IQ) task that produces data for an inverse zigzag (IZZ) task; see Figure 25.4. We focus on the interaction between these two tasks. The TTL interface supports different inter-task communication interface types that provide a trade-off between abstraction and efficiency. We illustrate this by means of code fragments. To save space we indicate scopes by means of indentation rather than curly braces. 25.5.1.1 Optimization for Single Interface Types The most abstract and easy-to-use interface type is CB, which combines synchronization and data transfer in write and read functions. Figure 25.5 shows a fragment of the IQ task that reads input (Line 08),

IQ

FIGURE 25.4 IQ and IZZ example.

© 2006 by Taylor & Francis Group, LLC

IZZ

Design and Programming of Embedded Multiprocessors

25-9

01 void IQ::main() 02 while (true) 03 for(int j=0; j
FIGURE 25.5 IQ using interface type CB. 01 void IZZ::main() 02 while (true) 03 VYApixel Cin[64]; 04 VYApixel Cout[64]; 05 read(CinP, Cin, 64); 06 for (int i=0; i<64; i++) 07 Cout [zigzag[i]] = Cin[i]; 08 write(CoutP, Cout, 64);

FIGURE 25.6 IZZ using interface type CB. 01 void IQ::main() 02 while (true) 03 for(int j=0; j
FIGURE 25.7 IQ using interface type CB and vector write.

performs the inverse quantization (Line 09) and writes output using a scalar write operation (Line 10). The write function terminates when the value of variable Cout has been transferred to the channel. This is repeated for all 64 values of a block (Line 05) and for all blocks in a minimum coding unit (Line 03 and 04). Figure 25.6 shows a fragment of the IZZ task using a vector read function (Line 05). The read function terminates when 64 values from the channel have been transferred to the variable Cin. Subsequently, these values are reordered (Line 06 and 07) and written to the output (Line 08). The channel from the IQ task to the IZZ task implements the write and read functions that handle both the synchronization and the data transfer. Note that the length of the communicated vectors is not bounded by the number of tokens in the channel, which makes tasks independent of their environment. A potential performance problem of the IQ task in Figure 25.5 is that for each pixel, the output synchronizes with the input of the IZZ task. In Reference 15 we demonstrated that this is costly in terms of cycles per pixel if the write function is implemented in software. We can solve this problem by calling the write function outside the inner loop as shown in Figure 25.7 in Line 10. To this end, we need to store a block of pixels locally in the IQ task (Line 05). Similar source code transformations to reduce the synchronization rate are possible for the other TTL interface types. 25.5.1.2 Optimization across Interface Types The disadvantage of the IQ task of Figure 25.7 is the additional local memory requirement. Interface type RB splits synchronization and data transfer in separate functions such that the synchronization rate can be decreased without additional local memory requirements.

© 2006 by Taylor & Francis Group, LLC

25-10

Embedded Systems Handbook 01 void IQ::main() 02 while (true) 03 for(int j=0; j
FIGURE 25.8 IQ using interface type RB.

01 void IZZ::main() 02 while (true) 03 VYApixel Cout[64]; 04 reAcquireData(CinP, 64); 05 for(int i=0; i<64; i++) 06 VYApixel Cin; 07 load(CinP, i, Cin); 08 Cout[zigzag[i]] = Cin; 09 write(CoutP, Cout, 64); 10 releaseRoom(CinP, 64);

FIGURE 25.9 IZZ using interface type RB.

Figure 25.8 shows how to decrease the synchronization rate from pixel rate to block rate at the output of the IQ task. Note that here we assume that the channel can store at least 64 pixels, otherwise the call of the function reAcquireRoom at Line 05 will never terminate. This assumption on the environment is not needed with interface type CB. Hence, the IQ task of Figure 25.8 puts more constraints on its use. Figure 25.9 shows the IZZ task with separate synchronization and data transfer. The IQ task and the IZZ task do not need to store blocks locally to interact with each other. They share the tokens in the channel. If the IQ task and the IZZ task need to execute concurrently, then the channel must be able to contain two blocks, that is, 128 pixels. The load function (Figure 25.9, Line 07) and the store function (Figure 25.8, Line 09) use relative addressing. The advantage of this is that the address generation for the FIFO can be implemented in the load and store functions. Hence, address generation is hidden for the tasks. Interface type DBI uses direct addressing rather than relative addressing. Direct addressing has advantages if the tokens of a channel and the variables of a task are stored in the same memory. In that case the tokens and the variables should be mapped onto the same memory locations to avoid in-place copying in the memory during the transfer of data from and to the tokens. Such copying occurs for instance in Figure 25.9 at Line 07 where a value from the channel is copied into variable Cin. Furthermore, the cost of calling the load and store functions can be avoided. The disadvantage of direct addressing is that the addresses of the tokens are exposed to tasks. To avoid that tasks must take care of wrap-around in the FIFO only scalar functions are available. Hence, typically it is more efficient to choose larger tokens if the synchronization rate has to be low. Figure 25.10 shows the IQ task using direct addressing on its output. We declare a pointer Cout in Line 04 that is given a value in Line 05. After the room has been acquired, Cout points to a block of 64 pixels. The channel data type is also block of 64 pixels. The pointer Cout is used to set the value of the pixels in Line 09 avoiding a call of a store function. Similarly, Figure 25.11 shows the IZZ task using direct addressing on its input avoiding both a call to a load function and a copy operation from the channel to a variable. Note that the granularity of synchronization between the IQ output and the IZZ input must be identical, because only scalar functions are available. For this reason, the IQ task and the IZZ task have become less re-usable.

© 2006 by Taylor & Francis Group, LLC

Design and Programming of Embedded Multiprocessors

25-11

01 void IQ::main() 02 for(int j=0; j
FIGURE 25.10

IQ using interface type DBI. 01 void IZZ::main() 02 while (true) 03 VYApixel *Cin; 04 VYApixel Cout[64]; 05 acquireData(CinP, Cin); 06 for(int i=0; i<64; i++) 07 Cout[zigzag[i]] = Cin[i]; 08 write(CoutP, Cout, 64); 09 releaseRoom(CinP);

FIGURE 25.11

IZZ using interface type DBI. 01 void IZZ::main() 02 if (!tryReAcquireData(CinP, 64)) return; 03 if (!tryReAcquireRoom(CoutP, 64)) return; 04 for (unsigned int i=0; i<64; i++) 05 VYApixel Cin; 06 load(CinP, i, Cin); 07 store(CoutP, zigzag[i], Cin); 08 releaseRoom(CoutP, 64); 09 releaseData(CinP, 64);

FIGURE 25.12

IZZ using interface type RN.

25.5.1.3 Non-Blocking Interface Types So far, we only discussed interface types that provide blocking synchronization functions. These interfaces are easy to use because programmers do not have to program what should happen when access to tokens is denied. Sometimes blocking synchronization is not efficient, for instance, if the state of a task is large such that it is costly to save it. In that case it may be more efficient to let the programmer decide what should happen. For this reason, non-blocking synchronization functions are needed. Figure 25.12 shows how the IZZ task can be modeled as an actor. When the actor is fired, it first checks for available data on its input (Line 02) and then for available room on its output (Line 03). If the data is available but the room is not available, then the actor can return without saving its state. In the next firing, it can redo the checks since the tryReAcquire functions do not modify the state of the channels. If both the data and the room are available, it is guaranteed that the actor can complete its execution. 25.5.1.4 Channel and Task Merging and Splitting Channel and task merging and splitting are important for load balancing. In Reference 15 we applied task merging to reduce the data transfer load, since the cost of data transfer from the IQ task to the IZZ task is large compared to the amount of computation that the IZZ task performs. Figure 25.13 shows how the IQ task and the IZZ task can be merged. The merging of the two tasks is based on the observation that the loop structure of the IZZ task fits in the loop structure of the IQ task. If one wants to merge two arbitrary tasks, this is not always the case.

© 2006 by Taylor & Francis Group, LLC

25-12

Embedded Systems Handbook 01 void IQ_IZZ::main() 02 while (true) 03 for(int j=0; j
FIGURE 25.13

Merged IQ and IZZ task. 01 void IQ_IZZ::main() 02 while (true) 03 VYApixel mcu[vi][hi][64]; 04 IQ(mcu); 05 for(int j=0; j
FIGURE 25.14

Statically scheduled IQ and IZZ tasks.

A more generic approach to statically schedule the firings of tasks is exemplified in Figure 25.14. The new task IQ_IZZ executes an infinite loop from which it calls the IQ and IZZ task by means of function calls. The communication between the IQ function and the IZZ function does not have to be synchronized explicitly because the calling order of the functions guarantees the availability of data and room. For this reason, we replace the channel by a variable mcu (minimum coding unit) that is declared in Line 03. The blocks in the mcu are passed by reference to the IQ function and the IZZ function.

25.5.2 Automated Transformation We aim to automate the above-mentioned source code transformations to support the proposed method by tools. It is not the goal to automate the design decision making process, because experiences in high-level synthesis and compilation tools show that it is hard to automate this while maintaining transparency for users. Our goal is to automate the rewriting of the source code according to the design decisions of users. This approach has two advantages. First, design decisions are explicitly formulated rather than implicitly coded in the source code. Second, the source code can be rewritten automatically such that modifications and bug fixes in the original specification can be inserted automatically in architecture-specific versions of the code. In this way a large set of design descriptions can be kept consistent. 25.5.2.1 Parser Generation The first step in automatic source code transformation is to be able to parse programs and to build data types that support source code transformation. For this purpose, we use an in-house tool called C3PO

© 2006 by Taylor & Francis Group, LLC

Design and Programming of Embedded Multiprocessors

25-13

in combination with the publicly available parser generator tool ANTLR [16]. C3PO takes a grammar as input and synthesizes data types for the non-terminals in the rules of the grammar as well as input for ANTLR. We use C3PO and ANTLR to generate a C++ parser and a heterogeneous abstract syntax tree (AST). We use the same tools to generate visitors for the AST that transform the code. After transformation, we generate new C++ or C code from the AST. The transformations that we target are typically inter-file transformations. For this reason, we process all source files simultaneously as opposed to the usual single file compilation for single processors. 25.5.2.2 Iterative Transformation Source code transformation typically is an iterative process in which many versions of the same program are generated. Automatic source code transformation has the advantage that the generated source code is consistently formatted and that the transformations can be repeated if necessary. This makes it possible to keep all versions of a program consistent automatically. For version management we have adopted CVS. Each iteration uses three versions of the source code. The first version is the result of the previous iteration or the original code if it is the first iteration. The second version is manually augmented source code that is the input for the automatic transformation. The augmentation can contain for instance design constraints and design decisions. The third version is the code that is automatically generated. If the original code changes, for instance, due to bug fixes or specification changes, then the changes can be automatically inserted in the second version of the code by the version management tool. The modified second version of the code is then given as input to the transformation tools in order to produce the third version of the code that is the starting point for the next iteration. 25.5.2.3 Automatic Interface Type Reﬁnement We illustrate automatic interface refinement using the example of IQ and IZZ. The original source code of the tasks is given in Figure 25.5 and Figure 25.6. The resulting code is given in Figure 25.8 and Figure 25.9. The complete code is distributed over six files: a source file and a header file for the definition of each of the two tasks, and a source file and a header file for the definition of the task graph that instantiates and connects the two tasks. All these files require changes if the communication between the two tasks changes. This has been automated in the following way. We augment the source code of the tasks with synchronization constraints. In Figure 25.5 between Line 04 and 05 we add the line ENTRY(P) and at the end of the text we add the line LEAVE(P), both in the scope of the loop in Line 04. This annotation means that we want to synchronize the output of the IQ task on blocks of 64 pixels. Similarly we add synchronization constraints ENTRY(C) and LEAVE(C) to the IZZ task in Figure 25.6 between Line 04 and 05 and at the end of the text, respectively, both in the scope of the loop of Line 02. Assuming that the channel between the two tasks is called iqizzbuf, we provide the transformation tool with the design information shown in Figure 25.15. This information means that we want the iqizzbuf channel to have 64 tokens (Line 01). Furthermore, the channel should be implemented in data type Channel, it should handle tokens of type VYApixel, and it should connect to interface type RB both for output and for input (Line 02). Line 03 and 04 denote the synchronization constraints: the amount of consumption should not exceed the amount of production 01 02 03 04 05 06 07 08 09 10

iqizzbuf[“64”] Channel USING RbIn, RbOut “64”*IZZ::C <= “64”*IQ::P “64”*IQ::P <= “64”*IZZ::C+64 STORAGE IQ Cout-> ../iqizzbuf TRANSFORMATION T1 STORAGE IZZ Cin-> ../iqizzbuf TRANSFORMATION T2 SYNCHRONIZATION IQ, IZZ -> iqizzbuf

FIGURE 25.15 Design constraints and decisions.

© 2006 by Taylor & Francis Group, LLC

25-14

Embedded Systems Handbook

but the difference between the amount of production and consumption may not exceed the buffer capacity of the channel. Line 06 and 08 denote that the variables Cout and Cin of the IQ task and the IZZ task, respectively, should be mapped on the iqizzbuf channel using Transformation T1 and T2 that are available in a library. This introduces the calls to load and store functions. The result of the call to the load function in the IZZ task is stored in a new variable, also called Cin. Line 10 denotes that the IQ task and the IZZ task should be synchronized using the iqizzbuf channel. This introduces the calls to acquire and release functions at the positions indicated by the ENTRY and LEAVE annotations in the augmented source code. The resulting source code is given in Figure 25.8 and Figure 25.9. 25.5.2.4 Processor and Channel Binding The last phase of source code transformation is the link to existing compilers and synthesizers in order to map the individual tasks to hardware and software. To this end, programmers specify a binding of tasks to processor types and processor instances. From that information the necessary input, that is, C files and makefiles, for compilation or synthesis to the target processor is generated. Furthermore, the programmer specifies specific implementations of channels. For instance, the same interface type can be implemented differently for intra-processor communication and for inter-processor communication because of efficiency reasons. Each implementation has its own set of names for its interface functions since function overloading is not available in C. The generated C code contains the data types and function calls that correspond to the implementations of the channels that the programmer has chosen. 25.5.2.5 Other Transformations There are other transformations that are beyond the scope of this paper. We briefly mention them here. We support structure transformation to change the hierarchy of task graphs. We support instance transformations such that multiple instances of the same task or task graph can be transformed individually. Finally, we plan to support channel and task merging and splitting [15] by connecting to the Compaan tool suite [17].

25.6 TTL on an Embedded Multi-DSP In this section we present the implementation of TTL on a multi-DSP. The objectives are to show (1) how TTL can be implemented and that a TTL implementation is cheap, (2) trade-offs between the implementation cost and the abstraction level of the TTL interfaces, and (3) how TTL supports the exploration of trade-offs between, for example, memory use and execution cycles. The TTL implementation is done without special hardware support. We first present the multi-DSP architecture. Then we describe how the implementation of five TTL interface types has been done and we present quantitative results. Finally, the results for an implementation of an MP3 decoder application are presented.

25.6.1 The Multi-DSP Architecture The embedded multi-DSP is a template that allows an arbitrary number of DSPs [18]. Each DSP has it own memory, which in limited ways can be accessed by (some of) the other DSPs. A DSP with memory and peripherals is called a DSP subsystem (DSS), see Figure 25.16. The DSPs do not have a shared address space. Communication between the DSSs is done through memory mapped unidirectional point-to-point links. Thus, two DSPs may refer to a single memory location with different addresses. Data may be routed from one point-to-point link to another and so on until it reaches its destination. In our instance, the DSP Epics7B from Philips Semiconductors was used. The DSP, which is mainly used for audio applications, has a dual Harvard architecture with 24 bits wide data path and 12 bit coefficients.

© 2006 by Taylor & Francis Group, LLC

DSS

DSS

DSS

DSS

DSS

DSS

DSS

DSS

25-15

DSS

DSS

DSS

DSS

DSS

DSS

DSS

External interfaces

Microprocessor

Microprocessor interface

Design and Programming of Embedded Multiprocessors

DSS

FIGURE 25.16 Multi-DSP architecture. Here an instance with 16 DSP subsystems is shown.

25.6.2 TTL Implementation There are two criteria to decide which TTL interface type to use for a certain application on a certain architecture. First the interface type must match the application characteristics. Second, the implementation of the interface type on the target architecture must be efficient. For audio applications, DBO and DNO are not needed because audio applications do not have large amounts of data that are produced or consumed out-of-order. Therefore, the other five interface types have been implemented on the multi-DSP architecture in order to determine the cost of the implementations. Most of the TTL functions have been implemented in optimized assembly code. It is justifiable to spend the effort because the TTL functions are implemented only once and used by many applications. A TTL channel implementation consists of two parts, the channel buffer and the channel administration. In the multi-DSP architecture, no special-purpose memories exist, so the channel buffer is a circular buffer in a RAM. This is where the tokens are stored. The channel administration is a structure that holds information about the state of the channel. In the multi-DSP architecture, the channel buffer has to be located in the memory of the DSS where the consumer is executed. This is due to the uni-directional point-to-point links in the architecture. 25.6.2.1 Channel Administration The channel administration keeps track of how many tokens there are in the channel and how many of those are full and empty respectively. It also provides a way to get the next full and the next empty token from the channel. When the channel buffer is implemented as a circular buffer in a RAM, the channel administration can be implemented in two different ways with two variables to keep track of the state of the channel. The first alternative is to use two pointers, one to the start of the empty tokens and one to the start of the full tokens. The second alternative is to have one pointer and one counter, for example, a pointer to the start of the full tokens and a counter telling how many full tokens there are in the channel. This requires atomic increment and decrement operations, which are not supported on the multi-DSP architecture. Therefore the channel administration is implemented with two pointers. The producer updates the pointer to the empty tokens (write_pointer) and the consumer updates the pointer to the full tokens (read_pointer) and thereby no atomic operations are needed [7]. Another method to avoid the need for atomic updates is to use two counters and two pointers. That method is explained in Section 25.7. When the two pointers point to the same memory location, it is not clear if the channel is full or empty unless wrap-around counters are used. Wrap-around counters imply expensive acquire functions. To avoid that problem we have implemented a channel administration that does not allow the pointers to point to the same memory location unless the channel is empty. We thereby have a memory overhead of the size of one token in the channel buffer. In the indirect interfaces the token size is always one word. Both the producer and the consumer need to access the channel administration. In the multi-DSP there are no shared memories, therefore the channel administration has to be duplicated and present in

© 2006 by Taylor & Francis Group, LLC

25-16

FIGURE 25.17

Embedded Systems Handbook Producer side

Consumer side

READ_POINTER WRITE_POINTER BASE_POINTER CH_SIZE BASE_RA REMOTE_POINTER

WRITE_POINTER READ_POINTER BASE_POINTER CH_SIZE BASE_RA REMOTE_POINTER

Double channel administration for the indirect interface types.

01 Boolean tryReAcquireData(port p, uint n) { 02 uint available_data; 03 available_data = (p->write_pointer – p->read_pointer) modulo p->ch_size; 04 if (available_data >= n) 05 return true; 06 else 07 return false; }

FIGURE 25.18

Pseudo code for tryReAcquireData (RN).

the two DSSs involved in the communication. The two copies are called the local and remote channel administration. See Figure 25.17. Since the producer and the consumer refer to the channel buffer with different addresses, this must be taken into consideration when updating the remote channel administration. We keep a pointer to the base address in the local address space (base_pointer) and a pointer to the base address in the remote address space (base_ra). These two pointers are used to calculate the pointer value to be stored in the remote channel administration. The channel administrations as well as the channel buffer must be located in memory areas that are accessible via the point-to-point links. As an example of the implementation of the TTL functions, pseudo code for the tryReAcquireData function in RN is shown in Figure 25.18.

25.6.3 Implementation Results The acquire functions for the RN interface type use 9 instructions. The release functions use 15 instructions. The vector load and store functions use a loop unrolling of 2 and achieve 2.5 instructions per data word with an overhead of 24 instructions to set up the data transfer. The scalar load and store functions are in-lined in the task code and use each 10 instructions. The acquire functions for the direct interface type DNI use between 19 and 33 instructions, dependent on the state of the channel. The release functions use between 29 and 38 instructions. No data transfer functions are used. The cost of the data transfers is comparable to the cost of accessing private data structures in the task. For the blocking interface types, it is not as easy to determine the cost in terms of instructions for the individual acquire functions, because they may include task switches. However, an acquire function in RB, that does not trigger a task switch uses 18 instructions. The release functions and the data transfer functions for RB have the same cost as those for RN. The same applies to the release functions of DBI with respect to DNI. In CB, synchronization and data transfer is combined into one single function. The cost of the implementation is approximately the sum of the costs of the three corresponding functions in RB. 25.6.3.1 Evaluation Application An MP3 decoder application has been used for the evaluation of the TTL implementations on the multiDSP. The MP3 decoder application was available as a sequential C program. The application was converted

© 2006 by Taylor & Francis Group, LLC

Design and Programming of Embedded Multiprocessors TABLE 25.2 Application TTL IF type CB RB RN DBI DNI

25-17

Simulation Results for the Whole

#Cycles

Part in TTL (%)

#Memory words

45,579,603 45,551,243 45,505,950 45,152,454 45,108,086

2.9 2.8 2.2 1.1 0.5

12493 12494 12365 9162 9041

into a TTL task and additional TTL tasks were added for mimicking the rest of a complete application and for handling the interaction with the simulation environment. The application has been implemented with all five interface types. The RN and DNI implementations use TTL actors and the other types use TTL processes. The application has also been implemented with four different granularities of the communication for the RN interface type. In the implementations with the direct interface types, DNI and DBI, the channel between the input task and the MP3 task uses RN and RB interface types respectively. This is due to the fact that the amount of communicated data on that channel is data dependent. 25.6.3.2 Simulation Results Table 25.2 shows the results of the various interface types with frame-based communication. All channel buffers have been sized so that they can hold one frame. The memory is the total data memory for the whole application. The number of cycles is the number used by the whole application to decode a test file. The blocking implementations use somewhat more memory and have some cycle overhead compared to the non-blocking implementations, when comparing RB to RN and DBI to DNI. This is due to the fact that the multi-tasking costs both memory for storing register contents and cycles to save and restore the register contents. The DNI and DBI interface type use considerably less memory and less cycles than the other interface types. This is because the data in the channels is accessed directly, without copying it to and from private variables. The CB version has similar performance as the RB version. For the DNI implementation, about 0.5% of the cycles are spent in the TTL functions and 99.5% of the cycles are spent in the tasks. This is of course dependent on the application as well as on the implementation of the TTL functions. Figure 25.19 shows the trade-offs that can be made by changing the granularity of the communication. Here a change of a factor of 36 has been pursued on the channel between the MP3 task and the output task. In the MP3 decoder, this is made possible by using a sub-frame decoding method, which allows the MP3 decoder to output blocks smaller than a frame. The memory is reduced both in the channel buffer, in the MP3 task and in the output task. The channel buffer sizes have been adjusted to match the granularity of the communication. The cycle overhead for the small granularity communication has two reasons. Smaller granularity implies more frequent synchronization calls and smaller buffers imply more frequent task switching. The implementation of CB allows channel buffers to be smaller than the vector sizes used by the tasks. One of the advantages with CB is that the channel buffer size can be reduced to achieve a memory-cycle trade-off without rewriting the tasks themselves. Results for this are shown in Figure 25.20.

25.6.4 Implementation Conclusions It has been shown that TTL can be implemented efficiently on a multi-DSP architecture. It has also been shown that changing the granularity of the communication of the tasks has great impact on the memorycycle trade-off. The direct interfaces in TTL provide benefits both regarding the memory usage and the cycle overhead. As expected, the most abstract interface type, CB, is also the most expensive to use. This proves the value of automating transformations between the various implementation alternatives.

© 2006 by Taylor & Francis Group, LLC

25-18

Embedded Systems Handbook MEM versus #cycles 13000 Full frame

12000

MEM #words

11000 10000

1/2 frame

9000 8000

1/4 frame

7000 6000 1/36 frame 5000 4.4

FIGURE 25.19

4.5

4.5

4.55

4.6 4.65 #cycles

4.7

4.75 4.8 x 107

Simulation results for RN, when changing the communication granularity.

MEM versus #cycles 13000 12500

Full frame

MEM #words

12000

11500

1/2 frame

11000 1/4 frame 10500

1/8 frame 1/16 frame 1/32 frame

10000 4.54 4.55 4.56 4.57 4.58 4.59

4.6

4.61 4.62 4.63

#cycles

FIGURE 25.20

x 107

Simulation results for CB, when changing the channel buffer size.

25.7 TTL in a Smart Imaging Core The objective of this section is to show that the implementation of TTL in hardware, software, and mixed hardware/software is possible with reasonable costs. The implementation allows the buffer size and the buffer location to be changed and the channel administration to be relocated. This section first discusses the smart imaging core followed by a detailed description of the TTL implementation including performance results.

25.7.1 The Smart Imaging Core Smart imaging applications combine image and video capturing with the processing and/or interpretation of the scene contents. An example is a camera that is able to segment a video sequence into objects, track

© 2006 by Taylor & Francis Group, LLC

Design and Programming of Embedded Multiprocessors Video Input

CCIR/ Camera Frontend

25-19 off-chip communication

Smart Imaging Copro

Motion Estimator Copro

I/O Interface

TTL HW Shell

HW Shell

Data IF

Data IF

DTL

SW Shell Data IF

Data IF

Communication interconnect DTL

Data IF

TTL

SW Shell ARM 9xx CPU

Data IF

Data IF

embed. RAM & (boot) ROM

Peripherals*

I/D Cache

* -Timers, Watchdog, Interrupt Controller

Data IF Memory Controller (Flash & DRAM)

ext. Flash

ext. SDRAM

FIGURE 25.21 Architecture of the smart imaging core.

some of them, and raise an alarm if some of these objects show an unusual behaviour. The smart imaging core described here can be embedded in a camera and is suited for automotive and mobile communication applications. Example applications are pedestrian detection [19], low-speed obstacle detection [20], and face tracking. Each of the smart imaging applications uses low-level pixel processing, typically on image segments, for an abstraction of the scene contents (feature extraction). Furthermore, motion segmentation is used to help in tracking objects in the scene. The applications are structured such that the more control-oriented parts are combined together in a task that fits well on a CPU. All the low-level pixel processing is combined together in a pixel processing task, which is mapped onto a smart imaging coprocessor. Likewise, the main processing part of the motion segmentation is described as an independent task, which is mapped onto a motion estimator coprocessor. The architecture of the smart imaging core is depicted in Figure 25.21. More details of the architecture can be found in Reference 21. The architecture globally consists of an ARM CPU, a video input unit, and two coprocessors: the motion estimator (ME) and the smart imaging (SI) coprocessor. The tasks on the coprocessors and the ARM communicate with each other using the TTL interface. By adopting the TTL interface for the coprocessors, we expect that the integration of these blocks into future systems will be significantly simplified.

25.7.2 TTL Shells This subsection presents the TTL shells used for the smart imaging core. These are a full hardware shell for the SI coprocessor and software shells for the ARM and the motion estimator (VLIW) coprocessor. 25.7.2.1 TTL Shell for the SI Coprocessor The TTL interface type used for the SI is the RB interface type using indirect data access. As already explained in the Multi-DSP section a TTL channel implementation consists of two parts, the channel buffer and the channel administration. In the SI core the channel buffers are always located in main (on-chip) memory. The channel administration can be placed both in the shell and in main memory.

© 2006 by Taylor & Francis Group, LLC

25-20

FIGURE 25.22

Embedded Systems Handbook Producer side

Consumer side

BASE_POINTER

BASE_POINTER

Unit

CH_SIZE

CH_SIZE

TOKEN_SIZE

TOKEN_SIZE

Tokens Bytes

N_WRITTEN

N_WRITTEN

Tokens Tokens

N_READ

N_READ

WRITE_POINTER

READ_POINTER

REMOTE_POINTER

REMOTE_POINTER

Token aligned

Channel administration.

Co-processor

1

1

2

1

1

np 32 ns

1

32

1

1

32

1 rd_ack

rd_data

rd_req

wr_ack

wr_data

wr_req

esize(n)

offset

port_id

is_granted

is_non_blocking

prim_req_type

port_type

acknowledge

request

FIGURE 25.23

1

Shell

TTL signal interface.

We also use two copies of the channel administration; one at the producer side and another at the consumer side. Figure 25.22 depicts the channel administration structure. To make sure that the channel status is handled correctly by both the producer and consumer, without the need for atomic access to the variables of the channel administration, n_written, n_read, read_pointer, and write_pointer are used. Only the producer modifies n_written and write_pointer. Similarly only the consumer modifies n_read and read_pointer. The equation “n_written – n_read” is used to calculate the amount of data available in the channel while “ch_size – (n_written – n_read)”is used to derive the amount of free room. The hardware implementation includes correct handling of wraparounds. With this approach both the consumer and producer have a conservative view on the channel status. The use of two token counters n_read and n_written instead of two pointers as in the Multi-DSP case is due to the variable token_size that can be handled with this implementation. Using the counters the implementation of the acquire functions is more efficient because no multiplication with token_size is needed. The variable remote_pointer is used to reference the remote channel administration. The variable base_pointer together with the offset parameter provided through the TTL load and store calls is used to calculate the physical address for accessing the channel buffer. The buffer behaves as a FIFO and is implemented as a fixed size (ch_size) circular buffer. This results in the equation “address = base_pointer + (read_pointer + offset ∗ token_size) %ch_size” for each read access. The hardware implementation consists of a hardware shell with two interfaces. A TTL signal interface connects the shell to a coprocessor, see Figure 25.23. A DTL interface connects the shell to the communication interconnect. The request and acknowledge signals are used to handshake a TTL call from the coprocessor to the shell. The shell is able to handle both input (port_type = 0) and output ports (port_type = 1). Both the RB and RN interface types are supported through the is_non_blocking and is_granted signals. The signal is_granted indicates if access is granted in non-blocking acquire operations. The esize signal indicates the vector length (max. 2ns tokens). The wr_ and rd_ signals are used to handshake the load and store data between the coprocessor and shell. In total 2np logical ports (port_id) are

© 2006 by Taylor & Francis Group, LLC

Design and Programming of Embedded Multiprocessors

25-21

handled independently by the shell. The TTL calls for these ports are however handled sequentially as each TTL call has to first finish by reasserting the acknowledge signal, hence only one set of port_id, offset, etc. signals is provided. Usually this limitation is acceptable as the number of concurrent ports that can be handled efficiently is also limited by the physical connection of the shell to the DTL infrastructure (see Figure 25.21). Using one DTL interface with multiple sets of TTL signals complicates the shell significantly. In that case multiple state-machines have to arbitrate for the DTL access. Obviously also multiple DTL interfaces for the shell could be used. This however is similar to just using multiple (simple) shells for one coprocessor as we provide. The shell architecture consists of an indirect addressing unit, a control unit, two ALUs, and a DTL interface unit. Furthermore, it includes a table that maps port_ids to their respective channel administration. The table brings the flexibility to map the channel administration at an arbitrary location (even within the shell itself). For the smart imaging core this flexibility enables the designer to optimize the communication infrastructure per application. For a channel between tasks on the ARM (producer) and SI (consumer), the channel buffer and administration of the producer side are mapped in main memory, while the consumer side administration is mapped in the hardware shell. This mapping minimizes the time spent for the acquire functions as the administrations are distributed and mapped locally to the producer and consumer. The performance of the hardware shell with the channel administration local in the shell is as follows: the acquire functions use 5 cycles, the release functions use 7 cycles, the load function uses 5 + 2n cycles, and the store function uses 5 + n cycles. The parameter n is the number of tokens specified in the TTL call. In total, the implementation of a shell for the SI coprocessor with 2 logical ports, synthesized for 100 MHz, takes ∼0.2 mm2 (∼8.3 Kcells) in 0.18µ CMOS technology. 25.7.2.2 TTL Shell for the ME and ARM The motion estimator coprocessor is synthesized using a high-level architectural synthesis tool called A|RT Designer. A|RT Designer takes as input a C description of an algorithm and generates a custom VLIW processor, consisting of a data path and a controller. The controller contains an FSM, which determines the next instruction to be executed, and a micro-code ROM memory, where the motion estimation main task is stored. The data path is parameterizable in the number and the type of functional units (FUs) used. The communication to the external system is arranged via input and output FUs that implement a standard DTL interface towards the system and have a simple register file interface inside the VLIW. The output of A|RT Designer is a synthesizable RTL description of the generated VLIW processor. Instead of implementing the TTL shell for the ME completely in hardware as an FU, the implementation choice for the ME is to use the A|RT high-level synthesis tool for the implementation of both the motion estimation main process and the shell. In this case the implementation of the TTL functions is by means of C-code that is executed on the ME VLIW. This is achieved by compiling the TTL C-code together with the design description of the ME. The A|RT high-level synthesis tool adds the TTL implementation into the microcode of the coprocessor, and executes it as part of the VLIW program. The physical communication is done via the FUs that provide standard DTL interfaces. The ARM software implementation is a simple C-code compilation of the TTL functions. The number of ARM instructions used is: 40 for the acquire functions; 42 for the release functions; 27 for (scalar) load and store functions (6 extra for each element of a vector).

25.8 Conclusions We have presented an interface-centric design method based on a task-level interface named TTL. TTL offers a framework of interoperable interface types for application development and for implementing applications on platform infrastructures. We have shown that the TTL interface provides a basis for a method and tools for mapping applications onto multiprocessors. Furthermore, we have demonstrated that TTL can be implemented efficiently on two different architectures, in hardware and in software.

© 2006 by Taylor & Francis Group, LLC

25-22

Embedded Systems Handbook

Industry-wide standardization of a task-level interface like TTL can help to establish system-level design technology that supports efficient MPSoC integration with reuse of function-specific hardware and software modules across companies. Future extensions of TTL will be concerned with the modeling of timing constraints and associated verification technology.

Acknowledgments We acknowledge the contributions of Jeffrey Kang, Ondrej Popp, Dennis Alders, Avneesh Maheshwari, and Ghiath Alkadi from Philips Research and of Victor Reyes from the University of Las Palmas. The work on smart imaging is partly sponsored by the European Commission in the IST-2001-34410 CAMELLIA project.

References [1] Rowson, J.A. and A. Sangiovanni-Vincentelli. Interface-Based Design. In Proceedings of the 34th Design Automation Conference, Anaheim, June 1997, pp. 178–183. [2] Cesário, W.O., D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, A.A. Jerraya, L. Gauthier, and M. Diaz-Nava. Multiprocessor SoC Platforms: A Component-Based Design Approach. IEEE Design and Test, November–December, 19(6), 52–63, 2002. [3] Dziri, M.-A., W. Cesário, F.R. Wagner, and A.A. Jerraya. Unified Component Integration Flow for Multi-Processor SoC Design and Validation. In DATE-04, Paris, February 16–20, 2004, pp. 1132–1137. [4] Cai, L. and D. Gajski. Transaction Level Modeling: An Overview. CODES + ISSS, Newport Beach, October 1–3, 2003, pp. 19–24. [5] De Kock, E.A., G. Essink, W.J.M. Smits, P. van der Wolf, J.-Y. Brunel, W.M. Kruijtzer, P. Lieverse, and K.A. Vissers. YAPI: Application Modeling for Signal Processing Systems. In Proceedings of the 37th DAC, Los Angeles, June 5–9, 2000, pp. 402–405. [6] Rutten, M.J., J. van Eijndhoven, E. Jaspers, P. van der Wolf, O.P. Gangwal, A. Timmer, and E.J. Pol. A Heterogeneous Multiprocessor Architecture for Flexible Media Processing. IEEE Design and Test, 19(4), 39–50, July–August, 2002. [7] Nieuwland, A., J. Kang, O.P. Gangwal, R. Sethuraman, N. Busa, K. Goossens, R.P. Llopis, and P. Lippens. C-HEAP: A Heterogeneous Multi-Processor Architecture Template and Scalable and Flexible Protocol for the Design of Embedded Signal Processing Systems. Journal of Design Automation for Embedded Systems, 7(3), 229–266, 2002. [8] Catthoor, F., K. Danckaert, C. Kulkarni, E. Brockmeyer, P.G. Kjeldsberg, T. Van Achteren, and T. Omnes. Data Access and Storage Management for Embedded Programmable Processors. Kluwer, Dordrecht, 2002. [9] Wong, C., P. Marchal, and P. Yang. Task Concurrency Management Methodology to Schedule the MPEG4 IM1 Player on a Highly Parallel Processor Platform. In Proceedings of the CODES 2001, Copenhagen, April 25–27, 2001, pp. 170–177. [10] Grotker, T., S. Liao, G. Martin, and S. Swan, System Design with SystemC, Kluwer, Dordrecht, 2002. [11] ARM, AMBA AXI Protocol Specification, June 2003. [12] OCP International Partnership, Open Core Protocol Specification, 2.0 Release Candidate, 2003. [13] Philips Semiconductors, Device Transaction Level (DTL) Protocol Specification, Version 2.2, July 2002. [14] Brunel, J.-Y., W.M. Kruijtzer, H.J.H.N. Kenter, F. Petrot, L. Pasquier, E.A. de Kock, W.J.M. Smits. COSY Communication IP’s. In Proceedings of the 37th DAC, Los Angeles, June 5–9, 2000, pp. 406–409. [15] De Kock, E.A. Multiprocessor Mapping of Process Networks: A JPEG Decoding Case Study. In Proceedings of the International Symposium on System Synthesis (ISSS), Kyoto, October 2–4, 2002, pp. 68–73.

© 2006 by Taylor & Francis Group, LLC

Design and Programming of Embedded Multiprocessors

25-23

[16] http://www.antlr.org/. [17] Cimpian, I., A. Turjan, E.F. Deprettere, and E.A. de Kock. Communication Optimization in Compaan Process Networks. In Proceedings of the 4th International Workshop on Systems, Architectures, Modeling and Simulation (SAMOS), 2004. [18] Schiffelers, R. et al. Epics7B: A Lean and Mean Concept. In Proceedings of International Signal Processing Conference 2003, Dallas, TX, March 31–April 3, 2003. [19] Abramson, Y. and B. Steux. Hardware-Friendly Pedestrian Detection and Impact Prediction. In IEEE Intelligent Vehicle Symposium 2004, Parma, Italy, June 2004. [20] Steux, B. and Y. Abramson. Robust Real-Time On-Board Vehicle Tracking System Using Particles Filter. In IFAC Symposium on Intelligent Autonomous Vehicles, July 2004. [21] Gehrke, W., J. Jachalsky, M. Wahle, W.M. Kruijtzer, C. Alba, and R. Sethuraman. Flexible Coprocessor Architectures for Ambient Intelligent Applications in the Mobile Communication and Automotive Domain. In Proceedings of the SPIE VLSI Circuits and Systems, Vol. 5117, April 2003, pp. 310–320.

© 2006 by Taylor & Francis Group, LLC

26 A Multiprocessor SoC Platform and Tools for Communications Applications 26.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-2 26.2 Wire-Speed Packet Forwarding Challenges . . . . . . . . . . . . 26-2 Impact on NPU Architectures • Survey of Multiprocessor SoC Platforms

26.3 Platform and Development Environment Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-4 26.4 StepNP Architecture Platform . . . . . . . . . . . . . . . . . . . . . . . . . . 26-5 StepNP Processors • Network-on-Chip • Use of Embedded FPGAs and Embedded Sea-of-Gates • Configurable Processor Implementation • Hardware Processing Elements

26.5 Multiflex MP-SoC Tools Overview . . . . . . . . . . . . . . . . . . . . . 26-9 26.6 Multiflex Modeling and Analysis Tools. . . . . . . . . . . . . . . . . 26-10 Modeling Language • Hardware Multithreaded Processor Models • SOCP Network-on-Chip Channel Interface • Distributed Simulation Using SOCP

26.7 Multiflex Model Control-and-View Support Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-12 SystemC Structural Introspection • SIDL Interface • Instrumentation of an SOCP Socket with SIDL

26.8 Multiflex Programming Models . . . . . . . . . . . . . . . . . . . . . . . . 26-14

Pierre G. Paulin, Chuck Pilkington, Michel Langevin, Essaid Bensoudane, and Damien Lyonnard Advanced System Technology, STMicroelectronics

Gabriela Nicolescu Ecole Polytechnique de Montreal

Survey of Multiprocessor Programming Models • Multiflex Programming Model Overview

26.9 DSOC Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-16 DSOC Message Passing • The DSOC ORB

26.10 SMP Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-19 Target SMP Platform • Hardware Support for SMP RTOS • Interoperable DSOC and SMP Programming Models

26.11 An IPv4 Packet Forwarding Application . . . . . . . . . . . . . . . 26-21 Networking Application Framework • StepNP Target Architecture • Simulation Speed • Multiprocessor Compilation and Distribution • IPv4 Results

26-1

© 2006 by Taylor & Francis Group, LLC

26-2

Embedded Systems Handbook

26.12 A Traffic Manager Application . . . . . . . . . . . . . . . . . . . . . . . . . . 26-25 Application Reference • DSOC Model • StepNP Target Architecture • DSOC + SMP Model • Experimental Results • Hardware Implementation

26.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.14 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26-29 26-30 26-31 26-31

26.1 Introduction The continuing growth in network bandwidth and services, the need to adapt products to rapid market changes, and the introduction of new network protocols has created the need for a new breed of high performance, flexible system-on-chip (SoC) design platforms. Emerging to meet this challenge is the network processor unit (NPU). An NPU is an SoC that includes a highly integrated set of programmable or hardwired accelerated engines, a memory subsystem, high-speed interconnect, and media interfaces to handle packet processing at wire speed [1]. Programmable NPUs preserve customers’ investments by letting them track ongoing specification changes [2]. By developing a programmable NPU as a reusable platform, network designers can amortize a significant design effort over a range of architecture derivatives. They can also meet technical challenges arising from a product’s time-to-market constraints, as well as economic constraints arising from a product’s short time-in-market. Network processors present a whole new set of requirements. In our bandwidth hungry world, OC12 and OC48 network speeds are becoming common. On the horizon is OC192 (10 Gb/sec), which allows for less than 50 nsec of processing time per packet received. It is clear that traditional microprocessors cannot keep up with the speed and programmability requirements of network processors. In the next section, we examine high-speed packet processing requirements, and highlight the resulting NPU challenges. We then describe the StepNP flexible MP-SoC platform and its key architectural components. This is followed by a review of the Multiflex modeling and analysis tools developed to support this platform. We present the distributed message passing and symmetrical multiprocessing (SMP) parallel programming models supported in Multiflex. We describe the approach used for real-time task scheduling and allocation, which automatically maps high-level parallel applications onto the StepNP platform. Finally, we present detailed results of the mapping of IPv4 packet forwarding and traffic management applications onto the StepNP platform, for a range of architectural parameters. We also provide an outlook on the use of this approach for consumer multimedia (audio, video) applications.

26.2 Wire-Speed Packet Forwarding Challenges Packet forwarding over a network includes the following main tasks [1]: • Header parsing: This consists of pattern matching of bits in the header field. • Packet classification: Identification of the packet type (e.g., IP, MPLS, ATM) and attributes (e.g., quality of service requirement, encryption type). • Lookup: Consists of looking up data based on a key. It is mostly used in conjunction with pattern matching to find a specific entry in a table. • Computation: This varies widely by application. Examples include checksum, CRC, time-to-live field decrement, and data encryption. • Data manipulation: Any function that modifies the packet header. • Queue management: Scheduling and storage of incoming and outgoing packet data units.

© 2006 by Taylor & Francis Group, LLC

Multiprocessor SoC Platform

26-3

• Control processing: Encompasses a large number of different tasks that usually do not need to be performed at wire speed. These are usually performed on a standard fast reduced instruction set computer (RISC) processor linked to the NPU. This so-called control plane is not the focus of this chapter.

26.2.1 Impact on NPU Architectures Wire-speed packet forwarding, at rates often exceeding 1 Gb/sec, poses many more challenges than generalpurpose data processing. First of all, data locality can be poor in network applications. One packet arrival has little in common, generally, with any other packet arrival. This eliminates much of the utility of the traditional data cache used in general-purpose processors. In network processing, both memory capacity and bandwidth are extremely demanding. Routing lookup tables are not only extremely large, but they must also support high-throughputs. The interconnect between processors, memories, and coprocessors must support a very high, cost-effective, scalable bandwidth [2–5]. Furthermore, a variety of bit-oriented data manipulations are needed throughout the packet processing. Therefore, specialized instructions for efficient multibit field manipulation are an important requirement [6]. A key aspect of efficient NPU hardware (HW) use is latency hiding. Three main approaches are used to hide latency: 1. Multithreading 2. Memory prefetching 3. Split-transaction interconnect The most common latency-hiding approach is multithreading, which efficiently multiplexes a processing element’s (PE’s) hardware. Multithreading lets the hardware process other streams while another thread waits for memory access or coprocessor execution. Most NPUs have separate register banks for different threads, with hardware units that schedule threads and swap them in one cycle. We call this function hardware multithreading.

26.2.2 Survey of Multiprocessor SoC Platforms A number of multiprocessor platforms designed for SoC-scale applications have been described. Daytona [7] was an early attempt to reach high DSP performances through MIMD (multiple instruction, multiple data) PEs. Each PE consists of a 32b GP-RISC and a vector unit with four 16b-MACs. The performances reach a peak value of 1.6 billion 16b-MACs, assuming no cache misses. Such results are extremely dependent on the instruction locality and require homogeneous data stream rates. This would not be expected for applications that are more control dominated. The PROPHID-based platform [8], namely Eclipse [9], has already been tuned into several dedicated instances. Among them is the well-known Viper [10] that provides set-top boxes applications with relevant multimedia features. Unfortunately, the use of numerous application-specific hardware accelerators inevitably leads to the high NRE costs of ASIC-style design. The Mescal system developed at U.C. Berkeley [11] allows a platform designer to build a platform instance in a targeted, domain-specific way. This is achieved through a range of activities — spanning PE architecture and microarchitecture design, to network topology definition — achieved with the assistance of the Mescal development system [12]. An OSI-like message passing model [13] is used. While this approach may be used to achieve the best cost/performance trade-off, it still implies high design and maskset NRE’s. S3 E2 S [14] is a design environment for heterogeneous multiprocessor architectures based on libraries of components. A sequential model of the application is first translated in a CDFG-like objects graph. Then each object is targeted to the most relevant processor selected from the libraries. The designspace exploration addresses the CPU’s choice (ranging from GP CPU, DSP, to highly specific ones) while

© 2006 by Taylor & Francis Group, LLC

26-4

Embedded Systems Handbook

taking into account the local memory accesses. Message-passing is the only supported communication mechanism. Also, the NoC topology and implementation is not addressed (nor modeled). While there are a number of commercial NPU platforms on the market [1], to our knowledge, there are no platforms built specifically as an exploration environment for multiprocessor design methodologies and tools. Ideally, such a platform would be built with only public domain components and be distributable to a large R&D community. For example, the Mescal environment [11] is currently based on a commercial NPU platform, which necessarily has limited distribution rights.

26.3 Platform and Development Environment Overview In order to explore tool and architectural issues in a range of high-speed communications applications, we have developed a system-level exploration platform for multiprocessor system-on-chip (MP-SoC) architectures and the associated platform automation tools. Our current focus is on packet processing applications used in network infrastructure SoCs. As depicted in Figure 26.1, the MP-SoC environment consists of three main components: 1. An application development framework supporting parallel programming models. 2. The StepNP™ high-level multiprocessor-architecture simulation model. 3. The Multiflex toolset for the programming, analysis, and debug of complex parallel applications. In developing this MP-SoC exploration platform, we had several objectives. We wanted the platform to be: • A challenging internal driver for our existing system design technology [15,16] and embedded system [17] developments, as well as a driver for high-level multiprocessor platform methods under development. (a) Parallel programming models RED

Classifier

Pulltopush Discard Strip

Application software

Reference platform

(b) P1 Pn

RISC

P1 Pn

P1 Pn

RISC

RISC

Network-on-Chip Coprocessor I/O

Network search engine

Semaphore engine

FIGURE 26.1 MP-SoC exploration environment.

© 2006 by Taylor & Francis Group, LLC

(c)

MultiFlex SoC Tools

Multiprocessor SoC Platform

26-5

• A vehicle for long-term research in multiprocessor-architecture exploration, design tools, and methods. • An open, easily accessible environment, built with public-domain components as much as possible. Furthermore, the architecture platform must also be representative of real NPU characteristics, or at least serve as a baseline from which realistic NPU architectures could be easily derived. The main characteristics we chose to implement in StepNP are as follows: • • • •

Scalable multiprocessing. Exploration of simple processor pipeline configurations. Scalable hardware multithreading capability for the embedded processor models. Support for a range of interconnect topologies to link the processor array, coprocessors, memories, and I/O

Finally, the Multiflex tools and the application framework must allow for: • Rapid development of packet processing applications. • Extensible tool framework for embedded software (SW) development, and model control, debug, and analysis. • Support of appropriate high-level parallel programming models, based on an appropriate mix of distributed message passing and shared-memory paradigms.

26.4 StepNP Architecture Platform Commercial NPUs feature a wide range of architectural styles, from fully software programmable to almost completely hardwired [1]. For the StepNP initial platform, or base platform, we use a fully programmable architecture based on standard RISC processors, and a simple but flexible interconnect. The base platform allows easy plug-and-play replacement with more specialized processors, coprocessors, and interconnect. Figure 26.2 depicts the StepNP flexible MP-SoC architecture platform. The StepNP platform instance used here includes: • Models of (re)configurable processors • A network-on-chip (NoC) • Reconfigurable HW (embedded field programmable gate array [FPGA] or embedded configurable sea-of-gates) and standard HW • As well as communication-oriented I/Os

MPU I/O SPI4.2

Processor Processor 1

Processor N

P1 Pn

Proc.

P1 Pn

Proc.

eSoG

eRAM

eSoG

eRAM

SoG/FPGA Memory ASIC

Hypertransport

Packetization

NoC QDR Mem I/O Gen-purpose I/O

eMEM

H/W Scheduler

FIGURE 26.2 StepNP flexible communications platform.

© 2006 by Taylor & Francis Group, LLC

H/W PE (eSoG)

H/W PE (eFPGA)

26-6

Embedded Systems Handbook

Note that, aside from these domain-specific I/Os, StepNP is a general-purpose, flexible multiprocessor platform. It has been subsequently used for consumer audio and video applications, based on a variant of the StepNP platform which uses different I/O’s.

26.4.1 StepNP Processors It is our conviction that the large-scale use of software programmable embedded processors will emerge as the key means to improve flexibility and productivity [18–20]. These processors will come in a wide diversity, from general-purpose RISC to specialized application-specific instruction-set processors (ASIPs), with different trade-offs in time-to-market versus product differentiation (power, performance, cost). Domain- or application-specific processors will play an important role in bridging the gap between the required ease-of-use and high flexibility of general-purpose processors on one end, and the higher speed and lower power of hardware on the other. Configurable processors are one possible means to achieve processor specialization from an RISC-based platform. A common requirement for all classes of processors is the efficient handling of the latencies of the interconnect, memory, and coprocessors. A variety of approaches can be used. These include caching, multithreading, memory prefetching, and split-transaction interconnect. Multithreading lets the processor execute other streams while another thread is blocked on a high-latency operation. A hardware multithreaded processor has separate register banks for different threads, allowing low-overhead switching between threads, often with no disruption to the processor pipeline. The StepNP simulation framework allows easy integration of a range of general-purpose to applicationspecific processor models. We have integrated public domain instruction-set models of the most popular RISC processors. The base StepNP architecture platform includes the public-domain models (http://www.fsf.org) of the ARM™ v4 and the PowerPC™ (versions 603, 603a, and 604), and MIPS™ (32 and 64 bits) instruction-set architectures, as well as the Stanford DLX processor model [21]. In order to explore network-specific instruction-set optimizations, the Tensilica Xtensa™ configurable processor model [22] has been integrated by our academic research partners [6]. Other researchers within ST have demonstrated the use of embedded FPGA to implement user-defined instructions, therefore implementing a reconfigurable processor [23]. For the exploration of ASIPs, we support the inclusion of instruction-set simulator (ISS) models generated from the Coware™/LisaTek instruction-set simulation model generator toolset [24]. As a first demonstrator of this approach, we have developed a LisaTek-based ISS for the Xilinx MicroBlaze™ soft RISC processor, and are currently extending it for hardware multithreading. Researchers can use this as a basis for further architecture extension or specialization.

26.4.2 Network-on-Chip A key component of an MP-SoC platform is the interconnect technology. The StepNP platform makes a very important assumption on the interconnect topology: namely, it uses a single interconnect channel that connects all I/O and PE’s. An orthogonal, scalable, interconnect approach with predictable bandwidth and latency is essential: 1. It provides a regular, plug-and-play methodology for interconnecting various hardwired, reconfigurable, or SW programmable IPs. 2. It supports the high-level communication between processes on multiple processors, and simplifies the automatic mapping onto the interconnect technology. However, it moves the complexity of the effective use of communication resources to the resource allocation tools, which must be tuned to the interconnect topology. This will be explored in further sections. We advocate the recent so-called NoC approaches currently under development [3,4]. We also strongly support the need for a standard NoC interface definition. Among others, we are currently using the OCP-IP standard [25] in our SoC platform developments [26], as discussed further.

© 2006 by Taylor & Francis Group, LLC

Multiprocessor SoC Platform

FIGURE 26.3

26-7

Bus-like Low latency Blocking (large contention) Not easy to scale, need hierarchy

Tree-like Medium latency Blocking (blind routing) Medium scalability

Ring-like Large latency Can be nonblocking Scalable

Crossbar-like Low latency Nonblocking Costly, poor scalability

NoC topologies.

We are exploring a range of NoC topologies, ranging from high-performance buses, rings, tree-based networks, and crossbars. The pros and cons of some of the main topologies are summarized in Figure 26.3. A common issue with all NoC topologies is communication latency. In 50 nm process technologies, it is predicted that the intrachip propagation delay will be between six and ten clock cycles [4]. A complex NoC could therefore exhibit latencies many times larger. Moreover, the increasing gap between processor clock cycle times and memory access times further increases the need for latency hiding. Finally, coprocessors can introduce additional latencies. Effective latency hiding is therefore key in achieving efficient parallel processing. This is the key reason for the adoption of hardware multithreading processors in the StepNP platform. This implies that the programming tools must be able to automatically exploit this capability. This was achieved using hardware-assisted dynamic task allocation, as described further. The interconnect model developed in the base StepNP platform is a simple functional model supporting split-transaction communication, as depicted in Figure 26.4. We gave the communication channel’s interface definition particular attention, and we describe it in the next sections. The first planned physical implementation of the StepNP platform will be based on an ST interconnect framework, the STBus [27] that supports a wide range of topologies, including buses, bridges, and crossbars. The STBus protocol supports similar advanced features to OCP-IP, for example, out-of-order and split-transactions. Despite the name, STBus is not a bus per se, but is in fact an interconnect generation framework, which supports the automatic generation of a range of interconnect topologies made up of buses, bridges, and crossbars. The STBus toolset generates an RTL-synthesizable implementation. We have integrated the STBus SystemC model into StepNP. Other NoC approaches are also being investigated: • In cooperation with the UPMC/LIP6 laboratory in Paris, we have developed a 32 port version of the SPIN packet-switched interconnection NoC [28,29], to be implemented using ST’s 0.13 micron process. • A ring-based NoC topology is also under development. This provides high scalability and can be designed as nonblocking, but at the expense of higher latencies.

© 2006 by Taylor & Francis Group, LLC

26-8

Embedded Systems Handbook Master side

Slave side

Rea

d

Thread 1 Thread 2

Buffer

Wri

te

Buffer

Thread 3 Data Ack

RISC

Thread 1 Thread 2 Thread 3

Buffer

Buffer

RISC

FIGURE 26.4 Split-transaction communication channel.

Finally, for the emerging 65 nm process technology node and beyond, we are exploring globally asynchronous, locally synchronous approaches. One interesting example of this approach is the Star network [30], which serializes packets and uses plesiochronous clocking regions.

26.4.3 Use of Embedded FPGAs and Embedded Sea-of-Gates It is our belief that the large majority of end-user SoC product functionality will run on the heterogeneous embedded processors. However, power and performance constraints will dictate partitions where the majority of performance will come from a combination of optimized hardware, embedded sea-of-gates (eSoGs), or embedded field programmable gate array (eFPGA), implementing critical inner loops and parallel operations, but of lower functional complexity. Industrial case studies justifying the use of the various types of platform components were described in Reference 20. Embedded FPGAs are used in the StepNP platform to complement the processors, but with limited scope. The ∼50 times cost and ∼5 times power penalty of eFPGAs restrict more widespread use. Nevertheless, for high-throughput and simple functions, or highly parallel and regular computations, eFPGAs can play an important role. An eFPGA test chip was developed in ST’s 0.18 µm CMOS process, and results were presented in Reference 23. Embedded SoG technology, for example, such as that proposed by eASIC [31], which is configured with one or two masks, is another interesting cost and flexibility compromise which we are also incorporating in StepNP. A test chip including this technology was developed in ST’s 0.13 µm CMOS process. The 3 to 3.5 times cost penalty over standard cells is compensated by the lower mask set NRE, which can be 10 to 30 times of lower cost than a complete mask set. The StepNP platform uses eFPGAs and eSoG in two roles: for (re)configurable processor customization and for hardware processing elements (HW PEs), as described next.

26.4.4 Conﬁgurable Processor Implementation Embedded FPGA and SoG fabrics are used to implement reconfigurable and configurable processors. ST has developed and manufactured a 1 GOPS reconfigurable signal processing chip [23]. This combines a commercial configurable RISC core with an eFPGA that implements the application-specific instructions.

© 2006 by Taylor & Francis Group, LLC

Multiprocessor SoC Platform

26-9

In the StepNP physical implementation, we extend this approach to use eSoGs to achieve a low-cost, one-time configurable version of these application-specific instructions.

26.4.5 Hardware Processing Elements The HW PEs of StepNP are implementable using a user-defined combination of eFPGAs and eSoGs. To facilitate interoperability, all PEs communicate to the NoC via a standard protocol. The conversion between the HW PE’s internal data representations and the packet-oriented format of the NoC (as depicted by the “packetization” blocks of Figure 26.2), is performed by hardware wrappers automatically generated by the SIDL compiler described further. To support applications in the networking space, we have also integrated an ST-proprietary highperformance network packet search engine (NPSE) optimized for IPv4/IPv6 forwarding. This search engine is a pipelined SRAM-based solution to the lookup and classification problems. In comparison with CAM-based lookup methods, this SRAM-based approach has more memory and is power-efficient [32]. The StepNP platform served as a validation environment for this search engine during the architecture design phase. The key characteristic of the StepNP platform is that, although it is composed of heterogeneous HW–SW PEs, memories, and I/O blocks, the use of a single standardized protocol to communicate with a single global NoC allowed us to build a homogeneous programming environment supporting automatic application-to-platform mapping.

26.5 Multiﬂex MP-SoC Tools Overview It is our conviction that the success of an MP-SoC platform will depend mostly on the quality and effectiveness of the platform programming, debug, and analysis tools. In particular, it will depend on the platform’s ability to support a high-level parallel programming model, therefore enabling higher productivity tools. Unless abstractions are introduced, MP-SoC devices will be very hard to program. This is due to the different instruction sets of the processors, different data representations, computations that are split between hardware and software blocks, evolving partitioning of computation, etc. The parallel programming model and supporting tools are the key means to bridge the gap between system specifications and the platform capabilities, as discussed in Reference 19. To support the objective of bridging the gap between system specifications and the platform capabilities, we have developed the Multiflex toolset for MP-SoC systems, as depicted in Figure 26.5. Although networking and communications applications are the first key drivers, the toolset and platform are in fact quite general and are being applied to other application domains, for example, advanced image processors and wireless base stations — but this is beyond the scope of this chapter. The Multiflex environment leverages our existing system-level design tools [15,16], and embedded software technologies [17], but also adds several multiprocessor-oriented capabilities, using StepNP as an MP-SoC driver platform. The environment starts with a SystemC-based [33] configurable model of the multiprocessor platform. It supports interaction with the model via a well-defined application-programming interface (API) called SystemC interface definition language (SIDL). This not only allows for model input and output, but also allows for orthogonal interactions with the model for various debug and analysis tools. The Multiflex tools support the automatic dynamic scheduling and allocation of tasks to multiprocessors from two high-level parallel programming models: a distributed system object component (DSOC) message passing model, and a symmetrical multiprocessing (SMP) model using shared memory, as described in Section 26.8. This is supported by a GNU-based multiprocessor processor source-level debugger and by ST’s FlexPerf processor performance analysis tools [17]. The top of Figure 26.5 depicts the top-level control layer that provides interaction, control, and analysis of the architecture model from a domain-specific, high-level of abstraction. It allows visualization and

© 2006 by Taylor & Francis Group, LLC

26-10

Embedded Systems Handbook

SoC Tools Standard views

Model control

User-defined views

Data In

Data Out SIDL Access API Appln. RED

MP performance analysis

Pulltopush Classifier

MP platform models: functional / TLM / (RTL)

Discard Strip

MP mapping, scheduling

PE

H/W

PE

Mem

MP debugger

FIGURE 26.5 The Multiflex MP-SoC tool environment.

control of the model’s execution from various perspectives (programming, temporal, spatial, and userdefined) for a range of abstraction levels (functional, transaction, and cycle-based). The programming perspective is a multiprocessor version of a conventional source-level debugger. The logical perspective lets the user track a single packet’s processing logically, even though the processing is distributed over multiple processor, interconnect, and hardware resources. The temporal perspective lets the user visualize parallel activities on a timeline. The timeline can represent various abstraction levels — for example, the name of the top-level C function running on a processor at a given time, or the signal value on a bus. The spatial perspective allows event tracking in a hierarchical block diagram. StepNP automatically extracts the graphical representation of the hierarchy from the model via an introspective API described later. Finally, an automatic API generator allows easy connection of scripting languages or graphical environments enabling user-defined perspectives, a key requirement for STMicroelectronics’ customers.

26.6 Multiﬂex Modeling and Analysis Tools This section describes the tools and methods used to model and analyze the StepNP architecture platform described in Section 26.4.

26.6.1 Modeling Language We chose SystemC 2.0 [33], with its wide range of modeling abstraction capability, as the StepNP architecture platform’s main modeling language. A range of abstraction levels are supported: untimed functional, transaction-level modeling [16], and cycle-accurate modeling. Transaction-level modeling is the most useful for our purposes, combining reasonable accuracy and higher simulation speeds than cycle-accurate models [15,16]. Where appropriate, we also included more-specialized languages — for example, Python and tool command language (Tcl) for user scripts and Java/Forte for graphical user interfaces and user-defined extensions.

© 2006 by Taylor & Francis Group, LLC

Multiprocessor SoC Platform

26-11

26.6.2 Hardware Multithreaded Processor Models As discussed in Section 26.4, one important requirement for the processors in the StepNP platform is the efficient handling of the latencies of the interconnect, memory, and co-processors. Multithreading lets the processor execute other streams while another thread is blocked on a high-latency operation. A hardware multithreaded processor has separate register banks for different threads, allowing low-overhead switching between threads. As models of processors supporting hardware multithreading were not readily available, we chose to use standard monothreaded processors and emulate this capability via modeling. For the standard RISC-style processors supported in StepNP, namely the ARM™, PowerPC™, MIPS™, and DLX models, we encapsulate the functional instruction-set models into a SystemC wrapper. The resulting encapsulation produces a cycle-based model implementing a configurable m-wide hardware multithreading capability and a simple n-stage pipeline. This approach could also be used in principle for the configurable or application-specific processor models. To achieve this, each thread in the SystemC processor wrapper calls the ISS to implement the thread instructions. The wrapper interleaves the sequencing and timing of these calls to model the execution of a hardware-multithreaded processor. The ISS returns memory reference operations to the wrapper for implementation in SystemC. The wrapper communicates with the rest of the StepNP platform via the SystemC open core protocol (SOCP) communication channel interface described in the next section.

26.6.3 SOCP Network-on-Chip Channel Interface A key component of the StepNP platform is the interconnect technology, as discussed in Section 26.4.2. The StepNP platform uses a single NoC that connects all I/O and PEs (HW and SW). The NoC interface is an important part of the modeling environment. Our goal in creating an interface to the NoC, was the creation of a standardized API, which would enable plug and play of different SoC IP’s at various abstraction levels. The following requirements motivated the StepNP communication channel interface’s design: • The interface must operate at the functional and transaction levels. The interface should contain no bit-level signals, polarities, clock cycles, or detailed timing. This requirement does not preclude an adapter that, for example, maps the interface to a cycle-accurate internal model. • The interface-modeling approach should use SystemC 2.0 constructs in a manner as close as possible to their original intent. In other words, the communication between master, channel, and slave should use the SystemC 2.0 port, interface, and channel methodology. • The interface should make no assumptions about the underlying interconnect architecture. It must support anything from a simple point-to-point connection to an arbitrarily complex, multilevel network on a chip. • The interface must support split transactions and should not assume that requests and responses are atomic. • It must support multithreaded masters and slaves. • If possible, the interface should be compatible with existing interfaces designed for IP reuse. This channel interface is dubbed SOCP and was fully described in Reference 26. We developed the SOCP channel interface model using the SystemC 2.0 language. The model follows the same high-level semantics as the open core protocol (OCP) [25] and the virtual component interface (VCI) [34] but has no notion of signals or detailed timing. For transaction-level modeling, these standards and can be considered functionally identical, and we refer to them interchangeably. Our modeling approach has the following advantages: • The SOCP channel interface model can inherit semantics of parameters and behavior largely from the OCP/VCI specification. • The StepNP user can refine the SOCP to transform it to an OCP/VCI lower-level interface or to other interconnect implementations, such as industry-standard buses or complex NoCs.

© 2006 by Taylor & Francis Group, LLC

26-12

Embedded Systems Handbook

• The channel interface model achieves higher simulation speeds than OCP/VCI or bus-level channel implementations because of its higher abstraction level. 26.6.3.1 Extending OCP Semantics for High-Level Modeling Although we followed the base OCP and VCI semantics, the SOCP needed selective extensions for functional- and transaction-level modeling. Data crosses an OCP interface one word at a time. In SOCP, however, we also allow a complete burst transaction in one put request. We do this by specifying an additional length parameter with a value greater than one and allowing pointers for the address-, data-, and byte-enable parameters. We assume that the other parameters are constant for each data item of the transfer, except for the MBurst parameter, which need not be set by the master. However, a cycle-accurate adapter on the other side of the interface can choose to feed the data into a lower-level internal model and generate the burst signal for each word as appropriate. The SOCP interface requires the response length to match the request length. Calls across an SOCP interface can block the caller. Therefore, the caller should be a SystemC thread construct (SC_THREAD). If a slave IP component blocks a put request from a channel, however, the channel could be blocked until the request is serviced. Therefore, if channel blocking is an issue, slave IP components should avoid blocking a put request (by using the SThreadBusy back-pressure mechanism or by buffering requests). The same applies to a response from the slave (or channel) to the master. It is possible to implement both master and slave in a purely functional manner. If a purely functional channel model is used, some mechanism should be provided in the channel to schedule new threads; otherwise, one master will dominate the simulation, and no other threads will run. 26.6.3.2 SOCP Outlook The SOCP work was used to help define the requirements for a new transaction-level modeling interface proposal which was submitted to the Open SystemC Initiative [33] community, and was recently accepted. This new framework supports a range of protocols (e.g., STBus, Amba, and OCP) on top of a generic transport mechanism. The StepNP platform models are currently being adapted to align with this new approach.

26.6.4 Distributed Simulation Using SOCP A benefit of a well-defined communication interface is that it can serve as the basis for partitioning the SystemC simulation model over multiple workstations. To achieve this capability in StepNP, we implemented an SOCP channel called dsim that allows distributed simulation over a pool of workstations. The masters and slaves plug into the dsim channel without any changes. The channel implementation, however, is aware of the workstations involved in the simulation and of the components’ addresses and identifiers. If a master sends a request to a local slave, the transaction completes in much the same way as the functional channel. However, if the slave is on another workstation, the channel implementation packages the request and sends it over the network to the destination slave. In this approach, either the communication rate or the slowest distributed SystemC model component limits the overall simulation speed. We measured communication rates of about 30 kHz using the dsim channel implemented with the transmission-control and Internet protocols (TCP/IP) over a standard 100-Mbit Ethernet line.

26.7 Multiﬂex Model Control-and-View Support Framework Designers and programmers using the StepNP platform need to debug, verify, understand, initialize, and measure the platform model’s execution. This diversity of uses requires a robust methodology for controlling and viewing models. In most cases, the control-and-view components are parts of a process separate from the model itself and are often implemented in another language. For example, scripting languages such as Tcl and Python are used to automate verification. A typical verification scenario could have a script that populates the routing table in the model with known values and then injects packets into a network device in the model.

© 2006 by Taylor & Francis Group, LLC

Multiprocessor SoC Platform (a)

26-13 (b)

(c) SoC tools

Performance analysis tools

Router testbench

C++

Tcl Java

Access API

(d)

Access API

Access API

SIDL probe I/F

SIDL ICE I/F

SoCP channel

Master

A2 Slave

ICE socket Port

A1

Interface

Probe

SystemC model

FIGURE 26.6 Multiflex control-and-view methodology: (a) application testbench, (b) performance analysis tool, (c) SoC tools platform, and (d) model.

The script would then start the model execution and examine the emitted packet for correct header values and processing latency. Given these control-and-view requirements, the Multiflex environment supports the introspective approach illustrated in Figure 26.6. An external control-and-view component — for example, the SoC tools platform — connects to the model’s access API. This component can query the model’s structure, discover components the model uses, and discover the supported interfaces to these components. In Figure 26.6, the external SoC tool component has discovered a SystemC probed signal between two slave model subcomponents, A1 and A2. The Multiflex control-and-view framework automates much of this process.

26.7.1 SystemC Structural Introspection SystemC provides an API for traversing the model’s structural hierarchy. The Multiflex access API enhances this basic support and builds a structural representation of the model. Low-level signals (such as objects of type sc_signal) and simple state representations can use a Multiflex-supplied probe class. This probe class extends signals and state representation variables with functionality that connects to external controland-view components. These components can use this functionality to discover the model’s signals and probed state and to recover the time history as needed. They can also use the probe class for automating sc_trace control, setting breakpoints, or for other functions.

26.7.2 SIDL Interface For software written using high-level SystemC 2.0 modeling facilities, it is more difficult to automatically extract state information and allow control access. Therefore, we developed an instrumentation methodology and an associated language called the SIDL. An SIDL interface allows external control-and-view components written in various languages to access a SystemC model object implementing this interface.

© 2006 by Taylor & Francis Group, LLC

26-14

Embedded Systems Handbook

An SIDL looks much like a pure virtual C++ class and is patterned after the sc_interface approach in SystemC. For example, the SIDL interface to a simple counter could be class CounterCandV { public: virtual int getCount() = 0; virtual int setCount(int) = 0; };

An SIDL compiler parses an SIDL header file and produces all the client-server glue. The glue on the server end connects an object implementing this interface to the low-level access API. The compiler produces the client-end glue in the desired language (e.g., Java, C++, Python, or Tcl). The function parameters can be basic types (integers, floats, strings, and so forth), structures, or vector containers of these types. The SIDL compiler handles all marshaling and remote procedure call issues. In the CounterCandV example, a counter in the SystemC model needs only to inherit the server instance of this class (CounterCandVServer) generated by the compiler, and implement the getCount and setCount methods. The client can call the access API in the server to discover all control-and-view interfaces and the names of these instances. For example, a client might find that the model supports a CounterCandV object, named counter0. The client can then create a CounterCandVClient object, supplying the name counter0 to the constructor. The client can then call the getCount and setCount methods of this object, which transparently calls the getCount and setCount methods in the corresponding SystemC object. At one level, SIDL looks like a distributed object model such as the common object request broker architecture (CORBA) [40]. However, SIDL is more restricted in scope than CORBA; it follows the interface style of SystemC, and is integrated in the SystemC environment.

26.7.3 Instrumentation of an SOCP Socket with SIDL It is possible to develop generic control-and-view interfaces for common master and slave components, such as processors and memories. However, the instrumentation of an SOCP interface is of particular interest because instrumentation tools developed for an SOCP interface can be used with any IP block. An IP block with an SOCP interface can be plugged into an object whose function is analogous to an in-circuit emulator (ICE) socket, with no change of the channel object or the device under test, as illustrated for the master module in Figure 26.6(d). The abstract ICE socket can transparently pass master requests and slave responses. However, external software can monitor or generate the transactions using an SIDL interface. The ICE socket can also perform transaction recording and store the transactions in a trace file for viewing by standard CAD tools or, as Figure 26.6(b) shows, by more specialized SoC performance analysis tools such as STMicroelectronics, SysProbe [15], and FlexPerf [17].

26.8 Multiﬂex Programming Models The new capabilities of emerging SoC devices will require a radical shift in how we program complex systems [18,19]. Unless abstractions are introduced, MP-SoC devices will be very hard to program. This is due to the different instruction sets of the processors, different data representations, computations that are split between hardware and software blocks, evolving partitioning of computation, etc.

26.8.1 Survey of Multiprocessor Programming Models The importance of new high-level programmer views of SoC, and the associated languages that support the development of these views, is presented in Reference 35. The capabilities supported by high-level programming models (e.g., hiding hardware complexity, enhancing code longevity, and portability, etc.) and different programming model paradigms (e.g., message passing and shared memory), are presented in the context of NoC design [4,36].

© 2006 by Taylor & Francis Group, LLC

Multiprocessor SoC Platform

26-15

A number of multiprocessor programming models designed for SoC-scale application have been presented. The Mescal approach [11] is based on a programming model defined as a set of API’s abstracting a microarchitecture. It serves the dual function of capturing enough of the application domain to pass down to the compiler as well as exporting just enough of the architecture to enable the programmer to efficiently program the hardware platform. Based on this definition, in Reference 37 the authors present a programming model that is inspired from the Click language, however, it is essentially dedicated to the IXP1200 NPU. Kiran et al. [38] propose a parallel programming model for communication, in the context of behavioral modeling of signal processing applications. This model, named shared messaging model, integrates the message passing and shared-memory communication paradigms. It exploits the advantages of both paradigms providing up to an order of magnitude improvement in the communication latency over a pure message-passing model. However, this approach only addresses communication modeling, and not the subsequent implementation. Forsell [39] presents a sophisticated programming model that is realized through multithreaded processors, interleaved memory modules, and a high-capacity interconnection network. This is based on PRAM (parallel random access machine) paradigm. However, this is restricted to a fixed NoC architecture (Eclipse).

26.8.2 Multiﬂex Programming Model Overview It is our conviction that programming model development will be evolutionary, rather than revolutionary, and the trend will be to support established software languages and technologies, rather than the development of entirely new programming paradigms. Currently, and in the foreseeable future, large systems will be written mostly in C++, Java, or languages supported by the Microsoft common language runtime (CLR), such as C#. The Java and CLR supported languages have established programming models for both tightly coupled and loosely coupled programming. Briefly stated, tightly coupled computing is done with some variant on an SMP model (threads, monitors, conditions, signals), and heterogeneous distributed computing is accomplished with some variant on a component object model (CORBA [40], Enterprise Java Beans, Microsoft DCOM [41] and its evolutions, etc). Recent proposals for C++ evolution [42] have also called for SMP and component models inside the C++ language specification. The two SoC parallel programming models presented here are inspired by leading-edge approaches for large system development, but adapted and constrained for the SoC domain: • DSOC model. This model supports heterogeneous distributed computing, reminiscent of CORBA and Microsoft DCOM distributed component object models. It is a message-passing model and it supports a very simple CORBA-like interface definition language. • SMP, supporting concurrent threads accessing shared memory. The SMP programming concepts used here are similar to those embodied in Java and Microsoft C#. The implementation performs scheduling, and includes support for threads, monitors, conditions, and semaphores. Both programming models have their strengths and weaknesses, depending on the application. In the Multiflex system, both can be combined in an interoperable fashion. On one hand, this should be natural for programmers familiar with Java or C#, and on the other hand, be efficient enough (both in terms of execution efficiency and resource requirements) for use in emerging SoC devices. The following sections describe both of the Multiflex programming models in more detail. In comparison with the systems cited in the survey above, we believe the methodology supported by the Multiflex system has four key contributions: 1. Support of interoperable distributed message passing and SMP programming models. 2. An extremely efficient implementation of these models is achieved using novel hardware accelerators for message passing, context switching and dynamic task scheduling and allocation.

© 2006 by Taylor & Francis Group, LLC

26-16

Embedded Systems Handbook

Object2 Object1

S/W–S/W com Object3

PE1.n PE1.1

S/W–H/W com

Object4

PE2 Fast MT-RISC

PE3 H/W PE

RISC, Std. O/S

Message passing engine

NoC

Input

H/W ORB

Output

FIGURE 26.7 DSOC model to platform mapping.

3. Support of homogenous programming styles for MP-SoC platforms composed of heterogeneous HW–SW PEs. This is achieved via an interface definition language and compiler supporting a neutral message format. 4. All application programming is done using high-level language (C or C++ combined with highlevel calls to the parallel programming model APIs). Moreover, all the Multiflex tools supporting the mapping, compilation, and runtime execution are also written in processor-independent C and C++.

26.9 DSOC Programming Model The DSOC programming model relies on a high-level representation of parallel communicating objects, as illustrated in the simple example of Figure 26.7, where the four objects represent application functions. DSOC objects exchange messages via a well-defined interface definition language (IDL). The IDL description defines the interface to an object, in a language neutral way. As illustrated in Figure 26.7, the DSOC objects can be assigned to general-purpose processors running a standard operating system (e.g., for Object2), to multiple hardware multithreaded processors (Object1 and Object3), or to HW PEs (Object4). Due to the underlying heterogeneous components involved in the implementation of the interobject communication, a translation to a neutral data format is required. In the Multiflex system, we have implemented an IDL dubbed SIDL.1 SIDL looks much like a pure virtual C++ class, and is patterned after the sc_interface approach in SystemC. As explained below, the use of SIDL is key to the message passing implementation. 1 Although

the SIDL syntax used here is identical to the one used for API generation in the Multiflex control and view framework described in Section 25.7, the usage is in a different context, and this SIDL compiler uses a different back-end.

© 2006 by Taylor & Francis Group, LLC

Multiprocessor SoC Platform

26-17

The DSOC programming model relies on three key services. As we are targeting this platform at highperformance applications, such as network traffic management at 2.5 and 10 Gb/sec line-rates, a key design choice is the hardware support for remote object calls: • The hardware Message Passing accelerator engine is used to optimize remote object calls. It translates outgoing messages into a portable representation, formats them for transmission on the NoC, and provides the reverse function on the receiving end. • The hardware Object request broker (ORB) engine is used to coordinate object communication. As the name suggests, the ORB is responsible for brokering transactions between clients and servers. Currently, a simple first-come, first-served scheduling mechanism is implemented. The ORB engine allows multiple servers to service a particular service type. This allows client requests to be load balanced over the available servers. • Hardware Thread Management coordinates and synchronizes execution threads. All logical application threads are directly mapped onto hardware threads of processing units. No multiplexing of software threads onto hardware threads is done. While the global thread management is performed by the ORB (i.e., synchronizing and matching client threads with server threads), the cycle-by-cycle scheduling of active hardware threads on individual processors is done by the processor hardware, which currently uses a round-robin scheduler. A priority-based thread scheduler is also being explored.

26.9.1 DSOC Message Passing In the Multiflex system, a compiler is used to process the SIDL object interface description, and generate the client or server wrappers that are appropriate for the (hardware or software) PE sending/receiving a message. For processors, the SIDL compiler generates the low-level communication software driving the message passing hardware. For HW PEs, the compiler generates the data conversion hardware and links it to the NoC interface. The end result is that the client wrapper takes client calls, and, with the help of the message passing engine, “marshals” the data into a language neutral, portable representation. This marshaled data is transferred over the NoC to the server wrapper. The server wrapper “unmarshals” the data, and invokes the server object implementation. The return values are then marshaled, sent back to the client, and unmarshaled in a similar way. Due to the hardware support for message passing, the software overhead for the remote invocation is a few dozen instructions. Note that no software context switching is done. If the server method returns a result, the client hardware thread is stalled until the result is ready.

26.9.2 The DSOC ORB Figure 26.8 illustrates a sample DSOC execution platform. The top of the figure depicts a mix of software DSOC objects, running on one or more multithreaded RISC processors, as depicted on the top left-hand side. It also includes a number of hardware DSOC objects, shown in the top right-hand side of the figure. The role of the ORB in parallel execution and system scheduling is the key. Parallel execution in a DSOC application is achieved using one or more of the following mechanisms: • Many client objects may execute in parallel. • A service may be load balanced over a number of resources. • The programming model inside a DSOC object is not specified, but could involve SMP or message passing. • In some cases, a client call may return to the client before the server has completed the request, allowing both client and server to execute in parallel. Simply stated, the approach used here involves replicating object servers over a “server farm.” The ORB matches client requests for service with a server, according to some criteria. In the current implementation, the least loaded server is selected.

© 2006 by Taylor & Francis Group, LLC

26-18

Embedded Systems Handbook S/W S/W DSOC DSOC S/W Obj Obj DSOC Obj H/W MT RISC (1 to N ) Message Engine and NoC I/F

H/W DSOC Obj

H/W DSOC Obj

H/W DSOC Obj

Message Engine and NoC I/F

Message Engine and NoC I/F

NoC NoC I/F

H/W ORB

FIFO’s

ID

ID

ID

FIGURE 26.8 Execution platform for DSOC programming model.

In our approach, logical threads are mapped one-to-one to the physical threads of the hardware multithreaded PE’s. This may seem like a limitation, but on the other hand, even fairly small systems we envisage have 64 hardware threads or more (e.g., eight processors with eight threads each). In actual systems we have designed, the limitation on the number of threads has not proved to be an issue (indeed, until recently, some Unix workstations had process table restrictions limiting the number of processes to under 256). As a result of this mixed HW/SW implementation, the software overhead for a complete DSOC call is very low. For example, a call with a few integer parameters takes less than 50 instructions for a complete round trip between client and server: a half-dozen on the client-side and 20 on the server-side (assuming client and server tasks are implemented in software — the overhead is lower in the case of the server running in hardware). This includes: • • • • • • •

Call from the client to the server proxy object Insertion of the call arguments into message passing engine Retrieval of the arguments at the server-side Dispatching of the call to the server object Insertion of the result into the server message passing engine Reading of the results by the server proxy from the client message passing engine Return from the server proxy object to the client with result

The client-side code emitted by the SIDL compiler for the above looks much like a normal function call. However, instead of pushing arguments on a stack, the arguments are pushed into the MPE. Instead of a branch to subroutine instruction, a special MPE command is given to trigger the remote call. If the object call returns a result, the client thread is stalled until the request is serviced. No special software

© 2006 by Taylor & Francis Group, LLC

Multiprocessor SoC Platform

26-19

accomplishes the stall, rather the client immediately reads the return result from the MPE, and this read stalls the client thread until results are ready. All this is inlined, so the client-side DSOC code can be a handful of assembler instructions. The server-side code is slightly more complex, as it first reads an incoming service identifier and function identifier from the MPE. It then does a table lookup, and branches to the code handling this object method. This is implemented in less than a dozen RISC instructions, typically. From there, arguments are read from the MPE, and the object implementation is called. Finally, results (if any) are put in the MPE for transmission back to the client. Again, the overhead for this is roughly the same as a local object method call. The end result of this HW/SW architecture is that we are able to sustain end-to-end DSOC object calls from one processor to another, at a rate of about 35 million per second, using 500 MHz RISC-style processors.

26.10 SMP Programming Model Modern languages such as Java and C# support both tightly coupled SMP-style programming (with shared memory, threads, monitors, signals, etc.), as well as support for distributed object models, as described above. Unfortunately, SoC resource constraints make languages such as Java or C# impractical for highperformance embedded applications. For example, in a current STMicroelectronics consumer application, the entire “operating system” budget is less than 1000 instructions. As we have seen in the previous section, DSOC provides an advanced object-oriented programming model that is natural to Java or C# programmers, with essentially no operating system software or language runtime. Next, we will describe how we support a high-level SMP programming model in the same resource-constrained environment.

26.10.1 Target SMP Platform To address the efficiency requirements of high-performance SoCs, SMP functionality in the Multiflex system is implemented by a combination of a lightweight software layer and a hardware concurrency engine (CE), as depicted in Figure 26.9. The SMP access functions to the CE are provided by a C++ API. It defines classes and methods for threads, monitors (with enter/exit methods), condition variables (with methods for signal and wait), etc. The CE appears to the processors as a memory-mapped device, which controls a number of concurrency objects. For example, a special address range in the CE could correspond to a monitor, and operations on the monitor are achieved by reading and writing addresses within this address range. Most operations associated with hundreds or thousands of instructions on a conventional SMP operating system are accomplished by a single read or write operation to a location in the CE. To motivate the need for a hardware CE, consider the traditional algorithm for entering a monitor. This usually consists of the following: 1. Acquire lock for monitor control data structures. This is traditionally done with some sort of atomic test and set instruction, with a spin and back-off mechanism for heavily contested locks. 2. Look at busy flag of monitor. If clear, the thread can enter the monitor (by setting busy flag, and releasing the monitor lock). If the busy flag is set, the thread must: (1) link itself into a list of threads trying to enter the monitor, (2) release the lock for the monitor, (3) save the state of the calling thread (e.g., CPU registers), and switch to another thread. This control logic is quite involved. The logic for signaling a condition inside the monitor is even more complex. In contrast, with the Multiflex CE, entering a monitor, or signaling a condition, is done with one memory load instruction at a special address that indicates the monitor object index and the operation type. Similarly, forking up to 8192 (213 ) threads at a time can be accomplished with one memory write. The atomic maintenance of the linked lists, busy flag indicators, timeout queues, etc., is done in hardware.

© 2006 by Taylor & Francis Group, LLC

26-20

Embedded Systems Handbook

Pipe1 T1 Tm . . PipeP Data$

Pipe1 T1 Tm . . PipeP Data$

RISC 1

RISC N Message engine and NoC I/F

Message engine and NoC I/F

NoC NoC I/F

H/W CE Run queue

Semaphore Monitor

Entry list

Condition

FIGURE 26.9 Execution platform for SMP programming model.

Any operation that should block the caller (such as entering a busy monitor) will cause the CE to defer the response to the read until the blocking condition is removed (e.g., by the owner of the monitor exiting). This causes suspension of execution of the hardware thread. The split-transaction nature of the StepNP interconnect makes this possible, since the response to a read request can be delivered at any time in the future, and does not block the interconnect. Therefore, the response to a read from a CE location representing a monitor entry will not return a result over the interconnect until the monitor is free. Notice that no software context switching takes place for concurrency operations. The hardware thread is simply suspended, allowing other hardware threads enabled on the processor to run. This can often be done with no “bubbles” in the processor hardware pipeline. The large number of system hardware threads, in the order of thousands, would make software context switching unnecessary for most applications. Of course, it is possible for an operating system layer to run in the background, and take blocked hardware threads that have been suspended for a long time, for example, and context switch them in software to execute another task. However, for our current applications, we have not yet found a need for this. The CE is also responsible for other tasks such as run queue management, load balancing, etc. Our experiments to date indicate that a simple first-come first-served task scheduling hardware mechanism results in excellent performance with good resource utilization. Therefore, the Multiflex system provides a SMP programming model with essentially no “operating system” software, in the conventional sense. The C++ classes controlling concurrency are implemented directly with in-line read/write instructions to the CE. A C-based POSIX thread (“p-thread”) API is also available, with only slightly lower efficiencies. This high-performance SMP implementation simplifies programming for the application developer. With conventional SMP implementations, the cost of forking a thread, synchronizing, etc. must be carefully

© 2006 by Taylor & Francis Group, LLC

Multiprocessor SoC Platform

26-21

considered, and balanced against the granularity of the task to be executed. Making the task granularity too large can reduce opportunities for parallelism, while making tasks too small can result in poor performance, due to SMP overhead. Finding the right balance requires a great deal of trial and error. However, with the high-performance Multiflex SMP implementation the trade-off analysis is much simplified.

26.10.2 Hardware Support for SMP RTOS The overhead of a software-only RTOS context switch is typically over 1,000 cycles, and in the context of MP-SoCs with long NoC latencies can exceed 10,000 cycles [43]. The use of hardware engines for message passing, context switching, and task scheduling are therefore essential in both the DSOC and SMP model implementations in order to achieve efficient mapping of medium-grain parallel tasks onto MP architectures. For example, Mooney and coworkers [44] have reported an experiment where RTOS context switch times take 3,218 cycles and communication takes 18,944 cycles, for a medium-grained computation taking 8,523 cycles. This results in application execution efficiency of only 30%. They also experimented with a partial hardware acceleration for scheduling and lock management, and observe efficiencies approaching 63%. However, they note that the improvement due to hardware acceleration of the scheduling and synchronization was limited by the software context switch overheads. Kohout et al. [45] perform RTOS task scheduling in HW and demonstrate up to 10 times speedup of RTOS processing time (from 10 to 1% overhead). Absolute times for RTOS processing are not given, but the interrupt response time portion was between 1400 and 2200 nsec, for a clock cycle of 200 MHz. In the Multiflex system running on StepNP (assuming the same 200 MHz clock frequency), context switches occur in 5 nsec (1 clock cycle), message passing in less than 100 nsec (i.e., 15 to 20 instructions typically), and scheduling of DSOC and SMP objects in less than 100 nsec (10 to 20 instructions). More importantly, it is the combination of the three accelerator engines that enables the effective mapping of medium- to fine-grain parallelism onto MP architectures like StepNP. In the traffic manager application examples below, we are dealing with fine-grained parallel tasks that represent less than 500 RISC instructions typically.

26.10.3 Interoperable DSOC and SMP Programming Models As depicted in Figure 26.10, the Multiflex system supports the mapping of applications expressed in DSOC and SMP programming models. In this simple example, application Object1, Object3, and Object4 communicate via the DSOC message passing model described in the previous section. On the other hand, Object2 is an SMP object that contains three parallel threads that access shared memory. In order to achieve this, the platform makes use of both the DSOC scheduler (ORB) and the SMP scheduler (CE) described above.

26.11 An IPv4 Packet Forwarding Application To illustrate the concepts discussed in this chapter, we have mapped a Multiflex model of a complete IPv4 fast-path application of Figure 26.11(a) onto the multiprocessor execution platform depicted in Figure 26.11(b).

26.11.1 Networking Application Framework Our application software platform makes use of MIT’s open source Click modular router framework for the rapid development of embedded routing-application software [46]. Figure 26.11(a) depicts sample Click modules performing packet classification, discarding, stripping, and queuing. We have extended the Click IPv4 packet forwarding model to be compatible with the DSOC messagepassing programming model by encapsulating Click functions in SIDL-based interfaces. The granularity

© 2006 by Taylor & Francis Group, LLC

26-22

Embedded Systems Handbook

Shared memory com

Object2 T2 T1 T3

Object1

S/W–S/W com S/W–H/W com

Shared memory Object3

Object4

PE1.n PE1.1

PE2

Fast MT-RISC

H/W PE

Message passing engine

PE3 MP RISC, std. O/S

NoC In

FIGURE 26.10

DSOC ORB

SRAM

SMP CE

Out

Mixed DSOC-SMP model to platform mapping.

of the partitioning is user-defined and can be defined at the intrapacket and interpacket levels. It is the latter that is most naturally exploited here, due to low interpacket dependencies. Furthermore, we have added DSOC objects to the model. For example, the lookup element makes use of a DSOC server object for IPv4 address lookup. Similarly, DSOC server objects are used for packet I/O. SMP primitives are used to fork threads, and to guard updates to shared data, where necessary.

26.11.2 StepNP Target Architecture The target architecture platform used here is depicted in Figure 26.11(b). In order to support wire-speed network processing, a mixed hardware and software architecture is used. For example, at a 10 Gb/sec line-rate, the packet processing time is approximately 40 nsec (for packets of 54 bytes). To achieve this, the packet I/O and network address searches are implemented in hardware. The packet I/O component is implemented using 32 hardware threads. It receives input packets, and controls the DMA to memory. The network search engine model is based on ST’s NPSE search engine [32]. The platform was configured with the following variable parameters: • • • • • • • •

RISC processor ISA Number of processor pipeline stages Processor clock frequency Data/program cache size Number of processors Number of hardware threads (per proc.) One-way NoC latency NoC latency jitter

ARM v4 4 500 MHz 4 KB 1 to 48 4 to 32 0 to 160 nsec ±25%

In this experiment, the NoC latency is a configurable parameter. It represents the total lumped one way delay of a data item sent between one master (or slave) to another. We have not attempted to model the detailed behavior of the NoC interconnect, such as contention, blocking, etc. Since the latency value used in the experiment was varied from 0 to 160 nsec (or 80 clock cycles at 500 MHz), we believe this

© 2006 by Taylor & Francis Group, LLC

Multiprocessor SoC Platform

26-23

(a) 10 Gb/sec IPv4 packet forwarding

DSOC / Click RED

Classifier

Pulltopush Discard Strip

Dynamic mapping (b)

Packet I/O

Pipe1 T1 Tm . ... . PipeP Cache

Pipe1 T1 Tm . ... . PipeP Cache

Pipe1 T1 Tm . . PipeP Cache

RISC 1

RISC 2

RISC 40

Parameterizable NoC (latency Li, j )

DSOC ORB

RISC Network search engine

H/W

Message passing engine

Memory Task scheduling

FIGURE 26.11 Platform for IPv4 application: (a) packet forwarding applications, (b) architecture platform.

is a realistic upper bound on the effect of NoC contention or blocking. Moreover, we include a random jitter on the NoC latency of ±25%. This emulates the effect of out-of-order packet processing. We assume a simple fine-grained multithreaded processor architecture in which there is a different thread in each of the pipeline stage; thus, for the four pipeline stages we considered, a minimum of four threads per processors is necessary to fully use this resource.

26.11.3 Simulation Speed The use of the transaction-level modeling approach described in Section 25.6 leads to fast simulation models for this complex platform. For the configuration above using 40 RISC processors, we have observed simulation speeds of over 7K cycles/sec, running on a 2.4 GHz PC-based platform running Linux.

26.11.4 Multiprocessor Compilation and Distribution Figure 26.11 illustrates the basic compilation and mapping process used in the Multiflex multiprocessor compilation and allocation approach. The three superimposed boxes of Figure 26.11(a) represents the processing of three different packets. The top-level IPv4 application can be partitioned at two different levels: At the interpacket processing level. For IPv4, the inter-packet dependencies are very low. This allows for a high level of natural parallelism. This parallelism is depicted in Figure 26.11(a) via the overlapping boxes. Each packet processing is assignable to a different thread. At the intrapacket processing level. This involves cutting the processing graph at basic block boundaries. This is depicted in with the dotted lines illustrating cut points within a single packet processing.

© 2006 by Taylor & Francis Group, LLC

26-24

Embedded Systems Handbook

300

Forwarding rate (Mb/sec)

250

200

Latency 0 Latency 40 Latency 80

150

Latency 120 Latency 160

100

50

0 4

FIGURE 26.12

8

12

16 20 Number of Threads

24

28

32

IPv4 simulation results (normalized per single processor).

As explained above, the packet I/O and network address search functions are manually assigned to hardware. The remaining IPv4 packet processing is automatically distributed over the RISC processor threads. The hardware ORB load-balances the input from the I/O object to the IPv4 packet forwarding clients executing on the RISC processors. When a RISC processor thread completes the main IPv4 processing, another call is issued to the I/O object for output transmission.

26.11.5 IPv4 Results Some simulation results using a minimum packet length of 54 bytes2 are depicted in Figure 26.12. This depicts the packet processing rate obtained, normalized for one processor, while varying the number of hardware threads supported per processor (from 4 to 32) and the one way NoC latency (from 0 to 160 nsec). Three important observations: • The highest processing rate achievable for a single processor is approximately 260 Mb/sec. This provides a lower bound of 40 processors to target a system performance aggregate throughput of 10 Gb/sec. • Assuming a perfect, zero latency interconnect, we observe that the four threads configuration nearly achieves the highest processing rate. However, realistic latencies cause significant degradation of processor utilization, dropping below 15% for a 160 nsec latency. • As the number of threads is increased from 4 to 32, the NoC latency is increasingly well masked. This results in high processor utilization (as high as 97%), even with NoC latencies as high as 160 nsec. Profiling of the compiled IPv4 application code running on the ARM shows that 623 instructions are required to process one packet. Approximately 1 out of 8 (78 out of 623) of these instructions lead to an NoC access. This high access rate, which is inherent to the IPv4 application code, highlights the importance of masking the effect of communication latency. 2 This

minimum packet size is the worst-case condition as it requires the highest processing rates.

© 2006 by Taylor & Francis Group, LLC

Multiprocessor SoC Platform

26-25

TABLE 26.1 IPv4 Simulation Results for 2.5 and 10 Gb/sec Line-rate (Gb/sec)

Number of ARMs

Number of threads

NoC latency (nsec)

ARM utilization (%)

Packet latency (µsec)

16 48

8 16

40 80

67 86

16 30

2.5 10

This example illustrates the importance of the effective utilization of multiple threads in the presence of high latency interconnects. For example, in the presence of an NoC latency of 160 nsec, the throughput per processor varies between 35 Mb/sec for 4 threads, to 255 Mb/sec for 32 threads. The interprocessor communication represents only 8% of the total packet processing instructions. This very low cost — especially in this high-speed packet processing context — is achieved using the hardwarebased message passing and task scheduling mechanisms described above. In fact, a monoprocessor implementation would require approximately the same number of instructions, since you need to replace the message passing instructions with a regular procedure call.3 Two representative experimental results are summarized in Table 26.1. This provides architecture parameters to achieve 2.5 Gb/sec (OC48) and 10 Gb/sec (OC192) with worst-case traffic. For OC48, a configuration with 16 ARMs with 8 threads each can support the 2.5 Gb/sec line-rate with an NoC latency of 40 nsec. For OC192, a configuration of 48 ARMs with 16 threads each can support a 10 Gb/sec line-rate. Here, we assumed a higher NoC latency (80 nsec instead of 40 nsec) due to the higher number of PE’s. The total packet latency for this configuration is 30 µsec. In comparison with the 2.5 Gb/sec result, a line-rate of 4X is achieved with only 3X processors, in spite of a higher NoC latency (80 nsec instead of 40 nsec). This is a result of the higher processor utilization (86%), which is achieved by using 16 threads instead of 8. Note that even higher processor utilizations are achievable. With 40 processors and 24 threads, a processor utilization of 97% was obtained. However, in practice, this does not offer enough headroom for additional functionality. For both configurations, 50% of the reported latency to process a packet is a consequence of the NoC latency resulting from the required 78 NoC accesses. Note that the StepNP platform instance used for this application makes use of standard RISC processors only (ARM v4 ISA). The use of application-specific instructions is not straightforward in regular IPv4 packet forwarding since there are very few repeated sequences of operations to optimize. However, our academic partners demonstrated that the use of a Tensilica configurable processor optimized for a secure IPv4 packet forwarding application (using encryption) led to speedups up to 4.55x over an unoptimized core [6]. This example better demonstrates the value of configurable processors in the general StepNP platform.

26.12 A Trafﬁc Manager Application To further illustrate the concepts discussed in this chapter, we have mapped a Multiflex model of the IPv4 packet traffic management application of Figure 26.13(a) onto the StepNP multiprocessor platform instance depicted in Figure 26.13(b). In comparison with the packet forwarding application example presented above, this application has the following challenges: • While there is natural interpacket parallelism, there are also numerous interpacket dependencies to account for, due to packet queuing and scheduling, for example. 3 In

theory, you could rewrite the entire application as in-line code and run it on a single processor. This would reduce the communication cost, but leads to unstructured code that is not scalable to multiple processors.

© 2006 by Taylor & Francis Group, LLC

26-26

Embedded Systems Handbook

• The intrapacket processing consists of a dozen tasks with complex data dependencies. Also, these tasks are medium- to small-grain (one to two hundred RISC instructions typically). • All user-provided task descriptions are written in C++. No low-level C or assembly code is used. The DSOC and SMP library code is also entirely written in C++. We will demonstrate that, in spite of these constraints, the Multiflex tools support the efficient mapping of high-level application descriptions onto the StepNP platform. Packet processing at 2.5 Gb/sec implies that for 54 byte packets (worst-case condition), the processing time per packet is less than 100 clock cycles (at 500 MHz). As the traffic manager requires over 750 RISC instructions (compiled from C++), this implies a lower bound of 8 processors (at one instruction/cycle), with 94% utilization.

26.12.1 Application Reference A packet traffic manager is a functional component located between a packet processor and the external world, which can be a switch fabric [47]. We assume the packet processor is performing header validation/modification, as well as address lookup and classification. An SPI (system packet interface) is used to interface with the application, both as input and output. Such interface, as the SPI4.2, can support a bandwidth of 10 Gb/sec, where packets are transmitted as sequence of fixed-size segments interleaved between multiple logical ports (up to 256 ports for SPI4.2). The main functions of the traffic manager are: • • • •

Packet reassembly and queuing from SPI input Packet scheduling per output port Rate-shaping per output port Packet dequeuing and segmentation for SPI output

Typically, the queues are implemented as linked-lists of fixed-size buffers, and large queues are supported using external memories. SRAMs are used to store the links and DRAMs for the buffer content. We assume in the following that both the SPI segment size and buffer size are 64 bytes.

26.12.2 DSOC Model A DSOC model of the traffic manager application is depicted in Figure 26.13(a). This model is composed of the following tasks: • • • • • • • • • •

ingSPI: input SPI protocol ingSeg: temporary buffer for input SPI segment dataMgr: interface to link-buffer data storage queMgr: linked-list management, supporting N lists for packet reassembly and N × C lists for packet queuing, where N is the number of SPI ports, and C the number of traffic classes ingPkt: packet reassembly and queuing schPkt: packet scheduling, implementing strict priority or round-robin per port egrPkt: packet dequeuing and segmentation shPort: output port rate-shaping egrSeg: temporary buffer for output SPI segment egrSPI: output SPI protocol

Each task is a parallel DSOC object (whose internal function is described in C++). The object granularity is user-defined. The arrows in Figure 26.13(a) represent the object calls between the DSOC objects. The object invocations are summarized as follows: Ingress direction: (1) ingSPI invokes ingSeg to buffer segment; (2) at the end-of-segment, ingSPI invokes ingPkt to manage a segment; (3) ingPkt invokes queMgr to push a buffer in the queue associated with the segment input port; (4) ingPkt invokes ingSeg to forward the segment to the address associated with the

© 2006 by Taylor & Francis Group, LLC

Multiprocessor SoC Platform

26-27

(a) DSOC traffic manager dataMgr 11

2

3

12

egrSPI

7

egrSeg

8

shPort

egrPkt

6

schPkt

4

ingPkt

1

10 ingSeg

ingSPI

5

9 queMgr

(b)

ingSPI ingSeg

queMgr

Pipe1 T1 Tm . . PipeP Data$

Pipe1 T1 Tm . . dataMgr PipeP Data$

RISC 1

S/W egrSPI egrSeg

RISC N

Parameterizable NoC (latency Li, j)

DSOC ORB

H/W

Memory

Message passing

SMP CE

Task scheduling

FIGURE 26.13 Application platform. (a) DSOC traffic manager. (b) Parameterizable NoC.

pushed buffer; (5) ingSeg invokes dataMgr to store the segment; (6) at the end-of-packet, ingPkt invokes queMgr to append the packet (input queue) in its associated output queue, and invokes schPkt to inform about the arrival of a new packet. Egress direction: (7) shPort invokes egrPkt to request a segment for an output port; (8) at end-of-packet, egrPkt invokes schPkt to decide from which class a packet needs to be forwarded for a given output port; (9) egrPkt invokes queMgr to pop a buffer from the queue associated with the output port and scheduled class; (10) egrPkt invokes dataMgr to retrieve the buffer content of the pop buffer; (11) dataMgr invokes egrSeg to store the segment; (12) egrSeg invokes egrSPI to output the segment.

26.12.3 StepNP Target Architecture The application described above is mapped on the StepNP platform instance of Figure 26.13(b). In order to support wire-speed network processing, a mixed hardware and software architecture is used. The use of DSOC objects, combined with the SIDL interface compiler, allows easy mapping of tasks to hardware or software. The simple but high-speed ingSPI/ingSeg, egrSeg/egrSPI, queMgr, and dataMgr tasks are mapped onto hardware. A similar partition is used for the Intel IXP network processor [48]. The remaining blocks of the DSOC application model are mapped onto software. Multiple instances of each of these blocks are mapped on processor threads in order to support a given wire-speed requirement. For the output processing, in order to use processor local memory to store output status, each instance of the schPkt, egrPkt, and shPort are associated with a disjoint subset of ports.

© 2006 by Taylor & Francis Group, LLC

26-28

Embedded Systems Handbook

The platform was configured with the following parameters: • • • • •

RISC processor ISA Number of processor pipeline stages Processor clock frequency Number of hardware threads (per processor) One-way NoC latency (jitter)

ARM v4 4 500 MHz 8 40 ± 10 nsec

Using a configuration with 16 ports and 2 classes, and a mixed of packet with random length between 54B and 1024B, simulation shows that a bandwidth of at least 2.5 Gb/sec can be supported with seven ARM processors: two ARMs for ingPkt, two ARMs for egrPkt (one thread per port), two ARMs for schPkt (one thread per port), and one ARM for shPort (one thread per two ports). However, when using short 54B packets (worst case condition), the supported bandwidth drops to 2.1 Gb/sec (as predicted with the eight processor lower bound calculation). Because some of the functional blocks are mapped on a thread-per-port basis, it is not possible to simply increase the number of ARM to support a higher bandwidth. In principle, this could be achieved by increasing the number of ports, but this will restrict the supported application domain. Alternatively, the application could be refined in smaller tasks to permit more freedom in mapping these tasks to threads, but this would require time-consuming code rewriting and balancing of functions on different threads. A simpler way is to relax the constraint on using local memory to store output port status. This is achieved by using shared memory, as described next.

26.12.4 DSOC + SMP Model

queMgr

FIGURE 26.14

DSOC + SMP traffic manager model.

© 2006 by Taylor & Francis Group, LLC

Mem egrSPI

Share mem

H/W

shPort

dataMgr

egrPkt

S/W

schPkt

DSOC + SMP traffic manager

ingPkt

Share mem

ingSPI

A mixed DSOC and SMP model of the traffic manager application is depicted in Figure 26.14. The main differences with the previous model are that (1) shared memory is used to store temporary segments, and (2) port status data is protected with semaphores. The DSOC and SMP model is mapped to the same platform as for the DSOC model (with minor modification of the ingSPI and egrSPI blocks). Using the same configuration parameters as for the reported DSOC experiment, a bandwidth of 2.6 Gb/sec is supported for the case of 54B packet using nine ARMs. In this configuration, one ARM is still used for the shPort, while the other eight ARMs are used to perform any of the ingPkt, schPkt, and egrPkt functions, as scheduled by the ORB. This automatic task balancing by the DSOC ORB, combined with shared-memory management using the SMP support, allows for easy exploration of different application configurations, as described next.

Multiprocessor SoC Platform TABLE 26.2 Number of ports 16 16 64 64 256

26-29 Experimental Results Number of classes

Number of ARMs

2 8 8 32 32

9 9 9 10 10

Bandwidth (Gb/sec) Strict-priority

Round-robin

2.63 2.54 2.55 2.54 2.50

NA 2.33 2.47 2.44 2.45

26.12.5 Experimental Results Experimental results for different number of ports and classes are summarized in Table 26.2, showing the number of ARMs required to support a bandwidth of at least 2.5 Gb/sec when using strict-priority scheduling. All the experiments used the same platform parameters described above, where one ARM is dedicated to the shPort port shaping functionality. We can see that increasing the number of classes requires more processing, while increasing the number of ports has almost no impact. The table also shows the bandwidth achieved using a variant of schPkt functionality supporting three class categories: (1) high-priority, (2) fair-sharing, and (3) best-effort. Fair-sharing classes are scheduled following a round-robin scheme. The table indicates that the processing impact of this “improved” scheduler is more significant when there are less supported ports. There are many variant of scheduler functionality that can be implemented in a packet traffic manager, as well as other features as RED and policing [47], and there are many variant of multiprocessor configurations. The experiments shown here demonstrate that the Multiflex architecture, programming model and tools form a very useful environment to explore and derive an HW/SW multiprocessor solution. The average processor utilization in all the experiments varied from 85 to 91%, allowing us to get close to the eight processor theoretical lower bound. For the most complex scheduler (the round-robin version for 32 classes), the egrPkt + schPkt pair runs in 401 instructions on average. Of these, 87 instructions are needed for 7 DSOC calls, or 22% of instructions. This demonstrates the importance of the fast message passing, task scheduling, and context switching hardware for these medium- to fine-grain tasks.

26.12.6 Hardware Implementation The implementation of the Multiflex hardware O/S accelerator engines required for the traffic manager above requires 58K gates and 18K bytes of memory in total (RTL synthesis result). This includes ORB to support the message passing model, one CE for SMP, and ten message passing engines. The total area for a high-speed implementation of these 12 components (gates and memory) is less than 0.6 mm2 for ST’s 90 nm CMOS technology.

26.13 Summary We have described the StepNP™ flexible SoC platform and the associated Multiflex programming and analysis environment. The StepNP platform consists of multiple configurable hardware multithreaded processors, configurable and reconfigurable HW PEs, shared-memory, and networking-oriented I/O, all connected via a network-on-chip (NoC). The key characteristic of the StepNP platform is that, although it is composed of heterogeneous hardware and software precessing elements (PEs), memories and I/O blocks, the use of a single standardized protocol to communicate with a single global NoC allowed us to build a homogeneous programming environment supporting automatic application-to-platform mapping. The Multiflex MP-SoC simulation and analysis tools are built on a transaction-level model (TLM) in SystemC that supports interaction, instrumentation, and analysis. We demonstrate simulation speeds of

© 2006 by Taylor & Francis Group, LLC

26-30

Embedded Systems Handbook

over 7 kHz for a complex TLM model of 32 RISC hardware multithreaded processors, 4 HW PEs, I/O, and an NoC. The Multiflex MP-SoC programming environment supports two parallel programming models: a DSOC message passing model, and a symmetrical multiprocessing (SMP) model using shared memory. We believe the Multiflex programming models should be immediately familiar and intuitive to software developers exposed to mainstream SMP and distributed software techniques available in languages, such as Java or C#. In addition, the framework allows capture of characteristics of objects in a way that can be exploited by automation tools. The Multiflex programming tools supports the rapid exploration of algorithms written using interoperable DSOC and SMP programming models, automatically mapped to a wide range of parallel architectures. Using this approach, the system-level application development is largely decoupled from the details of a particular target platform mapping. Application objects can be executed on a variety of processors, as well as on configurable or fixed hardware. Moreover, the StepNP platform includes hardware-assisted messaging and dynamic task scheduling and allocation engines that support the platform mapping tools in order to achieve low-cost communication and high processor utilization rates. We presented the results of mapping two packet processing applications onto the StepNP platform, for a range of architectural parameters: (1) an Internet IPv4 packet forwarding application, running at 2.5 and 10 Gb/sec, and (2) a more complex and finer grain traffic management application, running at 2.5 Gb/sec. The HW/SW architecture is able to sustain end-to-end object-oriented message passing from one processor to another, at a rate of about 35 million per second (using 500 MHz RISC-style processors). As a result, the interprocessor communication code cost is kept to a strict minimum. For the IPv4 packet forwarding, which uses the DSOC distributed message-passing parallel programming model and consists of medium-grain parallelism (500 RISC instructions or more), we achieved a utilization rate of the embedded RISC processors as high as 97%, even in presence of NoC interconnect latencies of up to 160 nsec (one-way), while processing worst-case IPv4 traffic at a 10 Gb/sec line-rate. Only 8% of processor instructions are required for interprocessor communication. For the 2.5 Gb/sec traffic management application, we have developed a mixed DSOC and SMP programming model. Processor utilizations of 85 to 91% have been demonstrated. The low granularity of the tasks parallelized (typically <200 RISC instructions) highlight the importance of the efficient hardware engines used for task scheduling, context switching, and message passing. In this case, and in spite of the fine-grain parallelism, interprocessor communication represents only 22% of instructions.

26.14 Outlook The Multiflex technology has been recently applied to the mapping of a high-level MPEG4 video codec (VGA resolution at 30 frames per second) onto a mixed multiprocessor and hardware platform. This application makes heavy use of the SMP programming model. However, message passing is used for the interaction with tasks implemented in hardware. In this case study, we demonstrated that 95% of the MPEG4 functionality (as expressed in lines of MPEG4 application C code) can be mapped onto five simple RISC processors, running at 200 MHz. The average processor utilization rate is 88%. However, this application code mapped to software represents only 20% of the performance. This provides a highly flexible, yet low-cost solution since the remaining 80% of the performance is achieved using simple, regular hardware components. This demonstrates the general nature of the programming models supported. We will be exploring a H.264 application in the next step. We are also currently exploring the mapping of Layer1 modem functions for a 3G basestation. This includes CRC attachment, channel coding, first interleaver, second interleaver and deinterleaver, rate matching, spreading, and despreading. These base functions will be integrated with the public domain 3G stack from the Eurecom engineering school [49]. A physical implementation of the StepNP platform in 90 nm process technology is currently in the definition phase and is planned in the next year. This is for a platform configuration with a large number of

© 2006 by Taylor & Francis Group, LLC

Multiprocessor SoC Platform

26-31

multithreaded processors (16 or more), configurable HW PEs and I/O. This will be the basis for a high-end prototyping platform, from which application-specific platforms can be derived. Future activities on flexible MP-SoC platforms include the evaluation and manufacturing of a range of network-on-chip topologies. Finally, we are working with researchers from the Politecnico di Milano on the integration of their power estimation framework [50].

Acknowledgments We thank our ST colleagues Michele Borgatti, Pierluigi Rolandi, Bhusan Gupta, Naresh Soni, Philippe Magarshack, Frank Ghenassia, Alain Clouard, Miguel Santana, and Serge De Paoli, as well as Profs Guy Bois, David Quinn, Bruno Lavigueur, and Olivier Benny of Ecole Polytechnique de Montreal, Profs Donatella Sciuto and Giovanni Beltrame of Politecnico di Milano, Prof. Alain Greiner of Univ. Paris VI, Profs Martin Bainier and James Kanga Foaleng of Ecole Polytechnique de Grenoble, and Prof. Thomas Philipp of University of Aachen for their contributions to the themes and tools discussed in this chapter.

References [1] N. Shah, Understanding Network Processors, Internal report, Department of Electrical Engineering and Computer Science, University of California, Berkeley, 2001; http://wwwcad.eecs.berkeley.edu/∼niraj/papers/UnderstandingNPs.pdf [2] P.G. Paulin, F. Karim, and P. Bromley, Network Processors: A Perspective on Market Requirements, Processor Architectures and Embedded SW Tools, in Proceedings of the Design, Automation, and Test in Europe (DATE 2001), IEEE CS Press, Los Alamitos, CA, 2001, pp. 420–429. [3] A. Jantsch and H. Tenhunen (Eds.), Networks on Chip, Kluwer Academic Publishers, Dordrecht, 2003. [4] L. Benini and G. De Micheli, Networks on Chip: A New SoC Paradigm, Computer, 35: 70–72, 2002. [5] F. Karim et al., On-Chip Communication Architecture for OC-768 Network Processors, in Proceedings of Design Automation Conference (DAC 01), ACM Press, New York, 2001, pp. 678–683. [6] D. Quinn et al., A System-Level Exploration Platform and Methodology for Network Applications Based on Configurable Processors, in Proceedings of Design Automation and Test in Europe (DATE), Paris, February 2004. [7] B. Ackland et al., A Single Chip 1.6 Billion 16-b MAC/s Multiprocessor DSP, in Proceedings of Custom Integrated Circuits Conference, 1999. [8] J.A.J. Leijten et al., PROPHID: A Heterogeneous Multiprocessor Architecture for Multimedia, in Proceedings of International Conference on Computer Design, 1997. [9] M. Rutten et al., A Heterogeneous Multiprocessor Architecture for Flexible Media Processing, IEEE Design & Test of Computers, 19: 39–50, 2002. [10] S. Dutta et al., Viper: A Multiprocessor SoC for Advanced Set-Top Box and Digital TV Systems, IEEE Design & Test of Computers, 18: 21–31, 2001. [11] K. Keutzer, S. Malik, R. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli, System Level Design: Orthogonalization of Concerns and Platform-Based Design, IEEE Transactions on Computer-Aided Design, 19(12): 1523–1543, 2000. [12] M. Gries, S. Weber, and C. Brooks, The Mescal Architecture Development System (tipi) Tutorial, Technical report, UCB/ERLM03/40, Electronics Research Lab, University of California at Berkeley, October 2003. [13] M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S. Malik, J. Rabaey, and A. Sangiovanni-Vincentelli, Addressing the System-on-a-Chip Interconnect Woes through Communication Based Design, in Proceedings of Design Automation Conference, June 2001, pp. 667–672. [14] L. Carro, M. Kreutz, F.R. Wagner, and M. Oyamada, System Synthesis for Multiprocessor Embedded Applications, in Proceedings of Design Automation and Test in Europe, March 2000, pp. 697–702.

© 2006 by Taylor & Francis Group, LLC

26-32

Embedded Systems Handbook

[15] A. Clouard et al., Towards Bridging the Gap between SoC Transactional and Cycle-Accurate Levels, in Proceedings of Design, Automation, and Test in Europe — Designer Forum, 2002, pp. 22–29. [16] A. Clouard, K. Jain, F. Ghenassia, L. Maillet-Contoz, and J.-P. Strassen, Using Transactional Level Models in a SoC Design Flow, in SystemC Methodologies and Applications, W. Muller, W. Rosentiel, and J. Ruf, Eds. Kluwer Academic Publishers, Dordrecht, 2003. [17] P.G. Paulin and M. Santana, FlexWare: A Retargetable Embedded-Software Development Environment, IEEE Design & Test of Computers, 19: 59–69, 2002. [18] J. Henkel, Closing the SoC Design Gap, IEEE Computer Magazine, 36(9): 19–121, September 2003. [19] P. Magarshack and P.G. Paulin, System-on-Chip Beyond the Nanometer Wall, in Proceedings of 40th Design Automation Conference (DAC), Anaheim, June 2003. [20] P.G. Paulin et al., Chips of the Future: Soft, Crunchy or Hard, in Proceedings of Design Automation and Test in Europe (DATE), Paris, 2004. [21] J.L. Hennessy et al., Computer Architecture: A Quantitative Approach, Morgan Kaufmann, San Mateo, CA, 1990. [22] See Tensilica web site: http://www.tensilica.com [23] M. Borgatti et al., A 0.18 µm, 1GOPS Reconfigurable Signal Processing IC with Embedded FPGA and 1.2 GB/s, 3-Port Flash Memory Subsystem, in Proceedings of the International Solid-State Circuits Conference (ISSC), San Francisco, February 2003. [24] See LisaTek tools on CoWare web site: http://www.coware.com [25] See OCP-IP web site: http://www.ocpip.org [26] P.G. Paulin, C. Pilkington and E. Bensoudane, StepNP: A System-Level Exploration Platform for Network Processors, IEEE Design & Test of Computers, 19: 17–26, 2002. [27] See STMicroelectronics STBus web site: http://www.stmcu.com/inchtml-pages-STBus_intro.html [28] P. Guerrier and A. Greiner, A Generic Architecture for On-Chip Packet-Switched Interconnections, in Proceedings of Design Automation and Test in Europe (DATE), IEEE CS Press, Los Alamitos, CA, March 2000, pp. 70–78. [29] A. Greiner et al., SPIN: A Scalable, Packet-Switched, On-Chip Micro-Network, in Proceedings of Design Automation and Test in Europe (Designer Forum), Munich, March 2003. [30] S.-J. Lee et al., An 800 MHz Star-Connected On-Chip Network for Application to Systems on a Chip, in Proceedings of International Solid-State Circuits Conference (ISSC), San Francisco, February 2003. [31] See eASIC Corp. web site: http://www.easic.com [32] N. Soni et al., NPSE: A High Performance Network Packet Search Engine, in Proceedings of Design Automation and Test in Europe (Designer Forum), Munich, March 2003. [33] See Open SystemC web site: http://www.systemc.org [34] See VSIA web site: http://www.vsi.org [35] J.M. Paul, Programmers’ Views of SoCs, in Proceedings of CODESS/ISSS, October 2003. [36] G. DeMicheli, Networks on a Chip, in Proceedings of MPSoC 2003, Chamonix, France, 2003. [37] N. Shah, W. Plishker et al., NP-Click: A Programming Model for the Intel IXP1200, in Proceedings of Workshop on Network Processors, International Symposium on High Performance Architecture, February 2003. [38] S. Kiran et al., A Complexity Effective Communication Model for Behavioral Modeling of Signal Processing Application, in Proceedings of 40th Design Automation Conference, Anaheim, June 2003. [39] M. Forsell, A Scalable High-Performance Computing Solution for Network on Chip, IEEE Micro, 22(5): 46–55, September–October 2002. [40] Object Management Group, www.omg.org [41] Distributed Component Object Model (DCOM), http://www.microsoft.com/com/tech/ DCOM.asp [42] See http://www.research.att.com/∼bs/C++0x_panel.pdf [43] K. Goossens, Systems on Chip and Networks on Chip: Bridging the Gap with QoS, in Proceedings of Application-Specific Multi-Processor Systems School, Chamonix, July 2003.

© 2006 by Taylor & Francis Group, LLC

Multiprocessor SoC Platform

26-33

[44] J. Lee, V. Mooney et al., A Comparison of the RTU Hardware RTOS with HW/SW RTOS, in Proceedings of the ASP-DAC, January 2003, pp. 683–688. [45] P. Kohout, B. Ganesh, and B. Jacob, Hardware Support for Real-Time Operating Systems, in Proceedings of Codes-ISSS, Newport Beach, CA, October 2003, pp. 45–51. [46] E. Kohler et al., The Click Modular Router, ACM Transactions on Computer System, 18: 263–297, 2000. [47] G. Armitage, Quality of Service in IP Networks, MacMillan Technical Publishing, Indianapolis IN, USA, 2000. [48] E.J. Johnson and A.R. Kunze, IXP2400/2800 Programming, Intel Press, Hillsboro OR, USA, 2003. [49] See web site: http://www.wireless3G4Free.com [50] W. Fornaciari, F. Salice, and D. Sciuto, Power Modeling of 32-bit Microprocessors, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 21: 1306–1316, 2002.

© 2006 by Taylor & Francis Group, LLC

III Testing of Embedded Core-Based Integrated Circuits 27 Modular Testing and Built-In Self-Test of Embedded Cores in System-on-Chip Integrated Circuits Krishnendu Chakrabarty

28 Embedded Software-Based Self-Testing for SoC Design Kwang-Ting (Tim) Cheng

© 2006 by Taylor & Francis Group, LLC

27 Modular Testing and Built-In Self-Test of Embedded Cores in System-on-Chip Integrated Circuits 27.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27-1 Core-Based SOC • Testing a SOC • Built-In Self-Test

27.2 Modular Testing of SOCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27-5 Wrapper Design and Optimization • TAM Design and Optimization • Test Scheduling • Integrated TAM Optimization and Test Scheduling • Modular Testing of Mixed-Signal SOCs

27.3 BIST Using a Reconfigurable Interconnection Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27-14 Declustering the Care Bits

Krishnendu Chakrabarty Duke University

27.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27-22 Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27-23 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27-23

27.1 Introduction Integrated circuits (ICs) are widely used in today’s electronic systems, with applications ranging from microprocessors and consumer electronics to safety-critical systems, such as medical instruments and aircraft control systems. In order to ensure the reliable operation of these systems, high-quality testing of ICs is essential. To reduce product cost, it is also necessary to reduce the cost of testing without compromising product quality in any way. Testing of an IC is a process in which the circuit is exercised with test patterns and its resulting response is analyzed to ascertain whether it behaves correctly. Testing can be classified based on a number of criteria. Depending upon the specific purpose of the testing process, it can be categorized into four types: characterization, production test, burn-in, and incoming inspection [1]. Characterization, also known as silicon debug, is performed on a new design before it is sent for mass production. The purpose is to

27-1

© 2006 by Taylor & Francis Group, LLC

27-2

Embedded Systems Handbook 160 140 Relative growth

120 100 80 60 40 20 0 1999

2002

2005

2008

2011

2014

Year

FIGURE 27.1 Projected relative growth in test data volume for ICs [3]. (From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 1289–1305, 2004. With permission.)

diagnose and correct design errors, and measure chip characteristics for setting specifications. Probing of the internal nodes of the chip, which is rarely done in production test, may also be required during characterization. Every fabricated chip is subject to production test, which is less comprehensive than characterization test. The test vectors may not cover all possible functions and data patterns, but must have a high coverage of modeled faults. Since every device must be tested, test data volume and testing time, which directly affects testing cost, must be minimized. Only a pass/fail decision is made and no fault diagnosis is attempted. Even after successfully passing the production test, some devices may fail very quickly when they are put into actual use. These bad chips are driven to actual failure and weeded out by the burn-in process, in which the chips are subject to a combination of production test, high temperature, and over-voltage power supply for a period of time. System manufacturers perform incoming inspection on the purchased devices before integrating them into the system. Incoming inspection can be similar to production testing, or tuned to the specific application. It can also be carried out for a random sample with the sample size depending on the device quality and the system requirement. In this chapter, we focus on the production test process, which may also be used for burn-in process and incoming inspection. Recent advances in VLSI technology have lead to a rapid increase in the density of ICs. As projected in the 2001 edition of the International Technology Roadmap for Semiconductors (ITRS) [2], the density of ICs can reach 2 billion transistors per cm2 and 16 billion transistors per chip are likely in 2014. The increased density results in a tremendous increase in test data volume. Figure 27.1 shows the projected relative growth in test data volume for ICs [3]. The test data volume for ICs in 2014 can be as much as 150 times the test data volume in 1999. Considering the fact that the test data volume for an IC exceed 800 Mbits in 1999 [4], the test data volume in the near future will be prohibitively high. In addition to the increasing density of ICs, today’s system-on-chip (SOC) designs also exacerbate the test data volume problem. SOCs reduce design cycle time since prevalidated embedded cores can be purchased from core vendors and plugged into the design. However, each embedded core in an SOC needs to be tested using a large number of precomputed test patterns, which leads to a dramatic increase in test data volume [5]. The increase in test data volume not only leads to the increase in testing time, but the high test data volume may also exceed the limited memory depth of automatic test equipment (ATE). Multiple ATE reloads are time-consuming since data transfer from a workstation to the ATE hard disk or from the ATE hard disk to ATE channels are very slow. For example, it may take up to 1 h to transfer 32 Gbits (64 channels × 512 Mbits per channel) of test data from the ATE hard disk to the channel, at approximately 9 Mbits/sec [6].

© 2006 by Taylor & Francis Group, LLC

Modular Testing and Built-In Self-Test

27-3

Test access mechanism Source

Test access mechanism Embedded core

Sink

Wrapper

FIGURE 27.2 Overview of the three elements in an embedded-core test approach: (1) test pattern source and sink, (2) test access mechanism, and (3) core test wrapper [5].

27.1.1 Core-Based SOC In modern SOC designs, predesigned, prevalidated, and highly optimized embedded cores are routinely used. Based on the level of detail provided by the core vendors, embedded cores are categorized as soft cores, firm cores, and hard cores. Soft cores come in the form of a synthesizable register-transfer level (RTL) description. Soft cores leave much of the implementation to the system integrator, but are flexible and process independent. Firm cores are supplied as gate-level netlists. Hard cores or legacy cores are available as nonmodifiable layouts. Hard cores are optimized for anticipated performance, with an associated loss of flexibility. In order to protect the intellectual property (IP) of the core vendors, the detailed structural information is not released to system integrators. Core vendors must therefore develop the design for testability (DFT) structures and provide corresponding test patterns. A typical SOC consists of various cores, such as CPU, DSP, embedded memory, I/O controllers, video/audio codec cores, ethernet MAC cores, encryption cores, and analog/digital data converters. By reusing the IP cores across multiple generations, SOC designers greatly shorten the time-to-market and reduce the design cost. Although SOCs offer a number of advantages, production testing of SOCs is more difficult than that for traditional ICs. Due to the absence of structural information about the IP cores, fault simulation and test generation are not feasible. Testing of the IP cores in SOCs must therefore be based on the reuse of precomputed tests obtained from core vendors.

27.1.2 Testing an SOC An SOC test is essentially a composite test comprised of the individual tests for each core, the user-defined logic (UDL) tests, and interconnect tests. Each individual core or UDL test may involve surrounding components and may imply operational constraints (e.g., safe mode, low power mode, bypass mode), which necessitate special isolation modes. System-on-chip test development is challenging due to several reasons. Embedded cores represent IP and core vendors are reluctant to divulge structural information about their cores to users. Thus, users cannot access core netlists and insert DFT hardware that can ease test application from the surrounding logic. Instead, a set of test patterns is provided by the core vendor that guarantees a specific fault coverage. These test patterns must be applied to the cores in a given order, using a specific clocking strategy. Care must often be taken to ensure that undesirable patterns and clock skews are not introduced into these test streams. Furthermore, cores are often embedded in several layers of user-designed or other core-based logic, and are not always directly accessible from chip I/Os. Propagating test stimuli to core inputs may therefore require dedicated test transport mechanisms. Moreover, translation of test data is necessary at the inputs and outputs of the embedded-core into a format or sequence suitable for application to the core. A conceptual architecture for testing embedded-core-based SOCs is shown in Figure 27.2 [5]. It consists of three structural elements: 1. Test pattern source and sink. The test pattern source generates the test stimuli for the embedded cores, and the test pattern sink compares the response(s) to the expected response(s).

© 2006 by Taylor & Francis Group, LLC

27-4

Embedded Systems Handbook

2. Test access mechanism (TAM). The TAM transports test patterns. It is used for on-chip transport of test stimuli from test pattern source to the core under test, and for the transport of test responses from the core under test to a test pattern sink. 3. Core test wrapper. The core test wrapper forms the interface between the embedded core and its environment. It connects the terminals of the embedded core to the rest of the IC and to the TAM. Once a suitable test data transport mechanism and test translation mechanism have been designed, the next major challenge confronting the system integrator is test scheduling. This refers to the order in which the various core tests and tests for user-designed interface logic are applied. A combination of built-in self-test (BIST) and external testing is often used to achieve high fault coverage [7,8], and tests generated by different sources may therefore be applied in parallel, provided resource conflicts do not arise. Effective test scheduling for SOCs is challenging because it must address several conflicting goals: (1) SOC testing time minimization, (2) resource conflicts between cores arising from the use of shared TAMs and on-chip BIST engines, (3) precedence constraints among tests, and (4) power constraints. Finally, analog and mixed-signal cores are increasingly being integrated onto SOCs with digital cores. Testing mixed-signal cores is challenging because their failure mechanisms and testing requirements are not as well modeled as they are for digital cores. It is difficult to partition and test analog cores, because they may be prone to crosstalk across partitions. Capacitance loading and complex timing issues further exacerbate the mixed-signal test problem.

27.1.3 Built-In Self-Test In BIST solutions, test patterns are generated by an on-chip pseudorandom pattern generator, which is usually a linear feedback shift register (LFSR). BIST alleviates a number of problems related to test interfacing, for example, limited signal bandwidth and high pin count of the device-under-test. A typical BIST architecture is shown in Figure 27.3. Since the output patterns of the LFSR are time-shifted and repeated, they become correlated; this reduces the effectiveness for fault detection. Therefore, a phase shifter (a network of XOR gates) is often used to decorrelate the output patterns of the LFSR. The response of the circuit under test (CUT) is usually compacted by a multiple input shift register (MISR) to a small signature, which is compared with a known fault-free signature to determine whether the CUT is faulty. Most BIST techniques rely on the use of a limited number of pseudorandom patterns to detect the random-pattern-testable faults, which is subsequently followed by the application of a limited number of deterministic patterns to detect the random-pattern-resistant faults. Based on the mechanisms that are used to generate the deterministic patterns, logic BIST techniques can be classified into two categories: methods that generate deterministic patterns by controlling the states of the LFSR [9–12], and techniques that modify the patterns generated by the LFSR [13–15]. Linear feedback shift register reseeding [10,12,16–19] is an example of a BIST technique that is based on controlling the LFSR state. LFSR reseeding can be static, that is, the LFSR stops generating patterns while loading seeds, or dynamic, that is, test generation and seed loading can proceed simultaneously. The length of the seeds can be either equal to the size of the LFSR (full reseeding) or less than the size of the LFSR (partial reseeding). In [10], a dynamic reseeding technique that allows partial reseeding is

Scan chain 1 ( l bits) Phase shifter

Scan chain 2 ( l bits)

...

...

LFSR

MISR

Scan chain m ( l bits)

FIGURE 27.3 A generic BIST architecture based on an LFSR, an MISR, and a phase shifter.

© 2006 by Taylor & Francis Group, LLC

Modular Testing and Built-In Self-Test

27-5

proposed to encode test vectors. An LFSR of length r ≥ smax + 20, where smax is the maximum number of specified bits in any deterministic test cube, is used to generate the test patterns. While the length of the first seed is r, the lengths of the subsequent seeds are significantly smaller than r. A set of linear equations is solved to obtain the seeds, and the test vectors are reordered to facilitate the solution of this set of linear equations. A BIST pattern generator based on a folding counter is proposed in [9]. The properties of the folding counter are exploited to find the seeds needed to cover the given set of deterministic patterns. Width compression is combined with reseeding to reduce the hardware overhead. In [11], a twodimensional test data compression technique that combines an LFSR and a folding counter is proposed for scan-based BIST. LFSR reseeding is used to reduce the number of bits to be stored for each pattern (horizontal compression) and folding counter reseeding is used to reduce the number of patterns (vertical compression). Bit-flipping [15], bit-fixing [13,20–24], and weighted random BIST [14,25–28] are examples of techniques that rely on altering the patterns generated by the LFSR to embed deterministic test cubes. In [29], a hybrid BIST method based on weighted pseudorandom testing is presented. A weight of 0, 1, or u (unbiased) is assigned to each scan chain in CUT. The weight sets are compressed and stored on the tester. During test application, an on-chip lookup table is used to decompress the data from the tester and generate the weight sets. A three-weight weighted random scan-BIST scheme is discussed in [14]. The weights in this approach are 0, 0.5, and 1. In order to reduce the hardware overhead, scan cells are carefully reordered and a special ATPG approach is used to generate suitable test cubes. In this chapter, we review recent advances in modular testing of core-based SOCs, and a BIST technique that reduces test data volume and testing time. A comprehensive set of references is also provided for the interested reader.

27.2 Modular Testing of SOCs Modular testing of embedded cores in an SOC is being increasingly advocated to simplify test access and test application [5]. To facilitate modular test, an embedded core must be isolated from surrounding logic, and test access must be provided from the I/O pins of the SOC. Test wrappers are used to isolate the core, while TAMs transport test patterns and test responses between SOCs pins and core I/Os [5]. Effective modular test requires efficient management of the test resources for core-based SOCs. This involves the design of core test wrappers and TAMs, the assignment of test pattern bits to ATE channels, the scheduling of core tests, and the assignment of ATE channels to SOCs. The challenges involved in the optimization of SOC test resources for modular test can be divided into two broad categories: 1. Wrapper/TAM co-optimization. Test wrapper design and TAM optimization are of critical importance during system integration since they directly impact hardware overhead, testing time, and tester data volume. The issues involved in wrapper/TAM design include wrapper optimization, core assignment to TAM wires, sizing of the TAMs, and routing of TAM wires. As shown in [30–32], most of these problems are N P -hard. 2. Constraint-driven test scheduling. The primary objective of test scheduling is to minimize testing time, while addressing one or more of the following issues: (1) resource conflicts between cores arising from the use of shared TAMs and BIST resources, (2) precedence constraints among tests, and (3) power dissipation constraints. Furthermore, testing time can often be decreased further through the selective use of test preemption [33]. As discussed in [7,33], most problems related to test scheduling for SOCs are also N P -hard. In addition, the rising cost of ATE for SOC devices is a major concern [2]. Due to the growing demand for pin counts, speed, accuracy, and vector memory, the cost of high-end ATE for full-pin, at-speed functional test is predicted to rise to over $20 million by 2010 [2]. As a result, the use of low-cost ATE that

© 2006 by Taylor & Francis Group, LLC

27-6

Embedded Systems Handbook

perform structural rather than at-speed functional test is increasingly being advocated for reducing test costs. Multisite testing, in which multiple SOCs are tested in parallel on the same ATE, can significantly increase the efficiency of ATE usage, as well as reduce testing time for an entire production batch of SOCs. The use of low-cost ATE and multisite test involve test data volume reduction and test pin count (TAM width) reduction, such that multiple SOC test suites can fit in ATE memory in a single test session [34,35]. As a result of the intractability of the problems involved in test planning, test engineers have adopted a series of simple ad hoc solutions in the past [34]. For example, the problem of TAM width optimization is often simplified by stipulating that each core on the SOC have the same number of internal scan chains, say W ; thus, a TAM of width W bits is laid out and cores are simply daisy-chained to the TAM. However, with the growing size of SOC test suites and rising cost of ATE, more aggressive test resource optimization techniques that enable effective modular test of highly complex next-generation SOCs using current-generation ATE is critical.

27.2.1 Wrapper Design and Optimization A core test wrapper is a layer of logic that surrounds the core and forms the interface between the core and its SOC environment. Wrapper design is related to the well-known problems of circuit partitioning and module isolation, and is therefore a more general test problem than its current instance (related to SOC test using TAMs). For example, earlier proposed forms of circuit isolation (precursors of test wrappers) include boundary scan and BILBO [36]. The test wrapper and TAM model of SOC test architecture was presented in [5]. In this chapter, three mandatory wrapper operation modes listed were (1) normal operation, (2) core-internal test, and (3) core-external test. Apart from the three mandatory modes, two optional modes are “core bypass” and “detach.” Two proposals for test wrappers have been the “test collar” [37] and TestShell [38]. The test collar was designed to complement the Test Bus architecture [37] and the TestShell was proposed as the wrapper to be used with the TestRail architecture [38]. In [37], three different test collar types were described: combinational, latched, and registered. For example, a simple combinational test collar cell consisting of a 2-to-1 multiplexer can be used for high-speed signals at input ports during parallel, at-speed test. The TestShell described in [38] is used to isolate the core and perform TAM width adaptation. It has four primary modes of operation: function mode, IP test mode, interconnect test mode, and bypass mode. These modes are controlled using a test control mechanism that receives two types of control signals: pseudostatic signals (that retain their values for the duration of a test) and dynamic control signals (that can change values during a test pattern). An important function of the wrapper is to adapt the TAM width to the core’s I/O terminals and internal scan chains. This is done by partitioning the set of core-internal scan chains and concatenating them into longer wrapper scan chains, equal in number to the TAM wires. Each TAM wire can now directly scan test patterns into a single wrapper scan chain. TAM width adaptation directly affects core testing time and has been the main focus of research in wrapper optimization. Note that to avoid problems related to clock skew, internal scan chains in different clock domains must either not be placed on the same wrapper scan chain, or antiskew (lockup) latches must be placed between scan flip–flops belonging to different clock domains. The issue of designing balanced scan chains within the wrapper was addressed in [39]; see Figure 27.4. The first techniques to optimize wrappers for test time reduction were presented in [32]. To solve the problem, the authors proposed two polynomial-time algorithms that yield near-optimal results. The largest processing time (LPT) algorithm is taken from the Multiprocessor Scheduling literature and solves the wrapper design problem in very short computation times. At the expense of a slight increase in computation time, the Combine algorithm yields even better results. It uses LPT as a start solution, followed by a linear search over the wrapper scan chain length with the First Fit Decreasing heuristic.

© 2006 by Taylor & Francis Group, LLC

Modular Testing and Built-In Self-Test (a)

Wrapper scan chain 1

27-7 (b)

Wrapper scan chain 1

Core

2 FF

Core

2 FF

4 FF 8 FF Wrapper Wrapper scan chain 2

4 FF 8 FF Wrapper Wrapper scan chain 2

FIGURE 27.4 Wrapper chains: (a) unbalanced, (b) balanced. (From V. Iyengar, K. Chakrabarty, and E.J. Marinissen. In Proceedings of the IEEE Asian Test Symposium, pp. 320–325, 2002. With permission.)

To perform wrapper optimization, the authors in [31] proposed Design_wrapper, an algorithm based on the Best Fit Decreasing heuristic for the Bin Packing problem. The algorithm has two priorities: (1) minimizing core testing time and (2) minimizing the TAM width required for the test wrapper. These priorities are achieved by balancing the lengths of the wrapper scan chains designed, and identifying the number of wrapper scan chains that actually need to be created to minimize testing time. Priority (2) is addressed by the algorithm since it has a built-in reluctance to create a new wrapper scan chain, while assigning core-internal scan chains to the existing wrapper scan chains [31]. Wrapper design and optimization continue to attract considerable attention. Recent work in this area has focused on “light wrappers,” that is, the reduction of the number of register cells [40], and the design of wrappers for cores and SOCs with multiple clock domains [41].

27.2.2 TAM Design and Optimization Many different TAM designs have been proposed in the literature. TAMs have been designed based on direct access to cores multiplexed onto the existing SOC pins [42], reusing the on-chip system bus [43], searching transparent paths through and around neighboring modules [44–46], and 1-bit boundary scan rings around cores [47,48]. Recently, the most popular appear to be the dedicated, scalable TAMs such as Test Bus [37] and TestRail [38]. Despite the fact that their dedicated wiring adds to the area costs of the SOC, their flexible nature and guaranteed test access have proven successful. Three basic types of such scalable TAMs have been described in [49] (see Figure 27.5): (1) the Multiplexing architecture, (2) the Daisychain architecture, and (3) the Distribution architecture. In the Multiplexing and Daisychain architectures, all cores get access to the total available TAM width, while in the Distribution architecture, the total available TAM width is distributed over the cores. In the Multiplexing architecture, only one core wrapper can be accessed at a time. Consequently, this architecture only supports serial schedules, in which the cores are tested one after the other. An even more serious drawback of this architecture is that testing the circuitry and wiring in between cores is difficult with this architecture; interconnect test requires simultaneous access to multiple wrappers. The other two basic architectures do not have these restrictions; they allow for both serial as well as parallel test schedules, and also support interconnect testing.

© 2006 by Taylor & Francis Group, LLC

27-8

Embedded Systems Handbook

(a)

Control

(b)

Control

(c)

w X

In

w

w

In w1

X

w

Y

w1

X

w2 Out

Out

w2

Y

Y

w3

Z

w

Control

w

In

w3 Z

Z w

SOC

SOC

Out

SOC

FIGURE 27.5 The (a) Multiplexing, (b) Daisychain, and (c) Distribution architectures [49]. (From V. Iyengar, K. Chakrabarty, and E.J. Marinissen. In Proceedings of the IEEE Asian Test Symposium, pp. 320–325, 2002. With permission.)

(a)

(b) A

C

D

(c)

B

A

E

C

F SOC

D

(c)

B

A

E

C

F SOC

B

D

E

F SOC

FIGURE 27.6 The (a) fixed-width Test Bus architecture, (b) fixed-width TestRail architecture, and (c) flexible-width Test Bus architecture [51]. (From V. Iyengar, K. Chakrabarty, and E.J. Marinissen. In Proceedings of the IEEE Asian Test Symposium, pp. 320–325, 2002. With permission.)

The Test Bus architecture [37] (see Figure 27.6[a]) is a combination of the Multiplexing and Distribution architectures. A single Test Bus is in essence the same as what is described by the Multiplexing architecture; cores connected to the same Test Bus can only be tested sequentially. The Test Bus architecture allows for multiple Test Buses on one SOC that operate independently, as in the Distribution architecture. Cores connected to the same Test Bus suffer from the same drawback as in the Multiplexing architecture, that is, their wrappers cannot be accessed simultaneously, hence making core-external testing difficult or impossible. The TestRail architecture [38] (see Figure 27.6[b]) is a combination of the Daisychain and Distribution architectures. A single TestRail is in essence the same as what is described by the Daisychain architecture: scan-testable cores connected to the same TestRail can be tested simultaneously, as well as sequentially. A TestRail architecture allows for multiple TestRails on one SOC, which operate independently, as in the Distribution architecture. The TestRail architecture supports serial and parallel test schedules, as well as hybrid combinations of those. In most TAM architectures, the cores assigned to a TAM are connected to all wires of that TAM. We refer to this as fixed-width TAMs. A generalization of this design, is one in which the cores assigned to

© 2006 by Taylor & Francis Group, LLC

Modular Testing and Built-In Self-Test

27-9

a TAM each connect to a (possibly different) subset of the TAM wires [50]. The core-TAM assignments are made at the granularity of TAM wires, instead of considering the entire TAM bundle as one inseparable entity. We call these flexible-width TAMs. We can apply this concept both for Test Bus as well as for TestRail architectures. Figure 27.6(c) shows an example of a flexible-width Test Bus architecture. Most SOC test architecture optimization algorithms proposed have concentrated on fixed-width Test Bus architectures and assume cores with fixed-length scan chains. In [30], the author describes a Test Bus architecture optimization approach that minimizes testing time using integer linear programming (ILP). ILP is replaced by a genetic algorithm in [52]. In [53], the authors extend the optimization criteria of [30] with place-and-route and power constraints, again using ILP. In [54,55], Test Bus architecture optimization is mapped to the well-known problem of two-dimensional bin packing and a Best Fit algorithm is used to solve it. Wrapper design and TAM design both influence the SOC testing time, and hence their optimization needs to be carried out in conjunction in order to achieve the best results. The authors in [31] were the first to formulate the problem of integrated wrapper/TAM design; despite its N P -hard character, it is addressed using ILP and exhaustive enumeration. In [56], the authors presented efficient heuristics for the same problem. Idle bits exist in test schedules when parts of the test wrapper and TAM are under-utilized leading to idle time in the test delivery architecture. In [57], the authors first formulated the testing time minimization problem both for cores having fixed-length as well as cores having flexible-length scan chains. Next, they presented lower bounds on the testing time for the Test Bus and TestRail architectures and then examined three main reasons for under-utilization of TAM bandwidth, leading to idle bits in the test schedule and testing times higher than the lower bound [57]. The problem of reducing the amount of idle test data was also addressed in [58]. The optimization of a flexible-width Multiplexing architecture (i.e., for one TAM only) was proposed in [50]. This work again assumes cores with fixed-length scan chains. The chapter describes a heuristic algorithm for co-optimization of wrappers and Test Buses based on rectangle packing. In [50], the same authors extended this work by including precedence, concurrency, and power constraints, while allowing a user-defined subset of the core tests to be preempted. Fixed-width TestRail architecture optimization was investigated in [59]. Heuristic algorithms have been developed for the co-optimization of wrappers and TestRails. The algorithms work both for cores with fixed-length and flexible-length scan chains. TR-Architect, the tool presented in [59] is currently in actual industrial use.

27.2.3 Test Scheduling Test scheduling for SOCs involving multiple test resources and cores with multiple tests is especially challenging, and even simple test scheduling problems for SOCs have been shown to be N P -hard [7]. In [8], a method for selecting tests from a set of external and BIST tests (that run at different clock speeds) was presented. Test scheduling was formulated as a combinatorial optimization problem. Reordering tests to maximize defect detection early in the schedule was explored in [60]. The entire test suite was first applied to a small sample population of ICs. The fault coverage obtained per test was then used to arrange tests that contribute to high fault coverage earlier in the schedule. The authors used a polynomial-time algorithm to reorder tests based on the defect data as well as execution time of the tests [60]. A test scheduling technique based on the defect probabilities of the cores has recently been reported [61]. Macro Test is a modular testing approach for SOC cores in which a test is broken down into a test protocol and list of test patterns [62]. A test protocol is defined at the terminals of a macro and describes the necessary and sufficient conditions to test the macro [63]. The test protocols are expanded from the macrolevel to the SOC pins and can either be applied sequentially to the SOC, or scheduled to increase parallelism. In [63], a heuristic scheduling algorithm based on pairwise composition of test protocols was presented. The algorithm determines the start times for the expanded test protocols in the schedule, such that no resource conflicts occur and test time is minimized [63].

© 2006 by Taylor & Francis Group, LLC

27-10

Embedded Systems Handbook

System-on-chips in test mode can dissipate up to twice the amount of power they do in normal mode, since cores that do not normally operate in parallel may be tested concurrently [64]. Powerconstrained test scheduling is therefore essential in order to limit the amount of concurrency during test application to ensure that the maximum power budget of the SOC is not exceeded. In [65], a method based on approximate vertex cover of a resource-constrained test compatibility graph was presented. In [66], the use of list scheduling and tree-growing algorithms for power-constrained scheduling was discussed. The authors presented a greedy algorithm to overlay tests such that the power constraint is not violated. A constant additive model is employed for power estimation during scheduling [66]. The issue of reorganizing scan chains to trade-off testing time with power consumption was investigated in [67]. The authors presented an optimal algorithm to parallelize tests under power and resource constraints. The design of test wrappers to allow for multiple scan chain configurations within a core was also studied. In [33], an integrated approach to test scheduling was presented. Optimal test schedules with precedence constraints were obtained for reasonably-sized SOCs. For precedence-based scheduling of large SOCs, a heuristic algorithm was developed. The proposed approach also includes an algorithm to obtain preemptive test schedules in O(n 3 ) time, where n is the number of tests [33]. Parameters that allow only a certain number of preemptions per test can be used to prevent excessive BIST and sequential circuit test preemptions. Finally, a new power-constrained scheduling technique was presented, whereby power-constraints can be easily embedded in the scheduling framework in combination with precedence constraints, thus delivering an integrated approach to the SOC test scheduling problem.

27.2.4 Integrated TAM Optimization and Test Scheduling Both TAM optimization and test scheduling significantly influence the testing time, test data volume, and test cost for SOCs. Furthermore, TAMs and test schedules are closely related. For example, an effective schedule developed for a particular TAM architecture may be inefficient or even infeasible for a different TAM architecture. Integrated methods that perform TAM design and test scheduling in conjunction are therefore required to achieve low-cost, high-quality test. In [68], an integrated approach to test scheduling, TAM design, test set selection, and TAM routing was presented. The SOC test architecture was represented by a set of functions involving test generators, response evaluators, cores, test sets, power and resource constraints, and start and end times in the test schedule modeled as Boolean and integral values [68]. A polynomial-time algorithm was used to solve these equations and determine the test resource placement, TAM design and routing, and test schedule, such that the specified constraints are met. The mapping between core I/Os and SOC pins during the test schedule was investigated in [54]. TAM design and test scheduling was modeled as two-dimensional bin-packing, in which each core test is represented by a rectangle. The height of each rectangle corresponds to the testing time, the width corresponds to the core I/Os, and the weight corresponds to the power consumption during test. The objective is to pack the rectangles into a bin of fixed width (SOC pins), such that the bin height (total testing time) is minimized, while power constraints are met. A heuristic method based on the Best Fit algorithm was presented to solve the problem [54]. The authors next formulated constraint-driven pin mapping and test scheduling as the chromatic number problem from graph theory and as a dependency matrix partitioning problem [55]. Both problem formulations are N P -hard. A heuristic algorithm based on clique partitioning was proposed to solve the problem. The problem of TAM design and test scheduling with the objective of minimizing the average testing time was formulated in [69]. The problem was reduced to one of minimum-weight perfect bipartite graph matching, and a polynomial-time optimal algorithm was presented. A test planning flow was also presented. In [50], a new approach for wrapper/TAM co-optimization and constraint-driven test scheduling using rectangle packing was described. Flexible-width TAMs that are allowed to fork and merge were designed. Rectangle packing was used to develop test schedules that incorporate precedence and power constraints,

© 2006 by Taylor & Francis Group, LLC

Modular Testing and Built-In Self-Test

27-11

while allowing the SOC integrator to designate a group of tests as preemptable. Finally, the relationship between TAM width and tester data volume was studied to identify an effective TAM width for the SOC. The work reported in [50] was extended in [70] to address the minimization of ATE buffer reloads and include multisite test. The ATE is assumed to contain a pool of memory distributed over several channels, such that the memory depth assigned to each channel does not exceed a maximum limit. Furthermore, the sum of the memory depth over all channels equals the total pool of ATE memory. Idle bits appear on ATE channels whenever there is idle time on a TAM wire. These bit positions are filled with don’t-cares if they appear between useful test bits; however, if they appear only at the end of the useful bits, they are not required to be stored in the ATE. The SOC test resource optimization problem for multisite test was stated as follows. Given the test set parameters for each core, and a limit on the maximum memory depth per ATE channel, determine the wrapper/TAM architecture and test schedule for the SOC, such that (1) the memory depth required on any channel is less than the maximum limit, (2) the number of TAM wires is minimized, and (3) the idle bits appear only at the end of each channel. A rectangle packing algorithm was developed to solve this problem. A new method for representing SOC test schedules using k-tuples was discussed in [71]. The authors presented a p-admissible model for test schedules that is amenable to several solution methods such as local search, two-exchange, simulated annealing, and genetic algorithms that cannot be used in a rectangle-representation environment. Finally, recent work on TAM optimization has focused on the use of ATEs with port scalability features [72–74]. In order to address the test requirements of SOCs, ATE vendors have recently announced a new class of testers that can simultaneously drive different channels at different data rates. Examples of such ATEs include the Agilent 93,000 series tester based on port scalability and the test processor-per-pin architecture [75] and the Tiger system from Teradyne [76] in which the data rate can be increased through software for selected pin groups to match SOC test requirements. However, the number of tester channels with high data rates may be constrained in practice due to ATE resource limitations, the power rating of the SOC, and scan frequency limits for the embedded cores. Optimization techniques have been developed to ensure that the high data-rate tester channels are efficiently used during SOC testing [74]. The availability of dual-speed ATEs was also exploited in [72,73], where a technique was presented to match ATE channels with high data rates to core scan chain frequencies using virtual TAMs. A virtual TAM is an on-chip test data transport mechanism that does not directly correspond to a particular ATE channel. Virtual TAMs operate at scan chain frequencies; however, they interface with the higher-frequency ATE channels using bandwidth matching. Moreover, since the virtual TAM width is not limited by the ATE pin-count, a larger number of TAM wires can be used on the SOC, thereby leading to lower testing times. A drawback of virtual TAMs however is the need for additional TAM wires on the SOC as well as frequency division hardware for bandwidth matching. In [74], the hardware overhead is reduced through the use of a smaller number of on-chip TAM wires; ATE channels with high data rates directly drive SOC TAM wires, without requiring frequency division hardware.

27.2.5 Modular Testing of Mixed-Signal SOCs Prior research on modular testing of SOCs has focused almost exclusively on the digital cores in an SOC. However, most SOCs in use today are mixed-signal circuits containing both digital and analog cores [77–79]. Increasing pressure on consumer products for small form factors and extended battery life is driving single chip integration, and blurring the lines between analog/digital design types. As indicated in the 2001 International Technology Roadmap for Semiconductors [2], the combination of these circuits on a single die compounds the test complexities and challenges for devices that fall in an increasing commodity market. Therefore, an effective modular test methodology should be capable of handling both digital and analog cores, and it should reduce test cost by enabling test reuse for reusable embedded modules.

© 2006 by Taylor & Francis Group, LLC

27-12

Embedded Systems Handbook

DAC

From TAM

fclk

Analog

ADC

To TAM

N

FIGURE 27.7 On-chip digitization of analog test data for uniform test access. (From A. Sehgal, S. Ozev, and K. Chakrabarty. In Proceedings of the IEEE International Conference on CAD, pp. 95–99, 2003. With permission.)

In traditional mixed-signal SOC testing, tests for analog cores are applied either from chip pins through direct test access methods, for example, via multiplexing, or through a dedicated analog test bus [80,81], which requires the use of expensive mixed-signal testers. For mid- to low-frequency analog applications, the data is often digitized at the tester, where it is affordable to incorporate high-quality data converters. In most mixed-signal ICs, analog circuitry accounts for only a small part of the total silicon (“big-D/small-A”). However, the total production testing cost is dominated by analog testing costs. This is because of the fact that expensive mixed-signal testers are employed for extended periods of time resulting in high overall test costs. A natural solution to this problem is to implement the data converters on-chip. Since most SOC applications do not push the operational frequency limits, the design of such data converters on-chip appears to be feasible. Until recently, such an approach has not been deemed desirable due to its high hardware overhead. However, as the cost of on-chip silicon is decreasing and the functionality and the number of cores in a typical SOC are increasing, the addition of data converters on-chip for testing analog cores now promises to be cost efficient. These data converters eliminate the need for expensive mixed-signal test equipment. Recently, results have been reported on the optimization of a unified test access architecture that is used for both digital and analog cores [82]. Instead of treating the digital and analog portions separately, a global test resource optimization problem is formulated for the entire SOC. Each analog core is wrapped by an DAC–ADC pair and a digital configuration circuit. Results show that for “big-D/small-A” SOCs, the testing time and test cost can be reduced considerably if the analog cores are wrapped, and the test access and test scheduling problems for the analog and digital cores are tackled in a unified manner. Each analog core is provided a test wrapper where the test information includes only digital test patterns, clock frequency, the test configuration, and pass/fail criteria. This analog test wrapper converts the analog core to a virtual digital core with strictly sequential test patterns, which are the digitized analog signals. In order to utilize test resources efficiently, the analog wrapper needs to provide sufficient flexibility in terms of required resources with respect to all the test needs of the analog core. One way to achieve this uniform test access scheme for analog cores is to provide an on-chip ADC–DAC pair that can serve as an interface between each analog core and the digital surroundings, as shown in Figure 27.7. Analog test signals are expressed in terms of a signal shape, such as sinusoidal or pulse, and signal attributes, such as frequency, amplitude, and precision. These tests are provided by the core vendor to the system integrator. In case of analog testers, these signals are digitized at the high precision ADCs and DACs of the tester. In case of on-chip digitization, the analog wrapper needs to include the lowest cost data converters that can still provide the required frequency and accuracy for applying the core tests. Thus, on-chip conversion of each analog test to digital patterns imposes requirements on the frequency and resolution of the data converters of the analog wrapper. These converters need to be designed to accommodate all the test requirements of the analog core. Analog tests may also have a high variance in terms of their frequency and test time requirements. While tests involving low-frequency signals require low bandwidth and high test times, tests involving high-frequency signals require high bandwidth and low test time. Keeping the bandwidth assigned to the analog core constant results in under-utilization of the precious test resources. The variance of analog test needs have to be fully exploited in order to achieve an efficient test plan. Thus, the analog test wrapper has to be designed to accommodate multiple configurations with varying bandwidth and frequency requirements.

© 2006 by Taylor & Francis Group, LLC

Modular Testing and Built-In Self-Test

27-13 Analog out

Analog in

Mux

Analog

Test enable

fs fupdate fTAM

Mux

ADC

Decoder

DAC

Register

Register

TAM

Encoder

From

To TAM

Self-test

Phase shift Serial-to-parallel ratio

Control

FIGURE 27.8 Block diagram of the analog test wrapper. (From A. Sehgal, S. Ozev, and K. Chakrabarty. In Proceedings of the IEEE International Conference on CAD, pp. 95–99, 2003. With permission.)

Figure 27.8 shows the block diagram of an analog wrapper that can accommodate all the abovementioned requirements. The control and clock signals generated by the test control circuit are highlighted in this figure. The registers at each end of the data converters are written and read in a semiserial fashion depending on the frequency requirement of each test. For example, for a digital TAM clock of 50 MHz, 12-bit DAC and ADC resolution, and an analog test requirement of 8 MHz sampling frequency, the input and output registers can be updated with a serial-to-parallel ratio of 6. Thus, the bandwidth requirement of this particular test is only 2 bits. The digital test control circuit selects the configuration for each test. This configuration includes the divide ratio of the digital TAM clock, the serial to parallel conversion rate of the input and output registers of the data converters, and the test modes. 27.2.5.1 Analog Test Wrapper Modes In the normal mode of operation, the analog test wrapper is completely by-passed; the analog circuit operates on its analog I/O pins. During testing, the analog wrapper has two modes, a self-test mode and a core-test mode. Before running any tests on the analog core, the wrapper data converters have to be characterized for their conversion parameters, such as the nonlinearity and the offset voltage. The self-test mode is enabled through the analog multiplexer at the input of the wrapper ADC, as shown in Figure 27.8. The parameters of the DAC–ADC pair are determined in this mode and are used to calibrate the measurement results. Once the self-test of the test wrapper is complete, core test can be enabled by turning off the self-test bits. For each analog test, the encoder has to be set to the corresponding serial-to-parallel conversion ratio (cr), where it shifts the data from the corresponding TAM inputs into the register of the ADC. Similarly, the decoder shifts data out of the DAC register. The update frequency of the input and output registers, fupdate = fs × cr, is always less than the TAM clock rate, fTAM . For example, if the test bandwidth requirement is 2 bits and the resolution of the data converters is 12 bits, the input and output registers of the data converters are clocked at a rate 6 times less than the clock of the encoder, and the input data is shifted into the encoder and out of the decoder at a 2 bits/cycle rate. The complexity of the encoder and the decoder depends on the number of distinct bandwidth and TAM assignments (the number of possible test configurations). For example, for a 12-bit resolution, the bandwidth assignments may include 1, 2, 3, 4, 6, and 12 bits, where in each case the data may come from distinct TAMs. Clearly, in order to limit the complexity of the encoder–decoder pair, the number of such distinct assignments have to be limited. This requirement can be imposed in the test scheduling optimization algorithm. The analog test wrapper transparently converts the analog test data to the digital domain through efficient utilization of the resources, thus this obviates the need for analog testers. The processing of the

© 2006 by Taylor & Francis Group, LLC

27-14

Embedded Systems Handbook

collected data can be done in the tester by adding appropriate algorithms, such as the FFT algorithm. Further details and experimental results can be found in [82].

27.3 BIST Using a Reconﬁgurable Interconnection Network In this section, we review a recent deterministic BIST approach in which a reconfigurable interconnection network (RIN) is placed between the outputs of the LFSR and the inputs of the scan chains in the CUT [83]. The RIN, which consists only of multiplexer switches, replaces the phase shifter that is typically used in pseudorandom BIST to reduce correlation between the test data bits that are fed into the scan chains. The connections between the LFSR and the scan chains can be dynamically changed (reconfigured) during a test session. In this way, the RIN is used to match the LFSR outputs to the test cubes in a deterministic test set. The control data bits used for reconfiguration ensure that all the deterministic test cubes are embedded in the test patterns applied to the CUT. The proposed approach requires very little hardware overhead, only a modest amount of CPU time, and fewer control bits compared to the storage required for reseeding techniques or for hybrid BIST. Moreover, as a nonintrusive BIST solution, it does not require any circuit redesign and has minimal impact on circuit performance. In a generic LFSR-based BIST approach shown in Figure 27.3, the output of the LFSR is fed to a phase shifter to reduce the linear dependency between the data shifted into different scan chains. The phase shifter is usually a linear network composed of exclusive-OR gates. In the proposed approach, illustrated in Figure 27.9(a), the phase shifter is replaced by an RIN that connects the LFSR outputs to the scan chains. The RIN consists of multiplexer switches and it can be reconfigured by applying appropriate control bits to it through the inputs D0 , D1 , . . . , Dg −1 . The parameter g refers to the number of configurations used during a BIST session and it is determined using a simulation procedure. The control inputs D0 , D1 , . . . , Dg −1 are provided by a d-to-g decoder, where d = log2 g . A d-bit configuration counter is used to cycle through all possible 2d input combinations for the decoder. The configuration counter is triggered by the BIST pattern counter, which is preset for each configuration by the binary value corresponding to the number of test patterns for the corresponding configuration. Although the elimination of the phase shifter may reduce the randomness of the pseudorandom patterns, complete fault coverage is guaranteed by the RIN synthesis procedure described later. As shown in Figure 27.9(b), the multiplexers in the RIN are implemented using tristate buffers with fully decoded control inputs. While the multiplexers can also be implemented in other ways, we use tristate buffers here because of their ease of implementation in CMOS. The outputs of the tristate buffers are connected at the output of the multiplexer. Each input Ii of a multiplexer is connected to the input of a tristate buffer, which is controlled by the corresponding control signal. While the number of multiplexers can at most be equal to the number of scan chains, in practice it is sometimes smaller than the number of scan chains because not all scan chains need to be driven by different LFSR cells. The number of tristate gates in each multiplexer is at most equal either to the number of configurations or to the number of LFSR cells, whichever is smaller. Once again, in practice, the actual number of tristate gates is smaller than this upper limit. We next describe the test application procedure during a BIST session. First the configuration counter is reset to the all-0 pattern, and the pattern counter is loaded with the binary value corresponding to the number of patterns that must be applied to the CUT in the first configuration. The pattern counter is decremented each time a test pattern is applied to the CUT. When the content of the pattern counter becomes zero, it is loaded with the number of patterns for the second configuration and it triggers the configuration counter, which is incremented. This leads to a corresponding change in the outputs of the decoder, and the RIN is reconfigured appropriately. This process continues until the configuration counter passes through all g configurations. The total number of test patterns applied to the CUT is therefore g i=1 ni , where ni is the number of patterns corresponding to configuration i, 1 ≤ i ≤ g . The BIST design procedure described next is tailored to embed a given set of deterministic test cubes in the sequence of g i=1 ni patterns.

© 2006 by Taylor & Francis Group, LLC

Modular Testing and Built-In Self-Test

27-15

(a) Configuration counter C0 Stored control bits

C1

. ..

Pattern counter

Cd1

Decoder D1

. ..

D0

Dg1 Scan chain 1 (l bits) Scan chain 2 (l bits)

.. .

...

LFSR

Reconfigurable interconnection network (RIN)

MISR

Scan chain m (l bits)

D0 D1 D2 D3

(b) Multiplexer 1

. .

Scan chain 1 (l bits)

.

Data from LFSR Multiplexer 2

Scan chain 2 (l bits)

FIGURE 27.9 (a) Proposed logic BIST architecture (b) RIN for m = 2 and g = 4. (From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 1289–1305, 2004. With permission.)

During test application, pseudorandom patterns that do not match any deterministic test cube are also applied to the CUT. These pseudorandom patterns can potentially detect nonmodeled faults. However, these patterns increase the testing time. A parameter called MaxSkipPatterns, which is defined as the largest number of pseudorandom patterns that are allowed between the matching of two deterministic cubes, is used in the design procedure to limit the testing time. We first need to determine for each configuration, the number of patterns as well as the interconnections between the LFSR outputs and the scan chains. We use the simulation procedure described next to solve this problem.

© 2006 by Taylor & Francis Group, LLC

27-16

Embedded Systems Handbook

t : xxx10 xx01x 1xx1x 0xxx1

t1 : t2 : t3 : t4 :

0xxx1 1xx1x xx01x xxx10

FIGURE 27.10 An illustration of converting a test cube to multiple scan chain format (m = 4, l = 5). (From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 1289–1305, 2004. With permission.)

We start with an LFSR of length L, a predetermined seed, and a known characteristic polynomial. Let TD = {c1 , c2 , . . . , cn } be the set of deterministic test cubes that must be applied to the CUT. The set TD can either target all the single stuck-at faults in the circuit, or only the hard faults that cannot be detected by a small number of pseudorandom patterns. As illustrated in Figure 27.10, each deterministic test cube c in the test set is converted into the multiple scan chain format as a set of m l-bit vectors {t1 , t2 , . . . , tm }, where m is the number of scan chains and l is the length of each scan chain. The bits in a test cube are ordered (i ) such that the least significant bit is first shifted into the scan chain. We use Connj to denote the set of LFSR taps that are connected to the scan chain j in configuration i, where i = 1, 2, . . . , g , j = 1, 2, . . . , m. The steps of the simulation procedure are as follows: 1. Set i = 1. (i ) 2. Set Connj = {1, 2, . . . , L } for j = 1, 2, . . . , m, that is, initially, each scan chain can be connected to any tap of the LFSR. 3. Driving the LFSR for the next l clock cycles, we obtain the output of the LFSR as a set of L l-bit vectors {Ok |k = 1, 2, . . . , L }, where vector Ok is the output stream of the kth flip–flop of the LFSR for the l clock cycles. 4. Find a test cube c ∗ in TD that is compatible with the outputs of the LFSR under the current (i ) (i ) connection configuration Connj , that is, for all j = 1, . . . , m, there exists k ∈ Connj such that ∗ ∗ tj is compatible with Ok , where c has already been reformatted for m scan chains as a set of vector {t1∗ , t2∗ , . . . , tm∗ }. (A vector u1 , u2 , . . . , ur and a vector v1 , v2 , . . . , vr are mutually compatible if for any i, 1 ≤ i ≤ r, one of the following holds: [i] ui = vi if they are both care bits; [ii] ui is a don’t-care bit; [iii] vi is a don’t-care bit.) 5. If no test cube is found in Step 4, go to Step 6 directly. Otherwise, remove the test cube c ∗ found in Step 4 from TD , and narrow down the connection configuration as follows. For each (i ) j = 1, 2, . . . , m, let U ⊂ Connj such that for any k ∈ U, Ok is not compatible with tj∗ . Then set (i )

(i )

Connj = Connj − U. 6. If in the previous MaxSkipPatterns + 1 iterations, at least one test cube is found in Step 4, then go to Step 3. Otherwise, the simulation for the current configuration is concluded. The patterns that are applied to the circuit under this configuration are those that are obtained in Step 3. 7. Match the remaining cubes in TD to the test patterns for the current configuration, that is, if any test vector in TD is compatible with any pattern for the current configuration, remove it from TD . 8. If no pseudorandom pattern for the current configuration is compatible with a test cube, the procedure fails and exits. Otherwise, increase i by 1, and go to Step 2 to begin the iteration for the next configuration until TD is empty. Figure 27.11 shows a flowchart corresponding to the above procedure, where the variable skip_patterns is used to record the number of continuous patterns that are not compatible with any deterministic test cube, and all_randoms is used to indicate if all the patterns for the current configuration are pseudorandom patterns. An example of the simulation procedure is illustrated in Figure 27.12. A 4-bit autonomous LFSR with characteristic polynomial x 4 + x + 1 is used to generate the pseudorandom patterns. There are four scan chains and the length of each scan chain is 4-bits. The parameter MaxSkipPatterns is set to 1. The output of the LFSR is divided into patterns pi , i = 1, 2, . . . . Each pattern consists of four 4 bit vectors. The procedure

© 2006 by Taylor & Francis Group, LLC

Modular Testing and Built-In Self-Test

27-17

Start

i = 1; skip_ patterns = 0. skip_ patterns = skip_ patterns + 1. Connj(i) = {1, 2, …, L}, j = 1, 2, …, m; all_ randoms = true. No

skip_ patterns = MaxSkipPatterns + 1?

Obtain the outputs of LFSR for next l clock cycles, {Ok / k =1, 2, …, L}.

Does there exist c*in TD that is compatible with {Ok} under current Connj (i )?

Yes

No

Compact the remaining cubes in TD to the test patterns for the current configuration.

Is TD empty?

Yes

Yes No all_randoms = false; skip_ patterns = 0; remove c* from TD and narrow down {Connj (i )}.

all_randoms = true?

End Yes

Fail! i=i+1

No

FIGURE 27.11 Flowchart illustrating the simulation procedure. (From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 1289–1305, 2004. With permission.)

that determines the connections is shown as Step (Init) to Step (f). Step (Init) is the initialization step in (1) which all the connections Connj , j = 1, 2, 3, 4 are set to {1, 2, 3, 4}. In Step (a), the first pattern p1 is matched with the test cube c1 , and the connections are shown for each scan chain: scan chain 1 can be connected to x1 or x4 , both scan chain 2 and scan chain 3 can only be connected to x2 , scan chain 4 can be connected to x1 , x2 , or x4 . In Step (c), none of the cubes is compatible with p3 . When neither p5 nor p6 matches any cubes in Step (e), the iterations for the current configuration are terminated. The patterns that are applied to the CUT in this configuration are p1 , p2 , . . . , p6 . We then compare the remaining cube c4 with the six patterns and find that it is compatible with p2 . So c4 is also covered by the test patterns for the current configuration. Thus the connections for this configuration are: scan chain 1 is connected to x4 , both scan chain 2 and scan chain 3 are connected to x2 , scan chain 4 is connected to x1 . Since p5 and p6 are not compatible with any deterministic cubes, the number of patterns for this configuration is set to 4. If there are test cubes remaining to be matched, the iteration for the next configuration starts from p5 .

27.3.1 Declustering the Care Bits The simulation procedure to determine the number of patterns and the connections for each configuration can sometimes fail to embed the test cubes in the LFSR sequence. This can happen if MaxSkipPatterns

© 2006 by Taylor & Francis Group, LLC

27-18

Embedded Systems Handbook LFSR:

x4

x3

Output of LFSR: . . . s6 x1 : . . . 1100 x2 : . . . 0110 x3 : . . . 1011 x4 : . . . 0101

s5 1000 0100 0010 1001

Test cubes: t1 00xx 1xx0 10xx x0xx

t2 0xxx xx1x 01xx 11xx

x2

s4 1111 0111 0011 0001

s3 0101 1010 1101 1110

x1

s2 1001 1100 0110 1011

t3 xx11 x10x x1x0 10xx

s1 0001 1000 0100 0010

t4 xx11 1xxx 01xx x001

Determination of connections: (a) s1:t1

(Init)

(b) s2:t3

Conn11: (1,2,3,4)

(1, 4)

(4)

Conn12: (1,2,3,4)

(2)

(2)

Conn13: (1,2,3,4)

(2)

(2)

Conn14: (1,2,3,4)

(1, 2, 4)

(1, 4)

(c) s3:none

(d) s4:t2

(e) s5, s6:none

(f) s2:t4

(4)

(4)

(4)

(4)

(2)

(2)

(2)

(2)

(2)

(2)

(2)

(2)

(1, 4)

(1)

(1)

(1)

FIGURE 27.12 An illustration of the simulation procedure. (From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 1289–1305, 2004. With permission.)

is too small, or the test cubes are hard to match with the outputs of the LFSR. During our experiments, we found that it was very difficult to embed the test cubes for the s38417 benchmark circuit. On closer inspection, we found that the care bits in some of the test cubes for s38417 are highly clustered, even though the percentage of care bits in TD is small. When these test cubes are converted into a multiple scan chain format, most of the vectors contain very few care bits but a few vectors contain a large number of care bits. These vectors with many care bits are hard to embed in the output sequence of the LFSR. In order to embed test cubes with highly clustered care bits, we propose two declustering strategies. The first is to reorganize the scan chains such that the care bits can be scattered across many scan chains, and each scan chain contains only a few care bits. Another strategy is based on the use of additional logic to interleave the data that are shifted into the different scan chains. The first strategy requires reorganization of the scan chains, but it does not require extra hardware overhead. Care needs to be taken in scan chain redesign to avoid timing closure problems. The interleaving method does not modify the scan chains, but it requires additional hardware and control mechanisms. The method of reorganization of scan chains is illustrated in Figure 27.13. As shown in the figure, before the reorganization, all the care bits of the given test cube are grouped in the second vector, which is

© 2006 by Taylor & Francis Group, LLC

Modular Testing and Built-In Self-Test

27-19

Test cube : x x x x x xx x x x x xx x x x x x1 0 0 1 10 x x x x x x Reformat Vectors

Scan cells

xxxxxx

1

2

3

4

5

6

Scan chain 0

100110

7

8

9

10

11

12

Scan chain 1

xxxxxx

13

14

15

16

17

18

Scan chain 2

xxxxxx

19

20

21

22

23

24

Scan chain 3

xxxxxx

25

26

27

28

29

30

Scan chain 4

Reorganization xxxx1x 1

26

21

16

11

6

Scan chain 0

2

27

22

17

12

Scan chain 1

x 0 x x x x 13

8

3

28

23

18

Scan chain 2

x x 0 x x x 19

14

9

4

29

24

Scan chain 3

25

20

15

10

5

30

Scan chain 4

1xxxx0

xxx1xx

7

FIGURE 27.13 An illustration of the reorganization of scan chains. (From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 1289–1305, 2004. With permission.)

hard to match with the output of LFSR. After the reorganization, the care bits are scattered across all the vectors, and the largest number of care bits in a vector is only two. This greatly increases the probability that this vector can be matched to an output pattern of the LFSR. Note that the concept of reorganization of scan chains is also used in [9]. However, the reorganization used in [9] changes the scan chain structure and makes it unsuitable for response capture — a separate solution is needed in [9] to circumvent this problem. In our approach, the basic structure of the scan chains is maintained and the usual scan test procedure of pattern shift-in, response capture, and shift-out can be used. The scan cells in the CUT can be indexed as ci,j , i = 0, 1, . . . , m − 1, j = 0, 1, . . . , l − 1, where m is the number of scan chains and l is the length of a scan chain. Note that we start the indices from 0 to facilitate the description of the scan chain reorganization procedure. The ith scan chain consists of the l scan cells ∗ to denote the reorganized scan cells, in which the ith scan chain consists ci,j , j = 0, 1, . . . , l − 1. We use ci,j ∗ of the l scan cells ci,j , j = 0, 1, . . . , l − 1. For each j = 0, 1, . . . , l − 1, the m cells c0,j , c1,j , . . . , cm−1,j constitute a vertical vector. The reorganized scan cell structure is obtained by rotating each such vertical ∗ = c , where k is given by k = (i + d) vector upwards by d positions, where d = j mod m, that is, ci,j k,j mod m. An alternative method for declustering, based on the interleaving of the inputs to the scan chains, is shown in Figure 27.14. We insert an extra stage of multiplexers between the outputs of the RIN and the inputs of the scan chains. From the perspective of the RIN, the logic that follows it, that is, the combination of the multiplexers for interleaving and the scan chains, is simply a reorganized scan chain with an appropriate arrangement of the connections between the two stages of multiplexers. For a CUT with m scan chains, m multiplexers are used for reconfiguration, and m multiplexers are inserted for interleaving. Each of the multiplexers used for interleaving has m inputs, which are selected in ascending order during the shifting in of a test pattern, that is, the first input is selected for the first scan clock cycle, the second input is selected for the second scan clock cycle, and so on. After the mth input is selected, the procedure is repeated with the first input. We use Ai to denote the output the ith multiplexers for reconfiguration and Bi,j to denote the jth input of the ith multiplexers for interleaving, where i, j = 1, 2, . . . , m. The interleaving is carried out by connecting the inputs of the multiplexers for interleaving with the outputs

© 2006 by Taylor & Francis Group, LLC

27-20

Embedded Systems Handbook Multiplexers for reconfiguration

Multiplexers for interleaving

...

Scan chain 1

...

...

Scan chain 2

...

...

Scan chain 3

...

Scan chain 4

...

...

Scan chain 5

FIGURE 27.14 An illustration of interleaving of the inputs of scan chains. (From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 1289–1305, 2004. With permission.)

of multiplexers for reconfiguration such that Bi,j =

Ai −j +1 Ai −j +1+m

if i ≥ j if i < j

In order to control the multiplexers for interleaving, an architecture similar to the control logic for the reconfigurable interconnection network can be used. However, for the interleaving, we do not need any storage and the pattern counter. A bit counter counting up to m − 1 (where m is the number of scan chains) is used to replace the configuration counter. The bit counter is reset to 0 at the start of the shifting in of each pattern, and it returns to 0 after counting to m − 1. Consider the test cube shown in Figure 27.13. After adding the second stage of multiplexers and connecting the inputs of the multiplexers for interleaving with the outputs of the multiplexers for reconfiguration, as shown in Figure 27.14 (only the connections related to the first RIN multiplexer are shown for clarity), the output of the first multiplexer for reconfiguration should match with “xxxx1x,” the same string as that in scan cell reorganization method. Note that the above reorganization and interleaving procedures yield the same set of test cubes. Detailed simulation results for benchmark circuits are presented in [83]. Here we discuss the influence of the initial seed on the effectiveness of test set embedding. Experiments were carried out with 20 randomly-selected initial seeds for the test set from [9] targeting all faults with scan cell reorganization and scan chains. The statistics on the number of configurations are listed in Table 27.1(A). We also carried out the same experiments for the test set from [9] targeting random-pattern-resistant faults and listed results in Table 27.1(B). The results show that the number of configurations depends on the initial seed. However, the dependency is not very significant due in part to the reconfigurability of the interconnection network. In order to evaluate the effectiveness of the proposed approach for large circuits, we applied the method to test sets for two production circuits from IBM, namely CKT1 and CKT2. CKT1 is a logic core consisting of 51,082 gates and its test set provides 99.80% fault coverage. CKT2 is a logic core consisting of 94,340 gates and its test set provides 99.76% fault coverage. The number of scan chains is fixed to 64 and 128 for each of these two circuits. We modified the simulation procedure such that the configuration of the interconnection network can be changed during the shifting in of a test cube, and we set the parameter

© 2006 by Taylor & Francis Group, LLC

Modular Testing and Built-In Self-Test

27-21

TABLE 27.1 Statistics on the Number of the Configurations with Random Seeds for Test Sets from [9] Targeting (A) All Faults and (B) Random-Pattern-Resistant Faults, with Scan Chain Reorganization (Assuming 32 Scan Chains for Each Circuit) (A) All faults

(B) Random-pattern-resistant faults Standard deviation

Circuit

Minimum

Maximum

Mean

s5378 s9234 s13207 s15850 s38417 s38584

6 29 10 16 180 9

9 33 14 20 192 12

7.2 30.5 11.85 17.5 185.9 9.8

0.83 1.10 1.18 1.24 3.23 0.89

Circuit

Minimum

Maximum

Mean

s5378 s9234 s13207 s15850 s38417 s38584

3 32 5 19 118 9

5 36 8 25 129 12

3.55 33.7 5.95 21.8 121.45 10.8

Standard deviation 0.60 1.38 0.89 1.51 3.07 0.83

Source: From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 1289–1305, 2004. With permission.

TABLE 27.2

Results for Test Cubes for Circuits from IBM

No. of test Circuit cubes

Length No. of No. of of scan scan scan chain No. of cells chains (bits) configs

Testing time (clock cycles)

Hardware overhead in GEs, and as a percentage

Storage requirement Encoding (bits) efficiency CPU time

CKT1

17,176 12,256

64 128

192 96

1,792 1,079

1,351,104 566,496

68,145.5 (8.52%) 75,579.5 (9.45%)

21,504 12,948

46.79 77.71

1 h 37 min 1 h 26 min

CKT2

43,079 22,216

64 128

348 174

3,221 1,828

4,051,764 124,062.5 (7.26%) 2,338,005 128,009.5 (7.49%)

38,652 21,936

26.03 45.87

6 h 35 min 6 h 06 min

Source: From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 1289–1305, 2004. With permission.

TABLE 27.3 The Number of Reconfigurations Per Pattern for Test Sets IBM No. of reconfigurations per pattern No. of scan Circuit chains

Length of a test cube per scan chain (bits)

Minimum Maximum Mean Standard deviation

CKT1

64 128

192 96

0 0

3 3

0.11 0.07

0.0210 0.0354

CKT2

64 128

348 174

0 0

3 15

0.11 0.06

0.0204 0.3018

Source: From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 1289–1305, 2004. With permission.

MaxSkipPatterns to 0. Accordingly, in the proposed BIST architecture shown in Figure 27.9(a), the stored control bits are the number of bits per configuration instead of the number of patterns per configuration, and the pattern counter is replaced by a bit counter that counts the number of bits that have been shifted into the scan chains. Table 27.2 lists the results for these two industrial circuits. The hardware overhead is less than 10% and very high encoding efficiency (up to 77.71) is achieved for both circuits. As mentioned above, we allow the configuration of the interconnection network to be changed during the shifting in of a test cube. Table 27.3, Figure 27.15, and Figure 27.16 present the statistics on the number of reconfigurations per test cube. The number of intrapattern configurations is small for both circuits.

© 2006 by Taylor & Francis Group, LLC

27-22

Embedded Systems Handbook

Percentage of patterns

95 64 scan chains 128 scan chains

80 60 40 20 0

0 1 2 3 No. of reconfigurations per pattern

FIGURE 27.15 The number of patterns versus the number of reconfigurations needed for CKT1. (From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 1289–1305, 2004. With permission.)

64 scan chains 128 scan chains

80

(b) Percentage of patterns

Percentage of patterns

(a) 100

60 40 20 0

0.03 0.025 0.02 0.015 0.01 0.005 0

0 1 No. of reconfigurations r per pattern, 0 ≤ r ≤ 1

64 scan chains 128 scan chains

2 5 10 15 No. of reconfigurations r per pattern, r ≥ 1

FIGURE 27.16 The number of patterns versus the number of reconfigurations r needed for CKT2. (a) 0 ≤ r ≤ 1, (b) r ≥ 2. (From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 1289–1305, 2004. With permission.)

27.4 Conclusions Rapid advances in test development techniques are needed to reduce the test cost of million-gate SOC devices. This survey has described a number of state-of-the-art techniques for reducing test time and test data volume, thereby decreasing test cost. Modular test techniques for digital, mixed-signal, and hierarchical SOCs must develop further to keep pace with design complexity and integration density. The test data bandwidth needs for analog cores are significantly different than that for digital cores, therefore unified top-level testing of mixed-signal SOCs remains a major challenge. Most SOCs today include embedded cores that operate in multiple clock domains. Since the forthcoming P1500 standard does not address wrapper design for at-speed testing of such cores, research is needed to develop wrapper design techniques for multifrequency cores. There is also a pressing need for test planning methods that can efficiently schedule tests for these multifrequency cores. The work reported in [41] is a promising first step in this direction. In addition, compression techniques for embedded cores also need to be developed and refined. Of particular interest are techniques that can combine TAM optimization and

© 2006 by Taylor & Francis Group, LLC

Modular Testing and Built-In Self-Test

27-23

test scheduling with test data compression. Some preliminary studies on this problem have been reported recently [84,85]. We have also reviewed a new approach for deterministic BIST based on the use of a RIN. The RIN is placed between the outputs of pseudorandom pattern generator, for example, an LFSR, and the scan inputs of the CUT. It consists only of multiplexer switches and it is designed using a synthesis procedure that takes as inputs the pseudorandom sequence from the LFSR and the deterministic test cubes for the CUT. As a nonintrusive BIST solution, the proposed approach does not require any circuit redesign and it has minimal impact on circuit performance.

Acknowledgments This survey is based on joint work and papers published with several students and colleagues. In particular, the author acknowledges Anshuman Chandra, Vikram Iyengar, Lei Li, Erik Jan Marinissen, Sule Ozev, and Anuja Sehgal.

References [1] M.L. Bushnell and V.D. Agrawal. Essentials of Electronic Testing. Kluwer Academic Publishers, Norwell, MA, 2000. [2] Semiconductor Industry Association. International Technology Roadmap for Semiconductors, 2001 Edition. http://public.itrs.net/Files/2001ITRS/Home.htm [3] A. Khoche and J. Rivoir. I/O bandwidth bottleneck for test: is it real? Test Resource Partitioning Workshop, 2002. [4] G. Hetherington, T. Fryars, N. Tamarapalli, M. Kassab, A. Hassan, and J. Rajski. Logic BIST for large industrial designs: real issues and case studies. In Proceedings of the International Test Conference, pp. 358–367, 1999. [5] Y. Zorian, E.J. Marinissen, and S. Dey. Testing embedded-core-based system chips. IEEE Computer, 32, 52–60, 1999. [6] O. Farnsworth. IBM Corp., personal communication, April 2003. [7] K. Chakrabarty. Test scheduling for core-based systems using mixed-integer linear programming. IEEE Transactions on Computer-Aided Design, 19, 1163–1174, 2000. [8] M. Sugihara, H. Date, and H. Yasuura. A novel test methodology for core-based system LSIs and a testing time minimization problem. In Proceedings of the International Test Conference, pp. 465–472, 1998. [9] S. Hellebrand, H.-G. Liang, and H.-J. Wunderlich. A mixed-mode BIST scheme based on reseeding of folding counters. In Proceedings of the International Test Conference, pp. 778–784, 2000. [10] C.V. Krishna, A. Jas, and N.A. Touba. Test vector encoding using partial LFSR reseeding. In Proceedings of the International Test Conference, pp. 885–893, 2001. [11] H.-G. Liang, S. Hellebrand, and H.-J. Wunderlich. Two-dimensional test data compression for scan-based deterministic BIST. In Proceedings of the International Test Conference, pp. 894–902, 2001. [12] J. Rajski, J. Tyszer, and N. Zacharia. Test data decompression for multiple scan designs with boundary scan. IEEE Transactions on Computers, 47, 1188–1200, 1998. [13] N.A. Touba and E.J. McCluskey. Altering a pseudo-random bit sequence for scan based. In Proceedings of the International Test Conference, 1996, pp. 167–175. [14] S. Wang. Low hardware overhead scan based 3-weight weighted random BIST. In Proceedings of the International Test Conference, pp. 868–877, 2001. [15] H.-J. Wunderlich and G. Kiefer. Bit-flipping BIST. In Proceedings of the International Conference on Computer-Aided Design, pp. 337–343, 1996.

© 2006 by Taylor & Francis Group, LLC

27-24

Embedded Systems Handbook

[16] A.A. Al-Yamani and E.J. McCluskey. Built-in reseeding for serial BIST. In Proceedings of the VLSI Test Symposium, pp. 63–68, 2003. [17] A.A. Al-Yamani and E.J. McCluskey. BIST reseeding with very few seeds. In Proceedings of the VLSI Test Symposium, pp. 69–74, 2003. [18] S. Chiusano, P. Prinetto, and H.-J. Wunderlich. Non-intrusive BIST for systems-on-a-chip. In Proceedings of the International Test Conference, pp. 644–651, 2000. [19] S. Hellebrand, S. Tarnick, J. Rajski, and B. Courtois. Generation of vector patterns through reseeding of multiple-polynomial linear feedback shift registers. In Proceedings of the International Test Conference, pp. 120–129, 1992. [20] M.F. Alshaibi and C.R. Kime. Fixed-biased pseudorandom built-in self-test for random pattern resistant circuits. In Proceedings of the International Test Conference, pp. 929–938, 1994. [21] M.F. AlShaibi and C.R. Kime. MFBIST: a BIST method for random pattern resistant circuits. In Proceedings of the International Test Conference, pp. 176–185, 1996. [22] S. Pateras and J. Rajski. Cube-contained random patterns and their application to the complete testing of synthesized multi-level circuits. In Proceedings of the International Test Conference, pp. 473–482, 1991. [23] N.A. Touba and E.J. McCluskey. Synthesis of mapping logic for generating transformed pseudorandom patterns for BIST. In Proceedings of the International Test Conference, pp. 674–682, 1995. [24] N.A. Touba and E.J. McCluskey. Transformed pseudo-random patterns for BIST. In Proceedings of the VLSI Test Symposium, pp. 410–416, 1995. [25] M. Bershteyn. Calculation of multiple sets of weights for weighted random testing. In Proceedings of the International Test Conference, pp. 1031–1040, 1993. [26] F. Brglez, G. Gloster, and G. Kedem. Built-in self-test with weighted random pattern hardware. In Proceedings of the International Conference on Computer Design, pp. 161–166, 1990. [27] F. Muradali, V.K. Agarwal, and B. Nadeau-Dostie. A new procedure for weighted random built-in self-test. In Proceedings of the International Test Conference, pp. 660–669, 1990. [28] I. Pomeranz and S.M. Reddy. 3-weight pseudo-random test generation based on a deterministic test set for combinational and sequential circuits. IEEE Transactions on Computer-Aided Design, 12, 1050–1058, 1993. [29] A. Jas, C.V. Krishna, and N.A. Touba. Hybrid BIST based on weighted pseudo-random testing: a new test resource partitioning scheme. In Proceedings of the VLSI Test Symposium, pp. 2–8, 2001. [30] K. Chakrabarty. Optimal test access architectures for system-on-a-chip. ACM Transactions on Design Automation of Electronic Systems, 6, 26–49, 2001. [31] V. Iyengar, K. Chakrabarty, and E.J. Marinissen. Test wrapper and test access mechanism co-optimization for system-on-chip. Journal of Electronic Testing: Theory and Applications, 18, 213–230, 2002. [32] E.J. Marinissen, S.K. Goel, and M. Lousberg. Wrapper design for embedded core test. In Proceedings of the International Test Conference, pp. 911–920, 2000. [33] V. Iyengar and K. Chakrabarty. System-on-a-chip test scheduling with precedence relationships, preemption, and power constraints. IEEE Transactions on Computer-Aided Design of ICs and Systems, 21, 1088–1094, 2002. [34] E.J. Marinissen and H. Vranken. On the role of DfT in IC–ATE matching. In International Workshop on TRP, 2001. [35] E. Volkerink et al. Test economics for multi-site test with modern cost reduction techniques. In Proceedings of the VLSI Test Symposium, pp. 411–416, 2002. [36] M. Abramovici, M.A. Breuer, and A.D. Friedman. Digital Systems Testing and Testable Design. Computer Science Press, New York, 1990. [37] P. Varma and S. Bhatia. A structured test re-use methodology for core-based system chips. In Proceedings of the International Test Conference, pp. 294–302, 1998. [38] E.J. Marinissen et al. A structured and scalable mechanism for test access to embedded reusable cores. In Proceedings of the International Test Conference, pp. 284–293, 1998.

© 2006 by Taylor & Francis Group, LLC

Modular Testing and Built-In Self-Test

27-25

[39] T.J. Chakraborty, S. Bhawmik, and C.-H. Chiang. Test access methodology for system-on-chip testing. In Proceedings of the International Workshop on Testing Embedded Core-Based SystemChips, pp. 1.1-1–1.1-7, 2000. [40] Q. Xu and N. Nicolici. On reducing wrapper boundary register cells in modular SOC testing. In Proceedings of the International Test Conference, pp. 622–631, 2003. [41] Q. Xu and N. Nicolici. Wrapper design for testing IP cores with multiple clock domains. In Proceedings of the Design, Automation and Test in Europe (DATE) Conference, pp. 416–421, 2004. [42] V. Immaneni and S. Raman. Direct access test scheme — design of block and core cells for embedded ASICs. In Proceedings of the International Test Conference, pp. 488–492, 1990. [43] P. Harrod. Testing re-usable IP: a case study. In Proceedings of the International Test Conference, pp. 493–498, 1999. [44] I. Ghosh, S. Dey, and N.K. Jha. A fast and low cost testing technique for core-based system-onchip. In Proceedings of the Design Automation Conference, pp. 542–547, 1998. [45] K. Chakrabarty. A synthesis-for-transparency approach for hierarchical and system-on-a-chip test. IEEE Transactions on VLSI Systems, 11, 167–179, 2003. [46] M. Nourani and C. Papachristou. An ILP formulation to optimize test access mechanism in system-on-chip testing. In Proceedings of the International Test Conference, pp. 902–910, 2000. [47] L. Whetsel. An IEEE 1149.1 based test access architecture for ICs with embedded cores. In Proceedings of the International Test Conference, pp. 69–78, 1997. [48] N.A. Touba and B. Pouya. Using partial isolation rings to test core-based designs. IEEE Design and Test of Computers, 14, 52–59, 1997. [49] J. Aerts and E.J. Marinissen. Scan chain design for test time reduction in core-based ICs. In Proceedings of the International Test Conference, pp. 448–457, 1998. [50] V. Iyengar, K. Chakrabarty, and E.J. Marinissen. Test access mechanism optimization, test scheduling and tester data volume reduction for system-on-chip. IEEE Transactions on Computers, 52, 1619–1632, 2003. [51] V. Iyengar, K. Chakrabarty, and E.J. Marinissen. Recent advances in TAM optimization, test scheduling, and test resource management for modular testing of core-based SOCs. In Proceedings of the IEEE Asian Test Symposium, pp. 320–325, 2002. [52] Z.S. Ebadi and A. Ivanov. Design of an optimal test access architecture using a genetic algorithm. In Proceedings of the Asian Test Symposium, pp. 205–210, 2001. [53] V. Iyengar and K. Chakrabarty. Test bus sizing for system-on-a-chip. IEEE Transactions on Computers, 51, 449–459, 2002. [54] Y. Huang et al. Resource allocation and test scheduling for concurrent test of core-based SOC design. In Proceedings of the Asian Test Symposium, pp. 265–270, 2001. [55] Y. Huang et al. On concurrent test of core-based SOC design. Journal of Electronic Testing: Theory and Applications, 18, 401–414, 2002. [56] V. Iyengar, K. Chakrabarty, and E. J. Marinissen. Efficient test access mechanism optimization for system-on-chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 22, 635–643, 2003. [57] E.J. Marinissen and S.K. Goel. Analysis of test bandwidth utilization in test bus and TestRail architectures in SOCs. Digest of Papers of DDECS, pp. 52–60, 2002. [58] P.T. Gonciari, B. Al-Hashimi, and N. Nicolici. Addressing useless test data in core-based system-on-a-chip test. IEEE Transactions on Computer-Aided Design of ICs and Systems, 22, 1568–1590, 2003. [59] S.K. Goel and E.J. Marinissen. Effective and efficient test architecture design for SOCs. In Proceedings of the International Test Conference, pp. 529–538, 2002. [60] W. Jiang and B. Vinnakota. Defect-oriented test scheduling. In Proceedings of the VLSI Test Symposium, pp. 433–438, 1999.

© 2006 by Taylor & Francis Group, LLC

27-26

Embedded Systems Handbook

[61] E. Larsson, J. Pouget, and Z. Peng. Defect-aware SOC test scheduling. In Proceedings of the VLSI Test Symposium, pp. 359–364, 2004. [62] F. Beenker, B. Bennetts, and L. Thijssen. Testability Concepts for Digital ICs — The Macro Test Approach. Frontiers in Electronic Testing, Vol. 3. Kluwer Academic Publishers, Boston, MA, 1995. [63] E.J. Marinissen et al. On IEEE P1500’s standard for embedded core test. Journal of Electronic Testing: Theory and Applications, 18, 365–383, 2002. [64] Y. Zorian. A distributed BIST control scheme for complex VLSI devices. In Proceedings of the VLSI Test Symposium, pp. 6–11, 1993. [65] R.M. Chou, K.K. Saluja, and V.D. Agrawal. Scheduling tests for VLSI systems under power constraints. IEEE Transactions on VLSI Systems, 5, 175–184, 1997. [66] V. Muresan, X. Wang, and M. Vladutiu. A comparison of classical scheduling approaches in power-constrained block-test scheduling. In Proceedings of the International Test Conference, pp. 882–891, 2000. [67] E. Larsson and Z. Peng. Test scheduling and scan-chain division under power constraint. In Proceedings of the Asian Test Symposium, pp. 259–264, 2001. [68] E. Larsson and Z. Peng. An integrated system-on-chip test framework. In Proceedings of the DATE Conference, pp. 138–144, 2001. [69] S. Koranne. On test scheduling for core-based SOCs. In Proceedings of the International Conference VLSI Design, pp. 505–510, 2002. [70] V. Iyengar, S.K. Goel, E.J. Marinissen, and K. Chakrabarty. Test resource optimization for multisite testing of SOCs under ATE memory depth constraints. In Proceedings of the International Test Conference, pp. 1159–1168, 2002. [71] S. Koranne and V. Iyengar. A novel representation of embedded core test schedules. In Proceedings of the International Test Conference, 2002, pp. 539–540. [72] A. Sehgal, V. Iyengar, M.D. Krasniewski, and K. Chakrabarty. Test cost reduction for SOCs using virtual TAMs and Lagrange multipliers. In Proceedings of the IEEE/ACM Design Automation Conference, pp. 738–743, 2003. [73] A. Sehgal, V. Iyengar, and K. Chakrabarty. SOC test planning using virtual test access architectures. IEEE Transactions on VLSI Systems, 12: 1263–1276, December, 2004. [74] A. Sehgal and K. Chakrabarty. Efficient modular testing of SOCs using dual-speed TAM architectures. In Proceedings of the IEEE/ACM Design, Automation and Test in Europe (DATE) Conference, pp. 422–427, 2004. [75] Agilent Technologies. Winning in the SOC market, available online at: http://cp.literature. agilent.com/litweb/pdf/5988-7344EN.pdf [76] Teradyne Technologies. Tiger: advanced digital with silicon Germanium technology. http://www.teradyne.com/tiger/digital.html [77] T. Yamamoto, S.-I. Gotoh, T. Takahashi, K. Irie, K. Ohshima, and N. Mimura. A mixed-signal 0.18–65 m CMOS SoC for DVD systems with 432-MSample/s PRML read channel and 16-Mb embedded DRAM. IEEE Journal of Solid-State Circuits, 36, 1785–1794, 2001. [78] H. Kundert, K. Chang, D. Jefferies, G. Lamant, E. Malavasi, and F. Sendig. Design of mixedsignal systems-on-a-chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 19, 1561–1571, 2000. [79] E. Liu, C. Wong, Q. Shami, S. Mohapatra, R. Landy, P. Sheldon, and G. Woodward. Complete mixed-signal building blocks for single-chip GSM baseband processing. In Proceedings of the IEEE Custom Integrated Circuits Conference, pp. 11–14, 1998. [80] A. Cron. IEEE P1149.4 — almost a standard. In Proceedings of the International Test Conference, pp. 174–182, 1997. [81] S.K. Sunter. Cost/benefit analysis of the P1149.4 mixed-signal test bus. In IEEE Proceedings — Circuits, Devices and Systems, 143, 393–398, 1996. [82] A. Sehgal, S. Ozev, and K. Chakrabarty. TAM optimization for mixed-signal SOCs using test wrappers for analog cores. In Proceedings of the IEEE International Conference on CAD, pp. 95–99, 2003. © 2006 by Taylor & Francis Group, LLC

Modular Testing and Built-In Self-Test

27-27

[83] L. Li and K. Chakrabarty. Test set embedding for deterministic BIST using a reconfigurable interconnection network. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 1289–1305, 2004. [84] V. Iyengar, A. Chandra, S. Schweizer, and K. Chakrabarty. A unified approach for SOC testing using test data compression and TAM optimization. In Proceedings of the IEEE/ACM Design, Automation and Test in Europe (DATE) Conference, pp. 1188–1189, 2003. [85] P.T. Gonciari and B. Al-Hashimi. A compression-driven test access mechanism design approach. In Proceedings of the European Test Symposium, 2004, pp. 100–105.

© 2006 by Taylor & Francis Group, LLC

28 Embedded Software-Based Self-Testing for SoC Design 28.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-2 28.2 Embedded Processor Self-Testing. . . . . . . . . . . . . . . . . . . . . . . 28-4 Stuck-At Fault Testing

28.3 28.4 28.5 28.6 28.7 28.8 28.9

Kwang-Ting (Tim) Cheng University of California at Santa Barbara

Test Program Synthesis Using VCCs . . . . . . . . . . . . . . . . . . . . Delay Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Embedded Processor Self-Diagnosis . . . . . . . . . . . . . . . . . . . Self-Testing of Buses and Global Interconnects. . . . . . . . Self-Testing of Other Nonprogrammable IP Cores. . . . Instruction-Level DfT/Test Instructions . . . . . . . . . . . . . . . Self-Test of On-Chip ADC/DAC and Analog Components Using DSP-Based Approaches . . . . . . . . . . . 28.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28-6 28-10 28-11 28-11 28-14 28-15 28-16 28-17 28-17 28-18

The increasing heterogeneity and programmability associated with system-on-chip (SoC) architecture, together with ever-increasing operating frequencies and technology changes, are demanding fundamental changes in integrated circuit (IC) testing. At-speed testing of high-speed circuits with external testers is becoming increasingly difficult owing to the growing gap between design and tester performance, the growing cost of high-performance testers, and the increasing yield loss caused by inherent tester inaccuracy. Therefore, empowering the chip to test itself seems like a sensible solution. Hardware-based self-testing techniques (known as built-in self test, or BIST) have limitations owing to performance, area, and design time overhead, as well as problems caused by the application of nonfunctional patterns (which may result in higher power consumption during testing, over-testing, yield loss problems, etc). The embedded software-based self-testing technique has recently become the focus of intense research. One guiding principle of this embedded self-test paradigm is to utilize on-chip programmable resources (such as embedded microprocessors and digital signal processors, DSPs) for on-chip test generation, test delivery, signal acquisition, response analysis, and even diagnosis. After the programmable components

28-1

© 2006 by Taylor & Francis Group, LLC

28-2

Embedded Systems Handbook

have been self-tested, they can be reused for testing on-chip buses, interfaces, and other nonprogrammable components. Embedded test techniques based on such a principle reduces the need for dedicated test hardware and makes possible relatively easier applications and more accurate analysis of at-speed test signals on-chip. In this chapter, we give a survey and outline of the roadmap of this emerging embedded software-based self-testing paradigm.

28.1 Introduction System-on-chip has become a widely accepted architecture for highly complex systems on a single chip. Short time-to-market and rich functionality requirements have driven the design houses to adopt the SoC design flow. A SoC contains a large number of complex, heterogeneous components that can include digital, analog, mixed-signal, radio frequency (RF), micromechanical, and other systems on a single piece of silicon. As the lines gradually fade between traditional digital, analog, RF, and mixed-signal devices; as operational frequencies are rapidly increasing; and as the feature sizes are shrinking, testing is facing a whole new set of challenges. Figure 28.1 shows the cost of silicon manufacturing versus the cost of testing given in the SIA and ITRS roadmaps [1,2]. The top curve shows the fabrication capital per transistor cost reduction (Moore’s law). The bottom curve shows the test capital per transistor (Moore’s law for test). From the ITRS roadmap it is clear that unless fundamental changes to test are made, it may cost more, in the future, to test the chip than to manufacture it [2]. Figure 28.1 also shows the historical trend in the test paradigms. On one hand, the high cost of manually developed functional tests and difficulties in translating the embedded component tests to the chip boundary where the automatic test equipment (ATE) interface exists are making these tests infeasible even for very high-volume products. On the other hand, even if automatically developed structural tests (such as scan tests) are available, their application using ATEs poses challenges because the testers performance is increasing at a slower rate than the device speed. This translates into an increasing yield loss owing to external testing since guard-banding to cover tester errors results in the loss of more and more good chips. In addition, high-speed and high-pincount testers are very costly. Design-for-testability (DfT) and BIST have been regarded as possible solutions for changing the direction of the bottom curve in Figure 28.1. BIST solutions eliminate the need for high-speed testers and

Cost: cents/transistor

Cost of silicon manufacturing and test

1 0.1

Fab capital/transistor (Moore’s law)

0.01 0.001 0.0001

Test capital/transistor (Moore’s law for test)

0.00001 0.000001 0.0000001 1982 Test paradigms:

FIGURE 28.1

1999 Roadmap

Based on ’97 SIA Roadmap Data and ’99 ITRS Roadmap 1985

1988

Functional testing (manual TG)

Fab versus test capital.

© 2006 by Taylor & Francis Group, LLC

1991

1994

Structural testing (scan, ATPG)

1997

2000

2003

Built-in self-test (embedded hardware tester)

2006

2009

2012

Embedded software-based self-test (embedded software tester)

Embedded Software-Based Self-Testing for SoC Design

28-3

show greater accuracy in their ability to apply and analyze at-speed test signals on-chip. Existing BIST techniques belong to the class of structural BIST. Structural BIST, such as scan-based BIST techniques [3–5], offer good test quality but require the addition of dedicated test circuitry (such as full scan, linearfeedback shift registers [LFSRs] for pattern generation, multiple-inputs signature analyzers [MISRs] for data analysis, and test controllers). Therefore, they incur nontrivial area, performance, and design time overhead. Moreover, structural BIST applies nonfunctional, high-switching random patterns and thus, causes much higher power consumption than normal system operations. Also, to apply at-speed tests to detect timing-related faults, existing structural BIST needs to resolve various complex timing issues related to multiple clock domains, multiple frequencies, and test clock skews that are unique in the test mode. A new embedded software-based self-testing paradigm [6–8] has the potential to alleviate the problems caused by using external testers as well as structural BIST problems described earlier. In this testing strategy, it is assumed that programmable components of the SoC (such as processor, DSP, and FPGA components) are first self-tested by running an automatically synthesized test program that can achieve high-fault coverage. Next, the programmable component is used as a pattern generator and response analyzer to test on-chip buses, interfaces between components, and other components including digital, mixed-signal, and analog components. This self-test paradigm is sometimes referred to as functional self-testing. The concept of embedded software-based self-testing is illustrated in Figure 28.2 using a bus-based SoC. In this illustration, the IP cores in the SoC are connected to a standard bus via the virtual component interface (VCI) [9]. The VCI acts as a standard communication interface between the IP core and the on-chip bus. First, the microprocessor tests itself by executing a set of instructions. Next, the processor can be used for testing the bus as well as other nonprogrammable IP cores in the SoC. In order to support the self-testing methodology, the IP core is encased in a test wrapper. The test wrapper contains test support logic needed to control shifting of the scan chain, buffers to store scan data and support at-speed test, etc. In this example, the on-chip bus is a shared bus, and the arbiter controls access to the bus. There are several advantages to the embedded software-based self-test approach. First, it allows reuse of programmable resources on SoCs for test purposes. In other words, this strategy views testing as an application of the programmable components in the SoC and thus minimizes the need for additional dedicated test circuitry for self-test or DfT. Second, in addition to eliminating the need for expensive high-speed testers, it can also reduce the yield loss owing to tester inaccuracy. Self-testing offers the ability to apply and analyze at-speed test signals on-chip with accuracy greater than that obtainable with a tester.

Test Test program program DSP VCI Bus interface BusInterface master wrapper Master Wrapper

Bus Bus arbiter r Arbite

CPU

Main Memor Responses Responses

VCI

Signatures Signatures

y

Bus interface BusInterface master wrapper On-chip bus

External External tester Tester Bus interface BusInterface Target Wrapper target wrapper VCI System System memory Memory

Bus intf/ /VCI BusIntf VCI gluelogic logic glue

BusInterface Bus interface Target Wrapper target wrapper VCI IPIPCore core (with scan) (with scan)

Test Test support Support

Scan scan Data data Interface interface buffer buffe r Wrapper r Wrappe

IPIPcore core

FIGURE 28.2 Embedded software-based self-testing for SoC.

© 2006 by Taylor & Francis Group, LLC

28-4

Embedded Systems Handbook

Third, while the hardware-based self-test must be applied in the nonfunctional BIST mode, softwarebased self-test can be applied in the normal operational mode of the design — that is, the tests are applied by executing instruction sequences as in regular system operations. This eliminates the problems created by application of nonfunctional patterns that can result in excessive power consumption when hardware BIST is used. Also, functional self-test can alleviate many over-testing and yield loss problems owing to the application of nonfunctional patterns during structural testing for delay faults and cross-talk faults (through at-speed scan or BIST). Experiments have shown that many structurally testable delay faults in the microprocessors can never be sensitized in the functional mode of the circuit [7]. This is because no functionally applicable vector sequence can excite these delay faults and propagate the fault effects into destination outputs/flip–flops at-speed. Defects on these faults will not affect the circuit performance and their testing is not necessary. However, if the circuit is tested by applying nonfunctional patterns, these defects could be detected and the chip could be identified as faulty, resulting in yield loss. Software-based fault localization tools are on the high-priority list according to ITRS roadmap [2]. In addition to self-testing, functional information can also be used to guide diagnostic self-test program synthesis. Testing of analog and mixed-signal circuits has been an expensive process because of the limited access to the analog parts and testers required to perform functional testing. The situation has become worse owing to the trend of integrating various digital, mixed-signal, and analog components into the SoC, with the result that testing the analog and the mixed-signal parts becomes the bottleneck of production testing. Most of these problems can be alleviated by self-testing of on-chip ADC/DAC and analog components based on DSP-based approaches and utilizing on-chip programmable resources. In the rest of the chapter, we present some representative methods on this subject. We start by discussing processor self-test methods targeting stuck-at faults and delay faults. We also give a brief description of a processor self-diagnosis method. Next, we continue with a discussion on methods for self-testing of buses and global interconnects as well as other nonprogrammable IP cores on SoC. We also describe instruction-level DfT methods based on insertion of test instructions to increase the fault coverage and reduce the test application time and test program size. Finally, we summarize DSP-based self-test for analog/mixed-signal components.

28.2 Embedded Processor Self-Testing Embedded software-based self-test methods for processors [6–9] consist of two steps: the test preparation step and the self-testing step. The test preparation step involves generation of realizable tests for components of the processor. Realizable tests are those that can be delivered using instructions; therefore, to avoid producing undeliverable test patterns, the tests are generated under the constraints imposed by the processor instruction set. The tests can then be either stored or generated on-chip, depending on which method seems more efficient for a particular case. A low-speed tester can be used to load the self-test signatures or the predetermined tests to the processor memory, prior to the application of tests. Note that the inability to apply every conceivable input pattern to a microprocessor component does not necessarily map to low-fault coverage. If a fault can be detected only by test patterns outside the allowed input space, then by definition, the fault is redundant in the normal operational mode of the processor. Thus, there is no need to test for this type of fault in production testing, even though we may still want to detect and locate it in the debugging and diagnosis phase. The self-testing step, illustrated in Figure 28.3, involves the application of these tests using a software tester. The software tester can also compress the responses into self-test signatures that can then be stored in memory. The signatures can later be unloaded and analyzed by an external tester. Here, the assumption is that the processor memory has already been tested with standard techniques such as memory BIST before the application of the test, and so the memory is assumed to be fault-free.

© 2006 by Taylor & Francis Group, LLC

Embedded Software-Based Self-Testing for SoC Design

External tester

28-5

Processor bus

OnOn-chip -c hip test test applicatio application n program program CPU CPU

Test Test Data data fo for stimulus Stimulus application Applicatio

Test Test response response Test response Test response analysis analysis program program Instruction memory

Response Response signature e signatur Data memory

FIGURE 28.3 Embedded processor self-testing.

In the following, we describe the embedded software-based self-test methods for testing stuck-at [6,9] and path delay faults [7,8] in microprocessors.

28.2.1 Stuck-At Fault Testing The method proposed by Chen and Dey [6] targets stuck-at faults in a processor core using a divide-andconquer approach. First, it determines the structural test needs for subcomponents in the processor (e.g., ALU, program counter) that are much less complex than the full processor, and hence more amenable to random pattern testing. Next, the component tests are either stored or generated on-chip and then, at the processor level, delivered to their target components using predetermined instruction sequences. To make sure that the test patterns generated for a subcomponent under test can be delivered by instructions, the test preparation step precedes the self-test step. 28.2.1.1 Test Preparation To derive the realizable component tests (i.e., tests deliverable by instructions), the instruction-imposed constraints must first be derived for each component. These constraints can be divided into input and output constraints. The input constraints define the input space of the component allowed by instructions. They describe the correlation among the inputs to the component and can be expressed in the form of Boolean equations. The output constraints define the subset of component outputs observable by instructions. To obtain a close prediction of fault coverage in component-level fault simulation, errors propagating to component outputs that are unobservable at processor-level are regarded as unobserved. Also, the constraints imposed by the processor instruction set can be divided into those that can be specified in a single time frame (spatial constraints) and those that span over several time frames (temporal constraints). Temporal constraints are used to account for the loss of fault coverage owing to fault aliasing, in the cases where the application of one test pattern involves multiple passes through a fault inside the component. If component tests are generated by automatic test pattern generation (ATPG), the spatial constraints can be specified during test generation with the aid of the ATPG tool. Alternatively, they can be specified with virtual constraint circuits (VCCs) as proposed in [10] (details of this alternative will be described in Section 28.3). Similarly, temporal constraints can be modeled with sequential VCCs. Unlike the case of ATPG, if random tests are used for components, random patterns can be used only on independent inputs. Component-level fault simulation is used for evaluating the preliminary fault coverage of these tests. The final fault coverage can be evaluated with processor-level fault simulation once the entire selftest program is constructed. Although component tests are generated only for the subset of components

© 2006 by Taylor & Francis Group, LLC

28-6

Embedded Systems Handbook

that are easily accessible through instructions (e.g., ALU, program counter, etc.), other components such as the instruction decoder are expected to be tested extensively during the application of the self-test program. 28.2.1.2 Self-Test After the realizable component tests have been derived, the next step is on-chip self-test using an embedded software tester for the on-chip generation of component test patterns, the delivery of component tests, and the analysis of their responses. Component tests can either be stored or be generated on-chip. If tests are generated on-chip, the test needs of each component are characterized by a self-test signature, which includes the seed, S, and the configuration, C, of a pseudo-random number generator as well as the number of test patterns to be generated, N . The self-test signatures can be expanded on-chip into test sets using a pseudo-random number-generation program. Multiple self-test signatures may be used for one component if necessary. Thus, this self-test methodology will allow incorporation of any deterministic BIST techniques that encode a deterministic test set as several pseudo-random test sets [11,12]. Since the component tests are developed under the constraints imposed by the processor instruction set, it will always be possible to find instructions for applying the component tests. On the output end, special care must be taken when collecting component test response. Inasmuch as data outputs and status outputs have different observability, they should be treated differently during response collection. In general, although there are no instructions for storing the status outputs of a component directly to memory, an image of the status outputs can be created in memory using conditional instructions. This technique can be used to observe the status outputs of any component. Using manually extracted constraints, the above scheme has been applied to a simple Parwan processor [13]. The generated test program could achieve a high coverage for stuck-at faults in this simple processor.

28.3 Test Program Synthesis Using VCCs Tupuri et al. proposed an approach in Reference 10 for generating functional tests for processors by using a gate-level sequential ATPG tool. It attempts to generate tests for all detectable stuck-at faults under the functional constraints, and then applies these functional test vectors at the system’s operational speed. The key idea of this approach lies in the synthesized logic embodying the functional constraints, also known as VCCs. After the functional constraints of an embedded module have been extracted, they are described in hardware description language (HDL) and synthesized into logic gates. Then a commercial ATPG is used to generate module-level vectors with such constraint circuitry imposed. These module-level vectors are translated to processor-level functional vectors and fault simulated to verify the fault coverage. Figure 28.4 illustrates this hierarchical test generation process using a gate-level test generator for sequential circuits. Chen et al. [9] performed the module-level test generation for embedded processors using the concept of VCCs but with a different utilization such that the test vector generation can be directly plugged into the settable fields (e.g., operands, source, and destination registers) in test program templates. This utilization simplifies the automated generation of test programs for embedded processors. Figure 28.5 shows the overall test program synthesis process proposed in Reference 9, in which the final self-test program can be synthesized automatically from (1) a simulatable HDL processor design at RTL (register transfer level) level, and (2) the instruction set architecture (ISA) specification of the embedded processor. The goal and the process for each step are presented as follows: Step 1. Partition the processor into a collection of combinational blocks as module-under-test (MUT), and the test program for each MUT will be synthesized separately. Step 2. Systematically construct a comprehensive set of test program templates. Test program templates can be classified into single-instruction templates and multi-instruction templates.

© 2006 by Taylor & Francis Group, LLC

Embedded Software-Based Self-Testing for SoC Design

28-7

Surrounding logic (determines module constraints)

Primary Embedded module

Inputs

Primary Outputs

Commercial ATPG

Constraint extraction and synthesis

Translator

Synthesized logic embodying constraints Virtual Inputs

Embedded module

Tests for faults in embedded module

Virtual Outputs

Commercial ATPG

Tests for faults in module with constraints

FIGURE 28.4 Use of VCCs for functional test generation. (From R.S. Tupuri and J.A. Abraham, in Proceedings of the IEEE International Test Conference (ITC), September 1997. With permission.)

Single-instruction templates are built around one key instruction whereas multi-instruction templates include additional supporting instructions — for example, to trigger pipeline forwarding. To exhaust all possibilities in generating test program templates would be impossible, but generating a wide variety of templates is necessary in order to achieve a high-fault coverage. Step 3. Rank templates based on a controllability-/observability-based testability metric through simulation. Templates at the top of the list Tm have high controllability (meaning it is easy to set specific values at the inputs of the MUT) and/or high observability (meaning it is easy to propagate the values at the output of the MUT to data registers or to observation points, which can be mapped onto and stored in the memory). Step 4. Derive the input mapping functions for each template t from the program template’s settable fields (which include operands, source registers, and destination registers) to the inputs of MUT. Also derive the output mapping functions from the MUT’s outputs to the system’s observation points. The input mapping functions can be derived by simulating a number of instances of template t to obtain traces followed by regression analysis to construct the mapping function between settable fields and inputs of MUT. The output mapping functions can be derived by injecting the unknown “X” value at outputs of MUT for simulation, followed by observing the propagation of the X values to the specified template’s destinations. Step 5. Synthesize the mapping functions into VCCs. The utilization of VCCs not only enforces the instruction-imposed constraints, but also facilitates the translation from module-level test patterns to instruction-level test programs. First, implement the mapping functions between settable fields in template t and inputs of MUT m as the input side VCC, and insert it into MUT m. Similarly, insert output side VCC that embodies the output mapping functions.

© 2006 by Taylor & Francis Group, LLC

28-8

Embedded Systems Handbook ISA

Processor RTL

Extract candidate test program templates

MUT partitioning

2

1 Set of MUTs: M

Set of test program templates: T

Pick m e M A Template t e T: X-based simulation to ranking compute control abillity/observability metrics 3a

3b

Generate ranked list of templates t Tm Choose best ranked template t e Tm Derivation of mapping functions

Generate instances of t by assigning random values to settable fields in t

Simulate template instances to obtain traces at inputs of m 4b

9 Update set of undetected faults, Rerank remaining templates in Tm

Regression analysis to derive 4c input mapping

4a

Simulate template instances 4d With X injection at outputs of m Regression analysis to derive 4e output mapping

Input/output mapping functions Generate VCCS

5

Constrained test generation 6 Template field assignments Test program synthesis, 7 cross-compilation Test program memory image Processor-level fault simulation

8

Y More t in Tm?

N

N

Y

Acceptable fault coverage? Y

N More m in M?

Software self-test programs

FIGURE 28.5 Overview of the scalable software-based self-test methodology. (From L. Chen, S. Ravi, A. Raghunathan, and S. Dey, in Proceedings of the ACM/IEEE Design Automation Conference (DAC), June 2003. With permission.)

© 2006 by Taylor & Francis Group, LLC

Embedded Software-Based Self-Testing for SoC Design Settable fields in template t

<s>

All outputs of m

Constrained inputs to m i1 i2 i3

VCC

28-9

MUT m

o1

Observable destination of m

VCC

o2

i4

Circuit as seen by test generator

FIGURE 28.6 Constrained test generation using VCCs. (From L. Chen, S. Ravi, A. Raghunathan, and S. Dey, in Proceedings of the ACM/IEEE Design Automation Conference (DAC), June 2003. With permission.) 1. Test patterns Pm,t

<s>

2. Assignment to settable fields <s>

ef12 0200 1ac0

10 3 8

ef12 0200 1ac0

10 3 8

1002 029a 9213

...

...

...

...

...

3. Test program template t load a<s>, load a, nop; nop; nop; nop add a, a<s>, a store a,

7 2 9

8 12 3

...

...

4. Final test program TPm,t load a10, ef12 load a7, 1002 nop; nop; nop; nop add a8, a10, a7 store a8, ff12 load a3, 0200 load a2, 029a nop; nop; nop; nop add a12, a3, a2 store a12, ff16 ......

FIGURE 28.7 Example of test program synthesis. (From L. Chen, S. Ravi, A. Raghunathan, and S. Dey, in Proceedings of the ACM/IEEE Design Automation Conference (DAC), June 2003. With permission.)

Step 6. Generate module-level tests for the composite circuit of the MUT between input/output virtual constraint components. During the constraint test generation, the test generator sees the circuit including MUT m and the two VCCs, as shown in Figure 28.6. Note that faults within the VCCs will be eliminated from the fault list and so will not be considered for test generation. With this composite model, the pattern generator can generate patterns with values directly specified at the settable fields in instruction template t . Step 7. Synthesize the target test program for the patterns generated in Step 6. Note that the generated test patterns of Step 6 assign values in some of the settable fields of each instruction template t . The other settable fields without value assignment in Step 6 would be filled with random values. The test program is then synthesized by converting the values of each settable field into its corresponding position in instruction template t . Figure 28.7 gives an example of the flow for synthesizing the target program. Step 8. Perform fault simulation on the synthesized test program segment to identify the subset of stuck-at faults detected by the program segment. Step 9. Update the set of undetected faults and rerank the remaining templates in template list Tm to prepare for the next iteration of test program generation. In Reference 14, the above process is further extended to synthesize test program for detecting cross-talk faults. Unlike the stuck-at faults, the signal integrity problems such as cross-talk need to be tested by

© 2006 by Taylor & Francis Group, LLC

28-10

Embedded Systems Handbook

applying a sequence of vectors at operational speed. The requirements for generating multiple specific vectors, considering instruction-imposed constraints at the same time, pose challenges in test program synthesis. The semiautomated test program generation framework proposed in Reference 14 combines multiple instruction-level constraints (multiple VCCs) with a structural ATPG algorithm to select the instruction sequences and their corresponding operand values for detecting cross-talk faults. Preliminary results were demonstrated for an industrial processor, Xtensa, from Tensilica Inc.

28.4 Delay Testing To ensure that the designs meet performance specifications requires the application of delay tests. These tests should be applied at-speed and contain two-vector patterns, applied to the combinational portion of the circuit under test, to activate and propagate the fault effects to registers or other observation points [33]. A software-based self-test method aiming at delay faults in processor cores has been proposed by Lai et al. [7,8]. As in the case of stuck-at faults, not all delay faults in the microprocessor can be tested in the functional mode. This is simply because no instruction sequence can produce the desired test sequence that can sensitize the path and capture the fault effect into destination output/flip–flop at-speed. A fault is said to be functionally testable if there exists a functional test for that fault. Otherwise, the fault is functionally untestable. To illustrate functionally untestable faults, consider part of a simple processor’s datapath as shown in Figure 28.8. It contains an 8-bit ALU, an accumulator (AC), and an instruction register (IR). The data inputs of the ALU, A7-A0, and B7-B0, are connected to the internal data bus and the AC, respectively. The control inputs of the ALU are S2-S0. The values in S2-S0 instruct the ALU to perform the desired arithmetic/logic operation. The outputs of the ALU are connected to the inputs of AC and the inputs of IR. It can be shown that for all possible instruction sequences, whenever a rising transition occurs on signal S1 at the beginning of a clock cycle, AC and IR can never be enabled at the end of the same cycle. Therefore, paths that start at S1 and end at the inputs of IR or AC are functionally untestable, since delay effects on them can never be captured by IR or AC immediately after the vector pair has been applied. The goal of the test preparation step is to identify functionally testable faults and synthesize tests for them. The flow of the test program synthesis for self-test of path delay faults in a microprocessor using its instructions consists of four major steps: 1. Given the ISA and the micro-architecture of the processor core, the spatial and temporal constraints, between and at the registers and control signals, are first extracted. 2. A path classification algorithm, extended from [15,16], implicitly enumerates and examines all paths and path segments with the extracted constraints imposed. If a path cannot be sensitized with the imposed extracted constraints, the path is functionally untestable and thus, is eliminated

Memory AC

B7-B0 ALU

IR

To controller

A7-A0 Internal data bus

S2 S1 S0

(op-code)

FIGURE 28.8 Datapath example. (From W.-C. Lai, A. Krstic, and K.-T. Cheng, in Proceedings of the IEEE VLSI Test Symposium (VTS), April 2000. With permission.)

© 2006 by Taylor & Francis Group, LLC

Embedded Software-Based Self-Testing for SoC Design

28-11

from the fault universe. This helps reduce the computational effort of the subsequent test generation process. The preliminary experimental results shown in Reference 7 indicate a nontrivial percentage of the paths in simple processors (such as the Parwan processor [13] and the DLX processor [17]) are functionally untestable but structurally testable. 3. A subset of long paths among the functionally testable paths are selected as targets for test generation. A gate-level ATPG for path delay faults is extended to incorporate the extracted constraints into the test generation process, where it is used to generate test vectors for each target path delay fault. If the test is successfully generated, it not only sensitizes the path but also meets the extracted constraints. Therefore, it is most likely to be deliverable by instructions (if the complete set of constraints has been extracted, the delivery by instructions could be guaranteed). 4. In the test program synthesis process that follows, the test vectors specifying the bit values at internal flip–flops are first mapped back to word-level values in registers and values at control signals. These mapped value requirements are then justified at the instruction level. Finally, a predefined propagating routine is used to propagate the fault effects captured in the registers/flip–flops of the path delay fault to the memory. This routine compresses the contents of some or all registers in the processor, generates a signature, and stores it in memory. The procedure is repeated until all target faults have been processed. The test program, which is generated offline, will be used to test the microprocessor at-speed. This test synthesis program has been applied to Parwan [13] and DLX [17] processors. On the average, 5.3 and 5.9 instructions were needed to deliver a test vector, and the achieved fault coverage for testable path delay faults was 99.8% for Parwan and 96.3% for DLX.

28.5 Embedded Processor Self-Diagnosis In addition to enabling at-speed self-test with low-cost testers, software-based self-test eliminates the use of scan chains and the associated test overhead, making itself an attractive solution for testing high-end microprocessors. The elimination of scan chains, on the other hand, poses a significant challenge for fault diagnosis. Though deterministic methods for generating diagnostic tests are available for combinational circuits [18], sequential circuits are much too complex to be handled by the same approach. Consequently, there have been several proposals on generating diagnostic tests for sequential circuits by modifying existing detection tests [19,20]. A prerequisite for these methods is a high-coverage detection test set for the sequential circuit under test. Thus, the success of these methods depends on the success of the sequential test generation techniques. Though current sequential ATPG techniques are not practical enough for handling large sequential circuits, software-based self-test methods have the ability of successfully generating tests for a particular type of sequential circuits — microprocessors. If properly modified, these tests might possibly achieve a high-diagnostic capability. In addition, functional information (ISA and micro-architecture) can be used to guide and facilitate diagnosis. Initial investigation for the diagnostic potential of software-based self-test was reported in Reference 21 which attempted to generate test programs geared toward diagnosis. Diagnosis is performed by analyzing the combination of test responses to a large number of small diagnostic test programs. To achieve a high-diagnostic resolution, the diagnostic test programs are generated in a way such that each test program detected as few faults as possible, while the union of all test programs detects as many faults as possible.

28.6 Self-Testing of Buses and Global Interconnects In SoC designs, a large amount of core-to-core communications must be realized with long interconnects. As we find ways to decrease gate delay, the performance of interconnect is becoming increasingly important

© 2006 by Taylor & Francis Group, LLC

28-12

Embedded Systems Handbook

for achieving a high-overall performance [2]. However, owing to the increase of cross-coupling capacitance and mutual inductance, signals on neighboring wires may interfere with each other, causing excessive delay or loss of signal integrity. While many techniques have been proposed to reduce cross-talk, owing to the limited design margin and unpredictable process variations, the cross-talk must also be addressed in manufacturing testing. Owing to the nature of its timing, testing for cross-talk effects should be conducted at the operational speed of the circuit under test. However, at-speed testing of GHz systems requires prohibitively expensive high-speed testers. Moreover, with external testing, hardware access mechanisms are required for applying tests to interconnects deeply embedded in the system. This may lead to unacceptable costs in area or performance overhead. A BIST technique in which a SoC tests its own interconnects for cross-talk defects using on-chip hardware pattern generators and error detectors has been proposed in Reference 22. Although the amount of area overhead may be amortized for large systems, for small systems, the amount of relative area overhead may be unacceptable. Moreover, hardware-based self-test approaches, such as the one in Reference 22, may cause over-testing and yield loss, as not all test patterns generated in the test mode are valid in the normal operational mode of the system. The problem of testing system-level interconnects in embedded processor-based SoCs has been addressed in References 23 and 24. In such SoCs, most of the system-level interconnects, such as the on-chip buses, are accessible to the embedded processor core(s). The proposed methodology, being software-based, enables an embedded processor core in the SoC to test for cross-talk effects in these interconnects by executing a software program. The strategy is to let the processor execute a self-test program with which the test vector pairs can be applied to the appropriate bus in the normal functional mode of the system. In the presence of cross-talk-induced glitch or delay effects, the second vector in the vector pair becomes distorted at the receiver end of the bus. The processor, however, can store this error effect to the memory as a test response, which can be later unloaded by an external tester for off-chip analysis. Maximum aggressor (MA) fault model proposed in Reference 25 is suitable for modeling cross-talk defects on interconnects. It abstracts the cross-talk defects on global interconnects by a linear number of faults. It defines faults based on the resulting cross-talk error effects, including positive glitch (gp ), negative glitch (gn ), rising delay (dr ), and falling delay (df ). For a set of N interconnects, the MA fault model considers the collective aggressor effects on a given victim line Yi , while all other N − 1 wires act as aggressors. The required transitions on the aggressor/victim lines to excite the four error types are shown in Figure 28.9. For example, the test for positive glitch (gp ) at victim line Yi , as shown in the first column of Figure 28.9, would require line Yi has a constant “0” value while other N − 1 aggressor lines have a rising transition. Under this pattern, the victim line Yi would have a positive glitch owing to the cross-talk effect. If excessive, the glitch would result in errors. These patterns, collectively called MA tests, excite the worst-case cross-talk effects on the victim line Yi . For a set of N interconnects, there are 4N MA faults, requiring 4N MA tests. It has been shown in Reference 25 that these 4N faults cover all cross-talk defects on any of the N interconnects. In a core-based SoC, the address, data, and control buses are the main types of global interconnects with which the embedded processors communicate with memory and other cores of the SoC via

Test for gn

Test for gp

Test for d f

Test for d r

Y1

Y1

Y1

Y1

Yi – 1 Yi Yi + 1

Yi –1 Yi Yi + 1

Yi – 1 Yi Yi +1

Yi – 1 Yi Yi + 1

YN

YN

0

YN

YN

1

FIGURE 28.9 Maximal aggressor tests for victim Yi . (From M. Cuviello, S. Dey, X. Bai, and Y. Zhao, in Proceedings of the IEEE International Conference on Computer-Aided Design (ICCAD), November 1999. With permission.)

© 2006 by Taylor & Francis Group, LLC

Embedded Software-Based Self-Testing for SoC Design

Fault-free address bus

28-13

Address

Address

ADDR 0001

ADDR 1110 0001

Data

MEM

DATA CPU Faulty address bus

Data

MEM

0100 DATA

0100

CPU ADDR 0001

0001 ADDR 1111 MEM

MEM

DATA CPU

CPU

DATA 1001 1001

FIGURE 28.10 Testing the address bus. (From X. Bai, S. Dey, and J. Rajski, in Proceedings of the ACM/IEEE Design Automation Conference (DAC), June 2000. With permission.)

memory-mapped I/O. Li et al. [23] concentrates on testing data and addresses bus in a processor-based SoC. The cross-talk effects on the interconnects are modeled using the MA fault model: Testing data bus. For a bidirectional bus such as a data bus, cross-talk effects vary as the bus is driven from different directions. Thus cross-talk tests should be conducted in both directions [22]. However, to apply a pair of vectors (v1, v2) in a particular bus direction, the direction of v1 is irrelevant, as long as the logic value at the bus is held at v1. Only v2 needs to be applied in the specified bus direction. This is because the signal transition triggering the cross-talk effect takes place only when v2 is being applied to the bus. To apply a test vector pair (v1, v2) for the data bus from a SoC core to the CPU, the CPU first exchanges data v1 with the core. The direction of data exchange is irrelevant, for example, if the core is a memory, the CPU may either read v1 from the memory or write v1 to the memory. The CPU then requests data v2 from the core (a memory-read if the core is memory). Upon the arrival of v2, the CPU writes v2 to memory for later analysis. To apply a test vector pair (v1, v2) to the data bus from the CPU to a SoC core, the CPU first exchanges data v1 with the core. Then, the CPU sends data v2 to the core (a memory-write if the core is memory). If the core is memory, v2 can be directly stored to an appropriate address for later analysis. Otherwise, the CPU must execute additional instructions to retrieve v2 from the core and store it to memory. Testing address bus. To apply a test vector pair (v1, v2) to the address bus, which is a unidirectional bus from the CPU to a SoC core, the CPU first requests data from two addresses (v1 and v2) in consecutive cycles. In the case of a nonmemory core, since the CPU addresses the core via memory-mapped I/O, v2 must be the address corresponding to the core. If v2 is distorted by cross-talk, the CPU would be receiving data from a wrong address, v2 , which may be a physical memory address or an address corresponding to a different core. By keeping different data at v2 and v2 (i.e., mem[v2] = mem[v2 ]), the CPU is able to observe the error and store it in memory for analysis. Figure 28.10 illustrates this process, for example, in the case where the CPU is communicating with a memory core, to apply test (0001, 1110) in the address bus from the CPU to the memory core, the CPU first reads data from address 0001 and then from address 1110. In the system with the faulty address bus, this address may become 1111. If different data are stored at addresses 1110 and 1111 (mem[1110] = 0100, mem[1111] = 1001), the CPU would receive a faulty value from memory (1001 instead of 0100). This error response can later be stored in memory for analysis. The feasibility of this method has been demonstrated by applying it to test the interconnects of a processor-memory system. The defect coverage was evaluated using a system-level cross-talk-defect simulation method. Functionally Maximal Aggressor (FMA) tests. Even though the MA tests have been proven to cover all physical defects related to cross-talk between interconnects, Lai et al. [24] observe that many of them

© 2006 by Taylor & Francis Group, LLC

28-14

Embedded Systems Handbook

can never occur during normal system operation owing to constraints imposed by the system. Therefore, testing buses using MA tests might screen out chips that are functionally correct under any pattern produced under normal system operation. Instead, functionally maximal aggressor (FMA) tests that meet the system constraints and can be delivered under the functional mode are proposed in [24]. These tests provide a complete coverage of all cross-talk-induced logical and delay faults that can cause errors during the functional mode. Given the timing diagrams of all bus operations, the spatial and temporal constraints imposed on the buses can be extracted and FMA tests can be generated. A covering relationship between vectors extracted from the timing diagrams of the bus commands is used during the FMA test generation process. Since the resulting FMA tests are highly regular, they can be generated in an algorithmic way. Therefore, the FMA tests are clustered and fit into a few groups. The tests in each group are highly similar except that the victim lines are different. Therefore, as with a marching sequence (which is commonly used for testing memory), the tests in each group can be synthesized by a software routine. The synthesized test program is highly modularized and very small. Experimental results have shown that a test program as small as 3000 to 5000 bytes can detect all cross-talk defects on the bus from the processor core to the target core. Next, the synthesized test program is applied to the bus from the processor core, and the input buffers of the destination core capture the responses at the other end of the bus. Such responses should be read back by the processor core to determine whether any faults occurred on the bus. However, because the input buffers of a nonmemory core cannot be read by the processor core, a DfT scheme is suggested to allow direct observability of the input buffers by the processor core. The DfT circuitry consists of bypass logic added to each I/O core to improve its testability. With the DfT support on the target I/O core, the test generation procedure first synthesizes instructions to set the target core to the bypass mode, and then it continues with synthesizing instructions for the FMA tests. The test generation procedure does not depend on the functionality of the target core.

28.7 Self-Testing of Other Nonprogrammable IP Cores Testing nonprogrammable cores on a SoC is a complex problem with many unresolved issues [26]. Industry initiatives such as the IEEE P1500 Working Group [27] provide some solutions for IP core testing. However, they do not address the requirements of at-speed testing. A self-testing approach for nonprogrammable cores on a SoC has been proposed in Reference 26. In this approach, a test program running on the embedded processor delivers test patterns to other IP cores in the SoC at-speed. The test patterns can be generated on the processor itself or fetched from an external ATE and stored in on-chip memory. This alleviates the need for dedicated test circuitry for pattern generation and response analysis. The approach is scalable to large-size IP cores whose structural netlists are available. Since the pattern delivery is done at the SoC operational speed, it supports delay test. A test wrapper (shown in Figure 28.11) is placed around each core to support pattern delivery. It contains test support logic needed to control shifting of the scan chain, buffers to store scan data, buffers to support at-speed test, etc. The test flow based on the embedded software self-testing methodology is illustrated in Figure 28.11. It offers tremendous flexibility in the type of tests that can be applied to the IP cores as well as in the quality of the test pattern set without entailing significant hardware overhead. Again, the flow is divided into a preprocessing phase and a testing phase. In the preprocessing phase, a test wrapper is automatically inserted around the IP core under test. The test wrapper is configured to meet the specific testing needs for the IP core. The IP core is then fault-simulated with different sets of patterns. Weighted random patterns generated with multiple weight sets are used in [26]. In [5], multiple capture cycles are used after each scan sequence. Next, a high-level test program is generated. This program synchronizes the software pattern generation, start of the test, application of the test, and analysis of the test response. The program can also synchronize testing multiple cores in parallel. The test program is then compiled to generate a processor specific binary code.

© 2006 by Taylor & Francis Group, LLC

Embedded Software-Based Self-Testing for SoC Design

28-15

IP/CUT IP/CUT Test specific Parameters

Processor Processor Specifi c specific parameters

Finding weights for PIs and PSIs Test code generator

Fault simulation

Binary test program

Response

Embedded Embedded processor Processor

BUS BUS

Test wrapper generator Preprocessing phase

IP core IP Core

Test phase

FIGURE 28.11 The test flow for testing nonprogrammable IP cores. (From J.-R. Huang, M.K. Iyer, and K.-T. Cheng, in Proceedings of the IEEE VLSI Test Symposium (VTS), April 2001. With permission.)

In the test phase, the test program is run on the processor core to test various IP cores. A test packet is sent to the IP core test wrapper informing it about the test application scheme (single- or multiplecapture cycle). Data packets are then sent to load the scan buffers and the PI/PO buffers. The test wrapper applies the required number of scan shifts and captures the test response for the programmed number of functional cycles. The results of the test are stored in the PI/PO buffers and the scan buffers; from there they are read out by the processor core.

28.8 Instruction-Level DfT/Test Instructions Several potential benefits can accrue from self-testing manufacturing defects in a SoC by running test programs using a programmable core. These include at-speed testing, low-DfT overhead (owing to elimination of dedicated test circuitry), and better power and thermal management during testing. However, such a self-test strategy might require a lengthy test program and still might not achieve sufficiently highfault coverage. These problems can be alleviated by applying a DfT methodology based on adding test instructions to an on-chip programmable core such as a microprocessor core. This methodology is called instruction-level DfT. Instruction-level DfT inserts test circuitry in the form of test instructions. It should be a less intrusive approach than the gate-level DfT techniques which attempt to create a separate test mode somewhat orthogonal to the functional mode. If the test instructions are carefully designed such that their microinstructions reuse the datapath for the functional instructions and do not require any new datapath, then the overhead, which occurs only in the controller, should be relatively low. This methodology is also more attractive for applying at-speed tests and for power/thermal management during test, as compared with the existing logic BIST approaches. Instruction-level DfT methods have been proposed in References 28 and 29. The approach in Reference 28 adds instructions to control the exceptions such as microprocessor interrupts and reset. With the new instructions, the test program can achieve a fault coverage close to 90% for stuck-at faults. However, this approach cannot achieve a higher coverage because the test program is synthesized based on a random approach and cannot effectively control or observe some internal registers that have low testability. The DfT methodology proposed in Reference 29 systematically adds test instructions to an on-chip processor core to improve the self-testability of the processor core, reduce the size of the self-test program, and reduce its runtime (i.e., reduce the test application time). To decide which instructions to add, the testability of the processor is analyzed first. If a register in the processor is identified as hard-to-access, a test instruction allowing direct accessing of the register is added. The testability of a register can be determined

© 2006 by Taylor & Francis Group, LLC

28-16

Embedded Systems Handbook

based on the availability of data movement instructions between registers and memory. A register is said to be fully controllable if there exists a sequence of data movement instructions that can move the desired data from memory to the register. Similarly, a register is said to be fully observable if there exists a sequence of data movement instructions to propagate the register data to memory. Given the micro-architecture of a processor core, it is possible to identify the fully controllable and fully observable registers. For registers that are not fully controllable/observable, new instructions can be added to improve their accessibility. In addition, test instruction can also be added to optimize the test program size and runtime. This is based on the observation that in the synthesized self-test program some code segments (called hot segments) appear repeatedly. Therefore, the addition of few test instructions can reduce the size of hot segments. Test instructions can be added to speed up the process of preparing the test vectors by the processor core, retrieving the responses from the on-chip core under test and analyzing the responses (by the processor core). When adding new instructions, the existing hardware should be “reused” as much as possible to reduce the area overhead. Adding extra buses or registers to implement new instructions is unnecessary and avoidable. In most cases, a new instruction can be added by introducing new control signals to the datapath rather than by adding hardware. Adding test instructions to the programmable core does not improve the testability of other nonprogrammable cores on the SoC. Therefore, instruction-level DfT cannot increase the fault coverage of the nonprogrammable cores. However, the programs for testing the nonprogrammable cores can be optimized by adding new instructions. In other words, the same set of test instructions added for selftesting the programmable cores can be used to reduce the size and runtime of the test programs for testing other nonprogrammable cores. For pipelined designs, instructions can be added to manage the difficult-to-control registers buried deeply in the pipeline. The experimental results of two processors (Parwan and DLX) show that test instructions can reduce the program size and program runtime by about 20% at the cost of about 1.6% increase in area overhead.

28.9 Self-Test of On-Chip ADC/DAC and Analog Components Using DSP-Based Approaches For mixed-signal systems that integrate both analog and digital functional blocks onto the same chip, testing of analog/mixed-signal parts has become the bottleneck during production testing. Because most analog/mixed-signal circuits are functionally tested, their testing needs expensive ATE for analog stimulus generation and response acquisition. One promising solution to this problem is BIST that utilizes on-chip resources (either shared with functional blocks or dedicated BIST circuitry) to perform on-chip stimulus generation and response acquisition. Under the BIST approach, the demands on the external test equipment are less stringent. Furthermore, stimulus generation and response acquisition is less vulnerable to environmental noise during the test process. With the advent of the CMOS technology, DSP-based BIST becomes a viable solution for analog/mixedsignal systems, as the required signal processing to make the pass/fail decision can be realized in the digital domain with digital resources. In DSP-based BIST schemes [30,31], on-chip DA and AD converters are used for stimulus generation and response acquisition, and DSP resources (such as CPU or DSP cores) are used for the required signal synthesis and response analysis. The DSP-based BIST scheme is attractive because of its flexibility — various tests, such as AC, DC, and transient tests, can be performed by modifying the software routines without altering the hardware. However, on-chip AD and DA converters are not always available in mixed-signal SoC devices. In Reference 32, the authors propose to use the one-bit firstorder delta–sigma modulator as a dedicated BIST circuitry for on-chip response acquisition, in case an on-chip AD converter is not available. Owing to its over-sampling nature, the delta–sigma modulator can tolerate relatively high-process variations and match inaccuracy without causing functional failure, and is therefore particularly suitable for VLSI implementation. This solution is suitable for low-to-medium frequency applications (for example, audio signal).

© 2006 by Taylor & Francis Group, LLC

Embedded Software-Based Self-Testing for SoC Design

ATE

SOC

Test Test stimuli stimuli and & specifications spec.

Pass/fail Pass/fail ?

28-17

?

Software ∆–Σ Software ∆Σ modulator modulator Programmable core + memory Response Response analysis analysis

Low-res. DAC Low-res. DAC & LPLPF F and Analog Analog CUT CUT One-bit 1- bit ∆Σ ∆–Σ modulator modulator

FIGURE 28.12 DSP-based self-test for analog/mixed-signal parts. (From J.L. Huang and K.T. Cheng, in Proceedings of the Asia and South Pacific Design Automation Conference, January 2000. With permission.)

Figure 28.12 illustrates the overall delta–sigma modulation-based BIST architecture. It employs the delta–sigma modulation technique for both stimulus generation [33] and response analysis [32] A software delta–sigma modulator converts the desired signal to a one-bit digital stream. The digital 1’s and 0’s are then transferred to two discrete analog levels by a one-bit DAC followed by a low-pass filter that removes the out-of-band high-frequency modulation noise, thus restoring the original waveform. In practice, we extract a segment from the delta–sigma output bit stream that contains an integer number of signal periods. The extracted pattern is stored in on-chip memory, and then periodically applied to the low-resolution DAC and low-pass filter to generate the desired stimulus. Similarly, for response analysis, a one-bit – modulator can be inserted to convert the analog DUT output response into a one-bit stream, which is then analyzed by DSP operations performed by on-chip DSP/microprocessor cores. Among the one-bit – modulation architectures, the first-order configuration is the most stable and has the maximal input dynamic range. However, it is not quite practical for high-resolution applications (as a rather high over-sampling rate will be needed), and it suffers inter-modulation distortion (IMD). Compared to the first-order configuration, the second-order configuration has a smaller dynamic range but is more suitable for high-resolution applications. Note that, the software part of this technique, that is, the software – modulator and the response analyzer, can be performed by on-chip DSP/microprocessor cores, if abundant on-chip digital programmable resources are available (as indicated in Figure 28.12), or by external digital test equipment.

28.10 Conclusions Embedded software-based self-testing has the potential to alleviate problems with many of the current external tester-based and hardware BIST testing techniques for SoCs. In this chapter, we give a summary of the recently proposed techniques on this subject. One of the main tasks in applying these techniques is extracting the functional constraints in the process of test program synthesis — that is, deriving tests that can be delivered by processor instructions. Future research in this area must address the problem of automating the constraint extraction process in order to make the proposed solutions fully automatic for general embedded processors. The software-based self-testing paradigm can be further generalized for analog/mixed-signal components through the integration of DSP-based testing techniques, – modulation principles, and some low-cost analog/mixed-signal DfT.

Acknowledgments The authors wish to thank L. Chen and T.M. Mak of Intel, Angela Krstic of Cadence, Sujit Dely of UC San Diego, Larry Lai of Novas, Li.-C. Wang, and Charles Wen of UC Santa Barbara for their efforts and contribution to this chapter.

© 2006 by Taylor & Francis Group, LLC

28-18

Embedded Systems Handbook

References [1] Semiconductor Industry Association, The National Technology Roadmap for Semiconductors, 1997. [2] Semiconductor Industry Association, The International Technology Roadmap for Semiconductors, 2003. [3] C.-J. Lin, Y. Zorian, and S. Bhawmik, Integration of Partial Scan and Built-In Self-Test, Journal of Electronic Testing: Theory and Applications (JETTA), 7(1-2): 125–137, August 1995. [4] K.-T. Cheng and C.-J. Lin, Timing-Driven Test Point Insertion for Full-Scan and Partial-Scan BIST, in Proceedings of the IEEE International Test Conference (ITC), Washington D.C., October 1995. [5] H.-C. Tsai, S. Bhawmik, and K.-T. Cheng, An Almost Full-Scan BIST Solution — Higher Fault Coverage and Shorter Test Application Time, in Proceedings of the IEEE International Test Conference (ITC), Washington D.C., October 1998. [6] L. Chen and S. Dey, Software-Based Self-Testing Methodology for Processor Cores, IEEE Transactions on Computer-Aided Design (TCAD), 20(3): 369–380, March 2001. [7] W.-C. Lai, A. Krstic, and K.-T. Cheng, On Testing the Path Delay Faults of a Microprocessor using its Instruction Set, in Proceedings of the IEEE VLSI Test Symposium (VTS), Montreal Canada, April 2000. [8] W.-C. Lai, A. Krstic, and K.-T. Cheng, Test Program Synthesis for Path Delay Faults in Microprocessor Cores, in Proceedings of the IEEE International Test Conference (ITC), Washington, D.C., October 2000. [9] L. Chen, S. Ravi, A. Raghunathan, and S. Dey, A Scalable Software-Based Self-Test Methodology for Programmable Processors, in Proceedings of the ACM/IEEE Design Automation Conference (DAC), Anaheim, CA, June 2003. [10] R.S. Tupuri and J.A. Abraham, A Novel Functional Test Generation Method for Processors using Commercial ATPG, in Proceedings of the IEEE International Test Conference (ITC), Washington D.C., September 1997. [11] S. Hellebrand and H.-J. Wunderlich, Mixed-Mode BIST Using Embedded Processors, in Proceedings of the IEEE International Test Conference (ITC), Washington, D.C., October 1996. [12] R. Dorsch and H.-J. Wunderlich, Accumulator Based Deterministic BIST, in Proceedings of the IEEE International Test Conference (ITC), Washington, D.C., October 1998. [13] Z. Navabi, VHDL: Analysis and Modeling of Digital Systems. McGraw-Hill, New York, 1997. [14] X. Bai, L. Chen, and S. Dey, Software-Based Self-Test Methodology for Crosstalk Faults in Processors, in Proceedings of the IEEE High-Level Design Validation and Test Workshop, San Francisco, CA, November 2003, pp. 11–16. [15] K.-T. Cheng and H.-C. Chen, Classification and Identification of Nonrobustly Untestable Path Delay Faults, IEEE Transactions on Computer-Aided Design (TCAD), 15(8): 845–853, August 1996. [16] A. Krstic, S.T. Chakradhar, and K.-T. Cheng, Testable Path Delay Fault Cover for Sequential Circuits, in Proceedings of the European Design Automation Conference, Geneva, Switzerland, September 1996. [17] M. Gumm, VHDL — Modeling and Synthesis of the DLXS RISC Processor. VLSI Design Course Notes, University of Stuttgart, Germany, December 1995. [18] T. Grüning, U. Mahlstedt, and H. Koopmeiners, DIATEST: A Fast Diagnostic Test Pattern Generator for Combinational Circuits, in Proceedings of the IEEE International Conference on Computer-Aided Design (ICCAD), Santa Clara, CA, November 1991. [19] X. Yu, J. Wu, and E.M. Rudnick, Diagnostic Test Generation for Sequential Circuits, in Proceedings of the IEEE International Test Conference (ITC), Washington, D.C., October 2000. [20] I. Pomeranz and S.M. Reddy, A Diagnostic Test Generation Procedure Based on Test Elimination by Vector Omission for Synchronous Sequential Circuits, IEEE Transactions on Computer-Aided Design (TCAD), 19(5): 589–600, May 2000. [21] L. Chen and S. Dey, Software-Based Diagnosis for Processors, in Proceedings of the ACM/IEEE Design Automation Conference (DAC), New Orleans, LA, June 2002.

© 2006 by Taylor & Francis Group, LLC

Embedded Software-Based Self-Testing for SoC Design

28-19

[22] X. Bai, S. Dey, and J. Rajski, Self-Test Methodology for At-Speed Test of Crosstalk in Chip Interconnects, in Proceedings of the ACM/IEEE Design Automation Conference (DAC), Los Angeles, CA, June 2000. [23] L. Chen, X. Bai, and S. Dey, Testing for Interconnect Crosstalk Defects Using On-Chip Embedded Processor Cores, in Proceedings of the ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV, June 2001. [24] W.-C. Lai, J.-R. Huang, and K.-T. Cheng, Embedded-Software-Based Approach to Testing Crosstalk-Induced Faults at On-Chip Buses, in Proceedings of the IEEE VLSI Test Symposium (VTS), Marina Del Rey, CA, April 2001. [25] M. Cuviello, S. Dey, X. Bai, and Y. Zhao, Fault Modeling and Simulation for Crosstalk in Systemon-Chip Interconnects, in Proceedings of the IEEE International Conference on Computer-Aided Design (ICCAD), San Jose, CA, November 1999. [26] J.-R. Huang, M.K. Iyer, and K.-T. Cheng, A Self-Test Methodology for IP Cores in Bus-Based Programmable SoCs, in Proceedings of the IEEE VLSI Test Symposium (VTS), Marina Del Rey, CA, April 2001. [27] IEEE P1500 Web Site, http://grouper.ieee.org/groups/1500/ [28] J. Shen and J.A. Abraham, Native Mode Functional Test Generation for Processors with Applications to Self Test and Design Validation, in Proceedings of the IEEE International Test Conference (ITC), Washington D.C., October 1998. [29] W.-C. Lai and K.-T. Cheng, Instruction-Level DFT for Testing Processor and IP Cores in Systemon-a-Chip, in Proceedings of the ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV, June 2001. [30] M.F. Toner and G.W. Roberts, A BIST Scheme for a SNR, Gain Tracking, and Frequency Response Test of a Sigma–Delta ADC, IEEE Transactions on Circuits and Systems-II, 42: 1–15, January 1995. [31] C.Y. Pan and K.T. Cheng, Pseudo-Random Testing and Signature Analysis for Mixed-Signal Circuits, in Proceedings of the International Conference on CAD (ICCAD), San Jose, CA, November 1995, pp. 102–107. [32] J.L. Huang and K.T. Cheng, A Sigma–Delta Modulation Based BIST Scheme for Mixed–Signal Circuits, in Proceedings of the Asia and South Pacific Design Automation Conference, Yakohama Japan, January 2000. [33] B. Dufort and G.W. Roberts, Signal Generation using Periodic Single and Multi-Bit Sigma–Delta Modulated Streams, in Proceedings of the IEEE International Test Conference (ITC), Washington, D.C., October 1997.

© 2006 by Taylor & Francis Group, LLC

IV Networked Embedded Systems 29 Design Issues for Networked Embedded Systems Sumit Gupta, Hiren D. Patel, Sandeep K. Shukla, and Rajesh Gupta

30 Middleware Design and Implementation for Networked Embedded Systems Venkita Subramonian and Christopher Gill

© 2006 by Taylor & Francis Group, LLC

29 Design Issues for Networked Embedded Systems 29.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-1 29.2 Characteristics of NES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-2 Functionality and Constraints • Distributed Nature • Usability, Dependability, and Availability

29.3 Examples of NES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-5

Sumit Gupta Tallwood Venture Capital

Hiren D. Patel and Sandeep K. Shukla Virginia Tech

Rajesh Gupta University of California

Automobile: Safety-Critical Versus Telematics • Data Acquisition: Precision Agriculture and Habitat Monitoring • Defense Applications: Battle-Space Surveillance • Biomedical Applications • Disaster Management

29.4 Design Considerations for NES . . . . . . . . . . . . . . . . . . . . . . . . . 29-8 29.5 System Engineering and Engineering Trade-Offs in NES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-10 Hardware • Software

29.6 Design Methodologies and Tools . . . . . . . . . . . . . . . . . . . . . . . 29-13 29.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-15 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-16

29.1 Introduction Rapid advances in microelectronic technology coupled with integration of microelectronic radios on the same board or even on the same chip has been a powerful driver of the proliferation of a new breed of Networked Embedded Systems (NESs) over the last decade. NES are distributed computing devices with wireline and/or wireless communication interfaces embedded in a myriad of products such as automobiles, medical components, sensor networks, consumer products, and personal mobile devices. These systems have been variously referred as EmNets (Embedded Network Systems), NEST (Networked Embedded System Technology), and NES [1–3]. NES are often distributed embedded systems that must interact not only with the environment and the user, but also with each other to coordinate computing and communication. And yet, these devices must often operate in very constrained environments related to their size, energy availability, network connectivity, etc. The challenges posed by the design and deployment of NES has captured the imagination of a large number of researchers and galvanized whole new communities into action. The design of

29-1

© 2006 by Taylor & Francis Group, LLC

29-2

Embedded Systems Handbook

NES requires multidisciplinary, multilevel cooperation, and development to address the diverse hardware (processor cores, radios, security cores) and software (applications, middleware, operating systems, networking protocols) needs. In this chapter, we briefly highlight some of the design concerns and challenges in deploying NES. Some examples of NES are wireless data acquisition systems such as habitat [4–7], agriculture and weather monitoring [8], disaster management and civil monitoring, Cooperative Engagement Capability (CEC) [9] for military use [10], fabric e-textile [11,12], and consumer products such as cell phones and Personal Digital Assistants (PDAs). Common to all these systems/applications is their ability to provide interaction between the environment and humans through a medium of devices such as sensors for data collection, computation processors to perform data computation, and remote storage devices to preserve and collate the information. A good exposition of the characteristics, parameters, examples, and design challenges of NES is presented in Reference 1. We draw heavily on this book for material and examples. Similar surveys and expositions of challenges in the applications and design and implementation of sensor networks are given in References 13–17. The rest of this chapter is organized as follows: in Section 29.2, we describe the characteristic of NES followed by some examples of such systems in Section 29.3. Based on these examples and characteristics, we delve into the design considerations of NES in Section 29.4. In Section 29.5, we explore the engineering trade-offs while designing and deploying NES. Finally, we discuss the design methodologies and design tools available for designing NES in Section 29.6 and conclude the chapter with a discussion.

29.2 Characteristics of NES The realm of possibilities where NES applications can be implemented makes characterizing these systems an inherently difficult task. However, we make an attempt at characterizing the basic functionality and constraints, distributed nature, and usability, dependability, and availability of such systems. Then, we describe NES through some examples.

29.2.1 Functionality and Constraints Networked embedded systems are typically designed to interact with and react to the environment and people around them. Thus, often NES have sensors that measure temperature, moisture, movement, light, and so on. By definition, NES have a communication mechanism — either a wireline connection or a wireless radio. Also, they typically have computation engines that can do at least a minimal amount of computing on the data they acquire. The environment and user needs place constraints on NES such as small size, low weight, harsh working conditions, safety and reliability concerns, low cost, and poor resource availability in terms of low computational ability, and low energy availability (limited battery) [18]. NES devices have to be small in size so that their deployment does not interfere with the environment; that is, they must function almost invisibly to the environment. For example, animals must not be aware of the habitat monitoring sensors that are embedded on them or around them. This example also demonstrates the need for these systems to be low weight and be able to work under harsh conditions, that is, be tolerant of temperature changes, physical abuse, vibration, shock, and corrosion. Since NES are frequently deployed in the field with little or no access to renewable energy sources, they have to live off a limited energy source or battery. Owing to real-time and mission-critical requirements, NES have to frequently meet safety and reliability constraints. For example, the cruise control, antilock braking, and airbag systems in automobiles have to respond within the given real-time constraints to meet safety requirements. The small form factor and wide distribution of NES also places a constraint on cost; price fluctuations of even a few cents on each device have a big impact as the volume of devices deployed increases.

© 2006 by Taylor & Francis Group, LLC

Design Issues for Networked Embedded Systems

29-3

FIGURE 29.1 U.C. Berkeley NES device — the MICA processor and radio board. (From M. Horton, D. Culler, K. Pister, J. Hill, R. Szewczyk, and Alec Woo. Sensors Magazine, 19, 2002. With permission.)

Figure 29.1 shows the Berkeley NES device that consists of a MICA processor and a radio board [19]. As technology advances, these devices are becoming smaller. In fact, the latest Berkeley “mote” is as small as a coin. This has led to the notion of smart dust or a massively distributed sensor network that is self-contained, networked, and provides multiple sensor and coordinated computational capabilities [7,20–22].

29.2.2 Distributed Nature The application spectrum of NES often means that these systems are physically distributed. In fact, the distributed nature of NES extends to distributed functionality and communication as well. Distributed in function refers to NES components that perform specific roles and work together with other NES components to complete the system. Automotive electronics is a good example where many different function-specific components work in unison, such as the power control modules, the engine, airbag deployment, cruise control, suspension, etc. Figure 29.2 shows the components from Kyocera that are widely used in automotive electronics. Similarly, distributed communication refers to local and global communication between the embedded systems distributed throughout the system. For example, automotive systems have local wires from actuators/sensors to ECUs (Electronic Control Units) and global wires/buses between ECUs [24].

29.2.3 Usability, Dependability, and Availability Networked embedded systems are becoming an increasingly dominant part of a number of devices and systems in all aspect of our daily lives from entertainment, transportation, personal communications to biomedical devices. The pervasiveness of these systems, however, raises concerns about dependability and availability. Availability generally means access to the system. In scenarios where some of the components of the NES fail, there must be a mechanism through which the users can interface and interact with the components to investigate and rectify the problems. Mediums of access can be through PDAs, wired serial access points, infrared technology, etc. Another dimension of availability is the long life expected from NES components. Often, NES do not have access to a renewable energy source and sometimes it is not possible to change the battery source either. For example, sensors deployed to measure traffic on roads or sensor tags placed on animals are inaccessible or difficult to reach after deployment. Dependability or reliability is also a major concern that goes hand in hand with the availability requirement. The system must guarantee a certain level of service that the user can depend on. For example, temperature sensors and smoke detectors are critical to the fire safety requirements of any building. Availability and dependability characteristics are especially crucial to safety-critical systems such as avionics and biomedical applications. The sensors used in an airplane to monitor cabin pressure, oxygen

© 2006 by Taylor & Francis Group, LLC

29-4

Embedded Systems Handbook (a)

(b)

(c)

Integrated terminals (Lead-offs)

Cross section of stepped active metal bond

FIGURE 29.2 Automotive electronic components by Kyocera. (From Kyocera Website http://global.kyocera.com/ application/automotive/auto_elec/. With permission.)

© 2006 by Taylor & Francis Group, LLC

Design Issues for Networked Embedded Systems

29-5

levels, elevation, relative speed, etc. are all important to maintain safety. For example, the release of oxygen into the oxygen masks in airplanes is controlled via sensors that monitor the cabin oxygen levels. For biomedical applications such as electronic heart pacemakers, the need for dependability is obvious since malfunctioning components can be life threatening. Military personnel monitoring is another example for biomedical applications with NES where devices are used to transmit the location and vital statistics of the personnel. Since NES consist of diverse hardware and software components that interact with each other, component interoperability becomes another important concern [1]. Today, cars have tens or sometimes even a hundred embedded computing systems that are designed and manufactured by different contractors. Such complex distributed, interacting embedded systems raise difficult challenges in system integration, component interoperability, and system testing and validation. These NES characteristics present new challenges and constraints for systems engineers that have not been fully addressed in the design of past networking and distributed systems. Software and hardware tools and techniques are required to satisfy the need for low cost, low power, Quality-of-Service (QoS) guarantees, and fast time-to-market. Formalization of methodologies to ensure functional correctness and efficiency of design are paramount in meeting the time-to-market requirements by reducing the iterations in the design process. In the following sections, we discuss some of the interesting applications of NES and then, describe NES tools and methodologies such as programming languages, simulation environments, and performance measurement tools designed specifically to address the design challenges posed by NES.

29.3 Examples of NES To demonstrate the characteristics, constraints, and design challenges of NES, we present several examples from current and future NES. These examples are representative of the diverse application domains where NES can be found. We start with three examples from Reference 1.

29.3.1 Automobile: Safety-Critical Versus Telematics Cars today typically have tens to a hundred microprocessors controlling all aspects of the automobile from entertainment systems to the emergency airbag release mechanisms. Figure 29.3 shows some of

Infrared rays

Communication

Radar tech. Power electronics

Control tech.

Optical devices

Fuel cell Semiconductor

Super - conduction

Computer and telephone

Monitoring system

FIGURE 29.3 Telematics components in an automobile. http://www.mitsubishielectric.ca/automotive/. With permission.)

© 2006 by Taylor & Francis Group, LLC

(From

Mitsubishi

electric.

29-6

Embedded Systems Handbook

the telematic components in a Mitsubishi car [25]. The microprocessors in charge of the functionality frequently communicate and interact with other processors. For example, the stereo volume is automatically reduced when the driver receives (or answers) a call on his or her cell phone. Thus, a range of devices that perform different tasks are beginning to be organized in sophisticated networks as distributed systems. Broadly speaking, there are two such distributed systems: safety-critical processing systems and telematics systems. Clearly, the safety-critical aspects cannot be sacrificed or compromised in any way. These two systems are an integral part of the design and construction of the automobile and dictate several design parameters. Since automobiles have time-to-market that can span up to five years from concept to final product, frequently the technology used for the telematics and safety-critical components in the automobile is already outdated. This is a rising concern, especially for the safety-critical components, because upgrading or switching out components is generally not performed or usually not even feasible. Note that, systems that cannot be upgraded or altered after final production are considered to be closed systems. Conversely, open systems allow plugging in newer components with more capabilities and features similar to a plug-and-play environment. Thus, to make automobiles open systems, we have to develop technologies that enable automobile designers to construct the safety-critical and telematics systems in an abstracted manner such that components with a standardized communication protocol and interface can simply be plugged into the final product. This resolves the disparity between the long design cycles for automobiles and the rapid advances in NES components used in them. Increasing popularity of wireless technology has spawned interesting applications for automobiles. For example, the OnStar system from General Motors can monitor the location of a car and customer service staff can remotely unlock the car, detect when airbags have been deployed, and so on. Wireless communication opens up infinite possibilities such as automobile service requests and data collection for automobile users, dealers, and manufacturers.

29.3.2 Data Acquisition: Precision Agriculture and Habitat Monitoring The use of sensor nodes for data acquisition is becoming an useful tool for agricultural and habitat monitoring. In Reference 7, the authors present a study in which they used wireless sensor networks in a real-world habitat monitoring project. The small footprint and weight of modern sensor nodes make them attractive for habitat and agricultural monitoring since they cause minimal disturbance to the animals, plant population, and other natural elements of the habitats being monitored. This solution to monitoring automates some of menial tasks such as data collection for researchers. Precision agriculture is an important area of research where NES technology is likely to have a big impact [1,26,27]. Precision agriculture envisages sensor deployment to monitor and manage crop productivity, quality, and growth. Besides increasing productivity, better crop quality and crop management are also key aspects of using NES in precision agriculture. Crop management provides monitoring and adjusting the level of fertilizer, pesticides, water for particular areas resulting in better yields at less pollution, emissions, and definitely lower costs. The automation of these functions requires an adaptive behavior to the changing surroundings such as water levels if it rains, or pesticides for a particular season when bug problems are more common. This adaptation is an integral aspect of precision agriculture. While there exist models that dictate the necessary amount of fertilizer, water, nutrient combinations, these models are not always accurate for the specific locale. So NES can also perform on-the-side data acquisition functions for purposes of reconstructing appropriate models and recalibrating or reconfiguring the sensor metrics accordingly to better suit the specific climate and locale (Figure 29.4). Feedback into such systems is crucial to develop the notion of true automated precision agriculture. Fine-grained tuning to crop management can be done automatically based on these regularly updated models that can also be monitored by researchers. However, manual adjustments of some kind would require appropriate interfaces between the deployed NES and the end-user attempting to make the change. Once again, wireless interfaces can be used for such manual fine-tunes. Configuration and management

© 2006 by Taylor & Francis Group, LLC

Design Issues for Networked Embedded Systems

29-7

Patch network

Sensor node

Sensor patch

Gateway

Transit network

Base station Client data browsing and processing

Internet

Data service

FIGURE 29.4 System architecture of NES for common data acquisition scenarios. (From David Culler et al. Wireless sensor networks for HAB monitoring. ACM, 2002. With permission.)

of the network can be handled remotely via handheld devices or even desktops. A practical example of such a deployment is shown in Reference 27 and wireless sensor networks for precision agriculture in Reference 26. A similar two-tiered network that couples wireless and wired networks together has been proposed for structural monitoring of civil structures that can be affected by natural disasters such as earthquakes [28]. The small size and wireless nature of NES sensor nodes enables field researchers to deploy these sensors in small and sensitive locations.

29.3.3 Defense Applications: Battle-Space Surveillance Network embedded systems are projected to become crucial for future defense applications, particularly in battle-space surveillance and condition and location monitoring of vehicles, equipment, and personnel. A military application called Cooperative Engagement Capability (CEC), developed by Raytheon Systems [9,29], acts as a force multiplier for naval air and missile defense systems by distributing sensor and weapons data to multiple CEC ships and airborne units. Data from each unit is distributed to other CEC

© 2006 by Taylor & Francis Group, LLC

29-8

Embedded Systems Handbook

units, after which the data is filtered and combined via an identical algorithm in each unit to construct a common radar-like aerial picture for missile engagements. DARPA has funded research in several areas of defense systems under the aegis of their future combat systems program [30]. Manipulating battle environments and critical threat-identification are projected uses of NES in such systems. Manipulating battle environments refers to controlling the opposition by detecting their presence and either altering their route or constraining their advance. Threat-identification involves identifying a threat early for force protection. A force-protection scenario involves deployment of sensors around a perimeter that requires protection so that forced entry can be identified and automated responses such as alarms can be triggered based on a certain event. System deployment used to be of concern in the past because sensors were bulky and large in size and required manual deployment. However, with the advances in technology entire sensor networks can now be deployed by airdrop, personnel, or even via artillery. The small sizes have enabled NES to be deployed for monitoring vehicles in a way similar to the automobile example discussed earlier. A relatively new technology called e-textiles has emerged whereby sensors or other computation devices are integrated into wearable material [11]. Nakad et al. [11] are investigating the communication requirements between sensing nodes of e-textile and the computing elements embedded with them. One key and obvious application of e-textiles is for data acquisition for human monitoring where sensor nodes can be used to track the location and vital statistics of military personnel.

29.3.4 Biomedical Applications A “civilian” application for e-textiles is for monitoring the health of people — particularly, old people. A sensor node embedded in an e-textile worn by patients with heart problems can automatically alert doctors or emergency services when the patient suffers from heart failure. We have already seen the value of heart pacemakers for helping millions of people around the world maintain a regular heart beat. Work is in progress to make sensors small and body-friendly enough that they can either be surgically inserted or swallowed for temporary monitoring. These devices can be used to monitor, diagnose, and even correct anomalies in the health of a patient. Of course, surgical insertion or ingestion of microelectronic devices raises several concerns about safety and ability of the body to adapt to the foreign bodies, which are active areas of research.

29.3.5 Disaster Management Scenarios that involve disaster management can be seen as data acquisition applications where certain information is gathered, based on which a response is computed and performed. A good example of an implemented scenario is provided in Reference 31 where remote villages are monitored by four sensors measuring the seismic activity and water levels for earthquakes and floods, respectively. Through wireless mediums, these sensors are connected to the nearest emergency rescue stations signaling emergency events when the thresholds for maximum water levels and seismic activity are crossed. As mentioned earlier, Kottapalli et al. [28] propose a sensor network for structural monitoring of civil structures in case of natural disasters such as earthquakes. Other applications of disaster management systems include severe cold (or heat) monitoring, fire monitoring (smoke detectors, heat sensors), volcano monitoring, etc.

29.4 Design Considerations for NES The examples presented in Section 29.3 gives an idea of the breadth of the application domains in which NES can be deployed. By studying these examples, we understand the various requirements, constraints, issues, and concerns involved in developing these kind of systems. Furthermore, as NES proliferate, the true potential of these systems will be realized when they are deployed at a massive scale, in the order of thousands or more components. Such a large-scale deployment, however, raises

© 2006 by Taylor & Francis Group, LLC

Design Issues for Networked Embedded Systems

29-9

some problems [13,15–17]: Deployment. Deployment refers to physical distribution of the nodes in the NES. The first concerns for deployment are safety, durability, and sturdiness; if devices are dropped from the air, they should not cause damage to other objects (people, animals, plants, or material) while landing, and should not cause damaged either. This is clearly important for defense applications where surveillance sensors may be airdropped in the battle space. Several deployment strategies are available that can be classified into either random or strategic deployment. As the name suggests, random deployment refers to deploying NES nodes in an arbitrary fashion in the field. Random deployment is useful when the region being monitored is not accessible for precise placement of sensors [32]. The problem then becomes of determining region coverage and also possible redeployment or movement of nodes to improve coverage. Strategic deployment refers to placing NES nodes at well-planned points so that the coverage is maximized or to place nodes strategically in a small field of concentration such that these nodes are not easily subjected to natural damages (e.g., habitat monitoring). The number of NES nodes deployed must be considered in the cost and performance/quality of monitoring trade-off because some nodes are generally bound to be destroyed by some means, so there should be sufficient reserves or fault tolerance in the network to continue the monitoring. Environment interaction. NES components often need to interact with the environment without human interaction. Thus, a requirement of NES is an ability to work on their own and perhaps also has a feedback loop so that nodes can adapt to changes (failure of nodes, movement of objects) in the environment and continue functioning correctly. Systems such as those used in precision agriculture, chemical and hazardous gas monitoring, and so on are designed to interact and react to changes in the system. For example in agriculture, release of water can be tied to the the moisture content in the air. Life expectancy of nodes. As discussed earlier, an essential requirement for nodes in a NES is a long life expectancy. This is because once deployed, it is very difficult to access and refurbish the batteries in the nodes. Also, these nodes must sustain environmental challenges such as inclement weather and unexpected loss of nodes to animal interaction or owing to component failure. Thus, a whole body of work has gone into identifying node failure and subsequent reconfiguration of the network to have some amount of fault tolerance [33]. Communication protocol between devices. A combination of wired and wireless links can be used to establish a NES. Furthermore, the nodes in the network may be stationary or mobile. Mobile nodes bring in a whole range of issues related to dynamic route and neighbor discovery, dynamic routing, etc. The NES should also be able to reconfigure and adjust to tolerate loss of nodes from a communication point of view. That is, if a node that is a relay point fails or dies, then the network should be able to use other nodes for relaying instead. Reconfigurability. In many scenarios, it is not possible to physically reach nodes. However, NES frequently require nodes to reconfigure after deployment. This may be to add, remove, or change functionality or to adjust parameters of the functionality. For example, handheld devices or even desktops may be used to reconfigure nodes to fine-tune certain aspects of the system. For example, the water level can be increased in precision agriculture when the weather report suggests a sudden heat wave for the following few days [26,27]. Security. NES — particularly those that use wireless communication — are prune to malicious attack [34]. This is most evident in military equipment where communication has to be secure from enemy eavesdropping. Security in handheld devices is becoming an increasing concern with their widespread use in office environments for everything from checking email to exchanging sensitive documents and data. Running security protocols is computationally expensive and hence, power hungry and several researchers are proposing ways to reduce these power requirements for sensor networks and handheld devices [35–37]. Energy constrained. The small form factor, low weight, and the deployment of NES nodes in inaccessible and remote regions implies that these nodes have access to a limited nonrenewable energy source. Thus, one major focus of the research community is to develop networking protocols, applications, operating

© 2006 by Taylor & Francis Group, LLC

29-10

Embedded Systems Handbook

systems, etc. (besides devices) that are energy efficient and utilize robust, high throughput but low power communication schemes [17,13]. Operating system. There is a need for special or optimized operating systems owing to the stringent hardware constraints (small form factor, limited energy source, limited memory space) and strict application requirements (real-time constraints, adaptability). Several Real-Time Operating Systems (RTOSs) have been proposed for embedded devices such as eCos [38], LynxOS from LynuxWorks [39], QNX RTOS [40], etc. Adequate design methodologies. Standard design methodologies and design flows have to be modified or new ones have to be created to address the special needs of NES. For example, there is a need for design methodologies for low power system-on-a-chip implementations to enable integration of the large number of diverse components that form a NES device [41].

29.5 System Engineering and Engineering Trade-Offs in NES The design considerations presented in Section 29.4 raise opportunities for interesting trade-offs between the hardware and software components in NES. Whereas area, power, and weight constraints limit the amount of hardware that can be put in a NES node, integration, debugging, and complexity issues are hindering increased dependence on software.

29.5.1 Hardware Rapid advances in silicon technology is ushering in an era where we will see widespread use of smart dust or very small sensor nodes with reasonably complex computational and communication abilities [20–22]. Besides a small size, these nodes are low power and have a variety of actuators and sensors, along with radio/wireless communication devices, and processors for computation. This enables these nodes to move from beyond being just data acquisition sensors that send their data to a central server. They can now also act as computation points that first collate and process the data before sending it to a server or even coordinate computation among themselves independent of a central server. The power and area constraints on NES nodes means that general-purpose microprocessors cannot be used in them. However, low power Application Specific Instruction Processors (ASIPs) augmented with Application Specific Integrated Circuits (ASICs) will provide the necessary computational ability at a relatively low power. Whereas the ASIPs are easily programmable, the ASICs can be used for executing computation-ally expensive and/or time sensitive portions of applications. For example, target identification in defense systems or airbag release mechanisms in cars require ASICs to meet their timing and computational needs. In fact, Henkel and Li [42] and Brodersen et al. [1] have shown that less power is consumed by custom-made processors than by general-purpose processors. The reason for this is that with custom chips, parallelism can be effectively exploited to gain better power consumption. Also, hardwiring the execution of each function eliminates the need for instruction storage and decoding, thus reducing power as well. On the other hand, applications such as habitat monitoring and precision agriculture do not have high timing or computational requirements , so generic microproessors or ASIPs can be used. The compromise is speed and computational ability versus programmability. ASICs have a high design and manufacturing cost and are inflexible when compared with programmable processors. A change in applications or protocols leads to a large redesign effort. Programmable processors, on the other hand, can be reused for several generations of an application (provided computational requirements do not increase). Reconfigurable hardware such as Field Programmable Gate Arrays (FPGAs) provides a middle path between programmable processors and hardwired ASICs. As the name suggests, FPGAs can be reprogrammed after being deployed in the field and hence, provide the flexibility of microprocessors while providing the hardwired speed of ASICs. In fact, FPGAs can be configured at runtime as suggested by

© 2006 by Taylor & Francis Group, LLC

Design Issues for Networked Embedded Systems

29-11

Nitsch and Kebschull [43]. They propose storing the functional behavior and structure of applications in a XML format. When a client wants to execute an application, the XML is analyzed and the appropriate mapping is done to the FPGA. The drawbacks of FPGAs are that they require large chip area and have a low clock frequency.

29.5.2 Software Small memory storage devices and low computational processors limit the size and complexity of the software that can run on NES nodes. Porting commonly used operating systems and applications to NES is difficult because of the limitations posed by the hardware. Hence, software development for NES nodes is another challenge that embedded systems designers have to overcome. The Tiny microthreading operating system (TinyOS) [44] has been proposed to address the unique characteristics of NES nodes. TinyOS is a component-based, highly configurable embedded operating system with a small footprint. TinyOS has a highly efficient multithreading engine and maintains a twolevel First In First Out (FIFO) scheduler. TinyOS consists of a set of interconnected modular components. Each component has tasks, events, and command handlers associated with it (Figure 29.5). Tasks are the processing units of components, that can signal event, issue commands, and execute other tasks. They are allocated a static area of memory to hold the state information of the thread associated with the component. TinyOS does not provide the functionality of dynamic memory allocation owing to the restrictions imposed by the hardware. The component-based structure allows TinyOS to be a highly applicationspecific operating system that can be configured by altering configuration files (.comp and .desc files). Volgyesi and Ledeczi [2] provide a model-based approach to the development of applications based on TinyOS. They presented a graphical environment, called GRATIS, through which the application and operating system components are automatically glued together to produce an application. GRATIS provides automatic code generation capability and a graphical user interface to construct the .comp

Signaled events

Handled commands

Frame Issued commands

Tasks Handled events

FIGURE 29.5 Events and commands in TinyOS. (From P. Volgyesi and A. Ledeczi. Component-based development of networked embedded applications, Vanderbilt Publication, 2002. With permission.)

© 2006 by Taylor & Francis Group, LLC

29-12

Embedded Systems Handbook

and .desc files automatically, thus simplifying the task of component description and wiring for building TinyOS-based applications [2]. This increases design productivity and reconfigurability. Another design effort to provide functionality to program operating system in sensor nodes is the development of a programming language framework called nesC [41]. nesC provides a programming paradigm based on an event-driven execution, flexible concurrency, and component-based design. TinyOS is an example where this language has been employed to develop a commonly used Linux-based operating system for sensor networks. This programming language manages to successfully integrate concurrency, reactivity to the environment and communication. The distributed nature of NES means that these systems are inherently concurrent. For example, data processing and event arrival are two processes that need to be concurrently executed on the NES node. Concurrency management then has to ensure that race conditions do not occur. For example, in the emergency airbag release system, the sensor needs to be able to sense the impact as well as react to it based on processing of the collected data. These type of real-time demands along with the small size and low cost of NES nodes make concurrency management a challenging task. These issues are addressed by nesC by drawing upon several existing language concepts. Three of the main contributions that nesC provides are: definition of a component model, expressive concurrency model, and program analysis to improve reliability and reduce code volume. The component model supports sensor node-like event targeted systems with bidirectional channel interfaces to ease event communication. It also provides flexible hardware and software boundaries and avoids dynamic component instantiation and the use of virtual functions. Expressive concurrency model is tied in with compile time analysis yielding data race detection at compile time to allow for comprehensive concurrent behaviors in NES nodes. Reduction of code size and improvement of reliability are natural goals for any programming language. TinyOS influenced the design of nesC owing to the specific features of the operating system. First, TinyOS provides a collection of reusable system components best suited for component-based architectures. The channel interface connecting components is called the wiring specification, which is independent of the specific implementation of the component. Tasks and events are inherent in TinyOS where tasks are regarded as nonpreemptive computation mechanisms and events are similar to tasks except that they can preempt another task or event. The event-task based concurrency scheme in TinyOS makes event-driven and expressive concurrency closely related to the implementation of nesC. Components in nesC are of either modules or configuration types, where the former consists of application code and the latter provides interfaces for communicating between components. Modules are written in C-style code and a top-level configuration is used to wire components together. This resembles the VHDL-like component-architecture scheme where components are defined and the architecture is the top-level model that connects signals between components. The component-based architecture brings flexibility to application implementations and allows users to write highly concurrent programs for a very small scale platform with limited physical resources. Fortunately, with the aid of nesC and graphics configuration tools such as GRATIS construction of dedicated operating systems based on TinyOS is gradually becoming easier [2]. These tools allow a designer to build his/her own operating system with relative ease, but application-specific functionality still requires implementation at a programming level. Another area of software for NES nodes that has received considerable attention is network protocols [14,17]. Power and energy constraints in NES nodes necessitate efficient network protocols for transmission of sensed data and intermediary communication. Two broad classifications of sensor networks are proactive and reactive. Proactive, as the word suggests, periodically sends the sensed attribute to the data collection location or base station. The period is known a priori allowing the sensors to migrate to their idle, sleep, or off modes to conserve energy. Applications that require periodic monitoring are best suited for this type of sensor network. A protocol called Low Energy Adaptive Clustering Hierarchy (LEACH) [45] is one of the many proposed proactive protocols. Reactive networks, on the other hand, continuously sense the environment and transmit their data to the base station only upon sensing that the attribute has exceeded a specified threshold. This type of

© 2006 by Taylor & Francis Group, LLC

Design Issues for Networked Embedded Systems

29-13

network is useful for time critical data so that the user or base station receives the sensitive information immediately. One such time sensitive protocol for reactive systems that has been proposed recently is the Threshold-sensitive Energy Efficient sensor Network (TEEN) protocol [46]. Hybrid networks constitute a third type of sensor networks that are a combination of proactive and reactive networks and are an attempt to overcome the drawbacks experienced by the two other network systems [47]. In hybrid networks, sensor nodes send sensed data at periodic intervals and also when sensed data exceeds the set threshold. The periodic interval is generally longer than ones found in proactive networks such that functionality of the two network types can be incorporated within one. Furthermore, hybrid systems can be made to work in only proactive or only reactive modes as well.

29.6 Design Methodologies and Tools Deploying large-scale distributed NES is inherently a complex and error-prone task. Designers of such systems rely on system-level modeling and simulation tools during the initial architectural definition phase for design space exploration, to come up with possible architectures that will satisfy the constraints and requirements and then to verify the functionality of the system. Design verification at the highest levels of abstraction down to final implementation is an important concern with any complex system. With distributed systems, this need becomes even more acute due to the inability of a system designer to foresee all possible events and sequences of events that may occur. Several tools have been developed for simulating NES at the highest level of abstraction as communicating network models made up of basic network models. Network Simulator 2 (NS2), OPNET, SensorSim, and NESLsim are popular network simulation tools widely used in the community [48–51]. SensorSim and NESLsim are simulation frameworks specifically designed for sensor networks. SensorSim closely ties in sensor networks with their power considerations by constructing a twopronged approach to creating the models. The first prong involves creating a sensor functional model that represents the software functions of the sensor consisting of the network protocol stack, middleware, user applications, and the sensor protocol stack. The second prong is the power model that simulates the hardware abstracts such as CPU, and radio module to provide the sensor functional model to execute. The architecture of the SensorSim simulator is shown in Figure 29.6.

Sensor function model Micro sensor node Application Middleware Network protocol stack

Sensor protocol stack

Network layer

Sensor layer

MAC layer

Physical layer

Physical layer

Wireless channel

Power model Radio model CPU model

Battery model

Geophone model Microphone model

Sensor channel

FIGURE 29.6 SensorSim architecture. (From S. Park, A. Savvides, and M. Srivastava. Sensorsim: a simulation framework for sensor networks, ACM, 2000. With permission.)

© 2006 by Taylor & Francis Group, LLC

29-14

Embedded Systems Handbook

This two-prong model implies that the sensor functional model dictates the execution of the tasks to the power model and these two models work in parallel with each other. An added feature is the sensor channel that allows sensing devices to detect events. In this way, the sensor channel exposes external signals to the sensor modules such as microphones, infrared detectors, etc. The signals that are transmitted through this channel can be of any available form such as infrared light and sound waves for microphones. Every type of signal has different characteristics based on the medium through which they travel and this is the primary goal of the sensor channel — to simulate these characteristics accurately and to detect and monitor the events in a sensor network. The use of a power model in SensorSim follows from the importance placed in designing low power NES devices [52]. Efficient power control is a basic requirement for the longevity of these devices. The basis behind the power model is that there is a single power supplier, the battery, and all other components or models are energy consumers (as shown in Figure 29.6. The consumers such as the CPU model and radio model drain energy from the battery through events. An attractive feature of SensorSim is its capability to perform hybrid simulations. This refers to the ability of SensorSim to behave as a network emulator and interact with real external components such as network nodes and user applications. However, network emulation for sensor networks differs from traditional network emulation. The large number and speed of input/output events in sensor networks mandates readjusting the real-time delays for the events and reordering the events, making the implementation of an emulator for such networks a much more difficult task. SensorSim enables reprogramming the sensor channel to monitor external inputs, thus using real inputs instead of models for these channels. For example, instead of modeling waves traveling through a wired (e.g., coaxial) cable, a microphone can be connected using a sensor channel to send waveforms through a wired link to the simulator. NESLsim is another modeling framework for sensor networks, which is based on the PARSEC (Parallel Simulation Environment for Complex Systems) simulator [51]. NESLsim abstracts a sensor node into two entities: the node entity and the radio entity. The node entity is responsible for computation tasks such as scheduling, traffic monitoring, and congestion control, whereas the radio entity undertakes the responsibility to maintain the communication between the sensor nodes in the NES. A third entity that is not part of the sensor node is the channel entity. This models the wireless medium through which communication is performed. NS-2 is a discrete event simulator based on an open-source C++ framework that implements a discrete event simulator developed by Virtual Inter Network Test (VINT) bed collaborative research project at University of Southern California and the University of California, Berkeley. They provide substantial support for simulation of the routing and multicast protocols in the TCP/IP networking stack (IP, TCP, UDP, etc.) over both wired and wireless channels. OPNET also performs similar tasks but is a proprietary software developed by Opnet Technologies. SystemC [53,54] is a system-level description language developed to model both the hardware and software of a behavioral specification. Drago et al. [55] developed a methodology to combine the simulation environments of NS-2 and SystemC to simulate and test the functionality of NES. They promote the use of NS-2 for modeling the network topology and communication infrastructure and SystemC for representing and simulating the hardware/software components of the embedded system. Using NS-2 relieves the designer from writing detailed high-level network protocols that already exist and are available for simulation in network simulator, whereas SystemC allows modeling and simulation of implementations of embedded systems. In unison these simulation frameworks preserve simulation integrity and reduce the modeling effort with an admissible degradation in simulation performance. Integration of simulators is regarded as a valuable resource for systems designers. However, to be able to perform such a link between simulators, the underlying development platform must be similar. For example, both NS-2 and SystemC have a C++ underlying framework on which these simulators were built. Also, the basic simulation paradigm in NS-2 and SystemC is similar. NS-2 is a discrete event-driven simulator where the scheduler runs by selecting the next event, executing it to completion, and looping back to execute the next event. Similarly, the SystemC simulator also has a discrete-event based kernel where

© 2006 by Taylor & Francis Group, LLC

Design Issues for Networked Embedded Systems

29-15

processes are executed and signals are updated at clocked transitions, working on the evaluate-update paradigm. Drago et al. [55] use a shared memory queue to pass tokens and packets for communication between the NS-2 kernel and SystemC kernel. System-level design methodologies for embedded systems are based on a hardware–software codesign approach [56–59]. Hardware–software codesign is a methodology where the hardware and software are designed and developed concurrently and in a collaboration. This leads to a more efficient and optimized implementation of applications. Ramanathan et al. [3,60] present a timing-driven design methodology for NES and explore the need for temporal correctness while designing these systems. Determining temporal correctness of system models is difficult because these models are not cycle accurate and usually have no notion of hardware and/or software implementation. Determining the correctness of timing constraints after the hardware has been manufactured naturally leads to costly redesign iterations for both the hardware and software subsystems. Instead the authors propose a solution whereby they specify, explore, and exploit temporal information in high-level network models and bridge the gap between requirement analysis and system design. At each stage of the design refinement, timing information modeled at the higher-level network models is trickled down to finer and lower-level models. The authors used NS-2 as their modeling framework. This timing-driven design methodology uses high-level network models generated using a rate derivation technique called RADHA-RATAN [60,61]. RADHA-RATAN works on generalized task graphs that have nodes that represent functionality or tasks and edges that are a asynchronous, unidirectional communication channel between producers and consumers. RADHA-RATAN is a collation of algorithms that generate timing budgets for the task graph nodes based on their preset execution (firing) and data production and consumption rates. Along with RADHA-RATAN, network-level models at the highest level are used to represent functionalities such as routing, congestion, and QoS. The designer can specify distributions, protocols, and such settings to generate the network graph in NS-2. The nodes of the network graph simulate the transfer of packets and tokens among themselves. This enables testing of the functionality of the protocols. However, further refinement by an experienced designer of this high-level network graph results in network subsystems that capture timing requirements. In a process known as timing-driven task structuring, the designer can then mutate the task graph until the desired timing behavior is achieved, after which partitioning of hardware and software can be performed. The disconnect between requirement analysis and system design is circumvented by this mutation, allowing the mix of the NS-2 modeling paradigm and formal timing analysis techniques to provide a methodology whereby timing requirements seep from high-level network models to low-level synthesis models. Simulation and timing analysis are one part of the puzzle in hardware–software codesign. Automated hardware synthesis and software synthesis and compilation techniques are the next step in generating implementations from the system-level models. To this end, Gupta et al. [62,63] have proposed the SPARK parallelizing high-level synthesis framework that performs automated synthesis of behavioral descriptions specified in C to synthesizable register-transfer level VHDL. This framework can then be used in a system level codesign methodology to implement an application on a core-based platform target [64]. Such hardware–software codesign methodologies are crucial for the design and development of NES. Automated or semiautomated methodologies are less error-prone, lead to faster time-to-market, and can help realize hardware–software trade-offs and design exploration that may not be obvious in large systems.

29.7 Conclusions The increasing interest in NES is quite timely as evident by the several important application areas where such systems can be used. The development process for these applications and systems remains an ad hoc process. We need methodologies and tools to provide system designers with the flexibility and capability to quickly construct efficient and optimized system designs. In this chapter, we have examined the range of design and verification challenges faced by the NES designers. Among existing solutions, we have

© 2006 by Taylor & Francis Group, LLC

29-16

Embedded Systems Handbook

presented techniques that promise to increase efficiency of the design process by raising the level of design abstraction and by enhancing the scope of system models. These include design tools such as GRATIS and nesC for operating system configuration and NS-2, SystemC, SPARK, NESLsim, and SensorSim for simulation, synthesis, and codesign of embedded systems. There are several open research problems. Reducing device size and power and increasing device speed further remain important objectives. There is a need for distributed applications, along with middleware and operating system support and support for network protocols for distributed coordinated collaboration. Continued progress in these technologies will fulfil the promise of NES as ubiquitous computing systems.

References [1] R.W. Brodersen, A.P. Chandrakasan, and S. Cheng. 1992. Lowpower CMOS digital design. IEEE Journal of Solid-State Circuits 27(4): 473–484. [2] P. Volgyesi and A. Ledeczi. Component-based development of networked embedded applications. In Proceedings of EuroMicro, 2002. [3] D. Ramanathan, R. Jejurikar, and R. Gupta. Timing driven co-design of networked embedded systems. In Proceedings of ASPDAC, 2000, pp. 117–122. [4] H. Wang, J. Elson, L. Girod, D. Estrin, and K. Yao. Target classification and localization in habitat monitoring. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), 2003. [5] S. Simic and S. Sastry. Distributed environmental monitoring using random sensor networks. In Proceedings of the Second International Workshop, IPSN 2003, 2003. [6] B. West, P. Flikkema, T. Sisk, and G. Koch. Wireless sensor networks for dense spatio-temporal monitoring of the environment: a case for integrated circuit, system and network design. In Proceedings of the IEEE CAS Workshop on Wireless Communications and Networking, 2001. [7] A. Mainwaring, J. Polsatre, R. Szewczyk, D. Culler, and J. Anderson. Wireless sensor networks for habitat monitoring. In Proceedings of WSNA’02, 2002. [8] P. Flikkema and B. West. Wireless sensor networks: from the laboratory to the field. In National Conference for Digital Government Research, 2002. [9] Cooperative engagement capability. http://www.fas.org/man/dod-101/sys/ship/weaps/cec.htm [10] Tian He, B.M. Blum, John A. Stankovic, and Tarek F. Abdelzaher. Aida: adaptive application independent data aggregation in wireless sensor networks. ACM Transactions on Embedded Computing System (TECS), Special Issue on Dynamically Adaptable Embedded Systems, 3(2): 426–457, 2004. [11] Z. Nakad, M. Jones, and T. Martin. Communications in electronic textile systems. In Proceedings of the International Conference on Communications in Computing (CIC), 2003. [12] D. Meoli and T.M. Plumlee. Interactive electronic textile. Journal of Textile and Apparel, Technology and Management, 2: 1–12, 2002. [13] D. Estrin, R. Govindan, J. Heidemann, and Satish Kumar. Next century challenges: scalable coordination in sensor networks. In Proceedings of the International Conference on Mobile Computing and Networking (MobiCom), 1999. [14] D. Estrin, A. Sayeed, and M. Srivastava. Wireless sensor networks. In Proceedings of the International Conference on Mobile Computing and Networking (MobiCom), 2002. [15] J. Kahn, R. Katz, and K. Pister. Emerging challenges: mobile networking for ‘smart dust.’ Journal of Communication Networks, 2: 188–196, 2000. [16] I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. A survey of wireless sensor networks. IEEE Communications Magazine, 38(4): 393–422, August 2002. [17] C.E. Jones, K.M. Sivalingam, P. Agrawal, and J.C. Chen. A survey of energy efficient network protocols for wireless networks. Wireless Networks, 7: 343–358, 2001. [18] P. Koopman. Embedded system design issues — the rest of the story. In Proceedings of the International Conference on Computer Design, 1996. [19] M. Horton, D. Culler, K. Pister, J. Hill, R. Szewczyk, and Alec Woo. Mica: the commercialization of microsensor notes. Sensors Magazine, 19(4): 40–48, 2002. © 2006 by Taylor & Francis Group, LLC

Design Issues for Networked Embedded Systems

29-17

[20] K.S.J. Pister, J.M. Kahn, and B.E. Boser. Smart dust: wireless networks of millimeter-scale sensor nodes. Technical report, Highlight Article in 1999 Electronics Research Laboratory Research Summary, 1999. [21] J.M. Kahn, R.H. Katz, and K.S.J. Pister. Mobile networking for smart dust. In Proceedings of the ACM/IEEE International Conference on Mobile Computing and Networking (MobiCom), 1999. [22] Crossbow: smarter sensors in silicon. http://www.xbow.com/ [23] Kyocera. http://global.kyocera.com/application/automotive/auto_elec/ [24] Gabriel Leen and Donal Heffernan. Expanding automotive electronic systems. IEEE Computer, 35: 88–93, 2002. [25] Mitsubishi electric. http://www.mitsubishielectric.ca/automotive/ [26] Y. Li and R. Wang. Precision agriculture: smart farm stations. IEEE 802 Plenary Meeting Tutorials. [27] Board on Agriculture and Natural Resources. Precision Agriculture in the 21st Century: Geospatial and Information Technologies in Crop Management. National Academy Press, Washington, 1998. [28] V. Kottapalli, A. Kiremidjian, J. Lynch, E. Carryer, T. Kenny, K. Law, and Y. Lei. Two-tiered wireless sensor network architecture for structural health monitoring. In Proceedings of the SPIE’s 10th Annual International Symposium on Smart Structures and Materials, 2003. [29] Raytheon systems co. http://www.raytheon.com/ [30] Darpa. http://www.darpa.mil/fcs/index.html [31] N. Sarwabhotla and S. Seetharamaiah. Intelligent disaster management system for remote villages in India. In Development by Design, Bangalore, India, 2002. [32] T. Clouqueur, V. Phipatanasuphorn, P. Ramanathan, and K. Saluja. Sensor deployment strategy for target detection. In Proceedings of WSNA 02, 2002. [33] F. Koushanfar, M. Potkonjak, and A. Sangiovanni-Vincentelli. Fault tolerance in wireless ad hoc sensor networks. In Proceedings of the IEEE International Conference on Sensors, 2002. [34] C. Karlof and D. Wagner. Secure routing in wireless sensor networks: attacks and countermeasures. In Proceedings of the IEEE International Workshop on Sensor Network Protocols and Applications, 2003. [35] A. Perrig, R. Szewczyk, J.D. Tygar, V. Wen, and D.E. Culler. Spins: security protocols for sensor networks. Wireless Networks, 8(5): 521–534, 2002. [36] H. Cam, S. Ozdemir, D. Muthuaavinashiappan, and Prashant Nair. Energy-efficient security protocol for wireless sensor networks. In Proceedings of the IEEE VTC Fall 2003 Conference, 2003. [37] N.R. Potlapally, S. Ravi, A. Raghunathan, and N.K. Jha. Analyzing the energy consumption of security protocols. In Proceedings of the International Symposium on Low Power Electronics and Design, 2003. [38] ecos Open-source real-time operating system for embedded systems. http://sources.redhat.com/ ecos/ [39] Lynxos real-time operating system for embedded systems. http://www.lynuxworks.com/ [40] Qnx real-time operating system for embedded systems. http://www.qnx.com/ [41] D. Gay, P. Levis, R. Behren, M. Welsh, E. Brewer, and D. Culler. The nesc language: a holistic approach to networked embedded systems. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, 2002. [42] J. Henkel and Y. Li. Energy-conscious hw/sw-partitioning of embedded systems: a case study on an mpeg-2 encoder. In Proceedings of the International Workshop on Hardware/Software Codesign, 1998. [43] C. Nitsch and U. Kebschull. The use of runtime configuration capabilities for network embedded systems. In Proceedings of Design, Automation and Test in Europe Conference and Exhibition, 2002, pp. 1093–2002. [44] J. Hill, R. Szewczyk, A. Woo, S. Hollar, D. Culler, and K. Pister. System architecture directions for networked sensors. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2000. [45] W. Ye, J. Heidemann, and D. Estrin. An energy efficient mac protocol for wireless sensor networks. In Proceedings of the INFOCOM 2002: 21st Annual Joint Conference of IEEE, 2002. © 2006 by Taylor & Francis Group, LLC

29-18

Embedded Systems Handbook

[46] A. Manjeshwar and D.P. Agrawal. Teen: a protocol for enhanced efficiency in wireless sensor networks. In Proceedings of the International Workshop on Parallel and Distributed Computing Issues in Wireless Networks and Mobile Computing, 2001. [47] A. Manjeshwar and D.P. Agrawal. Apteen: a hybrid protocol for efficient routing and comprehensive information retrieval in wireless sensor networks. In Proceedings of the International Workshop on Parallel and Distributed Computing Issues in Wireless Networks and Mobile Computing, 2002. [48] Opnet. http://www.opnet.com/ [49] Network simulator 2. http://www.isi.edu/nsnam/ns/ [50] SensorSim. http://nesl.ee.ucla.edu/projects/sensorsim/ [51] NESLsim. http://www.ee.ucla.edu/∼saurabh/NESLsim/ [52] S. Park, A. Savvides, and M. Srivastava. Sensorsim: a simulation framework for sensor networks. In Proceedings of MSWiM 2000, 2000. [53] R.K. Gupta and S.Y. Liao. Using a programming language for digital system design. IEEE Design and Test of Computers, 14(2): 72–80, April 1997. [54] SystemC. http://www.systemc.org [55] N. Drago, F. Fummio, and M. Poncino. Modeling network embedded systems with NS-2 and SystemC. In Proceedings of ICCSC: Circuits and Systems for Communication, 2002, pp. 240–245. [56] R.K. Gupta and G. De Micheli. Hardware–Software cosynthesis for digital systems. IEEE Design and Test of Computers, 10(3): 29–41, July 1993. [57] G. Micheli and R. Gupta. Hardware/software co-design. Proceedings of IEEE, 85, 349–365, 1997. [58] R. Ernst and J. Henkel. Hardware–software codesign of embedded controllers based on hardware extraction. In Proceedings of the International Workshop on Hardware/Software Codesign, 1992. [59] J. Henkel and R. Ernst. A hardware–Software partitioner using a dynamically determined granularity. In Proceedings of the Design Automation Conference, 1997. [60] A. Dasdan, D. Ramanathan, and R.K. Gupta. A timing-driven design and validation methodology for embedded real-time systems. ACM Transactions on Design Automation of Electronic Systems, 3: 533–553, 1998. [61] A. Dasdan, D. Ramanathan, and R.K. Gupta. Rate derivation and its applications to reactive, real-time embedded systems. In Proceedings of the Design Automation Conference, 1998. [62] S. Gupta, R.K. Gupta, N.D. Dutt, and A.Nicolau. SPARK: A Parallelizing Approach to the High-Level Synthesis of Digital Circuits. Kluwer Academic, Publishers, Dordrecht, 2004. [63] S. Gupta, N.D. Dutt, R.K. Gupta, and A. Nicolau. SPARK: a high-level synthesis framework for applying parallelizing compiler transformations. In Proceedings of the International Conference on VLSI Design, 2003. [64] M. Luthra, S. Gupta, N.D. Dutt, R.K. Gupta, and A. Nicolau. Interface synthesis using memory mapping for an FPGA platform. In Proceedings of the International Conference on Computer Design, October 2003.

© 2006 by Taylor & Francis Group, LLC

30 Middleware Design and Implementation for Networked Embedded Systems 30.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30-1 Multiple Design Dimensions • Networked Embedded Systems Middleware • Example Application: Ping-Node Scheduling for Active Damage Detection • Engineering Life-Cycle • Middleware Design and Implementation Challenges

30.2 Middleware Solution Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30-5 30.3 ORB Middleware for Networked Embedded Systems — A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30-7 Message Formats • Object Adapter • Message Flow Architecture • Time-Triggered Dispatching • Priority Propagation • Simulation Support

Venkita Subramonian and Christopher Gill Washington University

30.4 Design Recommendations and Trade-Offs . . . . . . . . . . . . 30.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30-12 30-13 30-13 30-14 30-14

30.1 Introduction Networked embedded systems support a wide variety of applications, ranging from temperature monitoring to battle field strategy planning [1]. Systems in this domain are characterized by the following properties: 1. 2. 3. 4.

Highly connected networks. Numerous memory-constrained end-systems. Stringent timeliness requirements. Adaptive online reconfiguration of computation and communication policies and mechanisms.

This work was supported in part by the DARPA NEST (contract F33615-01-C-1898) and PCES (contract F3361503-C-4111) programs.

30-1

© 2006 by Taylor & Francis Group, LLC

30-2

Embedded Systems Handbook

Networked embedded systems challenge assumptions about resource availability and scale made by classical approaches to distributed computing, and thus represent an active research area with many open questions. For example, advances in Micro Electro Mechanical Systems (MEMSs) hardware technology have made it possible to move software closer to physical sensors and actuators to make more intelligent use of their capabilities. To realize this possibility, however, new networked embedded systems technologies are needed. For example, hardware infrastructure for such systems may consist of a network of hundreds or even thousands of small microcontrollers, each closely associated with local sensors and actuators.

30.1.1 Multiple Design Dimensions The following four dimensions drive the design choices for development of many networked embedded systems: 1. 2. 3. 4.

Temporal predictability Distribution Feature richness Memory constraints

There is often a contravariant relationship between some of these design forces. For example, the left side of Figure 30.1 illustrates that feature richness may suffer when footprint is reduced. Similarly, a realtime embedded system’s temporal performance must be maintained even when more or fewer features are supported, as illustrated by the right side of Figure 30.1. Significant research has gone into each of these individual design dimensions and has resulted in a wide range of products and technologies. Research on the Embedded Machine [2] and Kokyu [3] mainly addresses the real-time dimension. The CORBA Event service [4], Real-time Publish/Subscribe [5], and Distributable Threads [6] provide alternative programming models that support both one-to-many and one-to-one communication and hence address the distribution dimension. Small footprint middleware is the main focus of e∗ ORB [7] and UCI-Core [8]. TAO [9] and ORBexpress RT [10] are general-purpose CORBA implementations that provide real-time and distribution features for a wide variety of application domains.

30.1.2 Networked Embedded Systems Middleware

Footprint

Performance

General-purpose middleware is increasingly taking the role that operating systems held three decades ago. Middleware based on standards such as CORBA [11], EJB [12], COM [13], and Java RMI [14] now caters to the requirements of a broad range of distributed applications such as banking transactions [15,16], online stock trading [17], and avionics mission computing [18]. Different kinds of general-purpose middleware have thus become key enabling technologies for a variety of distributed applications.

Features

FIGURE 30.1 Features, footprint, and performance.

© 2006 by Taylor & Francis Group, LLC

Features

Middleware Design and Implementation

30-3

To meet the needs of diverse applications, general-purpose middleware solutions have tended to support a breadth of features. In large-scale applications, layers of middleware have been added to provide different kinds of services [18]. However, simply adding features breaks down for certain kinds of applications. In particular, features are rarely innocuous in applications with requirements for real-time performance or small memory footprint. Instead, every feature of an application and its supporting middleware is likely either to contribute to or detract from the application in those dimensions. Therefore, careful selection of features is crucial for memory constrained and real-time networked embedded systems. As middleware is applied to a wider range of networked embedded systems, a fundamental tension between breadth of applicability and customization to the needs of each application becomes increasingly apparent. To resolve this tension, special-purpose middleware must be designed to address the following two design forces: 1. The middleware should provide common abstractions that can be reused across different applications in the same domain. 2. It should then be possible to make fine-grained modifications to tailor the middleware to the requirements of each specific application. In the following section, we describe a motivating example application and the design constraints it imposes. In Section 30.1.4, we describe additional design constraints imposed by the engineering life-cycle for this application.

30.1.3 Example Application: Ping-Node Scheduling for Active Damage Detection To illustrate how application domain constraints drive the design of special-purpose middleware, we now describe a next-generation aerospace application [19], in which a number of MEMS sensor/actuator nodes are mounted on a surface of a physical structure, such as an aircraft wing. The physical structure may be damaged during operation, and the goal of this application is to detect such damage when it occurs. Vibration sensor/actuator nodes are arranged in a mesh with (wired or wireless) network connectivity to a fixed number of neighboring nodes. To detect possible damage, selected actuators called ping nodes generate vibrations that propagate across the surface of the physical structure. Sensors within a defined neighborhood can then detect possible damage near their locations by measuring the frequencies and strengths of these induced vibrations. The sensors convey their data to other nodes in the system, which aggregate data from multiple sensors, process the data to detect damage, and issue alerts or initiate mitigating actions accordingly. Three restrictions on the system make the problem of damage detection difficult. First, the sensor/actuator nodes are resource-constrained. Second, two vibrations whose strengths are above a certain threshold at a given sensor location will interfere with each other. Third, sensor/actuator nodes may malfunction over time. These constraints, therefore, require that the actions of two overlapping ping nodes be synchronized so that no interfering vibrations will be generated at a sensor location at any time. This damage detection problem can be captured by a constraint model. Scheduling the activities of the ping nodes can be formulated as a distributed graph coloring problem. A color in the graph coloring problem corresponds to a specific time slot in which a ping node vibrates. Thus two adjacent nodes in the graph, each representing an actuator, cannot have the same color since the vibrations from these actuators would then interfere with each other. The number of colors is therefore the length (in distinct time slots) of a schedule. The problem is to find a shortest schedule such that the ping nodes do not interfere with one another, in order to minimize damage detection and response times. Distributed algorithms [20] have been shown to be effective for solving the distributed constraint satisfaction problem in such large scale and dynamic1 networks. 1 For

example, with occasional reconfiguration due to sensor/actuator failures online.

© 2006 by Taylor & Francis Group, LLC

30-4

Embedded Systems Handbook

30.1.4 Engineering Life-Cycle Large-scale networked embedded systems are often expensive and time consuming to develop, deploy, and test. Allowing separate development and testing of the middleware and the target system hardware can reduce development costs and cycle times. However, this separation imposes additional design and implementation challenges for special-purpose middleware. For example, to gauge performance of the distributed ping-scheduling algorithm in the actual system, physical, computational, and communication processes must be simulated for hundreds of nodes at once. For physical processes, tools such as Matlab or Simulink must be integrated within the simulation environment. Computation should be performed using the actual software that will be deployed in the target system. However, that software may be run on significantly different, and often fewer, actual end-systems in the simulation environment than in the target system. Similarly, communication in the simulation environment will often occur over conventional networks, such as switched Ethernet, which may not be representative of the target system’s network. The following issues must be addressed in the design and implementation of middleware that is suitable for both the simulation and target system environments: • We need to use as much of the software that will be used in the target system as possible in the simulation environment. This helps us to obtain relatively faithful metrics about the application and middleware that will be integrated with the target system. • We need to allow arbitrary configurations for the simulation. The hardware and software configuration may be different for each machine used to run the simulation, and different kinds and numbers of target system nodes may be simulated on each machine. • Simple time scaling will not work since it does not guarantee that the nodes are synchronized. First, it is not practical to require that all the computation and communication times are known a priori, since one function of the simulation may be to gauge those times. Moreover, even if we could scale the time to a safe upper bound, the wall-clock time it takes to run the simulation would likely be prohibitively large. • Because of the heterogeneous configuration of the simulation environment, some simulated nodes might run faster than others, leading to causal inconsistencies in the simulation [21,22]. • Additional infrastructure is thus necessary to encapsulate the heterogeneity of different simulation environments and simulate real-time performance on top of general-purpose operating systems and networks, with simulation of physical processes in the loop.

30.1.5 Middleware Design and Implementation Challenges To facilitate exchanges of information between nodes as part of the distributed algorithm, a middleware framework that provides common services, such as remote object method invocation, is needed. Two key factors that motivate the development of ORB (Object Request Broker)-style middleware for networked embedded systems are (1) remote communication and (2) location independence. Remote communication: Even though a fixed physical topology may connect a group of sensor/actuator components, the logical grouping of these components may not strictly follow the physical grouping. Location independence: The behavior of communicating components should be independent of their location to the extent possible. True location independence may not be achievable in all cases, for example, due to timing constraints or explicit coupling to physical sensors or actuators. However, the implementation of object functionality should be decoupled from the question of whether it accesses other objects remotely or locally where appropriate. The programming model provided to the object developer should thus provide a common programming abstraction for both remote and local access.

© 2006 by Taylor & Francis Group, LLC

Middleware Design and Implementation

30-5

In summary, the key challenges we faced in the design and implementation of special-purpose middleware to address the application domain constraints described in Sections 30.1.3 and 30.1.4 are to: Reuse existing infrastructure: We want to avoid developing new middleware from scratch. Rather, we want to reuse prebuilt infrastructure to the extent possible. Provide real-time assurances: The performance of middleware itself must be predictable to allow application-level predictability. Provide a robust DOC middleware: We chose the DOC communication paradigm since it offers direct communication among remote and local components, thus increasing location independence. Reduce middleware footprint: The target for this middleware is memory-constrained embedded microcontroller nodes. Support simulation environments: Simulations should be done with the same application software and middleware intended for deployment on the target. The middleware should also be able to deal with heterogeneous simulation testbeds, that is, different processor speeds, memory resources, etc.

30.2 Middleware Solution Space General-purpose CORBA implementations, such as TAO [23], offer generic CORBA implementations, whose feature sets are determined a priori. Furthermore, faithful implementation of the entire CORBA standard increases the number of features supported by ORBs and hence results in increased footprint for the application. In the case of memory-constrained networked embedded applications, this can become prohibitively expensive. We instead want to get only the features that we need. The selection of features for our special-purpose middleware implementation was strictly driven by the unique requirements of the application domain. Two approaches to developing special-purpose middleware must then be considered: • Top-down: Subdividing existing general-purpose middleware frameworks, for example,TAO [9]. • Bottom-up: Composing special-purpose middleware from lower-level infrastructure, for example, ACE [24]. Both approaches seek to balance reuse of features with customization to application-specific requirements. The top-down approach is preferred when the number and kinds of features required are close to those offered by a general-purpose middleware implementation. In this case, provided policy and mechanism options can be adjusted in the the general-purpose middleware to fit the requirements of the application. In general, this has been the approach used to create and refine features for real-time performance in TAO. On the other hand, if the number or kinds of middleware features required differs significantly from those available in general-purpose middleware, as is the case with many networked embedded systems applications, then a bottom-up approach is preferable. This is based largely on the observation that in our experience lower-level infrastructure abstractions are less interdependent and thus more easily decoupled than higher-level ones. It is therefore easier to achieve highly customized solutions by composing middleware from primitive infrastructure elements [25,26] than trying to extract the appropriate subset directly from a general-purpose middleware implementation. Modern software development relies heavily on reuse. Given a problem and a space of possible solutions, we try first to see whether the problem can be solved directly from an existing solution to a similar problem. Taking this view, we compared the challenges described in Section 30.1.5 to existing middleware solutions, as shown in Table 30.1. TAO [9,23] and e∗ ORB [27,28] appeared to be the most suitable candidate solutions based on the requirements of our target application described in Section 30.1.3. TAO is a widely used standardscompliant ORB built using the Adaptive Communication Environment (ACE) framework [24,29]. In addition to a predictable and optimized [30,31] ORB core [32], protocols [33,34], and dispatching [35,36] infrastructure, TAO offers a variety of higher-level services [37,38]. e∗ ORB is a customized ORB that offers space-efficient implementation of a reduced set of features, with a corresponding reduction in footprint.

© 2006 by Taylor & Francis Group, LLC

30-6

Embedded Systems Handbook TABLE 30.1

Challenges and Potential Solutions

Challenge

Framework

Infrastructure reuse Real-time assurances Robust DOC middleware Reduced middleware footprint Simulated real-time behavior

ACE, TAO Kokyu, TAO TAO, e∗ ORB UCI-Core, e∗ ORB TAO? Kokyu?

ACE

Kokyu

Network programming primitives Patterns Portability

Dispatching model Real-time QoS assurance Priority lanes

nORB

TAO

UCI-Core

IDL compilation strategies ORB concurrency patterns ORB core mechanisms

Minimum ORB feature set

FIGURE 30.2 Reuse from existing frameworks.

Problem −→ we get more or less than we need: Unfortunately, faithful implementation of the CORBA standard increases the number of features supported by TAO, e∗ ORB, and other similar CORBA implementations and hence results in increased footprint for the application. In the case of memory-constrained applications, this becomes prohibitively expensive. Although ACE reduces the complexity of the programming model for writing distributed objectoriented applications and middleware infrastructure, it does not directly address the challenges of realtime assurances, reduced footprint, or interoperation with standards-based distribution middleware. Kokyu [3] is a low-level middleware framework built on ACE, for flexible multiparadigm scheduling [39] and configurable dispatching of real-time operations. Thus Kokyu can supplement the capabilities of another DOC middleware framework, but cannot replace it. The UCI-Core approach supports different DOC middleware paradigms. It offers significant reuse of infrastructure, patterns, and techniques by generalizing features common to multiple DOC middleware paradigms and providing them within a minimal metaprogramming framework, thus also addressing the challenge of reducing middleware footprint. However, it is unsuited to meet other challenges described in Section 30.1.5, for example, it does not directly support real-time assurances or simulation of real-time behavior. Solution −→ use a bottom-up composition approach to get only the features that we need: Figure 30.2 illustrates our approach. The selection of features for our special-purpose middleware implementation was strictly driven by the unique requirements of the application described in Section 30.1.3. We initially considered a top-down approach to avoid creating and maintaining an open-source code base separate from TAO. However, this approach proved infeasible due to several factors. First, the degree of implementation-level interdependence between features in TAO made it difficult to separate them. Second, the scarcity of mature tools to assist in identifying and decoupling needed versus unneeded features made

© 2006 by Taylor & Francis Group, LLC

Middleware Design and Implementation

30-7

it unlikely that we would be able to achieve such a top-down decomposition in a reasonable amount of time. Third, due to absence of better tools it was also infeasible to validate that during refactoring we had correctly retained functional and real-time properties for the large body of TAO applications deployed outside our DOC middleware research consortium. Therefore, we ultimately took a bottom-up compositional approach, which led to the development of nORB [19]2 , starting with the ACE framework and reusing as much as possible from it with transparent refactoring of some ACE classes to avoid unneeded features. By building on ACE, we reduced duplication between the TAO and nORB code bases, while achieving a tractable development process. As in TAO, ACE components serve as primitive building blocks for nORB. Communication between nORB end-systems is performed according to the CORBA [11] model: the client side marshals the parameters of a remote call into a request message and sends it to a remote server, which then demarshals the request and calls the appropriate servant object; the reply is then marshaled into a reply message and sent back to the client, where it is demarshaled and its result returned to the caller. Although we did not retain strict compliance to the CORBA specification, wherever possible we reused concepts, interfaces, mechanisms, and formats from TAO’s implementation of the CORBA standard. Problem −→ no explicit support for temporal coordination in the simulation environment: As the last row of Table 30.1 suggests, none of the potential middleware solutions we found was able to support the kind of temporal coordination between simulated and actual infrastructure that is needed in the simulation environment described in Section 30.1.4. Because TAO is open-source, it would be possible to integrate special-purpose mechanisms that intercept GIOP messages exchanged between simulated nodes, and perform accounting of simulation time whenever a message is sent or received. However, TAO does not explicitly address time-triggered dispatching of method invocations, which must also be subjected to simulation time accounting. Kokyu is designed for time-triggered and event-triggered dispatching, but does not intercept messages exchanged through the ORB without transiting a dispatcher. Therefore, neither TAO nor Kokyu alone is able to provide the kind of temporal coordination needed in the the simulation environment. Solution −→ virtual clock integrated with real-time dispatching and distribution features: In the target system, both time-triggered local method invocations and remote method invocations must be dispatched according to real-time constraints. Those constraints and the corresponding temporal behavior of the application and middleware must be modeled and enforced effectively in the simulation environment as well. To support both the target and simulation environments, we integrated a dispatcher based on the Kokyu model with distribution features based on TAO, in nORB. We then used the dispatcher as a single point of real-time enforcement, where both local upcalls and method invocation request and reply messages are ordered according to dispatching policies. Within that single point of control, we then integrated a virtual clock mechanism that is used only in the simulation environment, to enforce both causal consistency and real-time message and upcall ordering on the simulation’s logical time-line.

30.3 ORB Middleware for Networked Embedded Systems — A Case Study In this section, we describe the design and implementation of nORB, and the rationale behind our approach, to address the networked embedded system design and implementation challenges described in Section 30.1.5. 2 nORB

is freely available as open-source software at http://deuce.doc.wustl.edu/nORB/

© 2006 by Taylor & Francis Group, LLC

30-8

Embedded Systems Handbook nORB Request message format Request id

Two way flag

Object key length

Op name

Object key

Priority

Parameters

nORB Reply message format Request id

Status

Results

nORB Locate Request message format Locate request id

Corbaloc style key

nORB Locate Reply message format Locate reply id

IOR string Profile-1

Profile-n

nORB IOR format Repository id

Object key

Transport address

Priority

FIGURE 30.3 nORB IOR and message formats.

30.3.1 Message Formats We support a limited subset of the messages supported by the CORBA specification, so that we do not incur unnecessary footprint, but at the same time support the minimum features required by the application. The following messages are supported by nORB: Request, Reply, Locate Request, and Locate Reply. Figure 30.3 shows the formats of the messages supported by nORB. The format of the request and reply messages in nORB closely resembles that of the GIOP Request and Reply messages, respectively. We use the common data representation [11] to encode the messages themselves. The nORB client builds a Request message and sends it to the nORB server which sends a Reply back to the client.

30.3.2 Object Adapter In standard CORBA, each server-side ORB may provide multiple object adapters [40]. Servant objects register with an object adapter, which demultiplexes each client request to the appropriate servant. Each object adapter may be associated with a set of policies, for example, for servant threading, retention, and lifespan [41]. In standard CORBA, multiple object adapters are supported by each ORB. This allows heterogeneous object policies to be implemented in a client–server environment, which is desirable in applications, such as online banking, where each object on a server may be configured according to preferences of the server administrator, or even the end-user. In nORB, however, there is no assumption of multiple object adapters. Instead, a single object adapter per ORB is considered preferable for simplicity and footprint reduction. In nORB, the number of objects hosted on an embedded node is expected to be small, which reduces the need for multiple policies and thus for multiple object adapters. Even though the resulting object adapter does not conform to the Portable Object Adapter specification, a significant degree of footprint reduction is achieved because of the reduced object adapter functionality. We have also simplified the process of object registration, to free developers from writing repetitive code as is seen in many CORBA programs. In the object adapter, we maintain a lookup table of object ids and pointers to servant implementation objects. The lookup table is synchronized using a Readers/Writer lock. We have also consolidated object registration with other middleware initialization functions, by moving it from the object adapter interface to the ORB interface.

© 2006 by Taylor & Francis Group, LLC

Middleware Design and Implementation

30-9

30.3.3 Message Flow Architecture The message flow architecture in nORB uses strategies and patterns similar to those in TAO. We briefly mention the strategies we used and refer the interested reader to Reference 42, which discusses these strategies in greater detail. 30.3.3.1 Reply Wait Strategy When a client makes a remote two-way function call, the caller thread needs to block until it receives a reply back from the server. The two-way function call is made on the client stub, which then marshals the parameters into a request and sends it to the server. The two-way function call semantics requires the caller thread to block until the reply comes back from the server. There are two different strategies to wait for the reply in TAO — Wait on Connection and Wait on Reactor [43]. nORB uses the Wait on Connection strategy to wait for the reply. 30.3.3.2 Upcall Dispatch Strategy On the server side there are different strategies to process an incoming request and send the reply back to the client. Two well-known strategies are Direct Upcall and Queued Upcall [43]. In the Direct Upcall strategy, the upcall is dispatched in the same thread as the I/O thread that listens for incoming requests from the connection stream. The Queued Upcall strategy follows the Half Sync Half Async pattern [44,45]. In contrast with the direct upcall strategy, a network I/O thread is dedicated to receiving requests from clients. Once the request is received, it is encapsulated into a command object and then put into a queue. This queue is shared with another thread, in which the upcall dispatch is done. TAO uses the Direct Upcall strategy, whereas nORB uses the Queued Upcall strategy to address the simulation issues presented in Section 30.1.4, as is discussed in Section 30.3.6.

30.3.4 Time-Triggered Dispatching In the active damage detection application described in Section 30.1.3, control events must be triggered predictably at various rates. We developed a dispatcher for nORB to trigger these events predictably, based on the Kokyu [39] dispatching model. Kokyu abstracts combinations of fundamental real-time scheduling and dispatching mechanisms to enforce a variety of real-time policies, including well-known strategies, such as Rate-Monotonic Scheduling (RMS) [46], Earliest Deadline First (EDF) [46], and maximum urgency first [47]. This dispatcher is used to trigger events on the client side and for dispatching upcalls on the server side. A Dispatchable interface is provided, which is to be implemented by an application object that needs to be time triggered. The handle_dispatch() method is called on the Dispatchable object. In the ping-scheduling application, some application objects are both Dispatchable and make remote calls when triggered. Figure 30.4 shows the message flow from the client to the server when a timer expires on the client side, leading to a remote object method call on the server side.

30.3.5 Priority Propagation nORB implements real-time priority propagation in a way similar to TAO [32,48]. The client ORB uses the priority of the thread in which the remote invocation is made. nORB then looks for a matching priority from the set of profiles in the IOR and then makes a connection to the appropriate port. We use a cached connection strategy [42] to avoid the overhead of connection setup every time a remote invocation is made. To alleviate priority inversion, each connection endpoint on the server is associated with a thread/reactor pair, thus forming a dispatching lane. The priority of the thread associated with each dispatching lane is set appropriately so that a request coming into a higher-priority lane will be processed before a request coming into a lower-priority lane. Figure 30.5 shows an example in which RMS [46] was used to assign priorities to the different rates on the client side.

© 2006 by Taylor & Francis Group, LLC

30-10

Embedded Systems Handbook Clients

Servants

Skeleton Stub

Skeleton

Stub

SOA

Dispatcher

Dispatcher

LO

HI

nORB

FIGURE 30.4 Kokyu dispatcher in nORB.

2 kHz

Client

Server

Stub

Skeleton

Dispatcher

100 Hz

HI

LO

HI

HI

nORB

Reactor port: 10,000

Request id = 23

IOR

Object key A

nORB

…….

LO

LO Reactor port: 20,000

Priority = LO

……. Priority = LO, host A: 20,000 Priority = HI, host A: 10,000

FIGURE 30.5 Priority propagation in nORB.

The priority of the client thread making the request is propagated in the request. This priority is used on the server side to enqueue the request if necessary as explained in Section 30.3.6.

30.3.6 Simulation Support Our solution to the engineering life-cycle issues described in Section 30.1.4 is to have nORB maintain a logical clock [21,49] at each node to support synchronization among nodes. We distinguish between logical times and priorities that are maintained by the simulation, and actual times and priorities that are maintained by the operating system. Logical time is divided into discrete frames. Each logical clock is incremented in discrete units by nORB after any incoming message is processed. If there are no incoming messages to be processed, then the logical time is incremented to the start of the next frame. At the start of each frame, each node registered for a timed upcall in that frame is triggered, which results in the node waiting for sensor values from a physical simulation tool, such as Matlab or Simulink.

© 2006 by Taylor & Francis Group, LLC

Middleware Design and Implementation

30-11

Client

Server

Stub

Skeleton

4

8

Dispatcher 100 Hz

2 kHz HI HI

1 2

HI

nORB

3

nORB

HI

Logical clock

5

LO

Logical clock LO

ME Reactor port:10,000

Request id = 23

IOR

Object key A

7

…….

…….

6

Logical priority = LO

host A:10,000

FIGURE 30.6 Processing with a logical clock.

While a node is waiting for the sensor values, we “freeze” the logical clock on that node, that is, we prevent the logical clock from advancing. As a result, no messages are processed on that node while it is waiting, and all incoming messages are queued. The queuing of incoming messages allows messages from the future to be delayed, and then processed after a slower node “catches up.” Any request or reply leaving a node carries the current logical clock value of that node. This value is used by the receiving node to sort the incoming messages based on their time of release and the logical clock of the receiving node. If the logical clock of the receiving node is later than that of the incoming message, then the message is stamped with the value of the logical clock at the receiving node. Each item (dispatchable or request/reply message) carries its logical execution time, which is predefined for each item. When an item is ready and most eligible for execution, the clock thread dequeues it and checks whether it can complete its execution before the earliest more eligible item’s release time. If it can complete its execution before another more eligible item must be processed, the clock thread enqueues the current item in the appropriate lane for actual execution on the processor. If not, the clock thread simulates the partial execution of the current item without actually executing it, by (1) storing the remaining logical execution time in the item itself and (2) enqueuing the updated item back into the clock thread’s queue so that it can compete with other enqueued items for its next segment of execution eligibility. A lane can be configured to run a single thread or a pool of worker threads. As described in Section 30.3.4, without clock simulation each lane thread is run at its own actual OS priority. In the simulation environment time and eligibility are accounted for by the logical clock thread, so all the lane threads are run at the same actual OS priority. Each lane still maintains its logical priority in thread-specific storage [45], so that the logical priority can be sent with the request messages and used for eligibility ordering, as it will be in the target system. Figure 30.6 shows an illustrative sequence of events that results from the addition of the logical clock thread to nORB, assuming for simplicity that all items run to completion rather than being preempted: 1. When the logical clock advances, the clock thread goes through the list of dispatchables to see whether any are ready to be triggered. A ready dispatchable is one whose next trigger time is less

© 2006 by Taylor & Francis Group, LLC

30-12

2. 3.

4. 5.

6.

7. 8.

Embedded Systems Handbook

than or equal to the current logical clock and whose previous invocation has completed execution. In general, the clock thread determines the earliest time a message or dispatchable will be run, and marks all items with that time as being ready. Any ready dispatchables are released to the clock thread’s queues, according to their assigned logical priorities. The clock thread selects the most eligible ready item (message or dispatchable) from among its priority queues. The clock thread then enqueues the selected item in the appropriate priority lane of the Dispatcher, where it will compete with other messages and locally released dispatchables. The corresponding lane thread in the Dispatcher dispatches the enqueued item. The resulting upcall might in turn invoke a remote call to a servant object, which we describe in steps 5 to 8. The logical priority of the dispatchable or the message is propagated to the server side. Currently the application scheduler uses RMS to decide the logical priority of the dispatchable based on its simulated rate. Each lane thread stores its assigned logical priority in thread-specific storage [45]. The actual OS priorities of all the lane threads are kept the same under the clock simulation mechanism. The incoming message is accepted by the server’s reactor thread and is enqueued for temporal and eligibility ordering by the clock thread. Note that there is only one reactor thread, which runs at an actual priority level between the clock and the lane threads’ actual priority levels. This is different from the previous approach discussed in Section 30.3.5 and is applied only when using the clock simulation. The lane threads are given the highest actual OS priority and the clock thread itself is assigned the lowest actual OS priority. This configuration of actual OS thread priorities reduces simulation times while still ensuring synchronization between nodes. As on the client side, the clock thread then chooses the most eligible item from its queues and enqueues the item on the appropriate lane thread’s queue. The lane thread dispatches the enqueued item.

30.4 Design Recommendations and Trade-Offs In this section, we present recommendations and trade-offs that we encountered in designing and developing the nORB special-purpose middleware to meet the challenges described in Section 30.1.5. While we focus specifically on our work on nORB, the same guidelines can be applied to produce middleware tailored to other networked embedded systems: Use the application to guide data-types to be supported. We used the application domain to guide our choice of data types that the ORB middleware supports. In the damage detection application, for example, sequences of simple structures are exchanged between sensor/actuator nodes. nORB therefore supports only basic data types, structures, and sequences. This reduces the code base needed to support data types to those necessary for ping-node scheduling. We supported marshaling in nORB to allow application deployment over heterogeneous networked embedded platforms, particularly for simulation environments. For other application domains with homogeneous platforms, support for marshaling could be removed entirely. To support other data types (for example, CORBA Any) the middleware simply has to incorporate the code to marshal and demarshal those data types properly, trading-off use of those types for an increase in footprint. Minimize generality of the messaging protocol. Previous work [50] has shown that optimizations can be achieved by the principle patterns of (1) relaxing system requirements and (2) avoiding unnecessary generality. Protocols in TAO, such as GIOP-Lite [50], are designed according to this principle. Similarly, as described in Section 30.3.1, we support a limited subset of the message types in the CORBA specification, so that we incur only the necessary footprint, while still providing all features required by our target application. By reducing the number of fields in the header, advanced features of a CORBA ORB such as Portable Interceptors are not supported. Providing this support would trade-off with an increase in both ORB footprint and message sizes.

© 2006 by Taylor & Francis Group, LLC

Middleware Design and Implementation

30-13

Simplify object life-cycle management. In a networked embedded system, the number of application objects hosted on each node is expected to be very small, which reduces the need for full-fledged life-cycle management. Servant objects are registered when the application starts, and live as long as the application, eliminating the need for more complicated dynamic life-cycle management. The trade-off is that, for large numbers of objects or where objects are dynamically created and destroyed, simplified object life-cycle management will not suffice. For such end-systems, more complex adapters are necessary, albeit at a cost of larger footprint. Simplify operation lookup and dispatch. When a remote operation request is received on a server, an ORB must search for the appropriate object in its Object Adapter and then perform a lookup of the appropriate method in the operations table. nORB uses linear search for that lookup because of the assumption that only a few methods on only a few objects on each node will receive remote calls. The linear search strategy reduces memory footprint while still maintaining real-time properties for small numbers of methods. The trade-off is that for large numbers of methods, real-time performance will suffer. This is because linear search will take O(n) time to do a lookup, where n is the number of operations. Perfect hashing will take O(1) time, but this alternative would again entail increased footprint due to the code generated to support perfect hashing. Pay attention to safety and liveness requirements. We described the different messaging and concurrency architecture choices in Section 30.3.3. With nested upcalls for remote method invocations, the Wait on Connection strategy could result in deadlocks [51]. The Wait on Reactor strategy, on the other hand, avoids deadlocks but introduces blocking factors that could hurt real-time performance [51]. Design to support the entire engineering life-cycle. Although many middleware solutions address the run-time behavior of networked embedded systems, few of them address earlier stages of the engineering life-cycle. In particular, for networked embedded systems where simulation is used to gauge performance in the target system prior to system integration, additional special-purpose mechanisms may be needed. The virtual clock described in Section 30.3.6 is a good example of how such specialpurpose mechanisms can be provided in middleware so that (1) application software is not modified for use in the simulation environment, and (2) the mechanism once developed can be reused for multiple applications.

30.5 Related Work MicroQoSCORBA [52] focuses on footprint reduction through case-tool customization of middleware features. Ubiquitous CORBA projects [53], such as LegORB and the CORBA specialization of the Universally Interoperable Core (UIC), focus on a metaprogramming approach to DOC middleware. The UIC contains meta-level abstractions that different middleware paradigms, for example, CORBA, must specialize, while ACE, TAO, and nORB are concrete base-level frameworks. e∗ ORB [7] is a commercial CORBA ORB developed for embedded systems, especially in the telecommunications domain. The Time-Triggered Architecture (TTA) [54] is designed for fault-tolerant distributed real-time systems. Within the TTA all system activities are initiated by the progression of a globally synchronized time-base. This stands in contrast to event-driven systems, in which system activity is triggered by events. The TimeTriggered Message-Triggered Object (TMO) [55,56] architecture facilitates the design and development of real-time systems with syntactically simple but semantically powerful extensions of conventional objectoriented real-time approaches.

30.6 Concluding Remarks We have described, how meeting the constraints of networked embedded systems requires careful analysis of a representative application, as an essential tool for the development of the special-purpose middleware itself. In addition, discovering which settings and features are best for an application requires careful

© 2006 by Taylor & Francis Group, LLC

30-14

Embedded Systems Handbook

design a priori. It is therefore important to adopt an iterative approach to middleware development that starts with specific application requirements and takes simulation and experimentation results into consideration. By integrating both real-time middleware dispatching and a virtual clock mechanism used for simulation environments with distribution middleware features, we have shown how to develop special-purpose middleware solutions that address multiple stages of a networked embedded system’s engineering lifecycle. We also have empirically verified [57] that with nORB the footprint of a statically linked executable memory image for the ping-node-scheduling application was 30% of the footprint for the same application built with TAO, while still retaining real-time performance similar to TAO.

Acknowledgments We gratefully acknowledge the support and guidance of the Boeing NEST OEP Principal Investigator Dr. Kirby Keller and Boeing Middleware Principal Investigator Dr. Doug Stuart. We also wish to thank Dr. Weixiong Zhang at Washington University in St. Louis for providing the initial algorithm implementation used in ping scheduling.

References [1] D. Estrin, D. Culler, K. Pister, and G. Sukhatme. Connecting the physical world with pervasive networks. IEEE Pervasive Computing, 1: 59–69, 2002. [2] T. Henzinger, C. Kirsch, R. Majumdar, and S. Matic. Time safety checking for embedded programs. In Proceedings of the Second International Workshop on Embedded Software (EMSOFT). LNCS, Springer-Verlag, Heidelberg, 2002. [3] C.D. Gill, R. Cytron, and D.C. Schmidt. Middleware scheduling optimization techniques for distributed real-time and embedded systems. In Proceedings of the Seventh Workshop on ObjectOriented Real-Time Dependable Systems. IEEE, San Diego, CA, January 2002. [4] T.H. Harrison, D.L. Levine, and D.C. Schmidt. The design and performance of a real-time CORBA event service. In Proceedings of OOPSLA ’97. ACM, Atlanta, GA, October 1997, pp. 184–199. [5] D.C. Schmidt and C. O’Ryan. Patterns and performance of real-time publisher/subscriber architectures. Journal of Systems and Software, Special Issue on Software Architecture — Engineering Quality Attributes, 66(3): 213–223, 2002. [6] Y. Krishnamurthy, C. Gill, D.C. Schmidt, I. Pyarali, L.M.Y. Zhang, and S. Torri. The design and implementation of real-time CORBA 2.0: dynamic scheduling in TAO. In Proceedings of the 10th Real-Time Technology and Application Symposium (RTAS ’04). IEEE, Toronto, CA, May 2004. [7] PrismTech. eORB. URL: http://www.prismtechnologies.com/English/Products/CORBA/eORB/ [8] Manuel Roman. Ubicore: Universally Interoperable Core. www.ubi-core.com/Documentation/ Universally_ Interoperable_Core/universal%ly_interoperable_core.html [9] Institute for Software Integrated Systems. The ACE ORB (TAO), Vanderbilt University. www.dre.vanderbilt.edu/TAO/ [10] O. Interface. ORBExpress, 2002. www.ois.com [11] Object Management Group. The Common Object Request Broker: Architecture and Specification, 3.0.2 ed. December 2002. http://www.omg.org/technology/documents/formal/corba_iiop.htm [12] Sun Microsystems. Enterprise JavaBeans Specification, August 2001. java.sun.com/products/ejb/ docs.html [13] D. Rogerson. Inside COM. Microsoft Press, Redmond, WA, 1997. [14] Sun Microsystems, Inc. Java Remote Method Invocation Specification (RMI), October 1998 http://java.sun.com//j2se/1.3/docs/guide/rmi/spec/rmi-title.html [15] L.R. David. Online banking and electronic bill presentment payment are cost effective. Published online by Online Financial Innovations at www.onlinebank report.com

© 2006 by Taylor & Francis Group, LLC

Middleware Design and Implementation

30-15

[16] K. Kang, S. Son, and J. Stankovic. Star: secure real-time transaction processing with timeliness guarantees, 23rd IEEE Real-Time Systems Symposium, Austin, Texas, 2002, pp. 3–12. [17] X. D’efago, K. Mazouni, and A. Schiper. Highly available trading system: experiments with CORBA, IFIP International Conference on Distributed Systems Platform and Open Distributed Processing (Middlewate 98), The Lake District, England, September 15–18, 1998. [18] D. Corman. WSOA-Weapon systems open architecture demonstration — using emerging open system architecture standards to enable innovative techniques for time critical target (TCT) prosecution. In Proceedings of the 20th IEEE/AIAA Digital Avionics Systems Conference (DASC), October 2001. [19] C. Gill, V. Subramonian, J. Parsons, H.-M. Huang, S. Torri, D. Niehaus, and D. Stuart. ORB middleware evolution for networked embedded systems. In Proceedings of the Eighth International Workshop on Object Oriented Real-time Dependable Systems (WORDS’03). Guadalajara, Mexico, January 2003. [20] W. Zhang, G. Wang, and L. Wittenburg. Distributed stochastic search for constraint satisfaction and optimization: parallelism, phase transitions and performance. In Proceedings of AAAI Workshop on Probabilistic Approaches in Search, 2002. [21] L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 26, 558–565, 1978. [22] Nancy A Lynch. Distributed Algorithms. Morgan Kaufmann Publishers, Inc., San Mateo, California, 1996. [23] D.C. Schmidt, D.L. Levine and S. Mungee. The Design of the TAO Real-Time Object Request Broker, Computer Communications 21(4): 294–324, 1998. [24] Institute for Software Integrated Systems. The ADAPTIVE Communication Environment (ACE), Vanderbilt University. www.dre.vanderbilt.edu/ACE/ [25] F. Hunleth, R. Cytron, and C. Gill. Building customizable middleware using aspect oriented programming. In The OOPSLA 2001 Workshop on Advanced Separation of Concerns in ObjectOriented Systems. ACM, Tampa Bay, FL, October 2001. www.cs.ubc.ca/˜kdvolder/Workshops/ OOPSLA2001/ASoC.html [26] F. Hunleth and R.K. Cytron. Footprint and feature management using aspect-oriented programming techniques. In Proceedings of the Joint Conference on Languages, Compilers and Tools for Embedded Systems. ACM Press, 2002, pp. 38–45. [27] S. Aslam-Mir. Experiences with real-time embedded CORBA in Telecom. In Proceedings of the OMG’s First Workshop on Real-time and Embedded Distributed Object Computing. Object Management Group, Falls Church, VA, July 2000. [28] J. Garon. Meeting performance and QoS requirements with embedded CORBA. In Proceedings of the OMG’s First Workshop on Embedded Objectbased Systems. Object Management Group, Santa Clara, CA, January 2001. [29] D.C. Schmidt. ACE: an object-oriented framework for developing distributed applications. In Proceedings of the USENIX C++ Technical Conference. USENIX Association, Cambridge, MA, April 1994. [30] I. Pyarali, C. O’Ryan, D.C. Schmidt, N. Wang, V. Kachroo, and A. Gokhale. Applying optimization patterns to the design of real-time ORBs. In Proceedings of the Fifth Conference on Object-Oriented Technologies and Systems. USENIX, San Diego, CA, May 1999, pp. 145–159. [31] N. Wang, D.C. Schmidt, and S. Vinoski. Collocation optimizations for CORBA. C++ Report, 11, 47–52, 1999. [32] D.C. Schmidt, S. Mungee, S. Flores-Gaitan, and A. Gokhale. Alleviating priority inversion and non-determinism in real-time CORBA ORB core architectures. In Proceedings of the Fourth IEEE Real-Time Technology and Applications Symposium. IEEE, Denver, CO, June 1998. [33] A. Gokhale and D.C. Schmidt. Principles for optimizing CORBA internet inter-ORB protocol performance. In Proceedings of the Hawaiian International Conference on System Sciences. Hawaii, USA, January 1998.

© 2006 by Taylor & Francis Group, LLC

30-16

Embedded Systems Handbook

[34] A. Gokhale and D.C. Schmidt. Optimizing a CORBA IIOP protocol engine for minimal footprint multimedia systems. Journal on Selected Areas in Communications, Special Issue on Service Enabling Platforms for Networked Multimedia Systems, 17: 1673–1699, 1999. [35] A. Gokhale and D.C. Schmidt. Evaluating the performance of demultiplexing strategies for realtime CORBA. In Proceedings of GLOBECOM ’97. IEEE, Phoenix, AZ, November 1997. [36] I. Pyarali, C. O’Ryan, and D.C. Schmidt. A pattern language for efficient, predictable, scalable, and flexible dispatching mechanisms for distributed object computing middleware. In Proceedings of the International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC). IEEE/IFIP, Newport Beach, CA, March 2000. [37] T.H. Harrison, C. O’Ryan, D.L. Levine, and D.C. Schmidt. The design and performance of a real-time CORBA event service. In Proceedings of the 12th ACM SIGPLAN conference on ObjectOriented Programming Systems, Languages, and Applications (OOPSLA 97), October 5–9, 1997, Atlanta, Georgia. [38] C.D. Gill, D.L. Levine, and D.C. Schmidt. The design and performance of a real-time CORBA scheduling service. Real-Time Systems, The International Journal of Time-Critical Computing Systems, Special Issue on Real-Time Middleware, 20: 117–154, 2001. [39] C. Gill, D.C. Schmidt, and R. Cytron. Multi-paradigm scheduling for distributed real-time embedded computing. IEEE Proceedings, Special Issue on Modeling and Design of Embedded Software, 91: 183–197, 2003. [40] I. Pyarali and D.C. Schmidt. An overview of the CORBA portable object adapter. ACM StandardView, 6: 30–43, 1998. [41] M. Henning and S. Vinoski. Advanced CORBA Programming with C++. Addison-Wesley, Reading, MA, 1999. [42] D.C. Schmidt and C. Cleeland. Applying a pattern language to develop extensible orb middleware. In Design Patterns in Communications, L. Rising, Ed. Cambridge University Press, London, 2000. [43] D.C. Schmidt, D.L. Levine, and C. Cleeland. Architectures and patterns for developing highperformance, real-time ORB endsystems. In Advances in Computers, M. Zelkovitz, Ed., Academic Press, New York, 1999. [44] D.C. Schmidt and C.D. Cranor. Half-sync/half-async: an architectural pattern for efficient and well-structured concurrent I/O. In Proceedings of the Second Annual Conference on the Pattern Languages of Programs. Monticello, IL, September 1995, pp. 1–10. [45] D.C. Schmidt, M. Stal, H. Rohnert, and F. Buschmann. Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked Objects, Vol. 2. John Wiley & Sons, New York, 2000. [46] C. Liu and J. Layland. Scheduling algorithms for multiprogramming in a hard-real-time environment. Journal of ACM, 20, 46–61, 1973. [47] D.B. Stewart and P.K. Khosla. Real-time scheduling of sensor-based control systems. In RealTime Programming, W. Halang and K. Ramamritham, Eds. Pergamon Press, Tarrytown, NY, 1992. [48] D.C. Schmidt, S. Mungee, S. Flores-Gaitan, and A. Gokhale. Software architectures for reducing priority inversion and nondeterminism in real-time object request brokers. Journal of RealTime Systems, Special Issue on Real-Time Computing in the Age of the Web and the Internet, 21: 77–125, 2001. [49] K.M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems, 3, 63–75, 1985. [50] I. Pyarali, C. O’Ryan, D.C. Schmidt, N. Wang, V. Kachroo, and A. Gokhale. Using principle patterns to optimize real-time ORBs. IEEE Concurrency Magazine, 8: 16–25, 2000. [51] V. Subramonian and C. Gill. A generative programming framework for adaptive middleware. In Proceedings of the Hawaii International Conference on System Sciences, Software Technology Track, Adaptive and Evolvable Software Systems Minitrack, HICSS 2003. HICSS, Honolulu, HW, January 2003.

© 2006 by Taylor & Francis Group, LLC

Middleware Design and Implementation

30-17

[52] D. McKinnon, D. Bakken et al. A configurable middleware framework with multiple quality of service properties for small embedded systems. In Proceedings of the Second IEEE International Symposium on Network Computing and Applications. IEEE, April 2003. [53] M. Rom’an, R.H. Campbell, and F. Kon. Reflective middleware: from your desk to your hand. IEEE Distributed Systems Online, 2, 2001. http://csdl.computer.org/comp/megs/ds/2001/05/ o5001abs.htm [54] H. Kopetz. Real-Time Systems: Design Principles for Distributed Embedded Applications. Kluwer Academic Publishers, Norwell, MA, 1997. [55] K. Kim. APIs enabling high-level real-time distributed object programming. IEEE Computer Magazine, Special Issue on Object-Oriented Real-time Computing, 33(6), June 2000. [56] K. Kim. Object structures for real-time systems and simulators. IEEE Computer Magazine, 30(8), August 1997. [57] V. Subramonian, G. Xing, C. Gill, C. Lu, and R. Cytron. Middleware specialization for memoryconstrained networked embedded systems. In Proceedings of the 10th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), 2004.

© 2006 by Taylor & Francis Group, LLC

V Sensor Networks 31 Introduction to Wireless Sensor Networks S. Dulman, S. Chatterjea, and P. Havinga

32 Issues and Solutions in Wireless Sensor Networks Ravi Musunuri, Shashidhar Gandham, and Maulin D. Patel

33 Architectures for Wireless Sensor Networks S. Dulman, S. Chatterjea, T. Hoffmeijer, P. Havinga, and J. Hurink

34 Energy-Efficient Medium Access Control Koen Langendoen and Gertjan Halkes

35 Overview of Time Synchronization Issues in Sensor Networks Weilian Su

36 Distributed Localization Algorithms Koen Langendoen and Niels Reijers

37 Routing in Sensor Networks Shashidhar Gandham, Ravi Musunuri, and Udit Saxena

38 Distributed Signal Processing in Sensor Networks Omid S. Jahromi and Parham Aarabi

39 Sensor Network Security Guenter Schaefer

40 Software Development for Large-Scale Wireless Sensor Networks Jan Blumenthal, Frank Golatowski, Marc Haase, and Matthias Handy

© 2006 by Taylor & Francis Group, LLC

31 Introduction to Wireless Sensor Networks

S. Dulman, S. Chatterjea, and P. Havinga University of Twente

31.1 31.2 31.3 31.4

The Third Era of Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . What Are Wireless Sensor Networks?. . . . . . . . . . . . . . . . . . . Typical Scenarios and Applications . . . . . . . . . . . . . . . . . . . . . Design Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31-1 31-2 31-3 31-5

Locally Available Resources • Diversity and Dynamics • Needed Algorithms • Dependability

31.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-9

Wireless Sensor Networks have gained a lot of attention lately. Due to technological advances, building small-sized, energy-efficient reliable devices, capable of communicating with each other and organizing themselves in ad hoc networks have become possible. These devices have brought a new perspective to the world of computers as we know it: they can be embedded into the environment in such a way that the user is unaware of them. There is no need for reconfiguration and maintenance as the network organizes itself to inform the users of the most relevant events detected or to assist them in their activity. This chapter will give a brief overview of the whole area, by introducing the wireless sensor networks concepts to the reader. Then, a number of applications as well as possible typical scenarios will be presented in order to better understand the field of application of this new emerging technology. Up to this moment, several main areas of applications have been identified. New areas of applications are still to be discovered as the research and products grow more mature. Wireless sensor networks bring lots of challenges and often contradictory demands from the design point of view. The last part of the chapter will be dedicated to highlighting the main directions of research involved in this field. It will serve as a brief introduction to the problems to be described in the following chapters of the book.

31.1 The Third Era of Computing Things are changing continuously in the world of computers. Everything started with the mainframe era: some 30 years ago, these huge devices were widely deployed, for example, within universities. 31-1

© 2006 by Taylor & Francis Group, LLC

31-2

Embedded Systems Handbook

Lots of users made use of a single mainframe computer which they had to share among themselves. The computation power came together with a high cost and a huge machine requiring a lot of maintenance. Technology advanced as it was predicted by Moore’s Law and we stepped into the second era of computers. It is a period that is still present today, but which is slowly approaching its final part. It is the era of the personal computers, cheaper and smaller, and increasingly affordable. Quite often, the average user has access to and makes use of more than one computer, these machines being present now in almost any home and work place. But in this familiar environment, things are starting to change and the third era of computing gains more and more terrain each day. Let us take a look at the main trends today. The technology advancements cause the personal computers to become smaller and smaller. The desktop computers tend to be replaced by laptops and other portable devices. The main factor that is influencing the new transition is the availability of wireless communication technology. People are getting rapidly used to wireless communicating devices due to their independence from fixed machines. The success and availability of the Internet brought even more independence to the user: the data could now be available regardless of the physical location of its owner. The advancements in technology did not stop here: the processors became small and cheap enough to be found now in almost any familiar device around us, starting with an every-day watch and ending with (almost) any home appliance we own. The new efforts nowadays are to make these devices “talk” to each other and organize themselves into ad hoc networks to accomplish their design goal as fast and reliably as possible. This is, in fact, the third computer age envisioned two decades ago by Mark Weiser [1]. Several names, such as ubiquitous computing, pervasive computing, ambient intelligence, invisible computing, disappearing computer, etc., were created to indicate different aspects of the new computing age (Mark Weiser himself defined it as “the calm technology, that recedes into the background of our lives”). The ubiquitous computing world brings a reversed view on the usage of computing power: instead of having lots of users gathered around the mainframe computer, now, each user will be using the services of several embedded networks. The user will be in the middle of the whole system, surrounded by an invisible intelligent infrastructure. The original functionality of the objects and application will be enhanced, and a continuous interaction will be present in a large variety of areas of daily life.

31.2 What Are Wireless Sensor Networks? So what are wireless sensor networks and where is their place in this new environment that starts “growing” around us? Wireless sensor networks is the generic name under which a broad range of devices hide. Basically, any collection of devices equipped with a processor, having sensing and communication capabilities and being able to organize themselves into a network created in an ad hoc manner falls into this category. The addition of the wireless communication capabilities to sensors increased their functionality dramatically. Wireless sensor networks bring monitoring capabilities that will forever change the way in which data is collected from the ambient environment. Let us take, for example, the traditional monitoring approach of a remote location for a given phenomenon, such as recording the geological activity, monitoring the chemical or biological properties of a region, or even monitoring the weather at a certain place. The old approach was the following: rather big and robust devices needed to be built. They should have contained, besides the sensor pack itself, a big power supply and local data storage capabilities. A team of scientists would have to travel together to the destination to be monitored, place these expensive devices at predefined positions and calibrate all the sensors. Then, they would come back after a certain amount of time in order to collect the sensed data. If by misfortune some hardware would fail, then nothing could be done for it, as the information about the phenomenon itself would be lost.

© 2006 by Taylor & Francis Group, LLC

Introduction to Wireless Sensor Networks

31-3

The new approach is to construct inexpensive, small sized, energy-efficient sensing devices. As hundreds, thousands, or even more of these devices will be deployed, the reliability constraints for them will be diminished. No local data storage is needed anymore as they will process locally and then transmit by wireless means the observed characteristic of the phenomenon to one or more access points connected to a computer network. Individual calibration of each sensor node is no longer needed as it can be performed by localized algorithms [2]. The deployment will also be easier, by randomly placing the nodes (e.g., simply throwing them from a plane) onto the monitored region. Having this example in mind, we can give a general description of a sensor node. The name sensor node will be used to describe a tiny device that has a short range wireless communication capability, a small processor and several sensors attached to it. It may be powered by batteries and its main function is to collect data from a phenomenon, collaborate with its neighbors, and forward its observations (preprocessed version of the data or even decisions) to the endpoint if requested. This is possible because its processor additionally contains the code that enables internode communication and setting up, maintenance, and reconfiguration of the wireless network. When referring to wireless communication, we have in mind mainly radio communication (other means such as ultrasound, visible or infrared light, etc., are also being used [3]). A sensor network is a network made up of large numbers of sensor nodes. By a large number we understand at this moment hundreds or thousands of nodes but there are no exact limits for the upper bound of the number of sensors deployed. Wireless sensor networks are one of the most important tools of the third era of computing. They are the simplest intelligent devices around, their main purpose being monitoring the environment surrounding us and alerting us of the main events happening. Based on the observation reported by these instruments, humans and machines can make decisions and act on them.

31.3 Typical Scenarios and Applications At this moment a large variety of sensors exist. Sensors have been developed to monitor almost every aspect of the ambient world: lighting conditions, temperature, humidity, pressure, the presence or absence of various chemical or biological products, detection of presence and movement, etc. By networking large number of sensors and deploying them inside the phenomenon to be studied we obtain a sensing tool way more powerful than a single sensor is able to do sensing at a superior level. A first classification of wireless sensor networks can be made based on the complexity of the networks involved [4]: “Intelligent” warehouse. Each item contained inside the warehouse will have a tag attached, that will be monitored by the sensor nodes embedded into the walls and shelves. Based on the read data, knowledge of the spatial positioning of the sensors, and time information, the sensor network will offer information about the traffic of goods inside the building, create automatic inventories, and even perform longterm correlations between the read data. The need of manual product scanning will thus disappear. In this category we can include the scenario of the modern supermarket, where the selected products of the customers will automatically be identified at the exit of the supermarket. This scenario also has the minimum complexity. The sensor nodes are placed at fixed positions, in a more or less random manner. The deployment area is easily accessible and some infrastructure (e.g., power supplies and computers) already exists. At the same time, the nodes are operating in a “safe” environment meaning that there are no major external factors that can influence or destroy them. Environmental monitoring. This is the widest area of application envisioned up to now. A particular application in this category is disaster monitoring. The sensor nodes deployed in the affected areas will help humans estimate the effects of the disaster, build maps of the safe areas, and direct the human actions toward the affected regions. A large number of applications in this category address monitoring of the wild life. This scenario has an increased complexity. The area of deployment is no longer accessible in an easy manner and no longer safe for the sensor nodes. There is hardly any infrastructure present, nodes

© 2006 by Taylor & Francis Group, LLC

31-4

Embedded Systems Handbook

have to be scattered around in a random manner and the network might contain moving nodes. Also a larger number of nodes will have to be deployed. Very-large-scale sensor networks applications. The scenario of a large city where all the cars have integrated sensors. These sensor nodes will communicate with each other collecting information about the traffic, routes, and special traffic conditions. On one hand, new information will be available to the driver of each car. On the other hand, a global view of the whole picture will also be available. The two main constraints that characterize this scenario are the large number of nodes and their high mobility. The algorithms employed will have to scale well and deal with a network with a continuously changing topology. On the other hand, the authors of Reference 5 present a classification of sensor networks based on their area of application. It takes into consideration only the military, environment, health, home, and other commercial areas and can be extended with additional categories, such as space exploration, chemical processing, and disaster relief. Military applications. Factors such as rapid deployment, self-organization, and increased fault tolerance make wireless sensor networks a very good candidate for usage in the military field. They are suited for deployment in battlefield scenarios due to the large size of the network and the automatic self-reconfiguration at the moment of destruction/unavailability of some sensor nodes [6]. Typical applications are: the monitoring of friendly forces, equipment, and ammunition; battlefield surveillance; reconnaissance of opposing forces and terrain, targeting, and battle damage assessment; and nuclear, biological, and chemical attack detection and reconnaissance. A large number of projects have already been sponsored by The Defense Advanced Research Projects Agency (DARPA) [7]. Environmental applications. Several aspects of the wildlife are being studied with the help of sensor networks. Existing applications include the following: monitoring the presence and the movement of birds, animals, and even insects; agricultural related projects observing the conditions of crops and livestock; environmental monitoring of soil, water, and atmosphere contexts and pollution studies; etc. Other particular examples include forest fire monitoring, biocomplexity mapping of the environment, and flood detection. Ongoing projects at this moment include the monitoring of birds on Great Duck Island [8], the zebras in Kenya [9], or the redwoods in California [10]. The number of these applications is continuously increasing as the first deployed sensor network show the benefits of easy remote monitoring. Healthcare applications. An increasing interest is being shown to the elder population [11]. Sensor networks can help in several areas of the healthcare field. The monitoring can take place both at home and in hospitals. At home, patients can be under permanent monitoring and the sensor networks will trigger alerts whenever there is a change in the patient’s state. Systems that can detect their movement behavior at home, detect any fall, or remind them to take their prescriptions are being studied. Also inside hospitals, sensor networks can be used in order to track the position of doctors and patients (their status or even errors in the medication), expensive hardware, etc. [12]. Home applications. The home is the perfect application domain for the pervasive computing field. Imagine all the electronic appliances forming a network and cooperating together to fulfill the needs of the inhabitants [13]. They will have to identify each user correctly, remember their preferences and their habits, and at the same time, monitor the entire house for unexpected events. The sensor networks also have an important role here, being the “eyes and the ears” that will trigger the actuator systems. Other commercial applications. This category includes all the other commercial applications envisioned or already built that do not fit in the previous categories. Basically they range from simple systems as environmental monitoring within an office to more complex applications, such as managing inventory control and vehicle tracking and detection. Other examples include incorporating sensors into toys and thus detecting the position of the children in “smart” kindergartens [14]; monitoring the material fatigue and the tensions inside the walls of a building, etc. The number of research projects dedicated to wireless sensor networks has increased dramatically over the last years. A lot of effort has been invested in studying all possible aspects of wireless sensor networks.

© 2006 by Taylor & Francis Group, LLC

Introduction to Wireless Sensor Networks TABLE 31.1

List of Sensor Networks Related Research Projects

Project name

Research area

CoSense [15] EYES [16] PicoRadio [17] SensoNet [18] Smart Dust [19] TinyDB [20] WINS [21]

TABLE 31.2

31-5

Collaborative sensemaking (target recognition, condition monitoring) Self-organizing, energy-efficient sensor networks Develop low cost, energy-efficient transceivers Protocols for sensor networks Cubic millimeter sensor nodes Query processing system Distributed network access to sensors, controls, and processors

Current Sensor Networks Companies List

Company name Ambient systems CrossBow Dust networks Ember Millennial net Sensoria corporation Xsilogy

Headquarters location The Netherlands San Jose, CA Berkeley, CA Boston, MA Cambridge, MA San Diego, CA San Diego, CA

HTTP address http://www.ambient-systems.net http://www.xbow.com http://dust-inc.com http://www.ember.com http://www.millennial.net http://www.sensoria.com http://www.xsilogy.com

Please refer to Table 31.1 for a few examples. Also, a number of companies were created, most of them start-ups from the universities that perform research in the field. Some of the names in the field, valid at the date of writing this document, are listed in Table 31.2.

31.4 Design Challenges When designing a wireless sensor network one faces, on one hand, the simplicity of the underlying hardware and, on the other hand, the requirements that have to be met. In order to satisfy them, new strategies and new sets of protocols have to be developed [22–24]. In the following paragraphs we will address the main challenges that are present in the wireless sensor network field. The research directions involved and the open questions that still need to be answered will be presented as well. To begin with, a high-level description of the current goals for the sensor networks can be synthesized as: Long life. The sensor node should be able to “live” as long as possible using its own batteries. This constraint can be translated to a power consumption <100 µW. The condition arises from the assumption that the sensor nodes will be deployed in a harsh environment where maintenance is either impossible or has a prohibitively high price. It makes sense to maximize the battery lifetime (unless the sensor nodes use some form of energy scavenging). The targeted lifetime of a node powered by two AA batteries is a couple of years. This goal can be achieved only by applying a strict energy policy that will make use of power-saving modes and dynamic voltage scaling techniques. Small size. The size of the device should be <1 mm3 . This constraint gave the sensor nodes the name of smart dust, a name that gives a very intuitive idea about the final design. Recently, the processor, the radio were integrated in a chip having a size of ∼1 mm3 . What is left is the antenna, the sensors themselves, and the battery. Advances are required in each of these three fields in order to be able to meet this design constraint. Inexpensive. The third high-level design constraint is about the price of these devices. In order to encourage large-scale deployment, this technology must be really cheap, meaning that the targeted prices are in the range of a couple of cents.

© 2006 by Taylor & Francis Group, LLC

31-6

Embedded Systems Handbook

31.4.1 Locally Available Resources Wireless sensor networks consist of thousands of devices working together. Their small size comes also with the disadvantage of very limited resource availability (limited processing power, low-rate unreliable wireless communication, small memory footprint, and low energy). This raises the issue of designing a new set of protocols across the whole system. Energy is of special importance and can by far be considered the most important design constraint. The sensor nodes will be mainly powered by batteries. In most of the scenarios, due to the environment where they will be deployed, it will be impossible to have a human change their batteries. In some designs, energy-scavenging techniques will also be employed. Still, the amount of energy available to the nodes can be considered limited and this is why the nodes will have to employ energy-efficient algorithms to maximize their lifetime. By taking a look at the characteristics of the sensor nodes, we notice that the energy is spent for three main functions: environment sensing, wireless communication, and local processing. Each of these three components will have to be optimized in order to obtain minimum energy consumption. For sensing of the environment component, the most energy-efficient available sensors have to be used. From this point of view, we can regard this component as a function of a specific application and a given sensor technology. The energy needed for transmitting data over the wireless channel dominates by far the energy consumption inside a sensor node. More than that, it was previously shown that it is more efficient to use a short-range multihop transmission scheme than sending data over large distances [5]. A new strategy characteristic to the sensor networks was developed based on a trade-off between the last two components and it is in fact, one of the main characteristics of the sensor networks (see e.g., techniques developed in References 25 and 26). Instead of blindly routing packets through the network, the sensor nodes will act based on the content of the packet [27]. Let us suppose that a certain event took place. All nodes that sensed it will characterize the event with some piece of data that needs to be sent to the interested nodes. There will be many similar data packets, or at least, some redundancy will exist in the packets to be forwarded. In order to reduce the traffic, each node on the communication path will examine the contents of the packet it has to forward. Then it will aggregate all the data related to a particular event into one single packet, eliminating the redundant information. The reduction of traffic by using this mechanism is substantial. Another consequence of this mechanism is that the user will not receive any raw data, but only high-level characterizations of the events. This makes us think of the sensor network as a self-contained tool, a distributed network that collects and processes information. From an algorithmic point of view, the local strategies employed by sensor nodes have as a global goal to extend the overall lifetime of the network. The notion of lifetime of the network usually hides one of the following interpretations: one can refer to it as the time passed since power on and a particular event, such as the energy depletion of the first node or of 30% of the nodes, or even the moment when the network is splitted in several subnetworks. No matter which of these concepts will be used, the nodes will choose to participate in the collaborative protocols following a strategy that will maximize the overall network lifetime. To be able to meet the goal of prolonged lifetime, each sensor node should: • Spend all the idle time in a deep power down mode, thus using an insignificant amount of energy. • When active, employ scheduling schemes that take into consideration voltage and frequency scaling. It is interesting to note at the same time, the contradictory wireless industry trends and the requirements for the wireless sensor nodes. The industry focuses at the moment in acquiring more bits/sec/Hz while the sensor nodes need more bits/euro/nJ. From the transmission range point of view, the sensor nodes need only limited transmission range to be able to use an optimal calculated energy consumption, while the industry is interested in delivering higher transmission ranges for the radios. Nevertheless, the radios

© 2006 by Taylor & Francis Group, LLC

Introduction to Wireless Sensor Networks

31-7

designed nowadays tend to be as reliable as possible, while a wireless sensor network is based on the assumption that failures are regarded as a regular event. Energy is not the only resource the sensor nodes have to worry about. The processing power and memory are also limited. Large local data storages cannot be employed, so strategies need to be developed in order to store the most important data in a distributed fashion and to report the important events to the outside world. A feature that helps dealing with these issues is the heterogeneity of the network. There might be several types of devices deployed. Resource poor nodes will be able to ask more powerful nodes to perform complicated computations. At the same time, several nodes could associate themselves in order to perform the computations in a distributed fashion. Bandwidth is also a constraint when dealing with sensor networks. The low-power communication devices used (most of the time radio transceivers) can only work in simplex mode. They offer low data rates due also to the fact that they are functioning in the free unlicensed bands where traffic is strictly regulated.

31.4.2 Diversity and Dynamics As we already suggested, there may be several kinds of sensor nodes present inside a single sensor network. We could talk of heterogeneous sensor nodes from the point of view of hardware and software. From the point of view of hardware, it seems reasonable to assume that the number of a certain kind of devices will be in an inversely proportional relationship to the capabilities offered. We can assist to a tiered architecture design, where the resource poor nodes will ask more powerful or specialized nodes to make more accurate measurements of a certain detected phenomenon, to perform resource intensive operations or even to help in transmitting data at a higher distance. Diversity can also refer to sensing several parameters and then combining them in a single decision, or in other words to perform data-fusion. We are talking about assembling together information from different kinds of sensors, such as light, temperature, sound, smoke, etc., to detect, for example, if a fire has started. Sensor nodes will be deployed in the real world, most probably in harsh environments. This puts them in contact with an environment that is dynamic in many senses and has a big influence on the algorithms that the sensor nodes should execute. First of all, the nodes will be deployed in a random fashion in the environment and in some cases, some of them will be mobile. Second, the nodes will be subject to failures at random times and they will also be allowed to change their transmission range to better suit their energy budget. This leads to the full picture of a network topology in a continuous change. The algorithms for the wireless sensor networks have as one of their characteristic the fact that they do not require a predefined well-known topology. One more consequence of the real-world deployment is that there will be many factors influencing the sensors in contact with the phenomenon. Individual calibration of each sensor node will not be feasible and probably will not help much as the external conditions will be in a continuous change. The sensor network will calibrate itself as a reply to the changes in the environment conditions. More than this, the network will be capable of self-configuration and self-maintenance. Another issue we need to talk about is the dynamic nature of the wireless communication medium. Wireless links between nodes can periodically appear or disappear due to the particular position of each node. Bidirectional links will coexist with unidirectional ones and this is a fact that the algorithms for wireless sensor networks need to consider.

31.4.3 Needed Algorithms For a sensor network to work as a whole, some building blocks need to be developed and deployed in the vast majority of applications. Basically, they are: a localization mechanism, a time synchronization mechanism, and some sort of distributed signal processing. A simple justification can be that data hardly has any meaning if some position and time values are not available with it. Full, complex signal processing done separately at each node will not be feasible due to the resource constraints.

© 2006 by Taylor & Francis Group, LLC

31-8

Embedded Systems Handbook

The self-localization of sensor nodes gained a lot of attention lately [28–31]. It came as a response to the fact that global positioning systems are not a solution due to high cost (in terms of money and resources) and it is not available or provides imprecise positioning information in special environments, such as indoors, etc. Informations, such as connectivity, distance estimation based on radio signal strength, sound intensity, time of flight, angle of arrival, etc., were used with success in determining the position of each node within degrees of accuracy using only localized computation. The position information once obtained was not only used for characterizing the data, but also in designing the networking protocols, for example, leading to more efficient routing schemes based on the estimated position of the nodes [32]. The second important building block is the timing and synchronization block. Nodes will be allowed to function in a sleep mode for long periods of time, so periodic waking up intervals need to be computed within a certain precision. However, the notion of local time and synchronization with the neighbors is needed for the communication protocols to perform well. Light-weight algorithms have been developed that allow fast synchronization between neighboring nodes using a limited number of messages. Loose synchronization will be used, meaning that each pair of neighbor nodes are synchronized within a certain bound, while nodes situated multiple hops away might not be synchronized at all. Global timing notion might not be needed at all in most of the applications. Due to the fact that many applications measure natural phenomenon, such as temperature, where delays up to the order of seconds can be tolerated, the trade-off between latency and energy is preferred. The last important block is the signal processing unit. A new class of algorithms has to be developed due to the distributed nature of wireless sensor networks. In their vast majority the signal processing algorithms are centralized algorithms that require a large computation power and the availability of all the data at the same time. Transmitting all the recorded data to all nodes is impossible in a dense network even from theoretical point of view, not to mention the needed energy for such an operation. The new distributed signal processing algorithms have to take into account the distributed nature of the network, the possible unavailability of data from certain regions due to failures, and the time delays that might be involved.

31.4.4 Dependability More than any other sort of computer network, the wireless sensor networks are subject to failures. Unavailability of services will be considered “a feature” of these networks or “regular events” rather than some sporadic and highly improbable events. The probability for something going wrong is at least several orders of magnitude higher than in all the other computer networks. All the algorithms have to employ some form of robustness in front of the failures that might affect them. On the other hand, this comes at the cost of energy, memory, and computation power, so it has to be kept at a minimum. An interesting issue is the one on the system architecture from the protocols point of view. In traditional computer networks, each protocol stack is designed for the worst-case scenario. This scenario hardly ever happens simultaneously for all the layers, and a combination of lower layer protocols could eliminate such a scenario. This leads to lot of redundancy in the sensor node, redundancy that costs important resources. The preferred approach is that of crosslayer designing and studying of the sensor node as a whole object rather than separate building blocks. This opens for a discussion on the topic of what is a right architecture for all the sensor networks and if a solution that fits all the scenarios makes sense at all. Let us summarize the sources of errors the designer will be facing: nodes will stop functioning starting with even the (rough) deployment phase. The harsh environment will continuously degrade the performances of the nodes making them unavailable as the time passes. Then, the wireless communication medium will be an important factor to disturb the message communication and to affect the links and implicitly the network topology. Even with a perfect environment, collisions will occur due to the imprecise local time estimates and lack of synchronization. Nevertheless, the probabilistic scheduling policies and protocol implementations can be considered as sources of errors.

© 2006 by Taylor & Francis Group, LLC

Introduction to Wireless Sensor Networks

31-9

Another issue that can be addressed as a dependability attribute is the security. The communication channel is opened and cannot be protected. This means that others are able to intercept and to disrupt the transmissions or even to transmit their own data. In addition to accessing private information, a third party could also act as an attacker that wants to disrupt the correct functionality of the network. The security in a sensor network is a hard problem that still needs to be solved. Like almost any other protocol in this sort of network, it has contradictory requirements: the schemes employed should be as light as possible while achieving the best results. The usual protection schemes require too much memory and computation power to be employed (the keys themselves are sometimes too big to fit into the limited available memory). A real problem is how to control the sensor network itself. The sensor nodes will be too many to be individually accessible to a single user and might also be deployed in an inaccessible environment. By control we understand issues, such as deployment and installation, configuration, calibration and tuning, maintenance, discovery, and reconfiguration. Debugging the code running in the network is completely infeasible, as at any point inside, the user has access only to the high-level aggregated results. The only real debugging and testing can be done with simulators that prove to be invaluable resources in the design and analysis of the sensor networks.

31.5 Conclusions This chapter was a brief introduction to the new field of wireless sensor networks. It provided a short overview of the main characteristics of this new set of tools that will soon enhance our perception capabilities regarding the ambient world. The major challenges have been identified, some initial steps have been taken and early prototypes are already working. The following chapters of the book will focus on particular issues, giving more insight into the current state of the art in the field. The research in this area will certainly continue and there may come a time when sensor networks will be deployed all around us and will become regular instruments available to everyone.

References [1] Weiser, M. The computer for the 21st century. Scientific American, 265, 66–75, 1991. [2] Whitehouse K. and Culler, D. Calibration as parameter estimation in sensor networks. In Proceedings of ACM International Workshop on Wireless Sensor Networks and Applications (WSNA’02). Atlanta, GA, 2002. [3] Want, R., Hopper, A., Falcao, V., and Gibbons, J. The active badge location system. ACM Transactions on Information Systems, 10, 91–102, 1992. [4] Estrin, D., Govindan, R., Heidemann, J., and Kumar, S. Next century challenges: scalable coordination in sensor networks. In Proceedings of the International Conference on Mobile Computing and Networking. ACM/IEEE Seattle, Washington, USA, 1999, pp. 263–270. [5] Akyildiz, I., Su, W., Sankarasubramaniam, Y., and Cayirci, E. Wireless sensor networks: a survey. Computer Networks Journal, 38, 393–422, 2002. [6] Brooks, R.R., Ramanathan, P., and Sayeed, A.M. Distributed target classification and tracking in sensor networks. Proceedings of the IEEE, 91, 1163–1171, 2003. [7] DARPA. http://www.darpa.mil/body/off_programs.html. [8] Polastre, J., Szewczyk, R., and Culler, D. Analysis of wireless sensor networks for habitat monitoring. In Wireless Sensor Networks, C.S. Ragavendra, K.M. Sivalingam, and T. Znati, Eds. Kluwer Academic Publishers, Dordrecht, 2004. [9] Juang, P., Oki, H., Wang, Y., Martonosi, M., Peh, L., and Rubenstein, D. Energy-efficient computing for wildlife tracking: design tradeoffs and early experiences with zebranet. In Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X ). San Jose, CA, 2002.

© 2006 by Taylor & Francis Group, LLC

31-10

Embedded Systems Handbook

[10] Yang, S. Redwoods go hightech: researchers use wireless sensors to study California’s state tree. UCBerkeley News, 2003. [11] IEEE Computer Science Society. Pervasive Computing, 3 — Successfull Aging, 2004. [12] Baldus, H., Klabunde, K., and Muesch, G. Reliable set-up of medical body-sensor networks. In Proceedings of the Wireless Sensor Networks, First European Workshop (EWSN 2004). Berlin, Germany, 2004. [13] Basten, T., Geilen, M., and Groot, H. Omnia fieri possent. In Ambient Intelligence: Impact on Embedded System Design. Kluwer Academic Publishers, Dordrecht, 2003, pp. 1–8. [14] Srivastava, M., Muntz, R., and Potkonjak, M. Smart kindergarten: sensor-based wireless networks for smart developmental problem-solving environments (challenge paper). In Proceedings of the Seventh Annual International Conference on Mobile Computing and Networking. ACM, Rome, Italy, 2001, pp. 132–138. [15] CoSense. http://www2.parc.com/spl/projects/ecca. [16] Eyes. http://eyes.eu.org. [17] Picoradio. http://bwrc.eecs.berkeley.edu/research/pico_radio. [18] SensoNet. http://users.ece.gatech.edu/ weilian/sensor/index.html. ˜ [19] SmartDust. http://robotics.eecs.berkeley.edu/pister/smartdust. [20] TinyDB. http://telegraph.cs.berkeley.edu/tinydb. [21] Wins. http://www.janet.ucla.edu/wins. [22] Estrin, D., Culler, D., Pister, K., and Sukhatme, G. Connecting the physical world with pervasive networks. IEEE Pervasive Computing, 1, 59–69, 2002. [23] Akyildiz, I., Su, W., Sankarasubramaniam, Y., and Cayirci, E. A survey on sensor networks. IEEE Communication Magazine, 40, 102–114, 2002. [24] Pottie, G.J. and Kaiser, W.J. Wireless integrated network sensors. Communications of the ACM, 43, 51–58, 2000. [25] Chlamtac, I., Petrioli, C., and Redi, J. Energy-conserving access protocols for identification networks. IEEE/ACM Transactions on Networking, 7, 51–59, 1999. [26] Schurgers, C., Raghunathan, V., and Srivastava, M.B. Power management for energy-aware communication systems. ACM Transactions on Embedded Computing Systems, 2, 431–447, 2003. [27] Intanagonwiwat, C., Govindan, R., Estrin, D., Heidemann, J., and Silva, F. Directed diffusion for wireless sensor networks. IEEE/ACM Transactions on Networking, 11, 2003. [28] Bulusu, N., Heidemann, J., and Estrin, D. Gps-less low cost outdoor localization for very small devices. In IEEE Personal Communications, 2000, pp. 28–34. [29] Doherty, L., Pister, K., and Ghaoui, L. Convex position estimation in wireless sensor networks. In IEEE INFOCOM. Anchorage, AK, 2001. [30] Langendoen, K. and Reijers, N. Distributed localization in wireless sensor networks: a quantitative comparison. Computer Networks, Special Issue on Wireless Sensor Networks, 2003. [31] Evers, L., Dulman, S., and Havinga, P. A distributed precision based localization algorithm for ad hoc networks. In Proceedings of Pervasive Computing (PERVASIVE 2004), 2004. [32] Zorzi, M. and Rao, R. Geographic random forwarding (geraf) for ad hoc and sensor networks: energy and latency performance. IEEE Transactions on Mobile Computing, 2(4), 337–348, 2003.

© 2006 by Taylor & Francis Group, LLC

32 Issues and Solutions in Wireless Sensor Networks 32.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-1 Sensor Networks versus Mobile ad hoc Networks

32.2 System Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-3 Operational Model • Radio Propagation Model

32.3 Design Issues in Sensor Networks . . . . . . . . . . . . . . . . . . . . . . 32-4 32.4 MAC Layer Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-5 32.5 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-5 Flat Routing Protocols • Cluster-Based Routing Protocols

32.6 Other Important Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-7

Ravi Musunuri, Shashidhar Gandham, and Maulin D. Patel University of Texas at Dallas

Security • Location Determination • Lifetime Analysis • Power Management • Clock Synchronization • Reliability • Sensor Placement and Organization for Coverage and Connectivity • Topology Control

32.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-13 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-14

32.1 Introduction Due to advances in integrated circuits (ICs) fabrication technology and Micro Electro Mechanical Systems (MEMSs) [1, 2], it is now commercially feasible to manufacture ICs with sensing, signal processing, memory, and other relevant components built into them. Such ICs enabled with RF communication bring forth a new kind of network, which are self-organizing and application specific. These networks are referred to as wireless sensor networks. A sensor network is a static ad hoc network consisting of hundreds of sensor nodes deployed on the fly for unattended operation. Each node consists of [3–5] sensors, processor, memory, radio, limited power battery, and software components, such as operating system and protocols. The architecture of a sensor node is completely dependent on the purpose of the deployment. But, we can generalize the architecture [2] as shown in Figure 32.1.

32-1

© 2006 by Taylor & Francis Group, LLC

32-2

Embedded Systems Handbook

Operating system and other software

Battery

Memory Sensor

AC/DC convertor

Processor

R a d i o

CPU

FIGURE 32.1 Sensor node architecture.

Sensor nodes are expected to monitor some surrounding environmental phenomena, process the data obtained and forward this data toward a base station located on the periphery of the sensor network. Wireless sensor networks have numerous applications in fields, such as surveillance, security, environmental monitoring, habitat monitoring, smart spaces, precision agriculture, inventory tracking, and healthcare [4]. The main advantage of sensor networks is their ability to be deployed in almost any kind of remote terrain. Their unattended mode of operation makes them a preferable choice over ground based radar systems [5]. The spatial distribution of sensor nodes ensures greater signal-to-noise ratio (SNR) by combining signals from various sensors. Furthermore, higher level of redundancy allows greater fault tolerance. As sensor nodes are expected to be manufactured for very less price, they can be deployed in large numbers. As a result, sensor networks can provide large coverage area through the union of individual nodes coverage area. Since sensor nodes are expected to be deployed close to the object of interest, obstruction of the line of sight for sensing activity is ruled out. To illustrate the above-mentioned advantages consider an example of seismic detection [4]. The earth generates seismic noise, which becomes attenuated and distorted with distance. Hence to increase the probability of detection, it is advisable to have sensors closer to the source. To accomplish this, we need to detect the exact location and time of the seismic activity, which happens to be the goal of deploying sensors. If a distributed network of sensors was deployed across the entire geographical area of interest, then there would be no requirement to pinpoint the locations where sensors need to be deployed.

32.1.1 Sensor Networks versus Mobile ad hoc Networks Wireless sensor networks are significantly different from mobile ad hoc networks (MANETS) [6] due to following reasons: Mode of communication. In MANETS, potentially any node can send data to any other node. But in sensor networks, mode of communication is restricted. In general, base station will broadcast commands to all sensor nodes in its network and sensor nodes send back sensed data to base station. Sometimes, sensor nodes may need to forward sensed data to other sensor nodes, if base station is not reachable directly. Depending on the application, some sensor networks will employ data aggregation at designated nodes to reduce the bandwidth usage. Most of the sensor network’s messages are routed to base stations. Hence sensor nodes need not maintain explicit routing tables. Node mobility. In MANETS every node can move. In general, sensor nodes are static and some architectures have mobile base stations [7]. Energy. Nodes in MANETS have a rechargeable source of energy. Thus energy conservation is of secondary importance. However, sensor networks consist of several hundreds of nodes, which need to

© 2006 by Taylor & Francis Group, LLC

Issues and Solutions in Wireless Sensor Networks

32-3

operate in remote terrain. Hence, battery replacement is not possible, due to which energy efficiency is critical for sensor networks. Apart from the above mentioned differences, sensor nodes have low computational power, less cost as compared to MANETS nodes. Protocols designed for sensor networks should be more scalable. Since they are expected to be deployed in hundreds. The remaining part of this chapter is organized as follows. In Section 32.2, we describe system models used in the literature. Section 32.3 presents design issues in sensor networks. In Section 32.4, medium access layer issues and few solutions proposed in literature are described. In Section 32.5, we move on to flat routing protocols and hierarchical routing protocols. We then describe other important issues, such as security, location determination, lifetime analysis, power management, and clock synchronization.

32.2 System Models Various system models proposed in the literature can be classified based on the following factors: • Mobility of base stations • Number of base stations • Method of organization (heirarchical/flat) System models considered by researchers until now consist of static sensor nodes randomly deployed in a geographical area of interest. This geographical area of interest is often referred to as sensor field. Most of the models considered had a single, static base station [6, 8–11]. In Reference 12, the author evaluates the best position to locate a base station and proposes to split large sensor networks into small squares and move the base station to center of each square to collect the data. In Reference 7, the authors propose to deploy multiple, intermittently mobile base stations to increase the lifetime of the sensor networks.

32.2.1 Operational Model Research on sensor networks done until now considered various operational models of the sensor nodes. These models can be broadly classified as below: Active. In case of active sensor networks [6,8–11,13] each sensor node would be sensing its environment continuously. Based on how frequently this sensed data is forwarded toward the base station, sensor networks can be further classified as • Periodic: Based on the application for which the sensor network is deployed it might be required to gather the data from every sensor node periodically [8,10]. • Event driven: Sensor networks that are deployed for monitoring specific events gather data only when the event of interest occurs [11, 13, 14]. For example, sensor nodes deployed to monitor seismic activity in a region need to route data only when they detect seismic currents in their proximity. Passive. In the case of passive sensor networks, data forwarding would be triggered by a query from the base station. Passive sensor networks can be further classified as follows: • Energized on query: In this operation mode, sensor nodes switch off their sensors most of the time. Only when a query is generated for the data, the sensor node would switch on its sensor and record the data to be forwarded. • Always sensing: This category of sensor nodes have their sensors running all the time. As soon as there is a query for data from it, a sensor node would generate the data packet based on the observations until now and forward the packet.

© 2006 by Taylor & Francis Group, LLC

32-4

Embedded Systems Handbook

32.2.2 Radio Propagation Model Most of the researchers assumed that energy spent in transmission over wireless medium is in accordance with the first-order radio model [8,11]. In this model energy required to transmit a signal has a fixed part and a variable part. The variable part is directly proportional to square of the distance. Some constant energy is required to receive a signal by a receiving antenna.

32.3 Design Issues in Sensor Networks Most sensor networks encounter operational challenges [15], such as ad hoc deployment, limited energy supply, dynamic environmental conditions, and unattended mode of operation. Any solution proposed for sensor networks should consider the following design issues: Energy. Each sensor node is equipped with a limited battery supplied energy. Sensor nodes spend more energy in communication than local computations.1 As sensor nodes are deployed in large numbers, it is not feasible to manually recharge the batteries. Thus sensor nodes should conserve their energy by minimizing the number of messages that are to be transmitted. Based on energy source, sensor nodes can be classified as below: • Rechargeable: Sensor nodes equipped with solar cells can recharge their batteries when sunlight is available. For such sensor nodes, the main design criteria would be to maximize the number of nodes operational during times when no sunlight is available. • Nonrechargeable: Sensor nodes equipped with nonrechargeable batteries will cease to operate once they drain their energy. Thus, the main design issue in such sensor networks would be to maximize the operational time of every sensor node. Bandwidth. Sensor nodes need to communicate over the ISM (industrial, scientific, and medical) band. When many nodes make an attempt to use the same communication frequency, there might be a requirement to use the available bandwidth optimally. Limited computation power and memory. As the processing power at each sensor node is limited, proposed solutions for sensor networks should not expect sensor nodes to carry out computationally intensive tasks. Unpredictable reliability, failure models. Sensor networks are expected to be deployed in inaccessible and hostile environment. As a result, it is possible for sensor nodes to crash or malfunction due to external environmental factors. The proposed solutions should be based on failure models that account for such possibility. Furthermore, failure of few nodes should not bring down the network. Scalability. Sensor nodes are expected to be deployed in thousands. As a result, scalability is a critical issue in design of sensor networks. Any solution proposed should be scalable to large-sized sensor networks. Timeliness of action (latency). Latency is an important issue in sensor networks deployed for critical applications, such as security and surveillance. Hence, the time elapsed between the time an event is detected and the time the event is reported at the base station is to be minimized. To address these design challenges several strategies, such as cooperative signal processing, Exploiting redundancy, adaptive signal processing, and hierarchical architecture are going to be key building blocks for sensor networks [3]. In near future we believe that sensor networks would find wide acceptance in day to day activities similar to computers. To attain such a wide-scale acceptance, sensor nodes should be affordable, easily available, easily configurable (plug and play), and deployable. To accomplish these objectives we need to come up with suitable Medium Access Control (MAC) layer protocols, routing protocols, location 1 To

take an example for ground to ground communication [6] it takes 3 J of energy to transmit 1 Kb of data at a distance of 100 m. A general-purpose processor having a processing capability of 100 million instructions per second would execute 300 million instructions for the same amount of energy.

© 2006 by Taylor & Francis Group, LLC

Issues and Solutions in Wireless Sensor Networks

32-5

discovery algorithms, power-management strategies, and solutions to other relevant problems. Some of these design problems are well studied by researchers. In the next section, we present a brief overview of existing solutions for each of these design problems.

32.4 MAC Layer Protocols Medium Access Control layer provides topology information and channel allocation to the higher layers in the protocol stack. Channel allocation is critical for energy-efficient functioning of the link layer. Energy efficiency and scalability [16] are main issues in developing MAC protocols for sensor networks. Fairness, latency, and throughput are also important performance measures for channel allocation algorithms. A channel could be a time slot in Time Division Multiple Access (TDMA), a frequency band in Frequency Division Multiple Access (FDMA), or a code in Code Division Multiple Access (CDMA). Channel allocation algorithms should try to avoid energy wastage through: • Collisions: When two or more nodes, which are in direct transmission range of each other, transmit packets in the same channel. • Overhearing: Nodes receive data destined for other nodes. • Idle listening: Unnecessarily listening to the channel when there are no packets to be received. • Control packet: Bandwidth wastage due to exchange of too many controls packets. The existing solutions to channel allocation in ad hoc networks can be divided into two categories: contention-based and contention-free methods. In contention-based solutions, the sender continuously senses the medium. IEEE 802.11 Distributed Coordinated Function (DCF), MACAW [17], and PAMAS [18] are examples of contention-based protocols. Contention-based schemes are not suitable for sensor networks because of energy wastage due to collisions and idle listening [19]. Sensor networks should use organized methods for channel allocation. The organized methods of channel allocation determine the network topology first and then assign the channels to the links. A channel assignment should avoid co-channel interference, which avoids two consecutive links being assigned to the same channel. Sensor networks channel allocation algorithm should be distributed because network-wide synchronization for calculation of a schedule would be an energy intensive procedure. Another reason for distributed algorithms is that they scale well with an increase in network size and that they are robust to network partitions and node failures. In Reference 6, the authors proposed a Self-organizing MAC (SMAC) protocol for sensor networks. SMAC is a distributed protocol, which enables nodes to discover their neighbors and build a network topology for communication. SMACS builds a flat topology, that is, there are no clusters or cluster heads. In SMAC, each node allocates channels to links between itself and neighbors within a TDMA frame referred to as super frame. In a given time slot, every node communicates with only one neighbor to avoid interference. Nodes communicate intermittently and hence they can power themselves off when they have no data to send. The super frame schedule is divided into two periods. In the first bootup period, nodes try to discover neighbors, and rebuild severed links. The second time period is reserved for communication between nodes. Authors in Reference 6 also proposed an Eavesdrop and Register (EAR) protocol to handle channel allocation with moving base stations. In Piconet [20], authors used periodic sleep cycle to save energy. Here, if a node wants to communicate with neighbors then it has to wait until it receives broadcast message from neighbors. Wei et al. [16] proposed an energy-efficient MAC protocol known as SMAC. SMAC saves energy by avoiding collisions, overhearing, and idle listening and increases latency.

32.5 Routing As stated earlier, each sensor node is expected to monitor some environmental phenomenon and forward the corresponding data toward the base station. To forward the data packets each node needs to have the routing information. Here, we would like to state that the flow of packets is mostly directed from sensor

© 2006 by Taylor & Francis Group, LLC

32-6

Embedded Systems Handbook

nodes toward the base station. As a result, each sensor node need not maintain explicit routing tables. Routing protocols can in general be divided into flat routing and cluster-based routing protocols.

32.5.1 Flat Routing Protocols In flat routing protocols the nodes in the network are considered to be homogeneous. Each node in the network participates in route discovery, maintenance, and forwarding of the data packets. Here, we describe few existing flat routing protocols for the sensor networks. Sequential Assignment Routing (SAR) [6] takes into consideration the energy and Quality of Service (QoS) for each path, and the priority level of each packet for making routing decisions. Every node maintains multiple paths to the sink to avoid the overhead of route recomputation due to the node or link failure. Estrin et al. [21] proposed a diffusion-based scheme for routing queries from base station to sensor nodes and forwarding corresponding replies. In directed diffusion, an attribute-based naming is used by the sensor nodes. Each sensor names data that it generates using one or more attributes. A sink may query for data by disseminating interests. Intermediate nodes propagate these interests. Interests establish gradients of data toward the sink that expressed that interest. The minimum cost forwarding approach proposed by Ye et al. [9] exploits the fact that the dataflow in sensor networks is in a single direction and is always toward the fixed base station. Their method neither requires sensor nodes to have unique identity nor maintain routing tables to forward the messages. Each node maintains the least cost estimate from itself to the base station. Each message to be forwarded is broadcasted by the node. On receiving a message, the node checks if it is on the least cost path between the source sensor node and the base station. If so, it would forward the message by broadcasting. In Reference 7, the authors proposed to model the sensor network as a flow network and have proposed an ILP (Integer Linear Program)-based routing method. The objective of this ILP-based method is to minimize the maximum energy spent by any sensor node during a period of time. Through simulation results the authors have shown that our ILP-based routing heuristic increases the lifetime of sensor network significantly. Kulik and coworkers [22] proposed a set of protocols to disseminate sensed data from the sensor to other sensor nodes. Sensor Protocols for Information via Negotiation (SPIN) overcomes information implosion and overlap by using negotiation and information descriptors (metadata). Authors proposed different protocols for both point-to-point and broadcast channels.

32.5.2 Cluster-Based Routing Protocols In cluster-based routing protocols special nodes, referred to as cluster heads, discover and maintain routes and noncluster-head nodes join one of the clusters. All the data packets originating in the cluster are forwarded toward the cluster head. Cluster head in turn will forward these packets toward destination using the routing information. Here, we describe some cluster-based routing protocols from the literature. Chandrakasan et al. [23] proposed Low Energy Adaptive Clustering Hierarchy (LEACH) as an energyefficient communication protocol for wireless sensor networks. In LEACH, self-elected cluster heads collect data from all the sensor nodes in their cluster, aggregate the collected data by data fusion methods, and transmit the data directly to the base station. In Reference 11, the authors have classified sensor networks into proactive networks and reactive networks. Nodes in proactive networks continuously monitor the environment and thus have the data to be sent at a constant rate. LEACH suits such sensor networks in transmitting data efficiently to the base station. In case of the reactive sensor networks, nodes need to transmit the data only when an event of interest occurs. Hence, all the nodes in the network do not have equal amount of data to be transmitted. Manjeshwar et al. proposed Threshold sensitive Energy-Efficient sensor Network (TEEN) protocol [11] for routing in reactive sensor networks. Estrin et al. [21] proposed a two-level clustering algorithm that can be extended to build a cluster hierarchy.

© 2006 by Taylor & Francis Group, LLC

Issues and Solutions in Wireless Sensor Networks

32-7

32.6 Other Important Issues In this section we will discuss other important issues, such as security, location determination, lifetime analysis, power management, and clock synchronization. We describe why these issues are paramount for the functioning of sensor networks and some solutions proposed in the literature.

32.6.1 Security Security is a very critical issue for the envisioned mass deployment of sensor networks. In particular, a strong security framework is a must in battlefield and border monitoring applications. The security framework in sensor networks should provide the following objectives: • Authentication/nonrepudiation: Each sensor should be able to identify the sender of a message correctly and no node should deny its previous actions. • Integrity: Messages sent over wireless medium should not be altered by unauthorized entities. • Confidentiality: Messages should be kept secret from unauthorized entities. • Freshness: Messages received by sensors should be current. 32.6.1.1 Sensor Networks versus ad hoc Networks: Security Perspective Sensor networks share some similarities with ad hoc networks. But security in sensor networks is different from ad hoc networks due to the following reasons: Node power. Sensor nodes have limited power supply and low computational capabilities as compared to ad hoc nodes. Asymmetric Key encryption [24] schemes requires large computational power as compared to Symmetric Key encryption [24]. Thus, sensor networks can only use Symmetric Key encryption. To use Symmetric Key encryption mechanisms we need to address the Key Distribution problem. Mode of communication. As stated earlier, most of the communication in the sensor networks is from sensor nodes to the base station. At times, base station would issue commands to the sensor nodes. In this mode of communication, every node may not need to share keys with every other node in its network. Moreover, it is not practical to store keys, that are shared with every node, at every node. Node mobility. In ad hoc networks every node can move. In general, sensor nodes are static and in some architectures have mobile base stations as in Reference 7. The above differences make ad hoc network’s or any other traditional network’s security protocols impractical for sensor networks. 32.6.1.2 Proposed Security Protocols Recently, there has been some work related to sensor network security. Perrig et al. [25] proposed SPINS: Security Protocols for Sensor Networks. SPINS framework consists of two protocols to satisfy the security objectives. Secure Network Encryption Protocol (SNEP) provides data integrity, two party authentication, data freshness, and micro-Timed Efficient Streaming Loss-tolerant Authentication Protocol (µTESLA) provides authenticated broadcast. In SNEP, each sensor node and base station share a unique key, which is bootstrapped. This shared key and an incremental message counter, maintained at both sensor node and base station, are used to derive new keys using the RC5 [24] algorithm. In µTESLA, sender generates chain of keys using one way function MD5 [24]. The important property of the chain of keys is that if the sender authenticates the initial key then other keys in the chain are self-authenticated. Sender divides time into equal intervals and each interval is assigned a key from the chain of keys. Sender and receiver agree upon the key disclosure schedule. The first key from the chain is authenticated using unicast authentication. After the first authentication, receiver authenticates packets after receiving a symmetric key from sender as per the disclosure schedule. µTESLA employs delayed disclosure of symmetric keys to authenticate packets after the first authentication of one key in the chain. In Reference 26, the authors proposed a security framework based on broadcast with end-to-end encryption of the data. This scheme avoids traffic analysis and also removes compromised and dead nodes

© 2006 by Taylor & Francis Group, LLC

32-8

Embedded Systems Handbook

from the network. Sasha et al. [27] divided the messages in sensor networks into three classes depending on security required. Each class of messages is encrypted using different encryption keys. They showed that this multilevel scheme saves the resources at the nodes. In general, the base stations will broadcast commands to all the sensor nodes. Hence, secure broadcast is a very important issue in the security framework. In µTESLA, the authentication of the first key in the chain is done using a unicast mechanism. This unicast authentication mechanism has the scalability problem. Authors in Reference 28 replaced this unicast-based mechanism with a broadcast-based mechanism, which avoids denial of service [24] attacks. In Reference 29, the authors proposed a routing aware broadcast key distribution algorithm. Karlof and Wagner [30] described possible attacks on different routing protocols in literature and suggested countermeasures. The Asymmetric key mechanism requires large computational power, bandwidth, and memory. Therefore, sensor networks employ the symmetric key encryption to satisfy security objectives. A key distribution [31–35] in the symmetric key encryption mechanism is another important issue in sensor networks. Eschenauer and Gligor [31] proposed a probabilistic predistribution of keys scheme. In this scheme, every sensor node will be given a small set of m keys out of a large set of available keys such that every two sensor nodes will have one common key with the given probability p. This scheme dramatically reduces the number of keys stored in each sensor as compared to storing separate keys to every node in its network. In Reference 31, the authors proposed three extensions to this basic key distribution scheme. In the first, q-composite keys extention, sensor nodes will share q common keys instead of one key with a given probability p. This extention improves the security against small-scale attacks, such as eavesdropping on one link. The second extention, multi-path extention, deals with setting up end-to-end path keys between two communicating nodes. In this extention, path keys between two nodes is established by sending random keys through every available path between them. A receiver uses all received random keys along all the paths to establish a path key. This improves the security against large-scale attacks, such as eavesdropping many links. The third extention, random pairwise keys scheme, provides the node-to-node authentication. In this scheme unique node identities are generated randomly. Every node is randomly paired with m other nodes and m corresponding keys. Every node is aware of the other node’s Id in the pair and the corresponding key. This node Id information is used for node-to-node authentication.

32.6.2 Location Determination Sensor nodes monitor surrounding phenomenon, such as temperature, light, seismic currents, chemical leaks, radiation, and other parameters of interest. After detecting an event, sensor nodes forward the sensed data toward the nearest base station. In order to process any message reported by the sensor network, the base station is required to know the sender’s location. For example, if the sensor network is deployed to detect forest fires, the base station should know the reporting sensor’s location. Hence, the base station needs to be aware of the location of every sensor node deployed in the network. In this section we will explain different solutions proposed in the literature for location determination in sensor networks. Locationing algorithm’s performance can be measured [36] by the following parameters: • Resolution: The smallest distance between nodes that can be distinguished by the locationing system. • Accuracy: Probability of locationing system finding the correct location. • Robustness: Ability of the locationing system to find the correct location when subjected to node failures and link failures. Global Positioning System (GPS) [37] has been used to locate outdoor nodes. Due to reflection and multi-path fading GPS is not a viable option for indoor locationing. Since sensor nodes can be deployed at indoor locations or on other planets, the GPS-based locationing system is not suggestible. Many non-GPS-based locationing solutions are proposed by the research community. Most of these solutions are either proximity or beacons based. In proximity-based solutions, some nodes will act as special nodes, whose locations are known. We can divide proximity-based solutions into two types. In the first type [38],

© 2006 by Taylor & Francis Group, LLC

Issues and Solutions in Wireless Sensor Networks

32-9

beacons are sent by special nodes, from which other nodes can approximate their location. In the second type of proximity-based solutions [39], beacons are sent by nonspecial nodes, from which special nodes can approximate the locations of nonspecial nodes. Cricket [38] uses the difference in arrival times from known beacons as the basis for finding the location. In RADAR [40], authors used SNR as the basis for finding the location of nodes. SpotON [41] system finds the location of nodes in the three-dimensional space. These solutions can be adopted for the location detection in sensor networks. In Reference 42, the authors proposed a location detection scheme, which consists of local positioning and global positioning. In local positioning, nodes will approximate their relative location from anchor nodes, whose locations are assumed, by using triangulation method. Global positioning finds global location by using cooperative ranging approach, in which nodes iteratively converge to global position by interacting with each other. Saikat et al. [36] proposed a robust location detection algorithm for emergency applications. None of the above-explained solutions dealt with robustness. Robustness is an important issue in emergency scenarios, such as building collapses. This location detection improves robustness by finding identifying codes. Estrin et al. [21] gave an interesting application of their clustering algorithm to pinpoint the location of an illegitimate object. This algorithm is robust to link or node failures and the overhead is proportional to the local population density and a sublinear function of the total number of nodes.

32.6.3 Lifetime Analysis Lifetime refers to the time period for which a sensor network is capable of sensing and transmitting the sensed data to the base station(s). In sensor networks, thousands of nodes are powered with very limited supply of battery power. As a result, lifetime analysis becomes an important tool to efficiently use the available energy. In sensor networks using rechargeable energy, such as solar energy, lifetime analysis helps the nodes to use the energy efficiently before recharging. Lifetime analysis may include an upper bound on the lifetime and factors influencing this upper bound. Theoretical upper bound on the lifetime of a sensor network helps to understand the efficiency of other protocols. Bhardwaj et al. [14] proposed a theoretical upper bound on the lifetime of a sensor network deployed for tracking movement of external objects. In Reference 43, authors found the lifetime of a sensor network with Hybrid Automata modeling. Hybrid Automata is a mathematical method to analyze the systems with both discrete and continuous behaviors. The authors used the trace data to analyze the power consumption and to estimate the lifetime of a sensor network.

32.6.4 Power Management Sensor networks should operate with the minimum possible energy to increase the life of sensor nodes. This requires power aware computation/communication component technology, low-energy signaling and networking, and power aware software infrastructure. Design challenges encountered in the building of wireless sensor networks can be broadly classified into hardware, wireless networking, and OS/applications. All three categories should minimize the power usage to increase the life of a sensor node. Hardware includes the design activities related to all hardware platforms that make up sensor networks. MEMS, digital circuit design, system integration, and RF are important categories in the design of hardware. The second aspect includes design of power-efficient algorithms and protocols. In previous sections, we described few energy-efficient protocols for MAC and routing. Next, we present few OS/application-level strategies related to power management in sensor nodes. Once the system is designed, additional power savings can be obtained by using the Dynamic Power Management (DPM) [44]. The basic idea behind DPM is to shutdown (sleep mode) the devices when not needed and get them back when required. This needs an embedded operating system [45] that is able to support the DPM. The switching of a node from the sleep state to the active state takes some finite time and resource. Each sensor node could be equipped with multiple devices. The number of devices switched

© 2006 by Taylor & Francis Group, LLC

32-10

Embedded Systems Handbook

off determines the level of the sleep state. Each sleep state is characterized by the latency and the power consumption. The deeper the sleep state, the lesser the power consumption, and more the latency. This requires a careful use of DPM to maximize the life of a sensor node. But in many cases it is not known beforehand when a particular device is required. Hence, a stochastic analysis should be applied to predict the future events. Energy can be conserved by using Dynamic Voltage Scheduling (DVS) [44,46]. DVS minimizes the idle processor cycles by using a feed-back control system. Energy savings can be obtained by optimizing the sensor nodes’ performance in active state. DVS is an effective tool to achieve this goal. The main idea behind DVS is to change the power supply to match the workload. This needs tuning of the processor to deliver the required throughput to avoid idle cycles. The crux of the problem lies in the fact that future workloads are nondeterministic. So the efficiency depends on predicting the future workload. Efficient Link Layer strategies can be used to conserve energy at each sensor node. In Reference 47, the authors propose to conserve energy by compromising the quality of the link layer established. This is possible by maintaining bit error rate (BER) just below the user requirements. Different error controlling algorithms, such as the Bose–Chaudhuri–Hocquen (BCH) coding, convolution coding, and turbo-coding, can be employed for error control. The algorithm with the lowest power consumption to support the predetermined BER and latency should be chosen. Local computation and processing [23,45] of sensor data in wireless networks can be made highly energy efficient. Partitioning the computation among multiple sensor nodes and performing the computation in parallel permits a greater control on latency and results in the energy conservation through frequency scaling and voltage scaling. Biomedical wireless sensor networks could use Power Efficient Topologies [48] to save the energy spent in communication. Biomedical sensor nodes include monitors and implantable devices intended for long-term placement in the human body. Topology is predetermined in these sensor networks. Ayad et al. proposed Directional Source-Aware routing Protocol (DSAP) for this class of sensor networks. DSAP incorporates power considerations into routing tables. The authors explored various topologies to determine the most energy-efficient topology for biomedical sensor networks.

32.6.5 Clock Synchronization Some of the communication algorithms, for wireless sensor networks, which are proposed in the literature, make an inherent assumption that there exists some mechanism through which local clocks of all the sensor nodes are synchronized. Though this assumption is valid we need to have an explicit way of synchronizing local clocks of all sensor nodes. Apart from the implementation of the communication algorithms, clock synchronization is required for accurate time stamps in cryptographic schemes, for recognizing duplicate detection of same event from different sensor nodes, for data aggregation algorithms, such as beam forming, for ordering of logged events, and many other similar applications. In this section, a post facto clock synchronization algorithm proposed by Jeremy et al. [49] is described. The post facto clock synchronization algorithm discussed here is suitable for applications, such as beam forming, duplicate event detection, and other similar localized methods. This algorithm is expected to be implemented on systems similar to the WINS (Wireless Integrated Network Sensors) platform where a processor has various sleep modes and has the capability of powering down high-energy peripherals. Because of the capability of the sensor node processor to power down a device and power up when there is a requirement to sense data and transmit, existing clock synchronization methods for distributed systems are not applicable. The basic idea behind post facto clock synchronization algorithm is that for certain applications such as data fusion and beam forming it is sufficient to order the events in a localized fashion. In this scheme, node’s clocks are normally unsynchronized. When a stimulus arrives (time to sense and transmit data), each node records the stimulus with respect to its local clock. Immediately following this event a third party will broadcast a synchronization pulse. Every node receiving this pulse normalizes their stimulus time stamp with respect to broadcasted synchronizing pulse. It is essential to note that the time elapsed

© 2006 by Taylor & Francis Group, LLC

Issues and Solutions in Wireless Sensor Networks

32-11

between recording the stimulus and arrival of synchronized pulse needs to be measured accurately. For this reason the algorithm is inappropriate for systems, which need to communicate a time stamp over a long distance.

32.6.6 Reliability Reliable transfer of critical sensed data is a very important issue in sensor networks. Reliability can be achieved from the MAC layer, transport layer, or application layer. In Reference 50, authors concluded that reliability in both MAC and transport layers are important. In sensor networks, base station uses the data, sensed by different sensors, to conclude occurrence of events. Hence, reliability of data from sensors to base station is critical. In ESRT [51], sink maintains an application-specific target reliability value, which depends on the reporting frequency of sensor nodes. ESRT protocol adaptively adjusts the reporting frequency of sensors based on required reliability.

32.6.7 Sensor Placement and Organization for Coverage and Connectivity Sensor networks are deployed to perform sensing, monitoring, surveillance, and detection tasks. The area in which a sensor node can perform its tasks with reasonable accuracy (i.e., the sensor readings have at least a threshold level of sensing/detection probabilities within that area) is also known as the coverage area. The union of coverage areas of individual sensor nodes is the coverage area of a sensor network. The coverage area can be modeled as a circular disk (similar to a sphere in 3D) surrounding a sensor at the center. The coverage areas can be irregular and can be location dependent due to the obstructions in the terrain, for example, sensor nodes deployed for indoor applications, in urban and hilly areas [52]. The coverage area may also depend on the target, for example, a bigger target can be detected at a longer distance than a smaller target [53]. The degree of the sensing coverage is a measure of the sensing quality provided by the sensor network in a designated area. Coverage requirement depends on the application. For some applications, covering every location with at least a single sensor node might be sufficient while other applications might need higher degree of coverage [54], for example, to pinpoint the exact location of a target, it might be necessary that every location be monitored by multiple sensor nodes [55]. Covering every location with multiple sensors can provide robustness. Some applications may require preferential coverage of critical points, for example, sensitive areas in the sensor field may require more surveillance/monitoring and should be covered by more sensors than other areas [52]. The coverage requirements can also change with time due to changes in environmental conditions, for example, the visibility can vary due to fog or smoke. A low degree of coverage might be sufficient in normal circumstances but when a critical event is sensed, a high degree of coverage may be desired [54]. It is desirable to achieve the required degree of coverage and robustness with the minimum number of active sensors so as to minimize the interference and the information redundancy [54, 56]. However, due to the limited range of the wireless communication, the minimum number of sensors required for the coverage may not guarantee the connectivity of the resulting sensor network. The network is said to be connected if any sensor node can communicate with any other sensor node (possibly using other sensor nodes as intermediate nodes). In some cases, the physical proximity of sensor nodes may neither guarantee connectivity nor coverage due to the obstacles, such as buildings, walls, and trees. The connectivity of the sensor nodes also depends on the physical layer technology used for communication. Some technologies require the transmitter and the receiver to be in the line-of-sight, for example, infra-red, ultrasound [57]. Maintaining greater connectivity is desirable for good throughput and to avoid network partitioning due to node failures [54]. The sensor nodes can be deployed randomly or deterministically in the sensor field. Next we discuss the issues and proposed strategies for placement and organization of sensor nodes.

© 2006 by Taylor & Francis Group, LLC

32-12

Embedded Systems Handbook

32.6.7.1 Sensor Placement for Connectivity and Coverage When the sensor nodes are deployed deterministically, a good placement strategy can minimize the cost and the energy consumption thereby increasing lifetimes of sensor nodes while guaranteeing the desired level of coverage, connectivity, and robustness [55]. Chakrabarty et al. [55] and Ray et al. [57] have used a framework of identifying codes to determine sensor placements for target location detection. The identifying code problem, in an undirected graph, finds an optimal covering of vertices such that any vertex on the graph can be uniquely identified by the subset of vertices that cover it. If each location in the sensor field is covered by a unique subset of sensors then the position of a target can be determined from the subset of sensors that observe the target. However, to determine the minimum number of sensors that must be deployed for uniquely identifying each position of the target is equivalent to constructing an optimal identifying code, which is an NP-complete problem [57]. Ray et al. [57] have proposed a polynomial-time algorithm to compute irreducible identifying codes such that resulting codes can tolerate up to a given number of errors in the received identifying code packets, while still providing information position. Zou and Chakrabarty [58] have proposed a virtual force algorithm to improve the coverage after an initial random deployment of sensor nodes. Initially, the sensors are deployed randomly in the sensor field. It is assumed that if two sensor nodes are very close to each other (closer than predefined threshold) then they exert (virtual) repulsive forces on each other. If two sensor nodes are very far (farther than predefined threshold) then they exert (virtual) attractive forces on each other. The obstacles exert repulsive forces and areas of preferential coverage exert attractive forces on a sensor node. The objective is to move sensor nodes from densely concentrated regions to sparsely concentrated regions so as to achieve uniform placement. The sensor nodes do not physically move during the execution of the virtual force algorithm but a sequence of virtual motion path is determined. After the new positions of the sensors are identified, a one-time movement is carried out to redeploy the sensors at their new position. 32.6.7.2 Sensor Organization for Connectivity and Coverage Sensor networks deployed in enemy territories, inhospitable areas, or disaster struck areas preclude deterministic placement of sensor nodes [53]. Dispersing a large number of sensor nodes over a sensor field from an airplane is one way to deploy sensor networks in those areas. Since the sensor nodes may be scattered arbitrarily, a very large number of sensor nodes are deployed compared with the number of sensor nodes that would have been deployed if deterministic placement was possible. Therefore, it is advantageous to operate the minimum number of sensor nodes required for sensing coverage and connectivity in the active mode and the remaining nodes in the passive (sleep) mode. The passive nodes can be made active as and when neighboring active nodes deplete their energies or fail so as to increase the lifetime of the sensor network. When the sensor nodes are deployed randomly, the main challenge is to develop an efficient distributed localized strategy for sensor organization that would maximize the lifetime of the network while guaranteeing the coverage and connectivity of active nodes [54, 56]. Wang et al. [54] have proposed a Coverage Configuration Protocol (CCP) which minimizes the number of active nodes required for coverage and connectivity. CCP assumes that the sensing areas and the transmission areas are circular and obstacle-free. The authors have shown that the set of sensor nodes that covers the convex region are connected if the transmission radius is at least twice the sensing radius. In CCP each node determines whether it is eligible to become active or not based on the coverage provided by its active neighbors. It is shown that a set of sensors in a convex region provide the required degree of coverage if (1) all the intersection points between any sensing circles have the required degree of coverage and (2) all the intersection points between any sensing circle and the region’s boundary have the required degree of coverage. A sensor node discovers other active sensor nodes and their locations that are within a distance of twice of the sensing radius through HELLO messages. Then it finds the coverage degree of all the intersection points within its coverage area. A sensor node is not eligible to become active if all the intersection points within its coverage area have the required degree of coverage. If there are no intersection points within its coverage area then it is ineligible if there are required number of active sensors located at the same position as itself. Each node periodically checks it eligibility, and only eligible

© 2006 by Taylor & Francis Group, LLC

Issues and Solutions in Wireless Sensor Networks

32-13

nodes remain active, sense environment, and communication with other active nodes. As active nodes deplete their energy, nonactive nodes become eligible and become active to maintain the required degree of coverage. Zhang and Hou [56] have proposed an Optimal Geographical Density Control (OGDC) algorithm, which maintains coverage and connectivity by keeping the minimum number of sensors in the active mode. The idea behind OGDC is similar to that of CCP, that is, if all the intersection points are covered by active sensor nodes then the entire area is covered. OGDC minimizes the number of active nodes by selecting nodes such that the overlap of sensing areas of active nodes is the minimum. It is shown that to minimize the overlap, the intersection point of two circles should be covered√with a third circle such that the centers of the circles form an equilateral triangle with a side length of 3r, where r is the radius of the circles. OGDC begins √ each round by randomly selecting a starting node and its neighbor at a distance (approximately) of 3r. To cover the intersection point of these two circles, a third node is selected whose position is closest to the optimal position (centers of three circles form an equilateral triangle). These processes continues until the entire area is covered. All the selected nodes become active and nodes not selected go to sleep mode.

32.6.8 Topology Control The topology of the sensor network is induced by the wireless links connecting the sensor nodes. The wireless connectivity of the nodes depends on many parameters, such as the physical layer technology, propagation conditions, terrain, noise, antenna characteristics, and the transmit power [59]. The topology of the network can be controlled by adjusting the tunable parameters, such as the power levels of the transmitters [59–63]. The topology of the network affects its performance in many ways. A sparse topology can increase the chances of network partitioning due to node failures and can increase the delay. On the other hand, a dense topology can limit the capacity due to limited spatial reuse and can increase the interference and the energy consumption [59]. A distributed localized topology control algorithm that adjusts the tunable parameters to achieve the desired level of performance while minimizing the energy consumption is highly desirable. Wattenhofer et al. [62, 63] have proposed a two-phase distributed Cone-Based Topology Control (CBTC) algorithm. In the first phase each node broadcasts a neighbor-discovery message with a small radius and records all the acknowledgments and the direction from which the acknowledgments came. The node continues its neighbor discovery process by increasing transmission power (radius) until either it finds at least one neighbor in every cone of α degrees centered on that node or it reaches its maximum transmission power. The authors have proved that for α ≤ 5π/6, the algorithm guarantees that the resulting network topology is connected. In the second phase the algorithm eliminates redundant edges without affecting the minimum power routes of the network. Li et al. [60] have proposed an MST (Minimum Spanning Tree)-based topology control algorithm, called Local Minimum Spanning Tree (LMST). In the information exchange phase each node collects node ids and positions of all the nodes within its maximum transmission range using HELLO messages. In the topology construction phase each node independently constructs its local MST using Prim’s algorithm. The transmission power needed to reach a node is taken as the cost of an edge to that node. The final topology of the network is derived from all the local MSTs by keeping only on-tree nodes that are one hop away as neighbors. To retain only bidirection links, either convert all the unidirectional links into bidirectional or delete the unidirectional links. The authors have proved that the resulting topology preserves the network connectivity and the node degree of any node is bounded by 6.

32.7 Conclusions In this chapter we made an attempt to present an overview of the wireless sensor networks and describe some design issues. We discussed various solutions proposed to prolong the lifetime of sensor networks.

© 2006 by Taylor & Francis Group, LLC

32-14

Embedded Systems Handbook

Proposed solutions to issues, such as MAC layer, routing data from sensor node to the base station, power management, location determination, and clock synchronization, were discussed.

References [1] Sohrabi, K. and Pottie, G.J. Performance of a novel self-organization protocol for wireless ad-hoc sensor networks. In Proceedings of the IEEE Vehicular Technology Conference, vol. 2, 1999, pp. 1222–1226. [2] Min, R., Bhardwaj, M., Seong-Hwan Cho, Shih, E., Sinha, A., Wang, A., and Chandrakasan, A. Low-power wireless sensor networks. In Proceedings of the 14th International Conference on VLSI Design, 2001, pp. 205–210. [3] Estrin, D., Girod, L., Pottie, G., and Srivastava, M. Instrumenting the world with wireless sensor networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001, pp. 2033–2036. [4] Pottie, G.J. Wireless sensor networks. In Information Theory Workshop, 1998, pp. 139–140. [5] Agre, J. and Clare, L. An integrated architecture for cooperative sensing networks. Computer, 33, 106–108, 2000. [6] Sohrabi, K., Gao, J., Ailawadhi, V., and Pottie, G.J. Protocols for self-organization of a wireless sensor network. IEEE Personal Communications, 7, 16–27, 2000. [7] Shashidhar Rao Gandham, Milind Dawande, Ravi Prakash, and Venkatesan, S. Energy efficient schemes for wireless sensor networks with multiple mobile stations. IEEE Globecom, 1, 377–381, 2003. [8] Heinzelman, W., Kulik, J., and Balakrishnan, H. Adaptive protocols for information dissemination in wireless sensor networks. In Proceedings of the Fifth Annual ACM/IEEE International Conference on Mobile Computing and Networking, 1999, 174–185. [9] Ye, F., Chen, A., Liu, S., and Zhang, L. A scalable solution to minimum cost forwarding in large sensor networks. In Proceedings of the Tenth International Conference on Computer Communications and Networks, 2001, pp. 304–309. [10] Lindsey, S. and Raghavendra, C.S. PEGASIS: power-efficient gathering in sensor information systems. In Proceedings of the International Conference on Communications, 2001. [11] Manjeshwar, A. and Agrawal, D.P. TEEN: a routing protocol for enhanced efficiency in wireless sensor networks. In International Proceedings of the 15th Parallel and Distributed Processing Symposium, 2001, pp. 2009–2015. [12] Gao, J. Analysis of energy consumption for ad hoc wireless sensor networks using the wattsper-meter metric. IPN Progress Report, 42–150, 2002. [13] Youssef, M.A., Younis, M.F., and Arisha, K.A. A constrained shortest-path energy-aware routing algorithm for wireless sensor networks. In Proceedings of the Wireless Communications and Networking Conference, vol. 2, 2002, pp. 794–799. [14] Bhardwaj, M., Chandrakasan, A., and Garnett, T. Upper bounds on the lifetime of sensor networks. In Proceedings of the IEEE International Conference on Communications, 2001, pp. 785–790. [15] Elson, J. and Estrin, D. Time synchronization for wireless sensor networks. In Proceedings of the 15th International Parallel and Distributed Processing Symposium, 2001, pp. 1965–1970. [16] Wei Ye, John Heidemann and Deborah Estrin. An energy-efficient MAC protocol for wireless sensor networks. In Proceedings of the IEEE INFOCOM, 2002. [17] Bhargavan, V., Demers, A., Sheker, S., and Zhang, L. MACAW: a media access protocol for wireless LANS. In Proceedings of the ACM SIGCOMM Conference, 1994. [18] Singh, S. and Ragavendra, C.S. PAMAS: power ware multi-access protocol with signalling for ad-hoc networks. ACM Computer Communication Review, 28, 5–26, 1998. [19] Andrew, S. Tanenbaum. Computer Networks, 3rd ed., Prentice-Hall Inc., New York, 1996. [20] Frazer Bennett, David Clarke, Joseph B. Evans, Andy Hopper, Alan Jones, and David Leask. Piconet: embedded mobile networking. IEEE Personal Communications, 4, 8–15, 1997.

© 2006 by Taylor & Francis Group, LLC

Issues and Solutions in Wireless Sensor Networks

32-15

[21] Estrin, D., Govindan, R., Heidemann, J., and Kumar, S. Next century challenges: scalable coordination in sensor networks. In Proceedings of the Fifth Annual ACM/IEEE International Conference on Mobile Computing and Networking, 1999, pp. 263–270. [22] Heinzelman, W., Kulik, J., and Balakrishnan, H. Negotiation-based protocols for disseminating information in wireless sensor networks. In Proceedings of the Fifth Annual ACM/IEEE International Conference on Mobile Computing and Networking, 1999. [23] Heinzelman, W.R., Chandrakasan, A., and Balakrishnan, H. Energy-efficient communication protocol for wireless micro sensor networks. In Proceedings of the 33rd Annual Hawaii International Conference on System Sciences, 2000, pp. 3005–3014. [24] Menezes, Alfred J., van Oorschot, Paul C., and Vanstone Scott A. Handbook of Applied Cryptography. CRC Press, Boca Raton, FL, October 1996. [25] Perrig Adrian, Szewczyk Robert, Wen Victor, Culler David, and Tygar, J.D. SPINS: security protocols for sensor networks. Wireless Networks Journal, 8, 521–534, 2002. [26] Undercoffer Jeffery, Avancha Sasikanth, Joshi Anupam, and Pinkston John. Security for sensor networks. In CADIP Research Symposium, 2002. [27] Slijepcevic Sasha, Potkonjak Miodrag, Tsiatsis Vlasios, Zimbeck Scott, and Srivastava Mani B. On communication security in wireless ad-hoc sensor network. In Proceedings of the 11th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises, Pittsburgh, PA, June 10–12, 2002. [28] Liu, Donggang and Ning, Peng. Efficient distribution of key chain commitments for broadcast authentication in distributed sensor networks. In Proceedings of the 10th Annual Network and Distributed System Security Symposium. San Diego, CA, February 2003. [29] Lazos, Loukas and Poovendran, Radha. Secure broadcast in energy-aware wireless sensor networks. In Proceedings of the IEEE International Symposium on Advances in Wireless Communications. Victoria, BC, Canada, September 23–24, 2002. [30] Karlof, Chris and Wagner, David. Secure routing in wireless sensor Networks: attacks and countermeasures. In Proceedings of the First IEEE International Workshop on Sensor Network Protocols and Applications, May 2003. [31] Eschenauer, Laurent and Gligor, Virgil D. A Key-Management Scheme for Distributed Sensor Networks. ACM Conference on Computer and Communications Security, Washington DC, USA, 2002, pp. 41–47. [32] Chan, Haowen, Perrig, Adrian, and Song, Dawn. Random key predistribution schemes for sensor networks. In Proceedings of the 2003 IEEE Symposium on Research in Security and Privacy, 2003. [33] Carman, c.D.W., Matt, B.J., and Cirincione, G.H. Energy-efficient and low-latency key management for sensor networks. In Proceedings of the 23rd Army Science Conference, Orlando, FL, December 2–5, 2002. [34] Law, Yee Wei, Etalle, Sandro, and Hartel, Pieter H. Key management with group-wise pre-deployed keying and secret sharing pre-deployed keying. Centre for Telematics and Information Technology, University of Twente, The Netherlands, Technical report (TR-CTIT-02-25), July 2002. [35] Law, Yee Wei, Corin, Ricardo, Etalle, Sandro, and Hartel, Pieter H. A formally verified decentralized key management architecture for wireless sensor networks. 4th IFIP TC6/WG6.8 Int. Conf on Personal Wireless Communications (PWC), LNCS 2775, Venice, Italy, September 2003, pp. 27–39. [36] Ray, Saikat, Ungrangsi, Rachanee, De Pellegrini, Francesco, Trachtenberg, Ari and Starobinski, David. Robust location detection in emergency sensor networks. In Proceedings of INFOCOM, 2003. [37] Hofmann-Welleenhof, B., Lichtenegger, H., and Collins, J. Global Positioning Sytem: Theory and Practice, 4th ed., Springer-Verlag, Heidelberg, 1997. [38] Priyantha, Nissanka B., Chakraborthy, Anit, and Balakrishnan, Hari. The cricket location-support system. In Proceedings of the ACM MOBICOM Conference, Boston, MA, 2000. [39] Want, Roy, Hopper, Andy, Falcao, Veronica, and Gibbons, Jon. The active badge location system. ACM Transactions on Information Sytems, 10, 91–102, 1992.

© 2006 by Taylor & Francis Group, LLC

32-16

Embedded Systems Handbook

[40] Bahl, Paramvir and Padmanabhan, Venkata N. RADAR: an in-building RF-based user location and tracking system. In Proceedings of the IEEE INFOCOM Conference. Tel Aviv, Israel, 2000. [41] Hightower, Jefferey, Borriello, Gaetano, and Want, Roy. SpotON: an indoor 3D location sensing technology based on RF signal strength. Technical report, 2000-020-02, University of Washington, February 2000. [42] Savarese, C. and Rabaey, J. Locationing in distributed ad-hoc wireless sensor networks. In IEEE Proceedings on Acoustics, Speech, and Signal Processing, 2001, pp. 2037–2040. [43] Colerim, Sinem, Ergen, Mustafa, and John Koo, T. Lifetime analysis of a sensor network with hybrid automata modelling. In Proceedings of the ACM WSNA Conference. Atlanta, GA, September 2002. [44] Sinha, A. and Chandrakasan, A. Dynamic power management in wireless sensor networks. IEEE Design and Test of Computers, 18, 62–74, 2001. [45] Wang, A. and Chandrakasan, A. Energy efficient system partitioning for distributed wireless sensor networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, 2001, pp. 905–908. [46] lm, C., Huiseok Kim, and Soonhoi, Ha. Dynamic voltage scheduling technique for low-power multimedia applications using buffers. In Proceedings of the International Symposium on Low Power Electronics and Design, 2001, pp. 34–39. [47] Shih, E., Calhoun, B.H., Cho, Seong Hwan, and Chandrakasan, A.P. Energy-efficient link layer for wireless micro sensor networks. In Proceedings of the IEEE Computer Society Workshop on VLSI, 2001, pp. 16–21. [48] Salhieh, A., Weinmann, J., Kochha, M., and Schwiebert, L. Power efficient topologies for wireless sensor networks. In Proceedings of the International Conference on Parallel Processing, 2001, pp. 156–163. [49] Elson, J. and Estrin, D. Time synchronization for wireless sensor networks. In International Proceedings of the 15th Parallel and Distributed Processing Symposium, 2001, pp. 1965–1970. [50] Stann, Fred and Heidemann, John. RMST: reliable data transport in sensor networks. In Proceedings of the IEEE International Workshop on Sensor Net Protocols and Applications, May 2003. [51] Yogesh, S., Ozgur, B. Akan, and Akyildiz, Ian F. ESRT: event-to-sink reliable transport in wireless sensor networks. In ACM MobiHoc, June 2003. [52] Santpal Singh Dhillon and Chakrabarty, Krishnendu. Sensor placement for effective coverage and surveillance in distributed sensor networks. In Proceedings of the IEEE Wireless Communications and Networking Conference, March 2003. [53] Slijepcevic Sasa, and Potkonjak, Miodrag. Power efficient organization of wireless sensor networks. In Proceedings of the IEEE International Conference on Communications, June 2001. [54] Wang, X., Xing, G., Zhang, Y., Lu, C., Pless, R., and Gill, C. Integrated coverage and connectivity configuration in wireless sensor networks. In Proceedings of the ACM SenSys 2003, November 2003. [55] Chakrabarty, K., Iyengar, S.S., Qi, H., and Cho, E. Grid coverage for surveillance and target location in distributed sensor networks. IEEE Transactions on Computers, 51(12), 1448–1453, December 2002. [56] Zhang, Honghia and Hou, Jennifer C. Maintaining sensing coverage and connectivity in large sensor networks. Technical report# UIUCDCS-R-2003-2351, University of Illinois at UrbanaChampaign, June 2003. [57] Ray, Saikat, Ungrangsi, Rachanee, De Pellegrini, Francesco, Trachtenberg, Ari, and Starobinski, David. Robust location detection in emergency sensor networks. In Proceedings of the IEEE INFOCOM, April 2003. [58] Zou, Yi and Chakrabarty, Krishnendu. Sensor deployment and target localization based on virtual forces. In Proceedings of the IEEE INFOCOM, April 2003. [59] Ramanathan, Ram and Rosales-Hain, Regina. Topology control of multihop wireless networks using transmit power adjustment. In Proceedings of the IEEE INFOCOM, March 2000.

© 2006 by Taylor & Francis Group, LLC

Issues and Solutions in Wireless Sensor Networks

32-17

[60] Li, Ning, Hou, Jennifer C., and Sha, Lui. Design and analysis of an MST-based topology control algorithm. In Proceedings of the IEEE INFOCOM, April 2003. [61] Liu, Jilei and Li, Baochun. Distributed topology control in wireless sensor networks with asymmetric links. In Proceedings of the IEEE GLOBECOM, December 2003. [62] Wattenhofer, Roger, Li, Li, Bahl, Paramvir, and Yi-Min Wang. Distributed topology control for power efficient operation in multihop wireless ad hoc networks. In Proceedings of the IEEE INFOCOM, April 2001. [63] Li, Li, Halpern, Joseph Y., Bahl, Paramvir, Wang, Yi-Min, and Wattenhofer, Roger. Analysis of a cone-based distributed topology control algorithm for wireless multi-hop networks. In Proceedings of the ACM Symposium on Principles of Distributed Computing, August 2001.

© 2006 by Taylor & Francis Group, LLC

33 Architectures for Wireless Sensor Networks 33.1 Sensor Node Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-2

S. Dulman, S. Chatterjea, T. Hoffmeijer, P. Havinga, and J. Hurink University of Twente

Mathematical Energy Consumption Model of a Node

33.2 Wireless Sensor Network Architectures . . . . . . . . . . . . . . . . 33-5 Protocol Stack Approach • EYES Project Approach

33.3 Data-Centric Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-17 Motivation • Architecture Description

33.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-21 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-21

The vision of ubiquitous computing requires the development of devices and technologies that can be pervasive without being intrusive. The basic component of such a smart environment will be a small node with sensing and wireless communications capabilities, able to organize itself flexibly into a network for data collection and delivery. Building such a sensor network presents many significant challenges, especially at the architectural, protocol, and operating system level. Although sensor nodes might be equipped with a power supply or energy scavenging means and an embedded processor that makes them autonomous and self-aware, their functionality and capabilities will be very limited. Therefore, collaboration between nodes is essential to deliver smart services in a ubiquitous setting. New algorithms for networking and distributed collaboration need to be developed. These algorithms will be the key for building self-organizing and collaborative sensor networks that show emergent behavior and can operate in a challenging environment where nodes move, fail, and energy is a scarce resource. The question that rises is how to organize the internal software and hardware components in a manner that will allow them to work properly and be able to adapt dynamically to new environments, requirements, and applications. At the same time the solution should be general enough to be suited for as many applications as possible. Architecture definition also includes, at the higher level, a global view of the whole network. The topology, placement of base stations, beacons, etc. is also of interest. In this chapter, we will present and analyze some of the characteristics of the architectures for wireless sensor networks. Then, we will propose a new dataflow-based architecture that allows, as a new feature, the dynamic reconfiguration of the sensor nodes software at runtime. 33-1

© 2006 by Taylor & Francis Group, LLC

33-2

Embedded Systems Handbook

33.1 Sensor Node Architecture Current existing technology already allows integration of functionalities for information gathering, processing, and communication in a tight packaging or even in a single chip (e.g., Figure 33.1 presents the EYES sensor node [1]). The four basic blocks needed to construct a sensor node are (see Figure 33.2): Sensor platform. The sensors are the interfaces to the real world. They collect the necessary information and have to be monitored by the central processing unit (CPU). The platforms may be built in a modular way such that a variety of sensors can be used in the same network. The utilization of a very wide range of sensors (monitoring characteristics of the environment, such as light, temperature, air pollution, pressure, etc.) is envisioned. The sensing unit can also be extended to contain one or more actuation units (e.g., to give the node the possibility of repositioning itself). Processing unit. Is the intelligence of the sensor node will not only collect the information detected by the sensor but will also communicate with the server network. The level of intelligence in the sensor node will strongly depend on the type of information that is gathered by its sensors and by the way in which the network operates. The sensed information will be preprocessed to reduce the amount of data to be transmitted via the wireless interface. The processing unit will also have to execute some networking protocols in order to forward the results of the sensing operation through the network to the requesting user. Communication interface. This is the link of each node to the sensor network itself. The focus relies on a wireless communication link, in particular on the radio communication, although visible or infrared light, ultrasound, etc. means of communications have already been used [2]. The used radio transceivers can usually function in simplex mode only, and can be completely turned off, in order to save energy. Power source. Owing to the application areas of the sensor networks, autonomy is an important issue. Sensor nodes are usually equipped with a power supply in the form of one or more batteries. Current studies focus on reducing the energy consumption by using low-power hardware components

FIGURE 33.1

EYES sensor node. (From EYES. Eyes European project, http://eyes.eu.org. With permission.)

© 2006 by Taylor & Francis Group, LLC

Architectures for Wireless Sensor Networks

33-3

Sensor platform

Processing unit

Power source

Communication interface

FIGURE 33.2 Sensor node components.

and advanced networking and data management algorithms. The usage of such energy scavenging techniques for sensor nodes might make possible for the sensor nodes to be self-powered. No matter which form of power source is used, energy is still a scarce resource and a series of trade-offs will be employed during the design phase to minimize its usage. Sensor networks will be heterogeneous from the point of view of the types of nodes deployed. Moreover, whether or not any specific sensor node can be considered as being part of the network only depends on the correct usage and participation in the sensor network suite of protocols and not on the node’s specific way of implementing software or hardware. An intuitive description given in Reference 3 envisions a sea of sensor nodes, some of them being mobile and some of them being static, occasionally containing tiny isles of relatively resource-rich devices. Some nodes in the system may execute autonomously (e.g., forming the backbone of the network by executing network and system services, controlling various information retrieval and dissemination functions, etc.), while others will have less functionality (e.g., just gathering data and relaying it to a more powerful node). Thus, from the sensor node architecture point of view we can distinguish between several kinds of sensor nodes. A simple yet sufficient in the majority of the cases approach would be to have two kinds of nodes: high-end sensor nodes (nodes that have plenty of resources or superior capabilities; the best candidate for such a node would probably be a fully equipped PDA device or even a laptop) and low-end nodes (nodes that have only the basic functionality of the system and have very limited processing capabilities). The architecture of a sensor node consists of two main components: defining the precise way in which functionalities are needed and how to join them into a coherent sensor node. In other words, sensor node architecture means defining the exact way in which the selected hardware components connect to each other, how they communicate and how they interact with the CPU, etc. A large variety of sensor node architectures have been built up to this moment. As a general design rule, all of them have targetted the following three objectives: energy efficiency, small size, and low cost. Energy efficiency is by far the most important design constraint because energy consumption depends on the lifetime of the sensor nodes. As the typical scenario of sensor networks deployment assumes that the power supplies of nodes will be limited and not rechargeable, a series of trade-offs need to be made to decrease the amount of consumed energy. Small size of the nodes leads to the ability of deploying lots of them to study a certain phenomenon. The ideal size is suggested by the name of one of the first research projects in the area: SmartDust [4]. Very cheap sensor nodes will lead to rapid deployment of such networks and large-scale usage.

© 2006 by Taylor & Francis Group, LLC

33-4

Embedded Systems Handbook

33.1.1 Mathematical Energy Consumption Model of a Node In this section, we present a basic version of an energy model for a sensor node. The aim of the model is to predict the current energy state of the battery of a sensor node based on historical data on the use of the sensor node and the current energy state of the battery. In general a sensor node may consist of several components. The main components are: a radio, a processor, a sensor, a battery, external memory, and periphery (e.g., a voltage regulator or debugging equipment and periphery to drive an actuator). In the presented model we consider only the first four components. The external memory is neglected in this stage of the research since its use of energy is rather complex and needs an own energy model if the memory is a relevant part of the functional behavior of the sensor node and not just used for storage. The periphery can be quite different and, thus, can not be integrated in an energy model of a sensor node in a uniform way. For the battery we assume that the usage of energy by the other components is independent of the current energy state of the battery. This implies that the reduction of the energy state of the battery depends only on the actions of the different components. Furthermore, we do not consider a reactivation of the battery by time or external circumstances. Based on these assumptions, it remains to give models for the energy consumption of the three components radio, processor, and sensor. The base of the model for the energy consumption of a component is the definition of a set S of possible states s1 , . . . , sk for the component. These states are defined such that the energy consumption of the component is given by the sum of the energy consumptions within the states s1 , . . . , sk plus the energy needed to switch between the different states. We assume that the energy consumption within a state sj can be measured using a simple index tj (e.g., execution time or number of instructions) and that the energy needed to switch between the different states can be calculated on the basis of a state transition matrix st, where st ij denotes the number of times the component has switched from state si to state sj . If now Pj denotes the needed power in the state sj and Eij denotes the energy consumption of switching once from state si to state sj , the total energy consumption of the component is given by k k tj Pj + stij Eij (33.1) Econsumed = j=1

i,j=1,i =j

In the following, we describe the state sets S and the indices to measure the energy consumption within the states for the radio, processor, and sensor: Radio. For the energy consumption of a radio four different states need to be distinguished: off, sleep, receiving, and transmitting. For all these four states the energy consumption depends on the time the radio has been in the state. Thus, for the radio we need to memorize the times the radio has been in the four states and the 4 × 4 state transition matrix representing the number of times the radio has switched between the four states. Processor. In general, for a processor, four main states can be identified: off, sleep, idle, and active. In sleep mode the CPU and most internal peripherals are turned off. It can be awaked by an external event (interrupt) only, on which idle state is entered. In idle mode the CPU is still inactive, but now some peripherals are active, such as the internal clock or timer). Within the active state the CPU and all peripherals are active. Within this state multiple states might be identified based on clock speeds and voltages. We assume that the energy consumption depends on the time the processor has been in a certain state. Sensor. For a (simple) sensor we assume that only the two states on and off are given and that the energy consumption within both states can be measured by time. However, if more powerful sensors are used, it may be necessary to work with more states (similar to the processor or the radio). The energy model for the complete sensor node now consists of the energy models for the three components radio, processor, and sensor plus two extra indicators for the battery: • For the battery only the energy state Eold at a time told in the past is given. • For each component, the indices Ij characterizing the energy consumption in state sj since time told and the state transition matrix st indicating the transitions since time told are specified. © 2006 by Taylor & Francis Group, LLC

Architectures for Wireless Sensor Networks

33-5

Based on this information an estimate of the current energy state of the battery can be calculated by subtracting from Eold the sum of the consumed energy for each component estimated on the base of Equation (33.1). Since the energy model gives only an estimate of the remaining energy of the battery, in practice it may be a good approach to use the energy model only for limited time intervals. If the difference between the current time and told gets larger than a certain threshold, the current energy state of the battery should be estimated on the base of measurements or other information available on the energy state and Eold and told should be replaced by this new estimate and the current time. Furthermore, the indices characterizing the states and the state transition matrix are reset for all the components of the sensor node.

33.2 Wireless Sensor Network Architectures A sensor network is a very powerful tool when compared to a single sensing device. It consists of a large number of nodes, equipped with a variety of sensors that are able to monitor different characteristics of a phenomenon. A dense network of such small devices, will give the researcher the opportunity to have a spatial view over the phenomenon and, at the same time, it will produce results based on a combination of various sorts of sensed data. Each sensor node will have two basic operation modes: initialization phase and operation phase. But, the network as a whole will function in a smooth way, with the majority of the nodes in the operation mode and only a subset of nodes in the initialization phase. The two modes of operation for the sensor nodes have the following characteristics: Initialization mode. A node can be considered in initialization mode if it tries to integrate itself in the network and is not performing its routine function. A node can be in initialization mode, for example, at power on or when it detects a change in the environment and needs to configure itself. During initialization, the node can pass through different phases such as detecting its neighbors and the network topology, synchronizing with its neighbors, determining its own position or even performing configuration operations on its own hardware and software. At a higher abstraction level, a node can be considered in initialization mode if it tries to determine which services are already present in the network, which services it needs to provide or can use. Operation mode. After the initialization phase the node enters a stable state, the regular operation state. It will function based on the conditions determined in the initialization phase. The node can exit the operation mode and pass through an initialization mode if either the physical conditions around it or the conditions related to the network or to itself have changed. The operation mode is characterized by small bursts of node activity (such as reading sensors values, performing computations, or participating in networking protocols) and periods spent in an energy-saving low-power mode.

33.2.1 Protocol Stack Approach A first approach to building a wireless sensor network will be to use a layered protocol stack as a starting point, as in the case of traditional computer network. The main difference between the two kinds of networks is that the blocks needed to build the sensor network usually span themselves over multiple layers while others depend on several protocol layers. This characteristic of sensor networks comes from the fact that they have to provide functionalities that are not present in traditional networks. Figure 33.3 presents an approximative mapping of the main blocks onto the traditional OSI protocol layers. The authors of Reference 5 propose an architecture based on the five OSI layers together with three management planes that go throughout the whole protocol stack (see Figure 33.4). A brief description of the layers included: (1) the physical layer, which addresses mainly the hardware details of the wireless communication mechanism, such as the modulation type, the transmission and receiving techniques, etc.; (2) the data-link layer is concerned with the Media Access Control (MAC) protocol that manages communication over the noisy shared channel; (3) the network layers manages routing the data between the nodes, while the transport layer helps to maintain the dataflow; (4) finally, the application layer contains (very often) only one single user application. © 2006 by Taylor & Francis Group, LLC

Embedded Systems Handbook

Link

Routing

Security

Addressing

Network

Localization

Transport

Collaboration

Timing

Lookup

Clustering

Application

Aggregation

33-6

Physical

FIGURE 33.3 Relationship between building blocks and OSI layers.

Data link layer

Task management plane

Network layer

Mobility management plane

Transport layer

Power management plane

Application layer

Physical layer

FIGURE 33.4 Protocol stack representation of the architecture. (From Akyildiz, I., Su, W., Sankarasubramaniam, Y., and Cayirci, E. IEEE Communication Magazine, 40, 102–114, 2002. With permission.)

In addition to the five network layers, three management planes have the following functionality: the power management plane coordinates the energy consumption inside the sensor node. It can, for example, be based on the available amount of energy, allow the node to take part in certain distributed algorithms or to control the amount of traffic it wants to forward. The mobility management plane will manage all the information regarding the physical neighbors and their movement patterns as well as its own moving pattern. The task management plane coordinates sensing in a certain region based on the number of nodes and their placement (in very densely deployed sensor networks, energy might be saved by turning certain sensors off to reduce the amount of redundant information sensed). In the following we will give a description of the main building blocks needed to setup a sensor network. The description will follow the OSI model. This should not imply that this is the right structure for these networks, but is taken only as a reference point: Physical layer. The physical layer is responsible for the management of the wireless interface. For a given communication task, it defines a series of characteristics as: operating frequency, modulation type, data coding, interface between hardware and software, etc. The large majority of already built sensor networks prototypes and most of the envisioned application scenarios assume the use of a radio transceiver as the means for communication. The unlicensed industrial,

© 2006 by Taylor & Francis Group, LLC

Architectures for Wireless Sensor Networks

33-7

scientific, and medical (ISM) band is preferred because it is a free band designed for short-range devices using low-power radios and requiring low data-transmission rates. The modulation scheme used is another important parameter to decide upon. Complex modulation schemes might not be preferred because they require important resources (in the form of energy, memory, computation power). In the future, the advancements of the integrating circuits technology (e.g., ASIC, FPGA) will allow the use of modulation techniques, such as ultrawide band (UWB) or impulse radio (IR), while if the sensor node is built using off-the-shelf components the choice comes down mainly to schemes such as amplitude shift keying (ASK) or frequency shift keying (FSK). Based on the modulation type and on the hardware used, a specific data encoding scheme will be chosen to assure both the synchronization required by the hardware component and a first level of error correction. At the same time, the data frame will also include some carefully chosen initial bytes needed for the conditioning of the receiver circuitry and clock recovery. It is worth mentioning that the minimum output power required to transmit a radio signal over a certain distance is directly proportional to the distance raised to a power between two and four (the coefficient depends on the type of the antenna used and its placement relative to the ground, indoor–outdoor deployment, etc.). In these conditions, it is more efficient to transmit a signal using a multihop network composed of short-range radios rather than using a (power consuming) long-range link [5]. The communication subsystem usually needs a controller hierarchy to create the abstraction for the other layers in the protocol stack (we are referring to the device hardware characteristics and the strict timing requirements). If a simple transceiver is used, some of these capabilities will need to be provided by the main processing unit of the sensor node (this can require a substantial amount of resources for exact timing execution synchronization, crosslayer distribution of the received data, etc.). The use of more advanced specialized communication controllers is not preferred as they will hide important low-level details of information. Data-link layer. The data-link layer is responsible for managing most of the communication tasks within one hop (both point-to-point and multicasting communication patterns). The main research issues here are the MAC protocols, the error control strategies and the power consumption control. The MAC protocols make the communication between several devices over a shared channel possible by coordinating the sending–receiving actions as a function of time or frequency. Several strategies have already been studied and implemented for the mobile telephony networks and for the mobile ad hoc networks but unfortunately, none of them is directly applicable. Still, ideas can be borrowed from the existing standards and applications and new MAC protocols can be derived — this can be proven by the large number of new schemes that target specifically the wireless sensor networks. As the radio component is probably the main energy consumer in each sensor node, the MAC protocol must be very efficient. To achieve this, the protocol must, first of all, make use of the power down state of the transceiver (turn the radio off) as much as possible because the energy consumption is negligible in this state. The most important problem comes from the scheduling of the sleep, receive, and transmit states. The transitions among these states also need to be taken into account as they consume energy and sometimes take large time intervals. Message collisions, overhearing, and idle listening are direct implications of the scheduling used inside the MAC protocol which, in addition, influences the bandwidth lost due to the control packet overheads. A second function of the data-link layer is to perform error control of the received data packets. The existent techniques include automatic repeat-request (ARQ) and forward error correction (FEC) codes. The choice of a specific technique comes down to the trade-off between the energy consumed to transmit redundant information over the channel and the energy and high computation power needed at both the coder/decoder sides. Additional functions of the data-link layer are creating and maintaining a list of the neighbor nodes (all nodes situated within the direct transmission range of the node in discussion); extracting and advertising the source and destination as well as the data content of the overheard packets; supplying information related to the amount of energy spent on transmitting, receiving, coding, and decoding the packets, the amount of errors detected, the status of the channel, etc.

© 2006 by Taylor & Francis Group, LLC

33-8

Embedded Systems Handbook

Network layer. The network layer is responsible for routing of the packets inside the sensor network. It is one of the most studied topics in the area of wireless sensor networks and it received a lot of attention lately. The main design constraint for this layer is, as in all the previous cases, the energy efficiency. The main function of wireless sensor networks is to deliver sensed data (or data aggregates) to the base stations requesting it. The concept of data-centric routing has been used to address this problem in an energy-efficient manner, minimizing the amount of traffic in the network. In data-centric routing, each node is assigned a specific task based on the interests of the base stations. In the second phase of the algorithm, the collected data is sent back to the requesting nodes. Interest dissemination can be done in two different ways, depending on the expected amount of traffic and level of events in the sensor network: the base stations can broadcast the interest to the whole network or the sensor nodes themselves can advertise their capabilities and the base stations will subscribe to that. Based on the previous considerations, the network layer needs to be optimized mainly for two operations: spreading the user queries, generated at one or more base stations, around the whole network, and then retrieving the sensed data to the requesting node. Individual addressing of each sensor node is not important in the majority of the applications. Due to the high density of the sensor nodes, a lot of redundant information will be available inside the sensor network. Retrieving all this information to a certain base station might easily exceed the available bandwidth, making the sensor network unusable. The solution to this problem is the data aggregation technique, which requires each sensor node to inspect the content of the packets it has to route and aggregate the contained information, reducing the high redundancy of the multiple sensed data. This technique was proven to substantially reduce the overall traffic and make the sensor network behave as an instrument for analyzing data rather than just a transport infrastructure for raw data [6]. Transport layer. This layer appears from the need to connect the wireless sensor network to an external network, such as the Internet, in order to disseminate its data readings to a larger community [7]. Usually the protocols needed for such interconnections require significant resources and they will not be present in all the sensor nodes. The envisioned scenario is to allow a small subset of nodes to behave as gateways between the sensor network and some external networks. These nodes will be equipped with superior resources and computation capabilities, and will be able to run the needed protocols to interconnect the networks. Application layer. The application layer usually links the user’s applications with the underlying layers in the protocol stack. Sensor networks are designed to fulfill one single application scenario for each particular case. The whole protocol stack is designed for a special application and the whole network is seen as an instrument. These make the application layer to be distributed along the whole protocol stack, and not appear explicitly. Still, for the sake of classification we can consider an explicit application layer that could have one of the following functionalities [5]: sensor management protocol, task management and data advertisement protocol, and sensor query and data dissemination protocol.

33.2.2 EYES Project Approach The approach taken in the EYES project [1] consists of only two key system abstraction layers: the sensor and networking layer and the distributed services layer (see Figure 33.5). Each layer provides services that may be spontaneously specified and reconfigured: 1. The sensor and networking layer contains the sensor nodes (the physical sensor and wireless transmission modules) and the network protocols. Ad hoc routing protocols allow messages to be forwarded through multiple sensor nodes taking into account the mobility of nodes, and the dynamic change of topology. Communication protocols must be energy efficient since sensor nodes have very limited energy supplies. To provide more efficient dissemination of data, some sensors may process data streams, and provide replication and caching.

© 2006 by Taylor & Francis Group, LLC

Architectures for Wireless Sensor Networks

33-9

Applications

Information service

Lookup service

Distributed services

Sensors and networking

FIGURE 33.5 EYES project architecture description.

2. The distributed services layer contains distributed services for supporting mobile sensor applications. Distributed services coordinate with each other to perform decentralized services. These distributed servers may be replicated for higher availability, efficiency, and robustness. We have identified two major services. The lookup service supports mobility, instantiation, and reconfiguration. The information service deals with aspects of collecting data. This service allows vast quantities of data to be easily and reliably accessed, manipulated, disseminated, and used in a customized fashion by applications. On top of this architecture, applications can be built using the sensor network and distributed services. Communication in a sensor network is data centric since the identity of the numerous sensor nodes is not important, only the sensed data together with time and location information counts. The three main functions of the nodes within a sensor network are directly related to this: Data discovery. Several classes of sensors will be equipped in the network. Specialized sensors can monitor climatic parameters (humidity, temperature, etc.), motion detection, vision sensors, and so on. A first step of data preprocessing can also be included in this task. Data processing and aggregation. This task is directly related to performing distributed computations on the sensed data and also aggregating several observations into a single one. The goal of this operation is the reduction of energy consumption. Data processing influences it by the fact that the transmission of one (raw sensed) data packet is equivalent to many thousands of computation cycles in the current architectures. Data aggregation keeps the overall traffic low by inspecting the contents of the routed packets, and in general, reducing the redundancy of the data in traffic by combining several similar packets into a single one. Data dissemination. This task includes the networking functionality comprising routing, multicasting, broadcasting, addressing, etc. The existing network scenarios contain both static and mobile nodes. In some cases, the static nodes can be considered to form a back-bone of the network and are more likely to be preferred in certain distributed protocols. Both mobile and static nodes will have to perform data dissemination, so the protocols should be designed to be invariant to node mobility. The particular hardware capabilities of each kind of sensor node will determine how the previously described tasks will be mapped onto them (in principle, all the

© 2006 by Taylor & Francis Group, LLC

33-10

Embedded Systems Handbook

nodes could provide all the previous functionalities). During the initialization phase of the network, the functionality of every node will be decided based on both the hardware configurations and the particular environmental conditions. For a large sensor network to be able to function correctly, a tiered architecture is needed. This means that nodes will have to organize themselves into clusters based on certain conditions. The nodes in each cluster will elect a leader — the best fitted node to perform coordination inside the cluster (this can be e.g., the node with the highest amount of energy or the node having the most advanced hardware architecture or just a random node). The cluster leader will be responsible for scheduling the node operations, managing the resources and the cluster structure, and maintaining communication with the other clusters. We can talk about several types of clusters that can coexist in a single network: Geographical clustering. The basic mode of organizing the sensor network. The clusters are built based on the geographical proximity. Neighboring nodes (nodes that are in transmission range of each other) will organize themselves into groups. This operation can be handled in a completely distributed manner and it is a necessity for the networking protocols to work even when the network scales up. Information clustering. The sensor nodes can be grouped into information clusters based on the services they can provide. This clustering structure belongs to the distributed services layer and is built on top of the geographical clustering. Nodes using this clustering scheme need not be direct neighbors from the physical point of view. Security clustering. A even higher hierarchy appears if security is taken into consideration. Nodes can be grouped based on their trust levels or based on the actions they are allowed to perform or resources they are allowed to use in the network. Besides offering increased capabilities to the sensor network, clustering is considered as one of the principal building blocks for the sensor networks also from the point of view of energy consumption. The overhead given by the energy spent for creating and organizing the sensor network is easily recovered in the long term due to the reduced traffic it leads to. 33.2.2.1 Distributed Services Layer Examples This section focuses on the distributed services that are required to support applications for wireless sensor networks. We discuss the requirements of the foundation necessary to run these distributed services and describe how various research projects approach this problem area from a multitude of perspectives. A comparison of the projects is also carried out. One of the primary issues of concern in wireless sensor networks is to ensure that every node in the network is able to utilize energy in a highly efficient manner so as to extend the total network lifetime to a maximum [5, 8, 9]. As such, researchers have been looking at ways to minimize energy usage at every layer of the network stack, starting from the physical layer right up to the application layer. While there are a wide range of methods that can be employed to reduce energy consumption, architectures designed for distributed services generally focus on one primary area — how to reduce the amount of communication required and yet get the main job done without any significant negative impact by observing and manipulating the data that flows through the network [6, 10, 11]. This leads us to look at the problem at hand from a data-centric perspective. In conventional IP-style communication networks, such as on the Internet for instance, nodes are identified by their end-points and internode communication is layered on an end-to-end delivery service that is provided within the network. At the communication level, the main focus is to get connected to a particular node within the network, thus the addresses of the source and destination nodes are of paramount importance [12]. The precise data that actually flows through the network is irrelevant to IP. Sensor networks, however, have a fundamental difference compared to the conventional communication networks described above as they are application-specific networks. Thus instead of concentrating on

© 2006 by Taylor & Francis Group, LLC

Architectures for Wireless Sensor Networks

33-11

which particular node a certain data message is originating from, a greater interest lies in the data message itself — what is the data in the data message and what can be done with it? This is where the concept of a data-centric network architecture comes into play. As sensor nodes are envisioned to be deployed by the hundreds and potentially even thousands [8], specific sensor nodes are not usually of any interest (unless of course a particular sensor needs to have its software patched or a failure needs to be corrected). This means that instead of a sensor network application asking for the temperature of a particular node with ID 0315, it might pose a query asking, “What is the temperature in sector D of the forest?” Such a framework ensures that the acquired results are not just dependent on a single sensor. Thus other nodes in sector D can respond to the query even if the node with ID 0315 dies. The outcome is not only a more robust network but due to the high density of nodes [13], the user of the network is also able to obtain results of a higher fidelity (or resolution). Additionally, as nodes within the network are able to comprehend the meaning of the data passing through them, it is possible for them to carry out application-specific processing within the network thus resulting in the reduction of data that needs to be transmitted [14]. In-network processing is particularly important as local computation is significantly cheaper than radio communication [15]. 33.2.2.1.1 Directed Diffusion Directed Diffusion is one of the pioneering data-centric communication paradigms developed specifically for wireless sensor networks [6]. Diffusion is based on a publish/subscribe API (application–programming interface), where the details of how published data is delivered to subscribers is hidden from the data producers (sources) and publishers (sinks). The transmission and arrival of events (interest or data messages) occur asynchronously. Interests describe tasks that are expressed using a list of attribute-value pairs as shown below: // detect location of seagull type = seagull // send back results every 20ms interval = 20ms // for the next 15 seconds duration = 15s // from sensors within rectangle rect = [-100,100,200,400]

A node that receives a data message sends it to its Filter API, which subsequently performs a matching operation according to a list of attributes and their corresponding values. If a match is established between the received data message and the filter residing on the node, the diffusion substrate passes the event to the appropriate application module. Thus the Filter API is able to influence the data which propagates through the network from the source to the sink node as an application module may perform some applicationspecific processing on the received event, for example, it may decide to aggregate the data. For example, consider a scenario in an environmental monitoring project where the user needs to be notified when the light intensity in a certain area goes beyond a specified threshold. As the density of deployed nodes may be very high, it is likely that a large number of sensors would respond to an increase in light intensity simultaneously. Instead of having every sensor relaying this notification to the user, intermediate nodes in the region could aggregate the readings from their neighboring nodes and return only the Boolean result thus greatly reducing the number of radio transmissions. Apart from aggregating data by simply suppressing duplicate messages, application-specific filters can also take advantage of named-data to decide how to relay data messages back toward the sink node and what data to cache in order to route future interest messages in a more intelligent and energy-saving manner. Filters also help save energy by ensuring that nodes react appropriately to incoming events only if the attribute matching process has proven to be successful. Diffusion also supports a more complex form of in-network aggregation. Filters allow nested queries such that one sensor is able to trigger other sensors in its vicinity if the attribute-value matching operation

© 2006 by Taylor & Francis Group, LLC

33-12

Embedded Systems Handbook

is successful. It is not necessary for a user to directly query all the relevant sensors. Instead the user only queries a certain sensor which in turn eventually queries the other relevant sensors around it if certain conditions are met. In this case, energy savings are obtained from two aspects. First, since the user may be geographically distant from the observed phenomenon, the energy spent transmitting data can be reduced drastically using a triggering sensor. Second, if sampling the triggered (or secondary) sensor consumes a lot more energy than the triggering (initial) sensor, then energy consumption can be reduced greatly by reducing the duty cycle of the secondary sensor to only periods when certain conditions are met at the initial sensor. 33.2.2.1.2 COUGAR Building up on the same concept, that processing data within the network would result in significant energy savings, but deviating from the library-based lower-level approach, that is, as used by Directed Diffusion, the COUGAR [10, 16] project envisions the sensor network as an extension of a conventional database thus viewing it as a device database system. It makes the usage of the network more user-friendly by suggesting the use of a high-level declarative language similar to SQL. Using a declarative language ensures that queries are formulated independent of the physical structure and organization of the sensor network. Conventional database systems use a warehousing approach [17] where every sensor that gathers data from an environment subsequently relays that data back to a central site where this data is then logged for future processing. While this framework is suitable for historical queries and snapshot queries, it cannot be used to service long-running queries [17]. For instance, consider the following query: Retrieve the rainfall level for all sensors in sector A every 30 sec if it is greater than 60 mm. Using the warehousing approach, every sensor would relay its reading back to a central database every 30 sec regardless of whether it is in sector A or its rainfall level reading is greater than 60 mm. Upon receiving all the readings from the sensors, the database would then carry out the required processing to extract all the relevant data. The primary problem in this approach is that excessive resources are consumed at each and every sensor node as large amounts of raw data need to be transmitted through the network. As the COUGAR approach is modeled around the concept of a database, the system generally proceeds as follows. It accepts a query from the user, produces a query execution plan (which contains detailed instructions of how exactly a query needs to be serviced), executes this plan against the device database system, and produces the answer. The query optimizer generates a number of query execution plans and selects the plan that minimizes a given cost function. The cost function is based on two metrics, namely resource usage (expressed in Joules) and reaction time. In this case, the COUGAR approach selects the most appropriate query execution plan that pushes the selection (rainfall level >60 mm) onto the sensor nodes. Only the nodes that meet this condition send their readings back to the central node. Thus just like in Directed Diffusion, the key idea here is to transfer part of the processing to the nodes themselves, which in turn would reduce the amount of data that needs to be transmitted. 33.2.2.1.3 TinyDB Following the steps of Directed Diffusion and the COUGAR project, TinyDB [11] also proclaims the use of some form of in-network processing to increase the efficiency of the network and thus improve network lifetime. However, while TinyDB views the sensor network from the database perspective just like COUGAR, it goes a step further by pushing not only selection operations to the sensor nodes but also basic aggregation operations that are common in databases, such as MIN, MAX, SUM, COUNT, and AVERAGE. Figure 33.6 illustrates the obvious advantage that performing such in-network aggregation operations have compared to transmitting just raw data. Without aggregation, every node in the network needs to transmit not only its own reading but also those of all its children. This not only causes a bottleneck close

© 2006 by Taylor & Francis Group, LLC

Architectures for Wireless Sensor Networks

33-13

Node transmitting sensor reading Intermediate node transmitting sensor reading + aggregating data Root node/base station

Data transmission without in-network aggregation

Data transmission with in-network aggregation

FIGURE 33.6 The effect of using in-network aggregation.

to the root node but also results in unequal consumption of energy, that is, the closer a node is to the root node, the larger the number of messages it needs to transmit, which naturally results in higher energy consumption. Thus nodes closer to the root node die earlier. Losing nodes closer to the root node can have disastrous consequences on the network due to network partitioning. Using in-network aggregation, however, every intermediate node aggregates its own reading with that of its children and eventually transmits only one combined result. Additionally, TinyDB has numerous other features, such as communication scheduling, hypothesis testing, and acquisitional query processing, which makes it one of the most feature-rich distributed query processing frameworks for wireless sensor networks at the moment. TinyDB requires users to specify queries injected into the sensor network using an SQL-like language. This language describes what data needs to be collected and how it should be processed upon collection as it propagates through the network toward the sink node. The language used by TinyDB varies from traditional SQL in the sense that the semantics supports queries that are continuous and periodic. For example, a query could state: “Return the temperature reading of all the sensors on Level 4 of the building every 5 min over a period of 10 h.” The period of time between every successive sample is known as an epoch (in this example it is 5 min). Just like in SQL, TinyDB queries follow the “SELECT - FROM - WHERE - GROUPBY - HAVING” format that supports selection, join, projection, aggregation, and grouping. Just like in COUGAR, sensor data is viewed as a single virtual table with one column per sensor type. Tuples are appended to the table at every epoch. Epochs also allow computation to be scheduled such that power is minimized. For example, the following query specifies that each sensor should report its own identifier and temperature readings once every 60 sec for a duration of 300 sec: SELECT nodeid, temp FROM sensors SAMPLE PERIOD 60s FOR 300s

The virtual table sensors is conceptually an unbounded, continuous data stream of values that contain one column for every attribute and one row for every possible instant in time. The table is not actually stored in any device, that is, it is not materialized but sensor nodes only generate the attributes and rows that are referenced in active queries. Apart from the standard query shown above, TinyDB also supports event-based queries and lifetime queries [18]. Event-based queries reduce energy consumption by allowing nodes to remain dormant until some triggering event is detected. Lifetime queries are useful when users are not particularly interested in the specific rate of incoming readings but more on the required lifetime of the network. So the basic idea is to send out a query saying that sensor readings are required for say 60 days. The nodes then decide on the best possible rate at which readings can be sent given the specified network lifetime.

© 2006 by Taylor & Francis Group, LLC

33-14

Embedded Systems Handbook

Queries are disseminated into the network via a routing tree rooted at the base station that is formed as nodes forward the received query to other nodes in the network. Every parent node can have multiple child nodes but every child node can only have a single parent node. Every node also keeps track of its distance from the root node in terms of the number of hops. This form of communication topology is commonly known as tree-based routing. Upon receiving a query, each node begins processing it. A special acquisition operator at each node acquires readings from sensors corresponding to the fields or attributes referenced in the query. Similar to the concept of nested queries in Directed Diffusion, where sensors with a low sampling cost are sampled first, TinyDB performs the ordering of sampling and predicates. Consider the following query as an example where a user wishes to obtain readings from an accelerometer and a magnetometer provided certain conditions are met: SELECT accel, mag FROM sensors WHERE accel > c1 AND mag > c2 SAMPLE INTERVAL 1s FOR 60s

Depending on the cost of sampling the accelerometer and the magnetometer sensors, the optimizer will first sample the cheaper sensor to see if its condition is met. It will only proceed to the more costly second sensor if the first condition has been met. Next we describe how the sampled data is processed within the nodes and is subsequently propagated up the network toward the root node. Consider the following query: Report the average temperature of the fourth floor of the building every 30 sec. To service the above query, the query plan has three operators: a data acquisitional operator, a select operator that checks if the value of floor equals 4, and the aggregate operator that computes the average temperature from not only the current node but also its children located on the fourth floor. Each sensor node applies the plan once per epoch and the data stream produced at the root node is the answer to the query. The partial computation of averages is represented as {sum, count} pairs, which are merged at each intermediate node in the query plan to compute a running average as data flows up the tree. TinyDB uses a slotted scheduling protocol to collect data where parent and child nodes receive and send (respectively) data in the tree-based communication protocol. Each node is assumed to produce exactly one result per epoch, which must be forwarded all the way to the base station. Every epoch is divided into a number of fixed-length intervals that is dependent on the depth of the tree. The intervals are numbered in reverse order such that interval 1 is the last interval in the epoch. Every node in the network is assigned to a specific interval that correlates to its depth in the routing tree. Thus for instance if a particular node is two hops away from the root node, it is assigned the second interval. During its own interval, a node performs the necessary computation, transmits its result and goes back to sleep. In the interval preceding its own, a node sets its radio to “listen” mode collecting results from its child nodes. Thus data flows up the tree in a staggered manner eventually reaching the root node during interval 1 as shown in Figure 33.7. 33.2.2.1.4 Discussion In this section we do a comparison of the various projects described above and highlight some of their drawbacks. We also mention some other work in the literature that has contributed further improvements to some of these existing projects. Table 33.1 shows a list comparing some of the key features of the various projects. As mentioned earlier, Directed Diffusion was a pioneering project in the sense that it introduced the fundamental concept of improving network efficiency by processing data within the sensor network. However, unlike COUGAR and TinyDB it does not offer a particularly simple interface, flexible naming system, or any generic aggregation and join operators. Such operators are considered as applicationspecific operators and must always be coded in a low-level language. A drawback of this approach is that

© 2006 by Taylor & Francis Group, LLC

Architectures for Wireless Sensor Networks Interval 5

Interval 4

Interval 3

Interval 2

33-15 Interval 1

A Key Message transmission

B Nodes sleeping C

Radio in transmit mode

D Radio in listen-only mode

E F G H Nodes sleeping

I

Slotted approach

J

Time

FIGURE 33.7 Communicating scheduling in TinyDB using the slotted approach [19]. TABLE 33.1

Comparison of Data Management Strategies Directed Diffusion

COUGAR

TinyDB

Type

Non-database

Database

Database

Platform

iPAQ class (Mote class for micro-diffusion)

iPAQ class

Mote class

Query language

Application specific, dependent on Filter API

SQL-based

SQL-based

Type of in-network aggregation

Suppression of identical data messages from different sources

Selection operators

Selection, aggregation operators and limited optimization

Crosslayer features

Routing integrated with in-network aggregation

None

Routing integrated with in-network aggregation; communication scheduling also decreases burden on the MAC layer

Caching of data for future routing

Yes

No

Yes

Power saving mechanism while sampling sensors

Yes — nested queries

None

Yes — acquisitional query processing

Type of optimization

None

Centralized

Mostly centralized — metadata is occasionally copied to catalogue

query optimizers are unable to deal with such user-defined operators as there are no fixed semantics. This is because query operators are unable to make the necessary cost comparisons between various userdefined operators. A direct consequence of this is that since the system is not able to handle optimization tasks autonomously, the arduous responsibility of placement and ordering of operators is placed on the

© 2006 by Taylor & Francis Group, LLC

33-16

Embedded Systems Handbook

user. This naturally would be a great hindrance to users of the system (e.g., environmentalists) who are only concerned with injecting queries into the network and obtaining the results — not figuring out the intricacies of energy-efficient mechanisms to extend network lifetime! While the COUGAR project specifically claims to target wireless sensor networks [20, 21], apart from the feature of pushing down selection operations into the device network, it does not demonstrate any other novel design characteristics that would allow it to run on sensor networks. In fact, the COUGAR project has simulations and implementations using Linux-based iPAQ class hardware that has made them take certain design decisions that would be unsuitable for sensor networks. For instance, unlike Directed Diffusion [14] and TinyDB [18], COUGAR does not take the cost incurred by sampling sensors into consideration during the generation of query execution plans. It also does not take advantage of certain inherent properties of radio communication, for example, snooping, and also fails to suggest any methods that could link queries to communication scheduling. Additionally, the usage of XML to encode messages and tuples makes it inappropriate for sensor networks given their limited bandwidth and high cost of transmission per bit. Among the various query processing systems currently stated in the literature, TinyDB seems to be the one that is the most feature packed. The TinyDB software has been deployed using Mica2 motes in the Berkeley Botanical Garden to monitor the microclimate in the garden’s redwood grove [22]. However, the initial deployment only relays raw readings and does not currently make use of any of the aggregation techniques introduced in the TinyDB literature. While it may have approached the problem of improving energy efficiency from several angles it does have a number of inherent drawbacks the most significant being the lack of adaptability. First, the communication scheduling mentioned above is highly dependent on the depth of the network that is assumed to be fixed. This makes it unable to react to changes in the topology in the network on-the-fly that could easily happen if new nodes are added or certain nodes die. Second, the communication scheduling is also directly dependent on the epoch that is specified in every query injected into the network. With networks expected to span say hundreds or even thousands of nodes, it is unlikely that environmentalists using a particular network would only inject one query into the node at any one time. Imagine if the Internet was designed in a way such that only one person was allowed to use it at any instant! Thus methods need to be devised to enable multiple queries to run simultaneously in a sensor network. Although TinyDB reduces the number of transmissions greatly by carrying out in-network aggregation for every long-running query, it keeps on transmitting data during the entire duration of the active query disregarding the temporal correlation in a sequence of sensor readings. Reference 23 takes advantage of this property and ensures that nodes only transmit data when there is a significant enough change between successive readings. In other words, sensors may refrain from transmitting data if the readings remain constant. Another area related to the lack of adaptability affecting both COUGAR and TinyDB has to do with the generation of query execution plans. In both projects the systems assume a global view of the network when it comes to query optimization. Thus network metadata is periodically copied from every node within the network to the root node. This information is subsequently used to work out the best possible query optimization plan. Obviously, the cost of extracting network metadata from every node is highly prohibitive. Also query execution plans generated centrally may be outdated by the time they reach the designated nodes as conditions in a sensor network can be highly volatile, for example, the node delegated to carry out a certain task may have run out of power and died by the time instructions arrive from the root node. In this regard, it is necessary to investigate methods where query optimizations are carried out using only local information. While they may not be as optimal as plans generated based on global network metadata, it will result in significant saving in terms of the number of radio transmissions. Reference 24 looks into creating an adaptive and decentralized algorithm that places operators optimally within a sensor network. However, the preliminary simulation results are questionable since the overhead incurred during the neighbor exploration phase is not considered. Also there is no mention of how fast the algorithm responds to changes in networkdynamics.

© 2006 by Taylor & Francis Group, LLC

Architectures for Wireless Sensor Networks

33-17

33.3 Data-Centric Architecture As we previously stated, the layered protocol stack description of the system architecture for a sensing node cannot cover all the aspects involved (such as crosslayer communication, dynamic update, etc.). In this section we address the problem of describing the system architecture in a more suited way and its implications in the application design.

33.3.1 Motivation The sensor networks are dynamic from many points of view. Continuously changing behaviors can be noticed in several aspects of sensor networks, some of them being: Sensing process. The natural environment is dynamic by all means (the basic purpose of sensor networks is to detect, measure, and alert the user of the changing of its parameters). The sensor modules themselves can become less accurate, need calibration or even break down. Network topology. One of the features of the sensor networks is their continuously changing topology. There are a lot of factors contributing to this, such as: failures of nodes or the unreliable communication channel, mobility of the nodes, variations of the transmission ranges, clusters reconfiguration, addition/removal of sensor nodes, etc. Related to this aspect, the algorithms designed for sensor networks need to have two main characteristics: they need to be independent on the network topology and need to scale well with the network size. Available services. Mobility of nodes, failures, or availability of certain kinds of nodes might trigger reconfigurations inside the sensor network. The functionality of nodes may depend on existing services at certain moments and when they are no longer available, the nodes will either reconfigure themselves or try to provide them themselves. Network structure. New kinds of nodes may be added to the network. Their different and increased capabilities will bring changes to the regular way in which the network functions. Software modules might be improved or completely new software functionality might be implemented and deployed in the sensor nodes. Most wireless sensor network architectures currently use a fixed layered structure for the protocol stack in each node. This approach has certain disadvantages for wireless sensor networks. Some of them are: Dynamic environment. Sensor nodes address a dynamic environment where nodes have to reconfigure themselves to adapt to the changes. Since resources are very limited, reconfiguration is also needed in order to establish an efficient system (a totally new functionality might have to be used if energy levels drop under certain values). The network can adapt its functionality to a new situation, in order to lower the use of the scarce energy and memory resources, while maintaining the integrity of its operation. Error control. It normally resides in all protocol layers so that for all layers the worst-case scenario is covered. For a wireless sensor network this redundancy might be too expensive. Adopting a central view on how error control is performed and crosslayer design will reduce the resources spent for error control. Power control. It is traditionally done only at the physical layer, but since energy consumption in sensor nodes is a major design constraint, it is found in all layers (physical, data-link, network, transport, and application layers). Protocol place in the sensor node architecture. An issue arises when trying to place certain layers in the protocol stack. Examples may include: timing and synchronization, localization, and calibration. These protocols might shift their place in the protocol stack as soon as their transient phase is over. The data produced by some of these algorithms might make a different protocol stack more suited for the sensor node (e.g., a localization algorithm for static sensor networks might enable a better routing algorithm that uses information about the location of the routed data destination).

© 2006 by Taylor & Francis Group, LLC

33-18

Embedded Systems Handbook

Protocol availability. New protocols might become available after the network deployment. At certain moments, in specific conditions, some of the sensor nodes might use a different protocol stack that better suits their goal and the environment. It is clear from these examples that dynamic reconfiguration of each protocol as well as dynamic reconfiguration of the active protocol stack is needed.

33.3.2 Architecture Description The system we are trying to model is an event-driven system, meaning that it reacts and processes the incoming events and afterwards, in the absence of these stimuli, it spends its time in the sleep state (the software components running inside the sensor node are not allowed to perform blocking waiting). Let us name a higher level of abstraction for the event class as data. Data may encapsulate the information provided by one or more events, have a unique name and contain additional information such as deadlines, identity of producer, etc. Data will be the means used by the internal mechanisms of the architecture to exchange information components. In the following we will address any protocol or algorithm that can run inside a sensor node with the term entity (see Figure 33.8). An entity is a software component that will be triggered by the availability of one or more data types. While running, each entity is allowed to read available data types (but not wait for additional data types becoming available). As a result of the processing, each software component can produce one or more types of data (usually on their exit). An entity is also characterized by some functionality, meaning the sort of operation it can produce on the input data. Based on their functionality, the entities can be classified as being part of a certain protocol layer as in the previous description. For one given functionality, several entities might exist inside a sensor node; to discern among them, one should take into consideration their capabilities. By capability

Data

Input Data Data

Input

Functionality

Data Output

Data

Capabilities {...}

Module manager

Data

Input Data Data

Input

Functionality

Data Output

Data

Capabilities {...}

Publish/subscribe server

Data

Input Data Data

Input

Data

Capabilities {...}

Input Data Data

Input

Functionality Capabilities {...}

FIGURE 33.8 Entity description.

© 2006 by Taylor & Francis Group, LLC

Functionality

Data Output

Data

Data Output

Data

Architectures for Wireless Sensor Networks

33-19

we understand high-level description containing the cost for a specific entity to perform its functionality (as energy, resources, time, etc.) and some characteristics indicating the estimated performance and quality of the algorithm. In order for a set of components to work together, the way in which they have to be interconnected should be specified. The architectures existent up to this moment in the wireless sensor network field, assume a fixed way in which these components can be connected, which is defined at compile time (except for the architectures that e.g., allow execution of agents). To change the protocol stack in such an architecture, the user should download the whole compiled code into the sensor node (via the wireless interface) and then make use of some boot code to replace the old running code in it. In the proposed architecture we are allowing this interconnection to be changed at runtime, thus making online update of the code possible, the selection of a more suited entity to perform some functionality based on the changes in the environment, etc. (in one word allowing the architecture to become dynamically reconfigurable). To make this mechanism work, a new entity needs to be implemented; let us call this the data manager. The data manager will monitor the different kinds of data being available and will coordinate the dataflow inside the sensor node. At the same time it will select the most fitting entities to perform the work and it will even be allowed to change the whole functionality of the sensor node based on the available entities and external environment (see Figure 33.9). The implementation of these concepts can not make abstraction of the small amount of resources each sensor node has (as energy, memory, computation power, etc.). Going down from the abstraction level to the point where the device is actually working, a compulsory step is implementing the envisioned architecture in a particular operating system (in this case a better term maybe system software). A large range of operating systems exist for embedded systems in general [25, 26]. Scaled down versions with simple schedulers and limited functionality have been developed especially for wireless sensor networks [27]. Usually, the issues of system architecture and operating system are treated separately, both of them trying to be as general as possible and to cover all the possible application cases. A simplistic view of a running operating system is a scheduler that manages the available resources and coordinates the execution of a set of tasks. This operation is centralized from the point of view of the scheduler that is allowed to take all the decisions. Our architecture can also be regarded as a centralized system, with the data manager coordinating the dataflow of the other entities. To obtain the smallest overhead possible there should be a correlation between the function of the central nucleus from our architecture and the function of the scheduler from the operating system. This is why we propose a close relationship between the two concepts by extending the functionality of the scheduler with the functionality of the data manager. The main challenges that arise are keeping the size of the code low and the context-switching time. 33.3.2.1 Requirements As we mentioned earlier, the general concept of data is used rather than the event one. For the decision based on data to work, there are some additional requirements to be met. First of all, all the modules need to declare the name of the data that will trigger their action, the name of the data they will need to read during their action (this can generically incorporate all the shared resources in the system) and the name of the data they will produce. The scheduler needs all this information to take the decisions. From the point of view of the operating system, a new component that takes care of all the data exchange needs to be implemented. This would in fact be an extended message passing mechanism, with the added feature of notifying the scheduler when new data types become available. The mapping of this module in the architecture is the constraint imposed to the protocols to send/receive data via, for example, a publish/subscribe mechanism to the central scheduler. An efficient naming system for the entities and the data is needed. Downloading new entities to a sensor node involves issues similar to services discovery. Several entities with the same functionality but with

© 2006 by Taylor & Francis Group, LLC

33-20

Data

Data

Input Data Data

Input

Functionality

Data

Input Output

Data

Data

Data

Data

Input

Capabilities {...}

Functionality Capabilities {...}

Data

Input Output

Data

Data

Data

Data

Input

Functionality

Input Output

Data

Data

Data

Data

Input

Functionality

Functionality

Input Output

Data

Data

Data

Data

Input

Capabilities {...}

Functionality

Data Output

Data

Capabilities {...}

Data

Input

Data

Input

Capabilities {...}

Data

Data

Data

Input Output

Data

Data

Data

Data

Input

Capabilities {...}

Functionality

Data Output

Data

Capabilities {...}

Data Data

Input Input Data Data

Input

Functionality

Data Data Output

Data

Input

Data

Data

Input

Data

Functionality Capabilities {...}

FIGURE 33.9 Architecture transitions.

© 2006 by Taylor & Francis Group, LLC

Data

Input Data Output

Data

Data

Data Data

Input

Functionality Capabilities {...}

Input Data Output

Data

Data Data

Input

Functionality Capabilities {...}

Data Output

Data

Embedded Systems Handbook

Data

Input

Data Output

Capabilities {...}

Capabilities {...}

Data

Functionality

Architectures for Wireless Sensor Networks

33-21

different requirements and capabilities might coexist. The data-centric scheduler has to make the decision as to which one is the best. 33.3.2.2 Extension of the Architecture The architecture presented earlier might be extended to groups of sensor nodes. Several data-centric schedulers together with a small, fixed number of protocols can communicate with each other and form a virtual backbone of the network. Entities running inside sensor nodes can be activated using data types that become available at other sensor nodes (e.g., imagine one node using his neighbor routing entity because it needs the memory to process some other data). Of course, this approach raises new challenges. A naming system for the functionalities and data types, reliability issues of the system (for factors, such as mobility, communication failures, node failures, security attacks) are just a few examples. Related work on these topics already exist (e.g., References 28 and 29).

33.4 Conclusion In this chapter, we have outlined the characteristics of wireless sensor networks from an architectural point of view. As sensor networks are designed for specific applications, there is no precise architecture to fit them all but rather a common set of characteristics that can be taken as a starting point. The combination of the data-centric features of sensor networks and the need to have a dynamic reconfigurable structure has led to a new architecture that provides enhanced capabilities than the existing ones. The new architecture characteristics and implementation issues have been discussed, laying the foundations for future work. This area of research is currently in its infancy and major steps are required in the fields of communication protocols, data processing, and application support to make the vision of Mark Weiser a reality.

References [1] EYES. Eyes European project, http://eyes.eu.org. [2] Chu, P., Lo, N.R., Berg, E., and Pister, K.S.J. Optical communication using micro corner cuber reflectors. In Proceedings of the MEMS97. IEEE, Nagoya, Japan, 1997, pp. 350–355. [3] Havinga, P. et al. Eyes deliverable 1.1 — system architecture specification. [4] SmartDust. http://robotics.eecs.berkeley.edu/ pister/SmartDust. [5] Akyildiz, I., Su, W., Sankarasubramaniam, Y., and Cayirci, E. A survey on sensor networks. IEEE Communication Magazine, 40(8), 102–114, 2002. [6] Intanagonwiwat, C., Govindan, R., Estrin, D., Heidemann, J., and Silva, F. Directed diffusion for wireless sensor networks. IEEE/ACM Transactions on Networking, 11(1), 2–16, 2003. [7] Pottie, G.J. and Kaiser, W.J. Embedding the internet: wireless integrated network sensors. Communications of the ACM, 43(5), 51–58, 2000. [8] Ganesan, D., Cerpa, A., Ye, W., Yu, Y., Zhao, J., and Estrin, D. Networking issues in wireless sensor networks. Journal of Parallel and Distributed Computing, Special Issue on Frontiers in Distributed Sensor Networks, 64(7), 799–814, 2004. [9] Estrin, D., Govindan, R., Heidemann, J.S., and Kumar, S. Next century challenges: scalable coordination in sensor networks. Mobile Computing and Networking. IEEE, Seattle, Washington, USA, 1999, pp. 263–270.

© 2006 by Taylor & Francis Group, LLC

33-22

Embedded Systems Handbook

[10] Bonnet, P., Gehrke, J., and Seshadri, P. Towards sensor database systems. In Proceedings of the Second International Conference on Mobile Data Management. Springer-Verlag, Heidelberg, 2001, pp. 3–14. [11] Madden, S., Szewczyk, R., Franklin, M., and Culler, D. Supporting aggregate queries over ad-hoc wireless sensor networks. In Proceedings of the Fourth IEEE Workshop on Mobile Computing and Systems Applications. IEEE, 2002. [12] Postel, J. Internet protocol, rfc 791, 1981. [13] Estrin, D., Girod, L., Pottie, G., and Srivastava, M. Instrumenting the world with wireless sensor networks. In Proceedings of the International Conference on Accoustics, Speech and Signal Processing. IEEE, Salt Lake City, Utah, 2001. [14] Heidemann, J.S., Silva, F., Intanagonwiwat, C., Govindan, R., Estrin, D., and Ganesan, D. Building efficient wireless sensor networks with low-level naming. In Symposium on Operating Systems Principles. ACM, 2001, pp. 146–159. [15] Pottie, G.J. and Kaiser, W.J. Wireless integrated network sensors. Communications of the ACM, 43(5), 51–58, 2000. [16] Bonnet, P. and Seshadri, P. Device database systems. In Proceedings of the International Conference on Data Engineering. IEEE, San Diego, CA, 2000. [17] Bonnet, P., Gehrke, J., and Seshadri, P. Querying the physical world. IEEE Personal Communications, 7, 10–15, 2000. [18] Madden, S., Franklin, M.J., Hellerstein, J.M., and Hong, W. The design of an acquisitional query processor for sensor networks. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. ACM Press, San Diego, CA, 2003, pp. 491–502. [19] Madden, S. The design and evaluation of a query processing architecture for sensor networks. PhD thesis, University of California, Berkeley, 2003. [20] Yao, Y. and Gehrke, J. The cougar approach to in-network query processing in sensor networks. SIGMOD Record, 31(3), 2002. [21] Yao, Y. and Gehrke, J. Query processing for sensor networks. In Proceedings of the Conference on Innovative Data Systems Research. Asilomar, CA, 2003. [22] Gehrke, J. and Madden, S. Query processing in sensor networks. Pervasive Computing. IEEE, 2004, pp. 46–55. [23] Beaver, J., Sharaf, M.A., Labrinidis, A., and Chrysanthis, P.K. Power aware in-network query processing for sensor data. In Proceedings of the Second Hellenic Data Management Symposium. Athens, Greece, 2003. [24] Bonfils, B.J. and Bonnet, P. Adaptive and decentralized operator placement for in-network query processing. In Proceedings of the Second International Workshop on Information Processing in Sensor Networks (IPSN), Vol. 2634 of Lecture Notes in Computer Science. Springer-Verlag, Berlin, Heidelberg, 2003, pp. 47–62. [25] VxWorks. Wind river, http://www.windriver.com. [26] Salvo. Pumpkin incorporated, http://www.pumpkininc.com. [27] Hill, J., Szewczyk, R., Woo, A., Hollar, S., Culler, D.E., and Pister, K.S.J. System architecture directions for networked sensors. In Architectural Support for Programming Languages and Operating Systems, 2000, pp. 93–104. [28] Verissimo, P. and Casimiro, A. Event-driven support of real-time sentient objects. In Proceedings of the Eighth IEEE International Workshop on Object-Oriented Real-Time Dependable Systems. IEEE, Guadalajara, Mexico, 2003. [29] Cheong, E., Liebman, J., Liu, J., and Zhao, F. TinyGALS: a programming model for event-driven embedded systems. In Proceedings of the 2003 ACM Symposium on Applied Computing. ACM Press, Melbourne, Florida, 2003, pp. 698–704.

© 2006 by Taylor & Francis Group, LLC

34 Energy-Efﬁcient Medium Access Control 34.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-1 Contention-Based Medium Access • Schedule-Based Medium Access

34.2 Requirements for Sensor Networks. . . . . . . . . . . . . . . . . . . . . 34-5 Hardware Characteristics • Communication Patterns • Miscellaneous Services

34.3 Energy Efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-7 Sources of Overhead • Trade-Offs

34.4 Contention-Based Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-11 IEEE 802.11 • LPL and Preamble Sampling • WiseMAC

34.5 Slotted Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-12 Sensor-MAC • Timeout-MAC • Data-Gathering MAC

34.6 TDMA-Based Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-14 Lightweight Medium Access

34.7 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-17 Simulation Framework • Micro-Benchmarks • Homogeneous Unicast and Broadcast • Local Gossip • Convergecast • Discussion

Koen Langendoen and Gertjan Halkes Delft University of Technology

34.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-27 Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-27 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-28

34.1 Introduction Managing wireless communication will be the key to effective deployment of large-scale sensor networks that need to operate for years. On the one hand, wireless communication is essential (1) to foster collaboration between neighboring sensor nodes to help overcome the inherent limitations of their cheap, and hence inaccurate sensors observing physical events, and (2) to report those events back to a sink node connected to the wired world. On the other hand, wireless communication consumes a lot of energy, is error prone, and has limited range, forcing many nodes to participate in relaying information, all of which severely limit the lifetime of the (unattended) sensor network. In typical sensor nodes, such as the Mica2 mote, communicating one bit of information consumes as much energy as executing several hundred

34-1

© 2006 by Taylor & Francis Group, LLC

34-2

Embedded Systems Handbook

Network Data link MAC protocol Physical

Layer 3 Layer 2 Layer 1

FIGURE 34.1 Network protocol stack.

instructions. Therefore, one should “think” twice before actually transmitting a message. Nevertheless, whenever a message should be sent, the protocol stack must operate as efficiently as possible. In this chapter, we will study the medium access layer, which is part of the data link layer (layer 2 of the OSI model) and sits directly on top of the physical layer (layer 1) (see Figure 34.1). Since the medium access layer controls the radio, it has a large impact on the overall energy consumption, and hence, the lifetime of a node. A Medium Access Control (MAC) protocol decides when competing nodes may access the shared medium, that is, the radio channel, and tries to ensure that no two nodes are interfering with each other’s transmissions. In the unfortunate event of a collision, a MAC protocol may deal with it through some contention resolution algorithm, for example, by resending the message later at a randomly selected time. Alternatively, the MAC protocol may simply discard the message and leave the retransmission — if any — up to the higher layers in the protocol stack. MAC protocols for wireless networks have been studied since the 1970s, but the successful introduction of wireless LANs (WLANs) in the late 1990s has accelerated the pace of developments; the recent survey by Jurdak et al. [1] reports an exponential growth of new MAC protocols. We will now provide a brief historic perspective on the evolution of MAC, and describe the two major approaches — contention-based and schedule-based — regularly used in wireless communication systems. Readers familiar with medium access in wireless networks may proceed to Section 34.2 immediately.

34.1.1 Contention-Based Medium Access In the classic (pure) ALOHA protocol [2], developed for packet radio networks in the 1970s, a node simply transmits a packet when it is generated. If no other node is sending at the same time, the data transmission succeeds and the receiver responds with an acknowledgment. In the case of a collision, no acknowledgment will be generated, and the sender retries after a random period. The price to be paid for ALOHA’s simplicity is its poor use of the channel capacity; the maximum throughput of the ALOHA protocol is only 18% [2]. However, a minor modification to ALOHA can increase the channel utilization considerably. In slotted ALOHA, time is divided into slots, and nodes may only transmit at the beginning of a slot. This organization halves the probability of a collision and raises the channel utilization to around 35% [3]. 34.1.1.1 Carrier Sense Multiple Access Instead of curing the effects (retransmissions) after the fact, it is often much better to take out the root of the problem (collisions). The Carrier Sense Multiple Access (CSMA) protocol [4], originally introduced by Kleinrock and Tobagi in 1975, tries to do just that. Before transmitting a packet, a node first listens to the channel for a small period of time. If it does not sense any traffic, it assumes that the channel is clear and starts transmitting the packet. Since it takes some time to switch the radio from receive mode to transmit mode, the CSMA method is not bullet proof and collisions can still occur. In practice however, CSMA-style MAC protocols can achieve a maximal channel utilization in the order of 50 to 80% depending on the exact access policy [4]. 34.1.1.2 Carrier Sense Multiple Access with Collision Avoidance When all nodes can sense each other’s transmissions, CSMA performs just fine. It took until 1990 before a significant new development in MAC was recorded. The Medium Access with Collision Avoidance

© 2006 by Taylor & Francis Group, LLC

Energy-Efﬁcient Medium Access Control

34-3

(MACA) protocol [5] addresses the so-called hidden terminal problem that occurs in ad hoc (sensor) networks where the radio range is not large enough to allow communication between arbitrary nodes and two (or more) nodes may share a common neighbor while being out of each other’s reach. Consider the situation in Figure 34.2 where nodes A and C both want to transmit a packet to their common neighbor B. Both nodes sense an idle channel and start to transmit their packets, resulting in a collision at B. Note that since node A is hidden from C, any packet sent by C will disrupt an ongoing transmission from A to B, so this type of collision is quite common in ad hoc networks. The MACA protocol introduces a three-way handshake to make hidden nodes aware of upcoming transmissions, so collisions at common neighbors can be avoided. The sender (node A in Figure 34.2) initiates the handshake by transmitting a short Request-To-Send (RTS) control packet announcing its intended data transmission. The receiver (B) responds with a Clear-To-Send (CTS) packet, which informs all neighbors of the receiver (including hidden nodes like C) of the upcoming transfer. The final DATA transfer (from A to B) is now guaranteed to be collision-free. When two RTS packets collide, which is technically still possible, the intended receiver does not respond with a CTS and both senders backoff for some random time. To account for the unreliability of the radio channel, MACA Wireless (MACAW [6]) adds a fourth packet to the control sequence to guarantee delivery. When the data is received correctly, an explicit ACKnowledgment is send back to the sender. If the sender does not receive the ACK in due time, it initiates a retransmission sequence to account for the corrupted or lost data. The collision avoidance protocol in MACA (and derivatives) is widely used and is generally known as CSMA/CA (Carrier Sense Multiple Access with Collision Avoidance). It has proved to be very effective in eliminating collisions. In fact, CSMA/CA is too good at it and also silences nodes whose transmissions would not interfere with the data transfer between the sender–receiver pair. The so-called exposed terminal problem is illustrated in Figure 34.3. In principle, the data transmissions B→A and C→D can take place concurrently since the signals from B cannot disturb the reception at D, and similarly C’s signals cannot collide at A. However, since B must be able to receive the CTS by A, all nodes who can hear B’s RTS packet must remain silent even if they are outside the reach of the receiver (A). Node C is thus exposed to B’s transmission (and vice versa). Since exposed nodes are prohibited from sending, aggregate throughput may be reduced.

(a)

(b) A

B Data

C

A

B

C

RTS

Data

Time

CTS

CTS Data

Blocked

FIGURE 34.2 (a) The hidden terminal problem resolved through (b) Request-To-Send/Clear-To-Send signaling. (a)

(b) A

B

C

Data

Data

D

A

B

Time

RTS CTS Data

C

D

RTS Blocked

FIGURE 34.3 The exposed terminal problem: (a) concurrent transfers are (b) synchronized.

© 2006 by Taylor & Francis Group, LLC

34-4

Embedded Systems Handbook

34.1.1.3 IEEE 802.11 In 1999, the IEEE Computer Society published the 802.11 WLAN standard [7], specifying the PHYsical and MAC layers. IEEE 802.11 compliant equipment, usually PC cards operating in the 2.4 or 5 GHz band, can operate in infrastructure mode as well as in ad hoc mode. In both cases, 802.11 implements carrier sense and collision avoidance to reduce collisions (see Section 34.4.1 for details). To preserve the energy of mobile nodes, the 802.11 standard includes a power-saving mechanism that allows nodes to go into sleep mode (i.e., disable their radios) for long periods of time. This mode of operation requires the presence of an access point that records the status of each node and buffers any data addressed to a sleeping node. The access point regularly broadcasts beacon packets indicating for which nodes it has buffered packets. These nodes may then send a poll request to the access point to retrieve the buffered data (or switch back from sleep to active mode). Krashinksy and Balakrishnan report up to 90% energy savings for web browsing applications, but at the expense of considerable delays [8]. Currently, power saving in 802.11’s ad hoc mode is only supported when all nodes are within each other’s reach, so a simple, distributed scheme can be used to coordinate actions; the standard does not include a provision for power saving in multihop networks.

34.1.2 Schedule-Based Medium Access The MAC protocols discussed so far are based on autonomous nodes contending for the channel. A completely different approach is to have a central authority (access point) regulate the access to the medium by broadcasting a schedule that specifies when, and for how long, each controlled node may transmit over the shared channel. The lack of contention overhead guarantees that this approach does not collapse under high loads. Furthermore, with the proper scheduling policy, nodes get deterministic access to the medium and can provide delay-bounded services as voice and multimedia streaming. Schedule-based medium access is, therefore, the preferred choice for cellular phone systems (e.g., GSM) and wireless networks supporting a mix of data and real-time traffic (e.g., Bluetooth). 34.1.2.1 Time-Division Multiple Access Time-Division Multiple Access (TDMA) is an important schedule-based approach that controls the access to a single channel (techniques for handling multiple channels will be discussed in Section 34.3.2.1). In TDMA systems the channel is divided into slots, which are grouped into frames (see Figure 34.4). The access point decides (schedules) which slot is to be used by which node. This decision can be made on a per frame basis, or it can span several frames in which case the schedule is repeated. In typical WLAN setups, most traffic is exchanged between the access point and the individual nodes. In particular, communication between nodes rarely occurs. By limiting communication to up- and downlink only, the scheduling problem is greatly simplified. Figure 34.4 shows a typical frame layout. The first slot in the frame is used by the access point to broadcast traffic control information to all nodes in its cell. This information includes a schedule that specifies when each node must be ready to expect to receive a packet (in the down-link section), and when it may send a packet (in the uplink section). The frame ends with a contention period in which new nodes can register themselves with the access point, so they can be included in future schedules. Frame n

Traffic control

Frame n + 1

Downlink

Frame n + 2

Uplink

Contention period

FIGURE 34.4 TDMA frame structure: traffic control–downlink–uplink–contention period.

© 2006 by Taylor & Francis Group, LLC

Energy-Efﬁcient Medium Access Control

34-5

The TDMA systems provide a natural way to conserve energy. A node can turn off its radio during all slots in a frame in which it is not engaged in communication to/from the access point. It does require, however, accurate time-synchronization between the access point and the individual nodes to ensure that a node can wake-up exactly at the start of “its” slots. In a sensor network where activity is usually low, a node is then — on average — only awake for one slot each frame to receive the traffic control information. Enlarging the frame size reduces the energy consumption, but also increases the latency since a node has to wait longer before its slot turns up. This fundamental energy/latency trade-off is further explored in Section 34.3.2.

34.2 Requirements for Sensor Networks The vast majority of MAC protocols described in the literature so far were designed, and optimized, for scenarios involving satellite links (early work) and WLANs (recent developments). The deployment scenarios for wireless sensor networks differ considerably, leading to a different set of requirements. In particular, the unattended operation of sensor networks stresses the importance of energy efficiency and reduces the significance of performance considerations, such as low latency, high throughput, and fairness. Nevertheless, there are lessons to be learned from MAC protocols developed for wireless communication systems, especially those targeting ad hoc networks of mobile nodes. The interested reader is referred to a number of recent surveys in this area [1,9,10]. The task of the MAC layer in the context of sensor networks is to use the radio, with its limited resources, as efficiently as possible to send and receive data generated by the upper layers in the protocol stack. It should take into account that data is often routed across multiple hops, and be able to handle large-scale networks with hundreds, or even thousands of (mobile) nodes. To understand the design trade-offs involved we will discuss the hardware characteristics of prototype sensor nodes in use today, as well as common traffic patterns that have emerged in preliminary experiences with applications.

34.2.1 Hardware Characteristics The current generation of sensor nodes, some of which are commercially available, are made up of off-the-shelf components mounted on a small printed circuit board. In the future, we expect single chip solutions with some of the protocol layers implemented in hardware. At the moment however, the MAC protocols are running on the main processor, which drives a separate chip that takes care of converting (modulating) bits to/from radio waves. The interface between the processor and the radio chip is at the level of exchanging individual bits or bytes. The advantage of this low-level interface is that the MAC designer has absolute control, which contrasts sharply with 802.11 WLAN equipment where the MAC is usually included as part of the chipset on the PC card. Popular processors include the 8-bit Atmel ATmega128L CPU used on the Mica motes, the 16-bit Texas Instruments MSP430 used on the Eyes nodes, and the PIC-16 from Microchip. The exact specifications vary, but the processors typically run at a frequency in the 1–10 MHz range, and are equipped with 2–4 KB of RAM. The processing capabilities provide ample headroom to drive the radio, but the limited amount of storage space for local data puts a strong constraint on the memory footprint of the MAC protocol. Since the focus of sensor node development is on energy consumption and form factor, we do anticipate that future generations will still be quite limited in their processing and memory resources. Table 34.1 provides details on the characteristics of two low-power radios employed in various stateof-the-art sensor nodes. For reference, the specifications of a typical 802.11 PC card are included. Several important observations can be made. First, the energy consumed when sending or receiving data is two to three orders of magnitude more than keeping the radio in a low-power standby state. Thus, the key to effective energy management will be in switching the radio off and on. Second, the time needed to switch from standby to active mode is considerable (518 µsec to 2.0 msec), and the time needed to switch the radio between transmit and receive mode is also nonnegligible. Therefore, the number of mode switches should be kept to a minimum. Finally, the WaveLAN card (including the MAC) outperforms the other

© 2006 by Taylor & Francis Group, LLC

34-6

Embedded Systems Handbook TABLE 34.1

Characteristics of Typical Radios in State-of-the-Art Sensor Nodes RFM TR 1001 [11]

CC1000 [12]

Lucent WaveLAN PC “Silver” card [13]

Operating frequency Modulation scheme Bit rate

868 MHz ASK 115.2 kbps

868 MHza FSK 76.8 kbps

2.4 GHz DSSS 11 Mbps

Energy consumption Transmit

12 mA (1.5 dBm)

284 mA

Receive Standby

3.8 mA 0.7 µA

8.6 mA (−20 dBm) 25.4 mA (5 dBm) 11.8 mA 30 µA

Switch times Standby-to-transmit Receive-to-transmit Standby-to-receive Transmit-to-receive Transmit-to-standby Receive-to-standby

16 µsec 12 µsec 518 µsecb 12 µsec 10 µsec 10 µsec

190 mA 10 mA

2.0 msec 270 µsec 2.0 msec 250 µsec

a The CC1000 radio supports any frequency in the 300 to 1000 MHz range; the quoted numbers are

for 868 MHz. b Time needed to fully initialize receive circuitry; a simple carrier sense can be performed in 30 µsec.

radios in terms of energy per bit (77 versus 312 µJ/bit); future nodes should include radios with higher frequencies and more complex modulation schemes.

34.2.2 Communication Patterns In the rapidly emerging field of wireless sensor networks there is little experience with realistic, longrunning applications, which is unfortunate since a good characterization of the workload (in terms of network traffic) is mandatory for designing a robust and efficient MAC protocol, or any other part of the network stack for that matter. It is, however, clear that the nature of the traffic for sensor networks has a few remarkable characteristics that sets it apart from your average WLAN traffic. From the various proposed deployment scenarios, usually in the area of remote monitoring, and the limited data from preliminary studies, such as the Great Duck Island [14] and vehicle tracking system [15], it becomes clear that data rates are very low: typically in the order of 1–200 bytes per second, with message payload sizes around 20–25 bytes. Furthermore, two distinct communication patterns (named convergecast and local gossip in Reference 16) appear to be responsible for generating the majority of network traffic: Convergecast. In many monitoring applications, information needs to be periodically transmitted to a sink node so it can be processed at a central location or simply stored in a database for future use. Since these individual reports are often quite small and need to travel across the whole network, the overhead is quite large. Aggregating messages along the spanning tree to the sink node therefore pays off. At the very least, two (or more) packets can be coalesced to share a common header. At the very best, two (or more) messages can be combined into one, for example, when reporting the maximum room temperature. Local gossip. When a sensor node observes a physical event, so do its neighbors since the node density in a sensor network is expected to be high. This allows a node to check with the nodes in its vicinity if they observed the same event or not, and in the latter case to derive that its sensor is probably malfunctioning. If its neighbors do observe the same event (e.g., a moving target) they can collaborate to obtain a better estimate of the event (location and speed) and report that back to the sink. Besides improving the quality of the reported information, the collaboration also avoids n duplicate messages traveling all the way back to the sink. Depending on the situation, neighbors may be addressed individually (unicast) or collectively (broadcast). In any case, by sharing (gossiping) their sensor readings (rumors) nodes can reduce the likelihood of false positives, and efficiently report significant events.

© 2006 by Taylor & Francis Group, LLC

Energy-Efﬁcient Medium Access Control

34-7

The important implication of these two communication patterns is that traffic is not distributed evenly over the network. The amount of data varies both in space and in time. Nodes in the vicinity of the sink relay much more traffic than nodes at the edges of the network due to the convergecast pattern. The fluctuation in time is caused by the physical events triggering outbursts of local gossip. In the extreme case of a forest fire detection system, nodes may be dormant for years before finally reporting an event. MAC protocols should be able to handle these kinds of fluctuations.

34.2.3 Miscellaneous Services Often the MAC layer is expected to provide some network-related services not directly associated with data transfer. Localization and time-synchronization algorithms often need precise information about the moment of the physical transmission of a packet to factor out any time spent by the MAC layer in contention resolution. The routing layer needs to be informed of any local changes in network topology, for example, it needs to know when mobile nodes move in and out of radio range. Since the MAC layer sits directly on top of the radio it can perform these services at no extra cost. Neighborhood discovery, for example, must be carried out to ensure the proper operation of TDMA-based MAC protocols. We will not consider these miscellaneous requirements in the remainder of this chapter, but concentrate on the MAC protocols’ ability to transfer data as efficiently as possible.

34.3 Energy Efﬁciency The biggest challenge for designers of sensor networks is to develop systems that will run unattended for years. This calls for robust hardware and software, but most of all for careful energy management, since that is and will continue to be a limited resource. The current generation of sensor nodes is battery powered, so lifetime is a major constraint; future generations powered by ambient energy sources (sunlight, vibrations, etc.) will provide very low currents, so energy consumption is heavily constrained. It is important to realize that the failure of individual nodes may not harm the overall functioning of a sensor network, since neighboring nodes can take over provided that the node density is high enough (which can be guaranteed at roll out). Therefore, the key parameter to optimize is network lifetime, that is, the time until the network gets partitioned. The MAC layer operates on a local scale (all nodes within reach) and lacks the global information to optimize for network lifetime. This is therefore best accomplished at the upper layers of the protocol stack, in particular the routing and transport (data aggregation) layers, which do have a global overview. This works most effectively when the MAC layer ensures that the energy it spends is directly related to the amount of traffic that it handles. Thus, the MAC layer should optimize for energy efficiency. In contrast to typical WLAN protocols, MAC protocols designed for sensor networks usually trade-off performance (latency, throughput, fairness) for cost (energy efficiency, reduced algorithmic complexity). It is, however, not clear cut what the best trade-off is, and various designs differ significantly as will become apparent in Section 34.3.2 where we will review the basic design choices made by 20 WSN-specific MAC protocols. Before that, we will consider the major sources of overhead that render WLAN-style (contention-based) MAC protocols ineffective in the context of sensor networks.

34.3.1 Sources of Overhead When running a contention-based MAC protocol on an ad hoc network with little traffic, much energy is wasted due to the following sources of overhead: Idle listening. Since a node does not know when it will be the receiver of a message from one of its neighbors, it must keep its radio in receive mode at all times. This is the major source of overhead, since typical radios consume two orders of magnitude more energy in receive mode (even when no data is arriving) than in standby mode (cf. Table 34.1).

© 2006 by Taylor & Francis Group, LLC

34-8

Embedded Systems Handbook TABLE 34.2 Impact of Overhead on Contention-Based Protocols (C) and Schedule-Based Protocols (S) Source Collisions Protocol overhead Idle listening Overhearing Traffic fluctuations Scalability/mobility

Performance (latency, throughput, fairness) C C, S

C, S S

Cost (energy efficiency) C C, S C C C, S S

Collisions. If two nodes transmit at the same time and interfere with each others’ transmission, packets are corrupted. Hence, the energy used during transmission and reception is wasted. The RTS/CTS handshake effectively resolves the collisions for unicast messages, but at the expense of protocol overhead. Overhearing. Since the radio channel is a shared medium, a node may receive packets that are not destined for it; it would have been more efficient to have turned off its radio. Protocol overhead. The MAC headers and control packets used for signaling (ACK/RTS/CTS) do not contain application data and are therefore considered overhead; these overheads can be significant since many applications only send a few bytes of data per message. Traffic fluctuations. A sudden peak in activity raises the probability of a collision, hence, much time and energy are spent on waiting in the random backoff procedure. When the load approaches the channel capacity, the performance can collapse with little or no traffic being delivered while the radio, sensing for a clear channel, is consuming a lot of energy. Switching to a schedule-based protocol (i.e., TDMA) has the great advantage of avoiding all energy waste due to collisions, idle listening, and overhearing since TDMA is inherently collision free and the schedule notifies each node when it should be active and, more importantly, when not. The price to be paid is in fixed costs (i.e., broadcasting traffic schedules) and reduced flexibility to handle traffic fluctuations and mobile nodes. The usual solution is to resort to some form of overprovisioning and choosing a frame size that is large enough to handle peak loads. Dynamically adapting the frame size is another approach, but this largely increases the complexity of the protocol and, hence, is considered to be an unattractive option for resource-limited sensor nodes. Table 34.2 compares the impact of the various sources of overhead on the performance and cost (energy efficiency) of contention-based and schedule-based MAC protocols.

34.3.2 Trade-Offs Different MAC protocols make different choices regarding the performance–energy trade-off, and also between sources of overhead (e.g., signaling versus collisions). A survey of 20 medium access protocols specially designed for sensor networks, and hence optimized for energy efficiency, revealed that they can be classified according to three important design decisions: 1. The number (and nature) of the physical channels used. 2. The degree of organization (or independence) between nodes. 3. The way in which a node is notified of an incoming message. Table 34.3 provides a comprehensive protocol classification based on these three issues. Given that the protocols are listed chronologically based on their publication date, we observe that there is no clear trend indicating that medium access for wireless sensor networks is converging toward a unique, best solution. On the contrary, new combinations are still being “invented” showing that additional information (from simulations and practical experience) is needed to decide on the best approach. Section 34.7 provides a simulation-based head-to-head comparison of four protocols representing very distinctive choices in the design space. We will not discuss all individual MAC protocols listed in Table 34.3 in detail, but rather

© 2006 by Taylor & Francis Group, LLC

Energy-Efﬁcient Medium Access Control TABLE 34.3

34-9

Protocol Classification

Protocol

Published

Channels

Organization

Notification

SMACS [17]

2000

FDMA

Frames

Schedule

PACT [18] PicoRadio [19]

2001 2001

Single CDMA + tone

Frames Random

Schedule Wake-up

STEM [20] Preamble sampling [21] Arisha [22] S-MAC [23] PCM [24] Low Power Listening [25]

2002 2002 2002 2002 2002 2002

Data + control Single Single Single Single Single

Random Random Frames Slots Random Random

Wake-up Listening Schedule Listening Listening Listening

Sift [26] EMACs [27] T-MAC [28] TRAMA [29] WiseMAC [30] B-MAC [31]

2003 2003 2003 2003 2003 2003

Single Single Single Single Single Single

Random Frames Slots Frames Random Random

Listening Schedule (per node) Listening Schedule (per node) Listening Listening

BMA [32] Miller [33] DMAC [34] SS-TDMA [16] LMAC [35]

2004 2004 2004 2004 2004

Single Data + tone Single Single Single

Frames Random Slots (per level) Frames Frames

Schedule Wake-up + listening Listening Schedule Listening

review three fundamental design choices that MAC designers will encounter while crafting a protocol best matching their envisioned deployment scenario. 34.3.2.1 Use Multiple Channels, or Not? Channels

Single

Multiple

Double

Data + tone Data + control

FDMA

CDMA

The first design choice that we discuss is whether or not the radio should be capable of dividing the available bandwidth into multiple channels. Two common techniques for doing so are Frequency-Division Multiple Access (FDMA) and Code-Division Multiple Access (CDMA). FDMA partitions the total bandwidth of the channel into number of small frequency bands, called subcarriers, on which multiple nodes can transmit simultaneously without collision. CDMA on the other hand, uses a single carrier in combination with a set of orthogonal codes. Data packets are XOR-ed with a specific code by the sender before transmission, and then XOR-ed again by the receiver with the same code to retrieve the original data. Receivers using another code perceive the transmission as (pseudo) random noise. This allows the simultaneous and collision-free transmission of multiple messages. The absence of collision in a multiple-channel system is attractive, hence its popularity in early proposals, such as SMACS (FDMA) and PicoRadio (CDMA). It requires, however, a rather complicated radio consuming considerable amounts of energy, so most MAC protocols are designed for a simple radio providing just a single channel. An interesting alternative is to use a second, extremely low-power radio that can be used for signaling an intended receiver to wake-up and turn on its primary radio to receive a data packet. In the most simple, most energy-efficient case, the second radio is only capable of emitting a fixed “tone” waking-up all neighboring nodes (including the intended receiver). Miller and Vaidya [33] discuss several policies to minimize the number of false wake-ups by overhearing nodes. STEM uses a full-blown second radio to control exactly which node responds on the primary channel.

© 2006 by Taylor & Francis Group, LLC

34-10

Embedded Systems Handbook

34.3.2.2 Get Organized, or Not?

Organization Random

Slots

Frames

The second design choice that we discuss is if, and how much, the nodes in the network should be organized to act together at the MAC layer. The CSMA and TDMA protocols discussed before represent the two extremes in the degree of organization: from completely random to frame-based access. The advantages of contention-based protocols (random access) are the low implementation complexity, the ad hoc nature, and the flexibility to accommodate mobile nodes and traffic fluctuations. The major advantage of framebased TDMA protocols is the inherent energy efficiency due to the lack of collisions, overhearing, and idle-listening overheads. Since the advantages of random access are the drawbacks of frame-based access, and vice versa, some MAC protocols have chosen to strike a middle ground between these two extremes and organize the sensor nodes in a slotted system (much like slotted-ALOHA). The Sensor-MAC (S-MAC) protocol was the first to propose that nodes agree on a common slot structure, allowing them to implement an efficient duty cycle regime; nodes are awake in the first part of each slot and go to sleep in the second part, which significantly reduces the energy waste due to idle-listening. The protocol classification in Table 34.3 shows that the research community is divided into what degree of organization to apply: we find nine contention-based, three slotted, and eight TDMA-based protocols. Since we view the organizational design decision as the most critical, we will detail the main protocols from each class in Sections 34.4 to 34.6. 34.3.2.3 Get Notiﬁed, or Not?

Notification Listening

Wake-up

Schedule

The third and final design issue is about how the intended receiver of a message transfer will get notified. In schedule-based protocols, the actual data transfers are scheduled ahead of time, so receiving nodes know exactly when to turn on the radio. Such knowledge is not available in contention-based protocols, so receiving nodes must be prepared to handle an incoming transfer at any moment. Without further assistance from the sender, the receiver has no other option than to listen continuously. To eliminate the resulting idle-listening overhead completely, senders may actively send a wake-up signal (tone) over a second, very low-power radio. Although the wake-up model matches well with the low-packet rates of sensor network applications, all contention-based protocols except PicoRadio, STEM, and Miller’s proposal are designed for nodes with a single radio. The general approach to reduce the inherent idle listening in these nodes is to enforce some kind of duty cycle by periodically switching the radio on for a short time. This can be arranged individually per node (Low-Power Listening [LPL], and preamble sampling, Section 34.4.2) or collectively per slot (S-MAC, Section 34.5.1). An alternative is to circumvent the idle-listening problem, as the Sift protocol does, by restricting the network to a cellular topology where access points collect data from nearby sensor nodes. We like to point out that the choice for a particular notification policy is largely dependent on the available hardware channels and the organizational model discussed before. Schedule-based notification matches with TDMA frames; wake-up is only possible on dual-channel nodes. The Lightweight Medium ACcess (LMAC) protocol (Section 34.6.1), however, is the exception to the rule and combines TDMA frames with listening, striking a different balance between flexibility and energy efficiency. © 2006 by Taylor & Francis Group, LLC

Energy-Efﬁcient Medium Access Control

34-11

34.4 Contention-Based Protocols We now proceed with describing in detail some of the medium access protocols developed for sensor networks according to their particular choice of organizational model (see Table 34.3). In this section, we review contention-based protocols in which nodes can start a transmission at any random moment and must contend for the channel. The main challenge with contention-based protocols is to reduce the energy consumption caused by collisions, overhearing, and idle listening. CSMA/CA protocols effectively deal with collisions and can be easily adapted to avoid a lot of overhearing overhead (i.e., switch off the radio for the duration of another transmission’s sequence). We also discuss the familiar IEEE 802.11 protocol. even though it was not developed specifically for sensor networks. It does, however, form the basis of the energy-efficient derivatives discussed in this section (LPL and WiseMAC), as well as the slotted protocols (S-MAC and Timeout-MAC [T-MAC]) discussed in the next section.

34.4.1 IEEE 802.11 The MAC in the IEEE 802.11 standard [7] is based on carrier sensing (CSMA) and collision detection (through acknowledgments). A node wanting to transmit a packet must first test the radio channel to check if it is free for a specified time called the Distributed Inter Frame Space (DIFS). If so, a DATA packet1 is transmitted, and the receiver waits a Short Inter Frame Space (SIFS) before acknowledging the reception of the data by sending an ACK packet. Since the SIFS interval is set shorter than the DIFS interval, the receiver takes precedence over any other node attempting to send a packet. If the sender does not receive the acknowledgment, it assumes that the data was lost due to a collision at the receiver and enters a binary exponential backoff procedure. At each retransmission attempt, the length of the contention window (CW) is doubled. Since contending nodes randomly select a time from their CW, the probability of a subsequent collision is reduced by half. To bound access latency somewhat, the CW is not doubled once a certain maximum (CWmax ) has been reached. To account for the hidden terminal problem in ad hoc networks, the 802.11 standard defines a virtual carrier sense mechanism based on the collision avoidance handshake of the MACA protocol. The RTS/CTS control packets include a time field in their header, that specifies the duration of the upcoming DATA/ACK sequence. This allows neighboring nodes overhearing the control packets to set their network allocation vector (NAV) and defer transmission until it expires (see Figure 34.5). To save energy, the radio can be switched off for the duration of the NAV. Thus CSMA/CA effectively eliminates collisions and overhearing overhead for unicast packets. Broadcast and multicast packets are always transmitted without an RTS/CTS reservation sequence (and without an ACK), so they are susceptible to collisions.

34.4.2 LPL and Preamble Sampling The major disadvantage of CSMA/CA is the energy wasted by idle-listening. Both Hill and Culler [25], and El-Hoiydi [21] independently developed a low-level carrier sense technique that effectively duty cycles DIFS

SIFS RTS

Data

Sender node SIFS

SIFS

DIFS ACK

CTS Receiver node NAV (CTS) NAV (RTS)

CW

Other nodes

FIGURE 34.5 IEEE 802.11 access control. 1 The 802.11 standard defines the transmission protocol in terms of frames, but we use the term packet instead to avoid confusion with the framing structure of TDMA protocols. © 2006 by Taylor & Francis Group, LLC

34-12

Embedded Systems Handbook

Sender

Preamble

Message

Receiver

FIGURE 34.6

LPL: a long preamble allows periodic sampling at the receiver.

the radio, that is, turns it off repeatedly, without losing any incoming data. This technique operates at the physical layer and concerns the layout of the PHY header prepended to each radio packet. This header starts off with a preamble that is used to notify receivers of the upcoming transfer and allows them to adjust (train) their circuitry to the current channel conditions; next follows the startbyte, signaling the true beginning of the data transfer. The basic idea behind the efficient carrier-sense technique is to shift the cost from the receiver (the frequent case) to the transmitter (the rarer case) by increasing the length of the preamble. This allows the receiver to periodically turn on the radio to sample for incoming data, and detect if a preamble is present or not. If it detects a preamble, it will continue listening until the startsymbol arrives and the message can be properly received (see Figure 34.6). If no preamble is detected, the radio is turned-off again until the next sample. This efficient carrier-sense method can be applied to any contention-based MAC protocol. El-Hoiydi combined it with ALOHA and named it preamble sampling [21]. Hill and Culler combined it with CSMA and named it Low-Power Listening [25]. Neither implementation includes collision avoidance to save on protocol overhead. The energy savings depend on the duty cycle, which in turn depends on the switching times of the radio. LPL, for example, was implemented as part of TinyOS running on Mica motes equipped with an RFM 1000 radio capable of performing a carrier sense in just 30 µsec (cf. Table 34.1). The carrier is sensed every 300 µsec, yielding a duty-cycle of 10%, effectively reducing the idle-listening overhead by a factor of ten. The energy savings come at a slight increase in latency (the length of the preamble is doubled to 647 µsec), and minor reduction in throughput. In the recently proposed B-MAC implementation (part of TinyOS 1.1.3) the preamble length is provided as a parameter to the upper layers, so they can select the optimal trade-off between energy savings and performance [31].

34.4.3 WiseMAC El-Hoiydi has refined his preamble sampling one step further, by realizing that long preambles are not necessary when the sender knows the sampling schedule of the intended receiver. The sender can then simply wait until the moment the receiver is about to sample the channel, and send a packet with an ordinary preamble. This not only saves energy at the sender, who waits instead of emitting an extended preamble, but also at the receiver, since the time until the start symbol occurs is reduced in length considerably. In WiseMAC [30] nodes maintain the schedule offsets of their neighbors through piggy backed information on the ACKnowledgments of the underlying CSMA protocol. Whenever a node needs to send a message to a specific neighbor n, it uses n’s offset to determine when to start transmitting the preamble; to account for any clock drift, the preamble is extended with a time proportional to the length of the interval since the last message exchange. The overall effect of these measures is that WiseMAC adapts automatically to traffic fluctuations. Under low load, WiseMAC uses long preambles and consumes low power (receiver costs dominate); under high loads, WiseMAC uses short preambles and operates energy efficiently (overheads are minimized). Finally, note that WiseMAC’s preamble length optimization is not very effective for broadcast messages, since the preamble must span the sampling points of all neighbors and account for drift, so it is quite often stretched to full length.

34.5 Slotted Protocols The three slotted protocols (S-MAC, T-MAC, and Data-gathering MAC [DMAC]) listed in Table 34.3 are all derived from classical contention-based protocols. They address the inherent idle-listening overhead

© 2006 by Taylor & Francis Group, LLC

Energy-Efﬁcient Medium Access Control

SYNC

Active

Sleep

34-13

SYNC

Active

Sleep

FIGURE 34.7 Slot structure of S-MAC with built-in duty cycle.

by synchronizing the nodes, and implementing a duty cycle within each slot. At the beginning of a slot, all nodes wake-up and any node wishing to transmit a message must contend for the channel. This synchronized behavior increases the probability of collision in comparison to the random organization of the energy-efficient CSMA protocols discussed in the previous section. To mitigate the increased collision overheads S-MAC and T-MAC include an RTS/CTS handshake, but DMAC does without to save on protocol overhead. The three slotted protocols also differ in their way of deciding when and how to switch back from active to sleep mode, as will become apparent in the following discussions.

34.5.1 Sensor-MAC The S-MAC protocol developed by Ye et al. [23] introduces a technique called virtual clustering to allow nodes to synchronize on a common slot2 structure (Figure 34.7). To this end nodes regularly broadcast SYNC packets at the beginning of a slot, so other nodes receiving these packets can adjust their clocks to compensate for drift. The SYNC packets also allow new (mobile) nodes to join the ad hoc network. In principle, the whole network runs the same schedule, but due to mobility and bootstrapping a network may comprise several virtual clusters. For the details of the synchronization procedure that resolves the rare occasion of two clusters meeting each other, please refer to Reference 36. An S-MAC slot starts off with a small synchronization phase, followed by a fixed-length active period, and ends with a sleep period in which nodes turn off their radio. Slots are rather large, typically in the order of 500 msec to 1 sec. The energy savings of S-MAC’s built-in duty cycle are under control of the application: the active part is fixed3 to 300 msec, while the slot length can be set to any value. Besides addressing the idle-listening overhead, S-MAC includes collision avoidance (RTS/CTS handshake) and overhearing avoidance. Finally, S-MAC includes message passing support to reduce protocol overhead when streaming a sequence of message fragments. The application’s explicit control over the idle-listening overhead is a mixed blessing. On the one hand, the application is in control of the energy-performance trade-off, which is good. On the other hand, the duty cycle must be decided upon before starting S-MAC, which is bad since the optimal setting depends on many factors including the expected occurrence rate of events observed after the deployment of the nodes, and may even change over time.

34.5.2 Timeout-MAC The T-MAC protocol by van Dam and Langendoen [28] introduces an adaptive duty cycle to improve S-MAC on two accounts. First, T-MAC frees the application from the burden of selecting an appropriate duty cycle. Second, T-MAC automatically adapts to traffic fluctuations inherent to the local gossip and convergecast patterns, while S-MAC’s slot length must be chosen conservatively to handle worst-case traffic. T-MAC borrows the virtual clustering method of S-MAC to synchronize nodes. In contrast to S-MAC, it operates with fixed length slots (615 msec) and uses a timeout mechanism to dynamically determine the end of the active period. The timeout value (15 msec) is set to span a small contention period and an RTS/CTS exchange. If a node does not detect any activity (an incoming message or a collision) within the 2 The S-MAC protocol is defined in terms of frames, but we use the term slot instead to avoid confusion with the framing structure of TDMA protocols. 3 A recent enhancement of S-MAC, which is called adaptive listening, includes a variable length active part to reduce multihop latency [36]. Since the timeout policy of the T-MAC protocol behaves similarly and was designed to handle traffic fluctuations as well, we do not discuss adaptive listening further.

© 2006 by Taylor & Francis Group, LLC

34-14

Embedded Systems Handbook Sink

Recv Send

Sleep

Recv Send

Recv Send

Recv Send

Sleep

Sleep

Recv Send

Recv Send

FIGURE 34.8 Convergecast tree with matching, staggered DMAC slots.

timeout interval, it can safely assume that no neighbor wants to communicate with it and goes to sleep. On the other hand, if the node engages or overhears a communication, it simply starts a new timeout after that communication finishes. To save energy, a node turns off its radio while waiting for other communications to finish (overhearing avoidance). The adaptive duty cycle allows T-MAC to automatically adjust to fluctuations in network traffic. The downside of T-MAC’s rather aggressive power-down policy, however, is that nodes often go to sleep too early: when a node s wants to send a message to r, but loses contention to a third node n that is not a common neighbor, s must remain silent and r goes to sleep. After n’s transmission finishes, s will send out an RTS to sleeping r and receives no matching CTS, hence, s must wait until the next frame to try again. T-MAC includes two measures to alleviate this so-called early-sleeping problem, for details refer to Reference 28, but the results in Section 34.7 show that it strongly favors energy savings over performance (latency/throughput).

34.5.3 Data-Gathering MAC The DMAC protocol by Lu et al. [34] is the third slotted protocol that we discuss. For energy efficiency and ease of use, DMAC includes an adaptive duty cycle like T-MAC. In addition, it provides low node-to-sink latency, which is achieved by supporting one communication paradigm only: convergecast. DMAC divides time into rather short slots (around 10 msec) and runs CSMA (with acknowledgments) within each slot to send or receive at most one message. Each node repeatedly executes a basic sequence of one receive, one send, n sleep slots. At setup DMAC ensures that the sequences are staggered to match the structure of the convergecast tree rooting at the sink node (see Figure 34.8). This arrangement allows a single message from a node at depth d in the tree to arrive at the sink with a latency of just d slot times, which is typically in the order of tens of milliseconds. DMAC includes an overflow mechanism to handle multiple messages in the tree. In essence, a node will stay awake for one more slot after relaying a message, so in the case of two children contending for their parent’s receive slot, the one losing will get a second chance. To account for interference, the overflow slot is not scheduled back to back with the send slot, but instead, receive slots are scheduled five slots apart. The overflow policy automatically takes care of adapting to the traffic load, much like T-MAC’s extension of the active period. The results reported in Reference 34 show that DMAC outperforms S-MAC in terms of latency (due to the staggered schedules), throughput, and energy efficiency (due to the adaptivity). It remains to be seen if DMAC can be enhanced to support communications other than convergecast equally well.

34.6 TDMA-Based Protocols The major attractions of a schedule-based MAC protocol are that it is inherently collision free and that idle listening can be ruled out since nodes know beforehand when to expect incoming data. The challenge is to adapt TDMA-based protocols to operate efficiently in ad hoc sensor networks without any infrastructure

© 2006 by Taylor & Francis Group, LLC

Energy-Efﬁcient Medium Access Control

34-15

(i.e., access points). We will now briefly discuss the different approaches taken by the frame-based protocols listed in Table 34.3: Sink-based scheduling. The approach taken by Arisha et al. [22] is to partition the network into large clusters, in which multihop traffic is possible. The traffic within each cluster is scheduled by a sink node that is connected to the wired backbone network, and hence equipped with increased resources. The goal is to optimize network lifetime, and the sink therefore takes the energy levels of each node into account when deciding (scheduling) which nodes will sense, which nodes will relay, and which nodes may sleep. The TDMA schedule is periodically refreshed to adapt to changes. It is required that all nodes can directly communicate with the sink node (at maximum transmit power), which clearly limits the scalability. Furthermore, the TDMA frame is of fixed length, so the maximum number of nodes must be known before deployment. Static scheduling. The Self-Stabilizing (SS-TDMA) protocol by Kulkarni and Arumugam [16] uses a fixed schedule throughout the lifetime of the network, which completely removes the need for a centralized (or distributed) scheduler. SS-TDMA operates on regular topologies, such as square and hexagonal grids and synchronizes traffic network-wide in rounds: all even rows transmit a north-bound message, all odd rows transmit a south-bound message, and so on. They show that such static schedules can result in acceptable performance for typical communication patterns (broadcast, convergecast, and local gossip), but their constraints on the location of the nodes renders it impractical in many deployment scenarios. Rotating duties. When the node density is high, the costs of serving as an access point may be amortized over multiple nodes by rotating duties among them. The PACT [18] protocol uses passive clustering to organize the network into a number of clusters connected by gateway nodes; the rotation of the cluster heads and gateways is based on information piggy-backed on the control messages exchanged during the traffic control phase of the TDMA schedule. The BMA protocol [32] uses the LEACH approach [37] to manage cluster formation and rotation. At the start of a TDMA frame, each node broadcasts one bit of information to its cluster head stating whether or not the node has data to send. Based on this information, the cluster head determines the number of data slots needed, computes the slot assignment, and broadcasts that to all nodes under its control. Note that the bit-level traffic announcements require very tight time synchronization between the nodes in the cluster. Partitioned scheduling. In the EMACs protocol by van Hoesel et al. [27] the scheduling duties are partitioned according to slot number. Each slot serves as a mini-TDMA frame and consists of a contention phase, a traffic control section, and a data section. An active node that owns a slot always transmits in its own slot. Therefore, a node n must listen to the traffic control sections of all its neighbors, since n may be the intended receiver of any of them. The contention phase is included to serve passive nodes that do not own a slot; the idea being that only some nodes need to be active to form a backbone network ready to be used by passive nodes when they detect an event. In many scenarios, events occur rarely, so the energy spent in listening for requests forms a major source of overhead. The LMAC protocol by the same authors therefore simply does without a contention interval. This improved protocol is discussed in detail below. In comparison to other TDMA-based protocols, both EMACs and LMAC have the advantage of supporting node mobility, which significantly increases their scope of deployment. The results in Section 34.7 show that, performance-wise, partitioned scheduling is also an attractive option. Replicated scheduling. The approach taken by Rajendran et al. [29] in the TRAMA protocol is to replicate the scheduling process over all nodes within the network. Nodes regularly broadcast information about (long-running) traffic flows routed through them and the identities of their one-hop neighbors. This results in each node being informed about the demands of its one-hop neighbors and the identity of its two-hop neighbors. This information is sufficient to determine a collision-free slot assignment by means of a distributed hash function that computes the winner (i.e., sender) of each slot based on the node identities and slot number. During execution the schedule may be adapted to match actual traffic conditions; nodes with little traffic may release their slot for the remainder of the frame for use by other (overloaded) nodes. Although TRAMA achieves high channel utilization, it does so at the expense of considerable latency and high algorithmic complexity.

© 2006 by Taylor & Francis Group, LLC

34-16

Embedded Systems Handbook

From the discussion above, it becomes apparent that distributing TDMA out into ad hoc networks is rather complicated and requires major compromises on deployment scenario (SS-TDMA and Arisha’s protocol), algorithmic complexity (TRAMA), flexibility/adaptivity (EMACs and LMAC), and latency (all protocols). Although TDMA is inherently free of collision and idle-listening overheads, PACT and BMA rely on the higher layers to amortize the overheads of the TDMA scheduler over rotating cluster heads. Note that the partitioned and replicated scheduling approaches are most similar to contention-based and slotted protocols in the sense that nodes operate autonomously, making them easy to install and operate, and robust to node failures. The algorithmic complexity of the TRAMA protocol (replication) is beyond the scope of this chapter, so we will only detail the LMAC protocol (partitioning).

34.6.1 Lightweight Medium Access With the LMAC protocol [35], nodes organize time into slots, grouped into fixed-length frames. A slot consists of a traffic control section (12 bytes) and a fixed-length data section. The scheduling discipline is extremely simple: each active node is in control of a slot. When a node wants to send a packet, it waits until its time-slot comes around, broadcasts a message header in the control section detailing the destination and length, and then immediately proceeds with transmitting the data. Nodes listening to the control header turn off their radio during the data part if they are not an intended receiver of the broadcast or unicast message. In contrast to all other MAC protocols, the receiver of a unicast message does not acknowledge the correct reception of the data; LMAC puts the issue of reliability at the upper layers. The LMAC protocol ensures collision-free transmission by having nodes select a slot number that is not in use within a two-hop neighborhood (much like frequency reuse in cellular communication networks). To this end, the information broadcasted in the control section includes a bit set detailing which slots are occupied by the one-hop neighbors of the sending node (i.e., the slot owner). New nodes joining the network listen for a complete frame to all traffic control sections. By OR-ing the occupancy bit sets, they can determine which slots are still free (Figure 34.9). The new node randomly selects a slot and claims it by transmitting control information in that slot. Collisions in slot-selection result in garbled control sections. A node observing such a collision, broadcasts the involved slot number in its control section, which will be overheard by the unfortunate new nodes, who will then backoff and repeat the selection process. The drawback of LMAC’s contention-based slot-selection mechanism is that nodes must always listen to the control sections of all slots in a frame — even the unused ones — since other nodes may join the network at arbitrary moments. The resulting idle-listening overhead is minimized by taking one sample of the carrier in an unused slot to sense any activity (cf. preamble sampling in Section 34.4.2). If there was activity, the slot is included in the occupancy bit set and listened to completely in the next frame. The end result is that LMAC combines a frame-based organization with notification by listening.

...1001110... ...0010110... 1 ...0110110... ...1010111... 3 6 5 4 ...1001010...

?

5

2 ...0100111...

6

...0100110...

...1001111... 7 ...0100111...

OR-ed bit sets for new node: ...1110111...

FIGURE 34.9 Slot-selection by LMAC. Nodes are marked with slot number and occupancy bit set.

© 2006 by Taylor & Francis Group, LLC

Energy-Efﬁcient Medium Access Control

34-17

34.7 Comparison In the previous sections we reviewed 20 energy-efficient MAC protocols especially developed for sensor networks. We discussed the qualitative merits of the different organizations: contention-based, slotted, and TDMA-based protocols. When available, we reported quantitative results published by the designers of the protocol at hand. Unfortunately, results from different publications are difficult to compare due to the lack of a “standard” benchmark, making it hard to draw any final conclusions. This section addresses the need for a quantitative comparison by presenting the results from a study into the performance and energy efficiency of four MAC protocols (LPL, S-MAC, T-MAC, and LMAC) on top of a common simulation platform. For reference we also report on the classic IEEE 802.11 protocol (in ad hoc mode). The work load used to evaluate the protocols ranges from standard micro-benchmarks (latency and throughput tests) to communication patterns specific to sensor networks (local gossip and convergecast).

34.7.1 Simulation Framework The discrete-event simulator developed at Delft University of Technology includes a detailed model of the popular RFM TR1001 low-power radio (discussed in Section 34.2.1) taking turnaround and wake-up times (12 and 518 µsec, respectively) into account. Energy consumption is based on the amount of energy the radio uses; we do not take protocol processing costs on a CPU driving the radio into account. The simulator records the amount of time spent in various states (standby, transmit, and receive/idle); transitions between states are modeled as time spent in the most energy consuming state. At the end of a run the simulator computes the average energy consumed for each node in the network using the current drawn by the radio in each state (Table 34.1) and an input voltage of 3 V. The five MAC protocols under study are implemented as a class hierarchy on top of the physical layer, which is a thin layer encapsulating the RFM radio model. The physical layer takes care of lowlevel synchronization (preambles, start/stop bits) and proper channel coding. We now briefly discuss the implementation details of the five MAC protocols: 802.11. The IEEE 802.11 (CSMA/CA) protocol was implemented using an 8 byte header encoding the message type (RTS/CTS/DATA/ACK), source and destination ID (2 bytes each), sequence number, data length, and CRC. The payload of the DATA packet can be up to 250 bytes. The sequence number serves for detecting duplicate packets; retransmissions are triggered upon detection of a missing CTS or ACK packet. LPL. The LPL protocol (CSMA with acknowledgments) was implemented with the DATA and ACK packets from the 802.11 implementation. LPL was set to sample the radio with a 10% duty cycle: 30 µsec carrier sense, 300 µsec sample period. The preamble was stretched with one sample period to 647 µsec. Since hidden nodes make CSMA susceptible to collisions, LPL’s initial contend time is set somewhat larger than for 802.11 (9.15 versus 3.05 msec). S-MAC. The implementation of the S-MAC protocol extends the 802.11 model with SYNC packets (8 byte header + 2 byte timestamp) to divide time into slots of 610 msec (20,000 ticks of a 32 kHz crystal). Like LPL, S-MAC is set to operate with a 10% duty cycle, hence, the active period is set to 61 msec. This is different from the original implementation to account for the different radio bitrate in our simulator and to bring the frame length in line with T-MAC. Since traffic is grouped into bursts in the active period, S-MAC deviates from the 802.11 backoff scheme and uses a fixed contend time of 9.15 msec. To reduce idle-listening overhead we choose to remove the synchronization section from the original S-MAC protocol; SYNC packets are transmitted in the active period of a slot. To reduce interference with other packets, a node transmits a SYNC packet only once every 90 sec on average. In our grid topology with eight neighbors within radio range, that amounts to receiving a SYNC message every 11 sec. T-MAC. The implementation of the T-MAC protocol enhances the S-MAC model with a variablelength active period controlled by a 15 msec timeout value, which is set to span the contention period (9.15 msec), an RTS (1.83 msec), the radio turnover period (12 µsec), and the start of a CTS. This timeout value causes T-MAC to operate with a 2.5% duty cycle in an empty network. In a loaded network the duty

© 2006 by Taylor & Francis Group, LLC

34-18

Embedded Systems Handbook TABLE 34.4 Implementation Details of the Simulator PHYsical layer Channel coding Effective bit rate Prelude Carrier sense

8-to-16 bit coding 46 kbps 433 µsec (347 µsec preamble + startbyte) 30 µsec

802.11 [extends PHY] Control packets DATA packets Contend time

8 bytes 8 byte header and 0–250 byte payload 3.05–305 msec

LPL [extends 802.11] Sample period Contend time

300 µsec 9.15–305 msec

S-MAC [extends 802.11] SYNC packets Slot time Active period Contend time

10 bytes 610 msec 61 msec 9.15 msec

T-MAC [extends S-MAC] Activity timeout

15 msec

LMAC [extends PHY] Slot time Frame time

14.3 msec (76 bytes) 456 msec (32 slots)

cycle will increase as the active period is adaptively extended. All options for mitigating the early-sleeping problem are included, see Reference 28 for details. LMAC. The LMAC protocol was implemented from scratch on top of the physical layer. It was set to operate with the maximum of 32 slots per frame to ensure that all nodes within a two-hop neighborhood can own a slot for typical node densities (up to ten neighbors). The slot size was set to 76 bytes (12 byte header + 63 byte data section + 1 byte CRC) to support a reasonable range of application-dependent message sizes. We short-circuited LMAC’s collision-based registration procedure by randomly selecting a slot number for each node at the start of a simulation run. A node listens to the 12 byte control sections of all slots owned by its one-hop neighbors, it polls the other slots in the frame with the short, 30 µsec carrier sense function to detect new nodes joining the network (which never happens during the experiments). For convenience Table 34.4 lists the key parameters of the MAC protocols used in our comparison. Note that the LMAC implementation includes a certain overprovisioning, since the experiments involve just 24 two-hop neighbors (<32 slots) and messages with a 25 byte payload (<63 bytes). This is the price to be paid for LMAC’s simplicity; other protocols, however, pay in terms of overhead (RTS/CTS signaling). Another important characteristic of LMAC is that it does not try to correct any transmission errors, while the others automatically do so through their retransmission policy for handling collisions. This difference also shows up in the estimated memory footprint (i.e., RAM usage) and code complexity of the MAC protocols listed in Table 34.5. All protocols except LMAC maintain information about the last sequence number seen from each neighbor to filter out duplicates. Our experiments use a static network with a grid topology. The radio range was set so that the nonedge nodes all have eight neighbors. Concurrent transmissions are modeled to cause collisions if the radio ranges (circles) of the senders intersect; nodes in the intersection receive a garbled packet with a failing CRC check. The application is modeled by a traffic generator at every node. The generator is parameterized to send messages with a 25 byte payload either to direct neighbors (i.e., nodes within the radio range of the sender), or to the sink node, which is located in the bottom-left corner of the grid. To route the latter

© 2006 by Taylor & Francis Group, LLC

Energy-Efﬁcient Medium Access Control

34-19

TABLE 34.5 Code Complexity and Memory Usage

Code complexity (lines) RAM usage (bytes)

802.11

LPL

S-MAC

T-MAC

LMACa

400 51

325 49

625 78

825 80

250 15

a The LMAC protocol leaves acknowledgments and retransmissions to the

higher layers, adding about 75 lines of code and 40 bytes of RAM to those layers.

TABLE 34.6 Base Performance with an Empty Network

Energy consumption (mW) Effective duty cycle (%)

802.11

LPL

11.4 100

1.14 10

S-MAC T-MAC LMAC 1.21 11

0.37 3.2

0.75 6.6

messages to the sink, we use a randomized shortest path routing method; for each message, the possible next hops are enumerated. Next hops are eligible if they have a shorter path to the final destination than the sending node. From these next hops, a random one is chosen. Thus messages flow in the correct direction, but do not use the same path every time. No control messages are exchanged for this routing scheme: nodes automatically determine the next hop. By varying the message interarrival times, we can study how the protocols perform under different loads.

34.7.2 Micro-Benchmarks To determine the organizational overhead associated with each protocol we ran the simulator with an empty workload. The resulting energy consumption is shown in Table 34.6. This table also shows the effective duty cycle relative to the performance of the 802.11 protocol, which keeps all nodes listening all the time. The contention-based LPL protocol wastes no energy on organizing nodes, and achieves its target duty cycle of 10%. The slotted protocols (S-MAC and T-MAC) spend some energy on sending and receiving SYNC packets, but the impact is limited as the effective duty cycles only marginally exceed the built-in active/sleep ratios (10 and 2.5%). Finally, note that the overhead of the TDMA-based LMAC protocol is remarkably low (6.6%), which is largely due to the efficient carrier sense at the physical layer. If the nodes were to listen to all traffic control sections completely, the overhead would grow to about 16% (12 control bytes per 76 byte slot). Our second experiment measured the multihop latency in an empty network, which we expect to be significant for slotted and schedule-based protocols. The results in Figure 34.10 confirm this: S-MAC, T-MAC, and LMAC show end-to-end latencies that are much higher than those obtained by 802.11 and LPL. In the case of LMAC a node that wants to send or relay a packet must wait until its slot turns up. On an average, this means that packets are delayed by half the length of a frame, or 236 msec, which is an order of magnitude more than the one-hop latency under 802.11 (13.2 msec). With T-MAC and S-MAC the source node must wait for the next active period to show up before it can transfer the message with an RTS/CTS/DATA/ACK sequence. This accounts for the initial offset of 263 msec. Then, in the case of T-MAC, the second node may immediately relay the message since the third node is again awake due to overhearing the first CTS packet. The fourth node, however, did not receive that same CTS and by lack of activity went to sleep. Therefore, the third node’s attempt to relay will fail, and it has to wait until the start of the next slot. This accounts for T-MAC’s staircase pattern in Figure 34.10. S-MAC is less aggressive in putting nodes to sleep, and messages can travel about 3 to 4 hops during one active period. The exact number depends on the random numbers selected from the contention interval prior to each RTS, and may be different for each data packet and active period. These numbers get averaged over multiple messages and this explains the “erosion” of the staircase pattern when traveling more hops.

© 2006 by Taylor & Francis Group, LLC

34-20

Embedded Systems Handbook 4500 802.11 LPL S-MAC T-MAC LMAC

4000

Latency (msec)

3500 3000 2500 2000 1500 1000 500 0

FIGURE 34.10

0

2

4

6

8 10 Number of hops

12

14

16

Multihop latency in an empty network.

70

802.11 LPL S-MAC T-MAC LMAC

Packets received/sec

60 50 40 30 20 10 0

FIGURE 34.11

0

20

40 60 Generated packets/sec

80

100

Throughput in a 3 × 3 grid.

Finally, observe that LPL outperforms 802.11 because it does not include an RTS/CTS handshake, but sends the DATA immediately. The third experiment that we carried out measured the maximum throughput that a single node can handle (channel utilization). We selected a 3 × 3 section of the network grid, and arranged the 8 edge nodes to repeatedly send a message (25 byte payload) to the central node. By increasing the sending rate we were able to determine what the maximum throughput is that each MAC protocol can handle, and whether or not it collapses under high loads. Figure 34.11 shows the results of this stress test. LPL performs very poorly because of the collisions generated by hidden nodes; in the 3 × 3 configuration each sending node senses only the communications by its direct neighbors on the edge, but the other nodes are hidden from it. The repeated retransmissions issued to resolve the collisions cause the internal queues to overflow, and packets to be dropped. The RTS/CTS handshake eliminates most collisions and 802.11 achieves a maximum throughput of around 70 packets per second, which is about 30% of the effective

© 2006 by Taylor & Francis Group, LLC

Energy-Efﬁcient Medium Access Control

34-21

bitrate (46 kbps, or 230 packets/sec) offered by the physical layer. The signaling overhead (33 bytes MAC control + physical layer headers + radio turnaround times) reduces this capacity already to 85 packets/sec; the remaining loss is caused by the contention period prior to each RTS. S-MAC runs at a 10% duty cycle, and its throughput is therefore reduced by a factor of 10. T-MAC on the other hand, adapts its duty cycle and is able to follow the 802.11 curve at much higher loads than the other protocols. It flattens off abruptly (around 45 packets/sec) due to its fixed contention window (9.15 msec), which is much shorter than the maximum length of 802.11’s binary backoff period (305 msec). The throughput of LMAC is limited by two factors: (1) only 8 out 32 slots in each frame are used, and (2) only 25 bytes out of each 76 byte slot are used. Consequently, LMAC’s throughput is maximized at 8% of the channel capacity.

34.7.3 Homogeneous Unicast and Broadcast The micro-benchmarks discussed in the previous section studied the behavior of the MAC protocols in isolation. In this section, we report on experiments involving all nodes in the network. The results in this section provide a stepping stone for understanding the performance of the complex local gossip and convergecast patterns common to sensor network applications. In our first network-wide experiment, we had all 100 nodes in a 10 × 10 grid repeatedly send a message (25 byte payload) to a randomly selected neighbor. The intensity of this homogeneous load on the network was controlled by adjusting the sending rate of the nodes. The topmost graph in Figure 34.12 shows the delivery ratio with increasing load. It reveals that S-MAC, T-MAC, and LPL collapse at some point, while the performance of the LMAC and 802.11 protocols degrades gracefully. When comparing the order in which the protocols break down (S-MAC, T-MAC, LPL, LMAC, 802.11) with that of the corresponding throughput benchmark (LPL, S-MAC, LMAC, T-MAC, 802.11) we see some striking differences. First, LPL does much better, because nodes are throttled back by eight neighbors instead of just a few reducing the probability of a collision with a hidden node. Second, T-MAC does much worse, because the RTS/CTS signaling in combination with T-MAC’s power-down policy silences nodes too early. Third, the gap between LMAC and 802.11 for high loads has shrunk considerably, which is mainly caused by 802.11 now suffering from exposed nodes not present in the micro-benchmark. The middle graph in Figure 34.12 plots the energy consumption of each MAC protocol when intensifying the homogeneous load. Again we observe a few remarkable facts. First, the energy consumption of the 802.11 protocol decreases for higher loads. This is caused by the overhearing avoidance mechanism that shuts down the radio during communications in which a node is not directly involved. Second, the energy consumption of T-MAC and LPL initially increase linearly, then jump to ∼11 mW. The jumps correspond with the breakdowns of the message delivery rates, showing that the most energy is spent on retransmissions due to collisions. The difference in gradient is caused by T-MAC spending additional energy on the RTS/CTS handshake and the early-sleeping problem. Third, the energy consumption of LMAC and S-MAC cross at about 50 bytes/node/sec, but while LMAC still delivers more than 97% of the messages S-MAC’s delivery rate is down to just 10%. This significant difference in price/performance ratio is shown in the bottom graph of Figure 34.12, which plots the energy spent per data bit delivered. These energy-efficiency curves clearly show the collapse of the (slotted) contention-based protocols. In our second network-wide experiment we had all 100 nodes repeatedly send a broadcast message (25 byte payload) to their neighbors. Figure 34.13 shows the delivery rates, energy consumption, and energy-efficiency metrics. When comparing these results with those for unicast (Figure 34.12) some interesting differences and similarities emerge. Consider the LMAC protocol first. For broadcast it achieves the same delivery rate as for unicast, which is no surprise given that LMAC guarantees collision-free communications. The energy consumption to handle broadcast traffic, on the other hand, is about twice the amount needed for unicast under high loads. This is a consequence of each node processing more incoming data; instead of one neighbor with its radio set to listen for unicast, all neighbors have to listen for a broadcast packet. This effect also explains why energy per received bit of information is reduced with a factor of about six for light loads: all neighbors (6.84 on average) receive useful data at little extra cost, and the energy is calculated per received bit.

© 2006 by Taylor & Francis Group, LLC

34-22

Embedded Systems Handbook

1

Delivery ratio

0.8

0.6 802.11 LPL S-MAC T-MAC LMAC

0.4

0.2

0

0

20

40 60 80 Payload (bytes/node/sec)

100

120

Energy consumed (avg. mW/node)

12

10

8 802.11 LPL S-MAC T-MAC LMAC

6

4

2

0 0

20

40

60

80

100

120

Payload (bytes/node/sec) 802.11 LPL S-MAC T-MAC LMAC

140

Energy per bit (mJ)

120 100 80 60 40 20 0

0

20

40

60

80

100

120

Payload (bytes/node/sec)

FIGURE 34.12 Performance under homogeneous unicast traffic: delivery rate (top), energy consumption (middle), and energy efficiency (bottom).

© 2006 by Taylor & Francis Group, LLC

Energy-Efﬁcient Medium Access Control

34-23

(a) 1

Delivery ratio

0.8

0.6

0.4

802.11 LPL S-MAC T-MAC LMAC

0.2

0

0

20

40

60

80

100

120

Payload (bytes/node/sec)

Energy consumption (avg. mW/node)

(b) 14 12 10 802.11 LPL S-MAC T-MAC LMAC

8 6 4 2 0

0

20

40 60 80 Payload (bytes/node/sec)

100

120

(c) 30 802.11 LPL S-MAC T-MAC LMAC

Energy per bit (mJ)

25

20

15

10

5

0

0

20

40 60 80 Payload (bytes/node/sec)

100

120

FIGURE 34.13 Performance under homogeneous broadcast traffic: (a) delivery rate, (b) energy consumption, and (c) energy efficiency.

© 2006 by Taylor & Francis Group, LLC

34-24

Embedded Systems Handbook

When considering the other protocols we find that the delivery rates degrade for light loads with respect to the unicast experiments, but improve dramatically for high loads. In particular, we find no breakdown points as with unicast traffic. The reason is twofold: (1) there are no retransmissions that clog up the network, and (2) even when collisions occur, some of the neighbors still receive data. Note that the delivery ratio should be interpreted as the fraction of neighbors receiving a broadcast message, not as the probability that the message is received by all neighbors. The slotted protocols (S-MAC and T-MAC) perform considerably worse than the contention-based protocols (802.11 and LPL). The reason for this is that by grouping all traffic into a rather short active period, the probability of a collision is increased considerably. The reason that 802.11 outperforms LPL is that the latter uses a longer preamble, and although this increases the length of the DATA only by about 5%, the probability of a collision is raised enough to make a difference in delivery rate. The energy-efficiency curves show that all protocols except S-MAC spend less energy per bit when the intensity of the broadcast traffic increases. In particular, the contention-based protocols do not suffer from a collapse as with unicast. The reason that the energy spent per bit increases with S-MAC is threefold: (1) it suffers from considerably more collisions in its small active period, (2) the fraction of time spent in transmitting steadily increases, especially since no time is spent waiting during a handshake as for unicast, and (3) overhearing avoidance is no longer applicable, forcing the radio to be on all the time during S-MAC’s active period. The latter reason also explains why 802.11’s energy consumption does not go down with increasing load as it did for unicast traffic.

34.7.4 Local Gossip The first communication pattern specific to sensor network applications that we studied was local gossip. We designated a 5 × 5 area in the middle of the grid as the event region in which nodes would repeatedly send a message (25 byte payload) to a randomly selected neighbor. In essence local gossip is a mixture of 75% empty workload (Table 34.6) and 25% homogeneous workload (Figure 34.12). The delivery rates associated with local gossip, as shown in Figure 34.14, are completely determined by the homogeneous unicast component of the workload, and therefore resemble the curves in Figure 34.12 to a large extent. The LMAC curve is identical, the others are shifted to the right because collisions occur less frequently due to a relatively large number of edge nodes with inactive neighbors (16/25 versus 36/100). The energy consumption numbers, which are averages over the whole network, are diluted by the empty workload component (cf. Figure 34.12 and Figure 34.14). In contrast, the energy-efficiency numbers, not shown for brevity, are raised since the energy spent by passive nodes (idle listening) is amortized over the limited traffic in the 5 × 5 region.

34.7.5 Convergecast In our final experiment we studied the convergecast communication pattern. All 100 nodes in the network periodically send a message (25 byte payload) to the sink in the bottom-left corner of the grid. To maximize the load on the MAC protocols, messages are not aggregated at intermediate nodes. Figure 34.15 shows the delivery rates and energy efficiencies for the convergecast pattern. The shapes of these curves show a large similarity with the homogeneous unicast pattern. Note that the generated load that can be handled is much lower than with homogeneous unicast, since each injected message needs to travel 6.15 hops on average. The performance results, however, do not simply scale with the path-length factor. The breakdown points on the delivery curves for convergecast are shifted far more to the left than a factor of six, and also the order in which the protocols breakdown is changed significantly. In particular, the LMAC protocol cannot handle the heavy loads around the sink since each node can only use the capacity of one slot as demonstrated by the throughput micro-benchmark. T-MAC and LPL handle the high loads around the sink much better than LMAC, with LPL being slightly more efficient. Both suffer from a collapse, however, when the load is increased causing the energy consumed per bit to suddenly rocket upwards. Furthermore, note that energy efficiency degrades more than a factor of six compared with that

© 2006 by Taylor & Francis Group, LLC

Energy-Efﬁcient Medium Access Control

34-25

1

Delivery ratio

0.8

0.6

0.4 802.11 LPL S-MAC T-MAC LMAC

0.2

0

0

20

40

60

80

100

120

Payload (bytes/node/sec)

Energy consumed (avg. mW/node)

12

10

8

802.11 LPL S-MAC

6

T-MAC LMAC

4

2

0 0

20

40 60 80 Payload (bytes/node/sec)

100

120

FIGURE 34.14 Performance under local gossip: delivery rate (top), energy efficiency (bottom).

for unicast under comparable load. Apparently, even the adaptive T-MAC protocol finds it impossible to select the right duty cycle for each node.

34.7.6 Discussion When reviewing the simulation results we find that no MAC protocol outperforms the others in all experiments. Each protocol has its strong and weak points, which reflects the particular choice on how to trade-off performance (latency, throughput) for cost (energy consumption). Some general observations, however, can be made: Communication grouping considered harmful. The slotted protocols (S-MAC and T-MAC) organize nodes to communicate during small periods of activity. The advantage is that very low duty cycles can be obtained, but at the expense of high latency and a collapse under high loads. T-MAC’s automatic adaptation of the duty cycle allows it to handle higher loads; S-MAC’s fixed duty cycle bounds the energy consumption under a collapse.

© 2006 by Taylor & Francis Group, LLC

34-26

Embedded Systems Handbook

802.11 LPL S-MAC T-MAC LMAC

1

Delivery ratio

0.8

0.6

0.4

0.2

0 0

5

10 15 Payload (bytes/node/sec)

20

900 802.11 LPL S-MAC T-MAC LMAC

800

Energy per bit (mJ)

700 600 500 400 300 200 100 0 0

FIGURE 34.15

5

10 Payload (bytes/node/sec)

15

20

Performance under convergecast: delivery rate (top) and energy efficiency (bottom).

The TDMA-based LMAC protocol also limits the moments at which nodes may communicate and therefore incurs high latencies in general, and reduced throughput under high load. In contrast to T-MAC, its energy consumption does not deteriorate; LMAC is rather robust and performance degrades gracefully under higher loads. The LPL protocol is most flexible since it puts only minor restrictions on when nodes can communicate (i.e., once every 300 µsec). Its sampling approach, however, critically depends on the radio’s ability to switch on quickly. This is the case for the RFM radio at hand, but preliminary experiments with the Chipcon radio shows that LPL’s advantage weakens when operating with a corresponding 2 out of 20 msec duty cycle: Collision avoidance considered prohibitive. On the one hand, the RTS/CTS handshake prevents collisions due to hidden nodes, which is good. On the other hand, the RTS/CTS handshake reduces the effective channel capacity since a communication takes more time (11.68 versus 8.31 msec), which decreases the minimum packet transfer rate required before network collapse. Given that typical messages in sensor

© 2006 by Taylor & Francis Group, LLC

Energy-Efﬁcient Medium Access Control

34-27

networks are small, the overheads associated with collision avoidance prove to be prohibitive, especially in combination with communication grouping. Adaptivity considered essential. The results for local gossip and convergecast communication patterns show that MAC protocols must be able to adapt to local traffic demands. Static protocols either consume too much energy under low loads (e.g., S-MAC), or throttle throughput too much under high loads (e.g., LMAC). The current generation of adaptive protocols (e.g., T-MAC and LPL), however, are not the final answer since they suffer from contention collapse, forcing applications to be aware of that and take precautions.

34.8 Conclusions Medium access protocols for wireless sensor networks trade-off performance (latency, throughput, and fairness) for cost (energy consumption). They do so by turning off the radio for significant amounts of time reducing the energy wasted by idle listening, which dominates the cost of typical WLAN-based MAC protocols. Other sources of overhead include collisions, overhearing, protocol overhead, and traffic fluctuations. Different protocols take different approaches to reduce (some of) these overheads. They can be classified according three important design decisions: (1) the number of channels used (single, double, or multiple), (2) the way in which nodes are organized (random, slotted, frames), and (3) the notification method used (listening, wake-up, schedule). Given that the current generation of sensor nodes is equipped with one radio, most protocols use a single channel. The organizational choice, however, is not so easily decided on since it reflects the fundamental trade-off between flexibility and energy efficiency. Contention-based protocols like CSMA are extremely flexible regarding the time, location, and amount of data transfered by individual nodes. This gives them the advantage of handling the traffic fluctuations present in typical monitoring applications running on wireless sensor networks. Contention-based protocols can be made energy efficient by implementing a duty cycle at the physical level provided that the radio can be switched on and off rapidly. The idea is to stretch the preamble, which allows potential receivers to sample the carrier at a low rate. Slotted protocols organize nodes to synchronize on a common slot structure. They reduce idle-listening by implementing a duty cycle within each slot. This duty cycle need not be fixed, and can be adapted automatically to match demands. TDMA-based protocols have the advantage of being inherently free of idle-listening since nodes are informed up front, by means of a schedule, when to expect incoming traffic. To control the overheads associated with computing the schedule and its distribution through the network, TDMA-based protocols must either limit the deployment scenario (e.g., single hop) or hard code some parameters (e.g., maximum number of two-hop neighbors) compromising on flexibility. A head-to-head comparison of sample protocols from each class revealed that there is no single, best MAC protocol that outperforms all others. What did become apparent, however, is that adaptivity is mandatory to handle the generic local gossip and convergecast communication patterns displaying traffic fluctuations both in time and space. Considering the speed at which protocols have been developed so far, we expect a number of new protocols to appear that will strike yet another balance between flexibility and energy efficiency. Other future developments may include crosslayer optimizations with routing and data aggregation protocols, and an increased level of robustness to handle practical issues, such as asymmetric links and node failures.

Acknowledgments We thank Tijs van Dam for his initial efforts in designing the T-MAC protocol and putting the issue of energy-efficient MAC protocols on the Delft research agenda. We thank Ivaylo Haratcherev, Tom Parker, and Niels Reijers for proofreading this chapter — correcting numerous mistakes, filtering out jargon, and rearranging material — all of which greatly enhanced the readability of the text.

© 2006 by Taylor & Francis Group, LLC

34-28

Embedded Systems Handbook

References [1] R. Jurdak, C. Lopes, and P. Baldi. A survey, classification and comparative analysis of medium access control protocols for ad hoc networks. IEEE Communications Surveys and Tutorials, 6, 2–16, 2004. [2] N. Abramson. The ALOHA system — another alternative for computer communications. In Proceedings of the Fall Joint Computer Conference, Vol. 37. Montvale, NJ, 1970, pp. 281–285. [3] L. Roberts. ALOHA packet system with and without slots and capture. ACM SIGCOMM Computer Communications Review, 5, 28–42, 1975. [4] L. Kleinrock and F. Tobagi. Packet switching in radio channels: part I — carrier sense multipleaccess modes and their throughput-delay characteristics. IEEE Transactions on Communications, 23, 1400–1416, 1975. [5] P. Karn. MACA — a new channel access method for packet radio. In Proceedings of the 9th ARRL Computing Networking Conference, September 1990, pp. 134–140. [6] V. Bharghavan, A. Demers, S. Shenker, and L. Zhang. MACAW: a media access protocol for wireless LANs. In Proceedings of the Conference on Communications Architectures, Protocols and Applications. London, August 1994, pp. 212–225. [7] IEEE standard 802.11. Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications, 1999. [8] R. Krashinsky and H. Balakrishnan. Minimizing energy for wireless web access with bounded slowdown. In Proceedings of the 8th ACM International Conference on Mobile Computing and Networking (MobiCom02). Atlanta, GA, September 2002, pp. 119–130. [9] A. Gummalla and J. Limb. Wireless medium access control protocols. IEEE Communications Surveys and Tutorials, 3, 2–15, 2000. [10] F. Liu, K. Xing, X. Cheng, and S. Rotenstrech. Energy-efficient MAC layer protocols in ad hoc networks. In Resource Management in Wireless Networking, M. Cardei, I. Cardei, and D.-Z. Du, Eds. Kluwer Academic Publishers, Dordrecht, 2004. [11] RF Monolithics. TR1001 868.35 MHz Hybrid Tranceiver. [12] Chipcon Corporation. CC1000 Low Power FSK Tranceiver. [13] L. Feeney and M. Nilsson. Investigating the energy consumption of a wireless network interface in an ad hoc networking environment. In Proceedings of the IEEE INFOCOM, IEEE, Anchorage, Alaska, 2001, pp. 1548–1557. [14] R. Szewczyk, J. Polastre, A. Mainwaring, and D. Culler. Lessons from a sensor network expedition. In Proceedings of the 1st European Workshop on Wireless Sensor Networks (EWSN ’04). Berlin, Germany, January 2004. [15] T. He, S. Krishnamurthy, J. Stankovic, T. Abdelzaher, L. Luo, R. Stoleru, T. Yan, L. Gu, J. Hui, and B. Krogh. An energy-efficient surveillance system using wireless sensor networks. In Proceedings of the 2nd International Conference on Mobile Systems, Applications, and Services (MobiSys04). Boston, MA, June 2004, pp. 270–283. [16] S. Kulkarni and M. Arumugam. TDMA service for sensor networks. In Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS04), ADSN Workshop. Tokyo, Japan, March 2004, pp. 604–609. [17] K. Sohrabi, J. Gao, V. Ailawadhi, and G. Pottie. Protocols for self-organization of a wireless sensor network. IEEE Personal Communications, 7, 16–27, 2000. [18] G. Pei and C. Chien. Low power TDMA in large wireless sensor networks. In Proceedings of the Military Communications Conference (MILCOM 2001), Vol. 1. Vienna, VA, October 2001, pp. 347–351. [19] C. Guo, L. Zhong, and J. Rabaey. Low power distributed MAC for ad hoc sensor networks. In Proceedings of the IEEE GlobeCom. San Antonio, AZ, November 2001. [20] C. Schurgers, V. Tsiatsis, S. Ganeriwal, and M. Srivastava. Optimizing sensor networks in the energy-latency-density design space. IEEE Transactions on Mobile Computing, 1, 70–80, 2002.

© 2006 by Taylor & Francis Group, LLC

Energy-Efﬁcient Medium Access Control

34-29

[21] A. El-Hoiydi. ALOHA with preamble sampling for sporadic traffic in ad hoc wireless sensor networks. In Proceedings of the IEEE International Conference on Communications (ICC). New York, April 2002. [22] K. Arisha, M. Youssef, and M. Younis. Energy-aware TDMA-based MAC for sensor networks. In Proceedings of the IEEE Workshop on Integrated Management of Power Aware Communications, Computing and NeTworking (IMPACCT 2002). New York City, NY, May 2002. [23] W. Ye, J. Heidemann, and D. Estrin. An energy-efficient MAC protocol for wireless sensor networks. In Proceedings of the 21st Conference of the IEEE Computer and Communications Societies (INFOCOM), Vol. 3. June 2002, pp. 1567–1576. [24] E.-S. Jung and N. Vaidya. A power control MAC protocol for ad hoc networks. In Proceedings of the 8th ACM International Conference on Mobile Computing and Networking (MobiCom02). Atlanta, GA, September 2002, pp. 36–47. [25] J. Hill and D. Culler. Mica: a wireless platform for deeply embedded networks. IEEE Micro, 22, 12–24, 2002. [26] K. Jamieson, H. Balakrishnan, and Y. Tay. Sift: a MAC protocol for event-driven wireless sensor networks. Technical report LCS-TR-894, MIT, May 2003. [27] L. van Hoesel, T. Nieberg, H. Kip, and P. Havinga. Advantages of a TDMA based, energy-efficient, self-organizing MAC protocol for WSNs. In Proceedings of the IEEE VTC 2004 Spring. Milan, Italy, May 2004. [28] T. van Dam and K. Langendoen. An adaptive energy-efficient MAC protocol for wireless sensor networks. In Proceedings of the 1st ACM Conference on Embedded Networked Sensor Systems (SenSys 2003). Los Angeles, CA, November 2003, pp. 171–180. [29] V. Rajendran, K. Obraczka, and J. Garcia-Luna-Aceves. Energy-efficient, collision-free medium access control for wireless sensor networks. In Proceedings of the 1st ACM Conference on Embedded Networked Sensor Systems (SenSys 2003). Los Angeles, CA, November 2003, pp. 181–192. [30] A. El-Hoiydi, J.-D. Decotignie, C. Enz, and E. Le Roux. Poster abstract: WiseMAC, an ultra low power MAC protocol for the WiseNET wireless sensor network. In Proceedings of the 1st ACM Conference on Embedded Networked Sensor Systems (SenSys 2003). Los Angeles, CA, November 2003. [31] J. Polastre and D. Culler. B-MAC: an adaptive CSMA layer for low-power operation. Technical report cs294-f03/bmac, UC Berkeley, December 2003. [32] J. Li and G. Lazarou. A bit-map-assisted energy-efficient MAC scheme for wireless sensor networks. In Proceedings of the 3rd International Symposium on Information Processing in Sensor Networks (IPSN04). Berkeley, CA, April 2004, pp. 55–60. [33] M. Miller and N. Vaidya. Minimizing energy consumption in sensor networks using a wakeup radio. In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC04). Atlanta, GA, March 2004. [34] G. Lu, B. Krishnamachari, and C. Raghavendra. An adaptive energy-efficient and low-latency MAC for data gathering in sensor networks. In Proceedings of the International Workshop on Algorithms for Wireless, Mobile, Ad Hoc and Sensor Networks (WMAN). Santa Fe, NM, April 2004. [35] L. van Hoesel and P. Havinga. A lightweight medium access protocol (LMAC) for wireless sensor networks. In Proceedings of the 1st International Workshop on Networked Sensing Systems (INSS 2004). Tokyo, Japan, June 2004. [36] W. Ye, J. Heidemann, and D. Estrin. Medium access control with coordinated, adaptive sleeping for wireless sensor networks. Technical report ISI-TR-567, USC/Information Sciences Institute, January 2003 (accepted for publication in IEEE/ACM Transactions on Networking). [37] W. Heinzelman, A. Chandrakasan, and H. Balakrishnan. Energy-efficient communication protocol for wireless microsensor networks. In Proceedings of the 33rd Hawaii International Conference on System Sciences. January 2000.

© 2006 by Taylor & Francis Group, LLC

35 Overview of Time Synchronization Issues in Sensor Networks 35.1 35.2 35.3 35.4 35.5

Weilian Su Naval Postgraduate School

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Design Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Factors Influencing Time Synchronization . . . . . . . . . . . . Basics of Time Synchronization . . . . . . . . . . . . . . . . . . . . . . . . Time Synchronization Protocols for Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35-1 35-2 35-3 35-3 35-6 35-9 35-9 35-9

35.1 Introduction In the near future, small intelligent devices will be deployed in homes, plantations, oceans, rivers, streets, and highways to monitor the environment [1]. Events, such as target tracking, speed estimating, and ocean current monitoring, require the knowledge of time between sensor nodes that detect the events. In addition, sensor nodes may have to time-stamp data packets for security reasons. With a common view of time, voice and video data from different sensor nodes can be fused and displayed in a meaningful way at the sink. Also, medium access scheme such as Time Division Multiple Access (TDMA) requires the nodes to be synchronized, so the nodes can be turned off to save energy. The purpose of any time synchronization technique is to maintain a similar time within a certain tolerance throughout the lifetime of the network or among a specific set of nodes in the network. Combining with the criteria that sensor nodes have to be energy efficient, low-cost, and small in a multihop environment, this requirement makes a challenging problem to solve. In addition, the sensor nodes may be left unattended for a long period of time, for example, in deep space or on an ocean floor. When messages are exchanged using short distance multihop broadcast, the software and medium access time and the variation of the access time may contribute the most in time fluctuations and differences in the path delays. Also, the time difference between sensor nodes may be significant over time due to the drifting effect of the local clocks.

35-1

© 2006 by Taylor & Francis Group, LLC

35-2

Embedded Systems Handbook

In this chapter, the backgrounds of time synchronization are provided to enable new developments or enhancements of timing techniques for the sensor networks. The design challenges and factors influencing time synchronization are described in Sections 35.2 and 35.3, respectively. In addition, the basics of time synchronization for sensor networks are explained in Section 35.4. Afterwards, different types of timing techniques are discussed in Section 35.5. Last, the chapter is concluded in Section 35.6.

35.2 Design Challenges In the future, many low-end sensor nodes will be deployed to minimize the cost of the sensor networks. These nodes may work collaboratively together to provide time synchronization for the whole sensor network. The precision of the synchronized clocks depends on the needs of the applications. For example, a sensor network requiring TDMA service may require microseconds difference among the neighbor nodes while a data gathering application for sensor networks requires only milliseconds of precision. As sensor networks are application driven, the design challenges of a time synchronization protocol are also dictated by the application. These challenges are to provide an overall guideline and requirement when considering the features of a time synchronization protocol for sensor networks; they are robust, energy aware, server-less, light-weight, and tunable service: Robust. Sensor nodes may fail, and the failures should not have significant effect on the time synchronization error. If sensor nodes depend on a specific master to synchronize their clocks, a failure or anomaly of the master’s clock may create a cascade effect that nodes in the network may become unsynchronized. So, a time synchronization protocol has to handle the unexpected or periodic failures of the sensor nodes. If failures do occur, the errors caused by these failures should not be propagated throughout the network. Energy aware. Since each node is battery limited, the use of resources should be evenly spread and controlled. A time synchronization protocol should use the minimum number of messages to synchronize the nodes in the earliest time. In addition, the load for time synchronization should be shared, so some nodes in the network do not fail earlier than others. If some parts of the network fail earlier than others, the partitioned networks may drift apart from each other and become unsynchronized. Server-less. A precise time server may not be available. In addition, the time servers may fail when placed in the sensor field. As a result, sensor nodes should be able to synchronize to a common time without the precise time servers. When the precise time servers are available, the quality of the synchronized clocks as well as the time to synchronize the clocks of the network should be much better. This server-less feature also helps to address the robustness challenge as stated earlier. Light-weight. The complexity of the time synchronization protocol has to be low in order to be programmed into the sensor nodes. Besides being energy limited, the sensor nodes are memory limited as well. The synchronization protocol may be programmed into a field programmable gate array (FPGA) or designed into an ASIC. By having the time synchronization protocol tightly integrated with the hardware, the delay and variation of the processing may be smaller. With the increase of precision, the cost of a sensor node is higher. Tunable service. Some services, such as medium access, may require time synchronization to be always ON while others only need it when there is an event. Since time synchronization can consume a lot of energy, a tunable time synchronization service is applicable for some applications. Nevertheless, there are needs for both type of synchronization protocols. The above challenges provide a guideline for developing various types of time synchronization protocols that are applicable to the sensor networks. A time synchronization protocol may have a mixture of these design features. In addition, some applications in the sensor networks may not require the time synchronization protocol to meet all these requirements. For example, a data gathering application may require the tunable service and light-weight features more than the server-less capability. The tunable service and light-weight features allow the application to gather precise data when the users require it. In addition, the nodes that are not part of this data gathering process may not have to be synchronized.

© 2006 by Taylor & Francis Group, LLC

Time Synchronization Issues in Sensor Networks

35-3

Also, the precision of the time does not need to be high, because the users may only need milliseconds precision to satisfy their needs. As these design challenges are important for guiding the development of a time synchronization protocol, the influencing factors that affect the quality of the synchronized clocks have to be discussed. Although the influencing factors are similar to existing distributed computer system, they are at different extreme levels. These influencing factors are discussed in Section 35.3.

35.3 Factors Inﬂuencing Time Synchronization Regardless of the design challenges that a time synchronization protocol wants to address, the protocol still needs to address the inherent problems of time synchronization. In addition, small and low-end sensor nodes may exhibit device behaviors that may be much worst than large systems such as personal computers (PCs). As a result, time synchronization with these nodes present a different set of problems. Some of the factors influencing time synchronization in large systems also apply to sensor networks [2]; they are temperature, phase noise, frequency noise, asymmetric delays, and clock glitches: Temperature. Since sensor nodes are deployed in various places, the temperature variation throughout the day may cause the clock to speedup or slow down. For a typical PC, the clock drifts few parts per million (ppm) during the day [3]. For low-end sensor nodes, the drifting may be even worst. Phase noise. Some of the causes of phase noise are due to access fluctuation at the hardware interface, response variation of the operating system to interrupts, and jitter in the network delay. The jitter in the network delay may be due to medium access and queueing delays. Frequency noise. The frequency noise is due to the unstability of the clock crystal [4]. A low-end crystal may experience large frequency fluctuation, because the frequency spectrum of the crystal has large sidebands on adjacent frequencies. The drift rate ρ values for quartz oscillators are between 10−4 and 10−6 [5]. Asymmetric delay. Since sensor nodes communicate with each other through the wireless medium, the delay of the path from one node to another may be different than the return path. As a result, an asymmetric delay may cause an offset to the clock that can not be detected by a variance type method [2]. If the asymmetric delay is static, the time offset between any two nodes is also static. The asymmetric delay is bounded by one-half the round trip time between the two nodes [2]. Clock glitches. Clock glitches are sudden jumps in time. This may be caused by hardware or software anomalies such as frequency and time steps. Since sensor nodes are randomly deployed and their broadcast ranges are small, the influencing factors may shape the design of the time synchronization protocol. In addition, the links between the sensor nodes may not be reliable. As a result, the influencing factors may have to be addressed differently. In the following section, the basics of time synchronization for sensor networks are discussed.

35.4 Basics of Time Synchronization As the factors described in Section 35.3 influence the error budget of the synchronized clocks, the purpose of a time synchronization protocol is to minimize the effects of these factors. Before developing a solution to address these factors, some basics of time synchronization for sensor networks need to be discussed. These basics are to provide the fundamentals for designing a time synchronization protocol. If a better clock crystal is used, the drift rate ρ may be much smaller. Usually, the hardware clock time H (t ) at real-time t is within a linear envelope of the real-time as illustrated in Figure 35.1. Since the clock drifts away from real-time, the time difference between two events measured with the same hardware clock may have a maximum error of ρ(b − a) [5], where a and b are the time of occurrence of first and second events, respectively. For modern computers, the clock granularity may be negligible, but it may contribute a significant portion to the error budget if the clock of a sensor node is really coarse, running at kHz range

© 2006 by Taylor & Francis Group, LLC

35-4

Embedded Systems Handbook

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Hardware clock time

H(t) Ideal time 1+ r 1 1– r

1 2 3 4 5 6 7 8 9 10 11 12 13 14 FIGURE 35.1

t, Real-time

Drifting of hardware clock time.

instead of MHz. In certain applications, a sensor node may have a volume of cm3 [6], so a fast oscillator may not be possible or suitable for such size. Regardless of the clock granularity, the hardware clock time H (t ) is usually translated into a virtual clock time by adding an adjustment constant to it. Normally, it is the virtual clock time that we read from a computer. Hence, a time synchronization protocol may adjust the virtual clock time and discipline the hardware clock to compensate for the time difference between the clocks of the nodes. Either approach has to deal with the factors influencing time synchronization as described earlier. When an application issues a request to obtain the time, the time is returned after a certain delay. This software access delay may fluctuate according to the loading of the system. This type of fluctuation is nondeterministic and may be lessened if real-time operation system and hardware architecture are used. For low-end sensor nodes, the software access time may be in the order of few hundred microseconds. For example, a Mica mote is running at 4 MHz [7] having clock granularity of 0.25 µsec. If the node is 80% loaded and it takes 100 cycles to obtain the time, the software access time is around 125 µsec. In addition to the software access time, the medium access time also contributes to the nondeterministic delay that a message experiences. If carrier-sense multiple access (CSMA) is used, the back-off window size as well as the traffic load affect the medium access time [8–10]. Once the sensor node obtains the channel, the transmission and propagation times are pretty deterministic, and they can be estimated by the packet size, transmission rate, and speed-of-light. In summary, the delays experienced when sending a message at real-time t1 and receiving an acknowledgment (ACK) at real-time t4 are shown in Figure 35.2. The message from node A incurs the software access, medium access, transmission, and propagation times. These times are represented by S1 , M1 , T1 , and P12 . Once the message is received by node B at t2 , it will incur extra delays through receiving and processing. After the message is processed, an ACK is sent to node A at t3 . The total delay at node B is the

© 2006 by Taylor & Francis Group, LLC

Time Synchronization Issues in Sensor Networks

35-5

Node A S1

Node B

M1

T1

Sending message at t1

P12

T3 Receiving ACK at t4 S4

R4

S = Software access time M = Medium access time T = Transmission time

P34

R2

S2

M3

S3

Receiving message at t2 Sending message at t3

P = Propagation time R = Reception time

FIGURE 35.2 Round-trip time.

summation of R2 , S2 , (1±ρB )(t3 −t2 ), S3 , M3 , and T3 , where ρB is the drift rate at node B and the difference (t3 − t2 ) is to account for the waiting time before an ACK is sent to node A by node B. After node B sends the ACK, the ACK propagates through the wireless medium and arrives at node A. Afterwards, node A processes the ACK. The path delays for sending and receiving the ACK from node B to A are P34 , R4 , and S4 . The round-trip time in real-time t for sending a message and receiving an ACK is calculated by t4 − t1 = S1 + M1 + T1 + P12 + R2 + S2 + (1 ± ρB )(t3 − t2 ) + S3 + M3 + T3 + P34 + R4 + S4 (35.1) where S, M , T , P, and R are the software access, medium access, transmission, propagation, and reception times, respectively. If the round-trip time is measured using the hardware clock of node A, it has to be adjusted by the drift rate of node A ρA . If the granularity of the hardware clock is coarse, the error δ contributed by the granularity should be accounted for. As a result, the round-trip time measured with the hardware clock is bounded by an error associated with the clock drift and granularity as determined by (1 − ρA )(t4 − t1 ) ≤ H (t4 ) − H (t1 ) < (1 + ρA )(t4 − t1 ) + δ

(35.2)

The bound for the round-trip time fluctuates with respect to time since the software and medium access fluctuate according to the load at the node and in the channel. Although the transmission, propagation, and reception times may be deterministic, they may contribute to the asymmetric delay that can cause time offset between nodes A and B. In the following section, different types of time synchronization protocols are described. Each of them tries to minimize the effect of the nondeterministic and asymmetric delays. For sensor networks, it is best to minimize the propagation delay variation. For example, the delays and jitters between two nodes may be different in the forward and return paths. In addition, the jitters may vary significantly due to frequent node

© 2006 by Taylor & Francis Group, LLC

35-6

Embedded Systems Handbook

failures, since the messages are relayed hop-by-hop between the two nodes. The synchronization protocols in the following section focus on synchronizing nodes hop-by-hop, so the propagation time and variation do not play too much effect on the error of the synchronized clocks. Although the sensor nodes are densely deployed and they can take advantage of the close distance, the medium and software access times may contribute the most in the nondeterministic of the path delay during a one hop synchronization. The way to provide time synchronization for sensor networks may be different for different applications. The current timing techniques that are available for different applications are described in the following section.

35.5 Time Synchronization Protocols for Sensor Networks There are three types of timing techniques as shown in Table 35.1, and each of these types has to address the design challenges and factors affecting time synchronization as mentioned in Sections 35.2 and 35.3, respectively. In addition, the timing techniques have to address the mapping between the sensor network time and the Internet time, for example, universal coordinated time. In the following paragraphs, examples of these types of timing techniques are described, namely the Network Time Protocol (NTP) [11], Timing-sync Protocol for Sensor Networks (TPSN) [12], Reference-Broadcast Synchronization (RBS) [13], and Time-Diffusion Synchronization Protocol (TDP) [14]. In Internet, the NTP is used to discipline the frequency of each node’s oscillator. The accuracy of the NTP synchronization is in the order of milliseconds [15]. It may be useful to use NTP to discipline the oscillators of the sensor nodes, but the connection to the time servers may not be possible because of frequent sensor node failures. In addition, disciplining all the sensor nodes in the sensor field maybe a problem due to interference from the environment and large variation of delay between different parts of the sensor field. The interference can temporarily disjoint the sensor field into multiple smaller fields causing undisciplined clocks among these smaller fields. The NTP protocol may be considered as type (1) of the timing techniques. In addition, it has to be refined in order to address the design challenges presented by the sensor networks. As of now, the NTP is very computational intensive and requires a precise time server to synchronize the nodes in the network. In addition, it does not take into account the energy consumption required for time synchronization. As a result, the NTP does not satisfy the energy aware, server-less, and light-weight design challenges of the sensor networks. Although the NTP can be robust, it may suffer large propagation delay when sending timing messages to the time servers. In addition, the nodes are synchronized in a hierarchical manner, and some time servers in the middle of the hierarchy may fail causing unsynchronized nodes in the network. Once these nodes fail, it is hard to reconfigure the network since the hierarchy is manually configured. Another time synchronization technique that adopts some concepts from NTP is TPSN. The TPSN requires the root node to synchronize all or part of the nodes in the sensor field. The root node synchronizes the nodes in a hierarchical way. Before synchronization, the root node constructs the hierarchy by TABLE 35.1

Three Types of Timing Techniques

Type

Description

Relies on fixed time servers to synchronize the network

The nodes are synchronized to time servers that are readily available. These time servers are expected to be robust and highly precise

Translates time throughout the network

The time is translated hop-by-hop from the source to the sink. In essence, it is a time translation service

Self-organizes to synchronize the network

The protocol does not depend on specialized time servers. It automatically organizes and determines the master nodes as the temporary time servers

© 2006 by Taylor & Francis Group, LLC

Time Synchronization Issues in Sensor Networks

35-7

Synchronization pulse g2

g3

B Acknowledgment A FIGURE 35.3

g1

g4

Two-way message handshake.

broadcasting a level_discovery packet. The first level of the hierarchy is level 0, which is where the root node resides. The nodes receiving the level_discovery packet from the root node are the nodes belonging to level 1. Afterwards, the nodes in level 1 broadcast their level_discovery packet, and neighbor nodes receiving the level_discovery packet for the first time are the level 2 nodes. This process continues until all the nodes in the sensor field has a level number. The root node sends a time_sync packet to initialize the time synchronization process. Afterwards, the nodes in level 1 synchronize to level 0 by performing the two way handshake as shown in Figure 35.3. This type of handshake is used by the NTP to synchronize the clocks of distributed computer systems. At the end of the handshake at time g4 , node A obtains the time g1 , g2 , and g3 from the acknowledgment packet. The time g2 and g3 are obtained from the clock of sensor node B while g1 and g4 are from the node A. After processing the acknowledgment packet, the node A readjusts its clock by the clock drift value , where = ((g2 − g1 ) − (g4 − g3 ))/2. At the same time, the level 2 nodes overhear this message handshake and wait for a random time before synchronizing with level 1 nodes. This synchronization process continues until all the nodes in the network are synchronized. Since TPSN enables time synchronization from one root node, it is type (1) of the timing techniques. The TPSN is based on a sender–receiver synchronization model, where the receiver synchronizes with the time of the sender according to the two-way message handshake as shown in Figure 35.3. It is trying to provide a light-weight and tunable time synchronization service. On the other hand, it requires a time server and does not address the robust and energy aware design goal. Since the design of TPSN is based on a hierarchical methodology similar to NTP, nodes within the hierarchy may fail and cause nodes to be unsynchronized. In addition, node movements may render the hierarchy useless, because nodes may move out of their levels. Hence, nodes at level i can not synchronize with nodes at level i − 1. Afterwards, synchronization may fail throughout the network. As for type (2) of the timing techniques, the RBS provides an instantaneous time synchronization among a set of receivers that are within the reference broadcast of the transmitter. The transmitter broadcasts m reference packets. Each of the receivers that are within the broadcast range records the time-of-arrival of the reference packets. Afterwards, the receivers communicate with each other to determine the offsets. To provide multihop synchronization, it is proposed to use nodes that are receiving two or more reference broadcasts from different transmitters as translation nodes. These translation nodes are used to translate the time between different broadcast domains. As shown in Figure 35.4, nodes A, B, and C are the transmitter, receiver, and translation nodes, respectively. The transmitter nodes broadcast their timing messages, and the receiver nodes receive these messages. Afterwards, the receiver nodes synchronize with each other. The sensor nodes that are within the broadcast regions of both transmitter nodes A are the translation nodes. When an event occurs, a message describing the event with a time stamp is translated by the translation nodes when the message is routed back to the sink. Although this time synchronization service is tunable and light-weight, there may not be translation nodes on the route path that the message is relayed. As a result, services may not be available on some routes. In addition, this protocol is not suitable for medium access scheme, such as TDMA, since the clocks of all the nodes in the network are not adjusted to a common time.

© 2006 by Taylor & Francis Group, LLC

35-8

Embedded Systems Handbook Transmitters

Receivers

B

C

A

Translation nodes FIGURE 35.4

Illustration of the RBS.

Master nodes

Hops

M

C

1

D

E

2 3

3 2

F

G 1

N

Diffused leader nodes FIGURE 35.5

TDP concept.

Another emerging timing technique is the TDP. The TDP is used to maintain the time throughout the network within a certain tolerance. The tolerance level can be adjusted based on the purpose of the sensor networks. The TDP automatically self-configures by electing master nodes to synchronize the sensor network. In addition, the election process is sensitive to energy requirement as well as the quality of the clocks. The sensor network may be deployed in unattended areas, and the TDP still synchronizes the unattended network to a common time. It is considered as a type (3) of the timing techniques. The TDP concept is illustrated in Figure 35.5. The elected master nodes are nodes C and G. First, the master nodes send a message to their neighbors to measure the round-trip times. Once the neighbors

© 2006 by Taylor & Francis Group, LLC

Time Synchronization Issues in Sensor Networks

35-9

receive the message, they self-determine if they should become diffuse leader nodes. The ones elected to become diffuse leader nodes reply to the master nodes and start sending a message to measure the roundtrip to their neighbors. As shown in Figure 35.5, nodes M, N, and D are the diffused leader nodes of node C. Once the replies are received by the master nodes, the round-trip time and the standard deviation of the round-trip time are calculated. The one-way delay from the master nodes to the neighbor nodes is half of the measured round-trip time. Afterwards, the master nodes send a time-stamped message containing the standard deviation to the neighbor nodes. The time in the time-stamped message is adjusted with the one-way delay. Once the diffuse leader nodes receive the time-stamped message, they broadcast the timestamped message after adjusting the time, which is in the message, with their measured one-way delay and inserting their standard deviation of the round-trip time. This diffusion process continues for n times, where n is the number of hops from the master nodes. From Figure 35.5, the time is diffused 3 hops from the master nodes C and G. The nodes D, E, and F are the diffused leader nodes that diffuse the time-stamped messages originated from the master nodes. For the nodes that have received more than one time-stamped messages originated from different master nodes, they use the standard deviations carried in the time-stamped messages as weighted ratio of their time contribution to the new time. In essence, the nodes weight the times diffused by the master nodes to obtain a new time for them. This process is to provide a smooth time variation between the nodes in the network. The smooth transition is important for some applications, such as target tracking and speed estimating. The master nodes are autonomously elected, so the network is robust to failures. Although some of the nodes may die, there are still other nodes in the network that can self-determine to become master nodes. This feature also enables the network to become server-less if necessary and to reach an equilibrium time. In addition, the master and diffusion leader nodes are self-determined based on their own energy level. Also, the TDP is light-weight, but it may not be as tunable as the RBS. In summary, these timing techniques may be used for different types of applications; each of them has its own benefits. All of these techniques try to address the factors influencing time synchronization while designing according to the challenges as described in Section 35.2. Depending on the types of services required by the applications or the hardware limitation of the sensor nodes, some of these timing techniques may be applied.

35.6 Conclusions The design challenges and factors influencing time synchronization for sensor networks are described in Sections 35.2 and 35.3, respectively. They are to provide guidelines for developing time synchronization protocols. The requirements of sensor networks are different from traditional distributed computer systems. As a result, new types of timing techniques are required to address the specific needs of the applications. These techniques are described in Section 35.5. Since the range of applications in the sensor networks is wide, new timing techniques are encouraged for different types of applications. This is to provide optimized schemes tailored for unique environments and purposes.

Acknowledgment The author wishes to thank Dr. Ian F. Akyildiz for his encouragement and support.

References [1] Akyildiz, I.F. et al., Wireless Sensor Networks: A Survey. Computer Networks Journal, 393–422, 2002. [2] Levine, J., Time Synchronization Over the Internet Using an Adaptive Frequency-Locked Loop. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 46, 888–896, 1999.

© 2006 by Taylor & Francis Group, LLC

35-10

Embedded Systems Handbook

[3] Mills, D.L., Adaptive Hybrid Clock Discipline Algorithm for the Network Time Protocol. IEEE/ACM Transactions on Networking, 6, 505–514, 1998. [4] Allan, D., Time and Frequency (Time-Domain) Characterization, Estimation, and Prediction of Precision Clocks and Oscillators. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 34, 647–654, 1987. [5] Cristian, F. and Fetzer, C., Probabilistic Internal Clock Synchronization. In Proceedings of the Thirteenth Symposium on Reliable Distributed Systems. Dana Point, CA, October 1994, pp. 22–31. [6] Pottie, G.J. and Kaiser, W.J., Wireless Integrated Network Sensors. Communications of the ACM, 43, 551–558, 2000. [7] MICA Motes and Sensors, http://www.xbow.com. [8] Bianchi, G., Performance Analysis of the IEEE 802.11 Distributed Coordination Function. IEEE Journal on Selected Areas in Communications, 18, 535–547, 2000. [9] Crow, B.P. et al., Investigation of the IEEE 802.11 Medium Access Control (MAC) Sublayer Functions. In Proceedings of the IEEE INFOCOM’97. Kobe, Japan, April 1997, pp. 126–133. [10] Tay, Y.C. and Chua, K.C., A Capacity Analysis for the IEEE 802.11 MAC Protocol. ACM Wireless Networks Journal, 7, 159–171, 2001. [11] Mills, D.L., Internet Time Synchronization: The Network Time Protocol. Global States and Time in Distributed Systems. IEEE Computer Society Press, Washington, 1994. [12] Ganeriwal, S., Kumar, R., and Srivastava, M.B., Timing-Sync Protocol for Sensor Networks. In ACM SenSys 2003. Los Angeles, CA, November 2003 (to appear). [13] Elson, J., Girod, L., and Estrin, D., Fine-Grained Network Time Synchronization Using Reference Broadcasts. In Proceedings of the Fifth Symposium on Operating Systems Design and Implementation (OSDI 2002). Boston, MA, December 2002. [14] Su, W. and Akyildiz, I.F., Time-Diffusion Synchronization Protocol for Wireless Sensor Networks. IEEE/ACM Transaction on Networking, 13(2), April 2005. [15] IEEE 1588, Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems, 2002.

© 2006 by Taylor & Francis Group, LLC

36 Distributed Localization Algorithms 36.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-1 36.2 Localization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-2 Generic Approach • Phase 1: Distance to Anchors • Phase 2: Node Position • Phase 3: Refinement

36.3 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-8 Standard Scenario

36.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-9 Phase 1: Distance to Anchors • Phase 2: Node Position

36.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-14 Phases 1 and 2 Combined • Phase 3: Refinement • Communication Cost • Recommendations

36.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-21

Koen Langendoen and Niels Reijers Delft University of Technology

Future Work

Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-22 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-22

36.1 Introduction New technology offers new opportunities, but it also introduces new problems. This is particularly true for sensor networks where the capabilities of individual nodes are very limited. Hence, collaboration between nodes is required, but energy conservation is a major concern, which implies that communication should be minimized. These conflicting objectives require unorthodox solutions for many situations. A recent survey by Akyildiz et al.[1] discusses a long list of open research issues that must be addressed before sensor networks can become widely deployed. The problems range from the physical layer (lowpower sensing, processing, and communication hardware) all the way up to the application layer (query and data dissemination protocols). In this chapter we address the issue of localization in ad-hoc sensor networks. That is, we want to determine the location of individual sensor nodes without relying on external infrastructure (base stations, satellites, etc.). The localization problem has received considerable attention in the past, as many applications need to know where objects or persons are, and hence various location services have been created. Undoubtedly, Reprinted from K. Langendoen and N. Reijers. Elsevier Computer Networks, 43: 499–518, 2003. With permission.

36-1

© 2006 by Taylor & Francis Group, LLC

36-2

Embedded Systems Handbook

the Global Positioning System (GPS) is the most well-known location service in use today. The approach taken by GPS, however, is unsuitable for low-cost, ad-hoc sensor networks since GPS is based on extensive infrastructure (i.e., satellites). Likewise solutions developed in the area of robotics [2–4] and ubiquitous computing [5] are generally not applicable for sensor networks as they require too much processing power and energy. Recently a number of localization systems have been proposed specifically for sensor networks [6–14]. We are interested in truly distributed algorithms that can be employed on large-scale ad-hoc sensor networks (100+ nodes). Such algorithms should be: 1. Self-organizing (i.e., do not depend on global infrastructure). 2. Robust (i.e., be tolerant to node failures and range errors). 3. Energy efficient (i.e., require little computation and, especially, communication). These requirements immediately rule out some of the proposed localization algorithms for sensor networks. In this chapter, we carry out a thorough sensitivity analysis on three algorithms that do meet the above requirements to determine how well they perform under various conditions. In particular, we study the impact of the following parameters: range errors, connectivity (density), and anchor fraction. These algorithms differ in their position accuracy, network coverage, induced network traffic, and processor load. Given the (slightly) different design objectives for the three algorithms, it is no surprise that each algorithm outperforms the others under a specific set of conditions. Under each condition, however, even the best algorithm leaves much room for improving accuracy and/or increasing coverage. In this chapter we will: 1. 2. 3. 4.

Identify a common, 3-phase, structure in the selected distributed localization algorithms. Identify a generic optimization applicable to all algorithms. Provide a detailed comparison on a single (simulation) platform. Show that there is no algorithm that performs best, and that there exists room for improvement in most cases.

Section 36.2 discusses the selection, generic structure, and operation of three distributed localization algorithms for large-scale ad-hoc sensor networks. These algorithms are compared on a simulation platform, which is described in Section 36.3. Section 36.4 presents intermediate results for the individual phases, while Section 36.5 provides a detailed overall comparison and an in-depth sensitivity analysis. Finally, we give conclusions in Section 36.6.

36.2 Localization Algorithms Before discussing distributed localization in detail, we first outline the context in which these algorithms have to operate. A first consideration is that the requirement for sensor networks to be self-organizing implies that there is no fine control over the placement of the sensor nodes when the network is installed (e.g., when nodes are dropped from an airplane). Consequently, we assume that nodes are randomly distributed across the environment. For simplicity and ease of presentation we limit the environment to 2 dimensions, but all algorithms are capable of operating in 3D. Figure 36.1 shows an example network with 25 nodes; pairs of nodes that can communicate directly are connected by an edge. The connectivity of the nodes in the network (i.e., the average number of neighbors) is an important parameter that has a strong impact on the accuracy of most localization algorithms (see Sections 36.4 and 36.5). It is initially determined by the node density and radio range, and in some cases it can be adjusted dynamically by changing the transmit power of the RF radio. In some application scenarios, nodes may be mobile. In this chapter, however, we focus on static networks, where nodes do not move, since this is already a challenging condition for distributed localization. We assume that some anchor nodes have a priori knowledge of their own position with respect to some

© 2006 by Taylor & Francis Group, LLC

Distributed Localization Algorithms

36-3

Anchor Unknown

FIGURE 36.1 Example network topology.

global coordinate system. Note that anchor nodes have the same capabilities (processing, communication, energy consumption, etc.) as all other sensor nodes with unknown positions; we do not consider approaches based on an external infrastructure with specialized beacon nodes (access points) as used in, for example, the GPS-less [6], Cricket [15], and RADAR [16] location systems. Ideally the fraction of anchor nodes should be as low as possible to minimize the installation costs. Our simulation results show that, fortunately, most algorithms are rather insensitive to the number of anchors in the network. The final element that defines the context of distributed localization is the capability to measure the distance between directly connected nodes in the network. From a cost perspective it is attractive to use the RF radio for measuring the range between nodes, for example, by observing the signal strength. Experience has shown, however, that this approach yields poor distance estimates [17–19]. Much better results are obtained by time-of-flight measurements, particularly when acoustic and RF signals are combined [14,19,20]; accuracies of a few percent of the transmission range are reported. However, this requires extra hardware on the sensor boards. Several different ways of dealing with the problem of inaccurate distance information have been proposed. The APIT [10] algorithm by He et al. only needs distance information accurate enough for two nodes determine which of them is closest to an anchor. GPS-less [6] by Bulusu et al. and DV-hop [11] by Niculescu and Nath do not use distance information at all, and are based on topology information only. Ramadurai and Sichitiu [12] propose a probabilistic approach to the localization problem. Not only the measured distance, but also the confidence in the measurement is used. It is important to realize that the main three context parameters (connectivity, anchor fraction, and range errors) are dependent. Poor range measurements can be compensated for by using many anchors and/or a high connectivity. This chapter provides insight in the complex relation between connectivity, anchor fraction, and range errors for a number of distributed localization algorithms.

36.2.1 Generic Approach From the known localization algorithms specifically proposed for sensor networks, we selected the three approaches that meet the basic requirements for self-organization, robustness, and energy-efficiency: 1. Ad-hoc positioning by Niculescu and Nath [11] 2. N-hop multilateration by Savvides et al. [14] 3. Robust positioning by Savarese et al. [13] The other approaches often include a central processing element (e.g., “convex optimization” by Doherty et al. [9]), rely on an external infrastructure (e.g., “GPS-less” by Bulusu et al. [6]), or induce too much

© 2006 by Taylor & Francis Group, LLC

36-4

Embedded Systems Handbook TABLE 36.1 Algorithm Classification Phase

Ad-hoc positioning [11]

Robust positioning [13]

N-hop multilateration [14]

1. Distance 2. Position 3. Refinement

Euclidean Lateration No

DV-hop Lateration Yes

Sum-dist Min-max Yes

communication (e.g., “GPS-free” by Capkun et al. [7]). The three selected algorithms are fully distributed and use local broadcast for communication with immediate neighbors. This last feature allows them to be executed before any multi-hop routing is in place, hence, they can support efficient location-based routing schemes like GAF [21]. Although the three algorithms were developed independently, we found that they share a common structure. We were able to identify the following generic, 3-phase approach1 for determining the individual node positions: 1. Determine the distances between unknowns and anchor nodes. 2. Derive for each node a position from its anchor distances. 3. Refine the node positions using information about the range (distance) to, and positions of, neighboring nodes. The original descriptions of the algorithms present the first two phases as a single entity, but we found that separating them provides two advantages. First, we obtain a better understanding of the combined behavior by studying intermediate results. Second, it becomes possible to mix-and-match alternatives for both phases to tailor the localization algorithm to the external conditions. The refinement phase is optional and may be included to obtain more accurate locations. In the remainder of this section we will describe the three phases (distance, position, and refinement) in detail. For each phase we will enumerate the alternatives as found in the original descriptions. Table 36.1 gives the breakdown into phases of the three approaches. When applicable we also discuss (minor) adjustments to (parts of) the individual algorithms that were needed to ensure compatibility with the alternatives. During our simulations we observed that we occasionally operated (parts of) the algorithms outside their intended scenarios, which deteriorated their performance. Often, small improvements brought their performance back in line with the alternatives.

36.2.2 Phase 1: Distance to Anchors In this phase, nodes share information to collectively determine the distances between individual nodes and the anchors, so that an (initial) position can be calculated in Phase 2. None of the Phase 1 alternatives engages in complicated calculations, so this phase is communication bounded. Although the three distributed localization algorithms each use a different approach, they share a common communication pattern: information is flooded into the network, starting at the anchor nodes. A network-wide flood by some anchor A is expensive since each node must forward A’s information to its (potentially) unaware neighbors. This implies a scaling problem: flooding information from all anchors to all nodes will become much too expensive for large networks, even with low anchor fractions. Fortunately a good position can be derived in Phase 2 with knowledge (position and distance) from a limited number of anchors. Therefore nodes can simply stop forwarding information when enough anchors have been “located.” This simple optimization presented in the Robust positioning approach proved to be highly effective in controlling the amount of communication (see Section 36.5.3). We modified the other two approaches to include a flood limit as well. 1 Our three phases do not correspond to the three of Savvides et al. [14]; our structure allows for an easier comparison

of all algorithms.

© 2006 by Taylor & Francis Group, LLC

Distributed Localization Algorithms

36-5

36.2.2.1 Sum-Dist The most simple solution for determining the distance to the anchors is simply adding the ranges encountered at each hop during the network flood. This is the approach taken by the N-hop multilateration approach, but it remained nameless in the original description [14]; we name it Sum-dist in this chapter. Sum-dist starts at the anchors, who send a message including their identity, position, and a path length set to 0. Each receiving node adds the measured range to the path length and forwards (broadcasts) the message if the flood limit allows it to do so. Another constraint is that when the node has received information about the particular anchor before, it is only allowed to forward the message if the current path length is less than the previous one. The end result is that each node will have stored the position and minimum path length to at least flood limit anchors. 36.2.2.2 DV-Hop A drawback of Sum-dist is that range errors accumulate when distance information is propagated over multiple hops. This cumulative error becomes significant for large networks with few anchors (long paths) and/or poor ranging hardware. A robust alternative is to use topological information by counting the number of hops instead of summing the (erroneous) ranges. This approach was named DV-hop by Niculescu and Nath [11], and Hop-TERRAIN by Savarese et al. [13]. Since the results of DV-hop were published first we will use this name. DV-hop essentially consists of two flood waves. After the first wave, which is similar to Sum-dist, nodes have obtained the position and minimum hop count to at least flood limit anchors. The second calibration wave is needed to convert hop counts into distances such that Phase 2 can compute a position. This conversion consists of multiplying the hop count with an average hop distance. Whenever an anchor a1 infers the position of another anchor a2 during the first wave, it computes the distance between them, and divides that by the number of hops to derive the average hop distance between a1 and a2 . When calibrating, an anchor takes all remote anchors into account that it is aware of. When later, information on extra anchors is received, the calibration procedure is repeated. Nodes forward (broadcast) calibration messages only from the first anchor that calibrates them, which reduces the total number of messages in the network. 36.2.2.3 Euclidean A drawback of DV-hop is that it fails for highly irregular network topologies, where the variance in actual hop distances is very large. Niculescu and Nath have proposed another method, named Euclidean, which is based on the local geometry of the nodes around an anchor. Again anchors initiate a flood, but forwarding the distance is more complicated than in the previous cases. When a node has received messages from two neighbors that know their distance to the anchor, and to each other, it can calculate the distance to the anchor. Figure 36.2 shows a node (“Self ”) that has two neighbors n1 and n2 with distance estimates (a and b) to an anchor. Together with the known ranges c, d, and e, Euclidean arrives at two possible values (r1 and r2)

Anchor Self d e

Self’

r1 d n1 e c

n2

FIGURE 36.2 Determining distance using Euclidean.

© 2006 by Taylor & Francis Group, LLC

r2 a b

36-6

Embedded Systems Handbook

for the distance of the node to the anchor. Niculescu describes two methods to decide on which, if any, distance to use. The neighbor vote method can be applied if there exists a third neighbor (n3) that has a distance estimate to the anchor and that is connected to either n1 or n2. Replacing n2 (or n1) by n3 will again yield a pair of distance estimates. The correct distance is part of both pairs, and is selected by a simple voting. Of course, more neighbors can be included to make the selection more accurate. The second selection method is called common neighbor and can be applied if node n3 is connected to both n1 and n2. Basic geometric reasoning leads to the conclusion that the anchor and n3 are on the same or opposite side of the mirroring line n1 to n2, and similarly whether or not Self and n3 are on the same side. From this it follows whether or not self and the anchor lay on the same side. To handle the uncertainty introduced by range errors Niculescu implements a safety mechanism that rejects ill-formed (flat) triangles, which can easily derail the selection process by “neighbor vote” and “common neighbor.” This check verifies that the sum of the two smallest sides exceeds the largest side multiplied by a threshold, which is set to two times the range variance. For example, the triangle Self-n1-n2 in Figure 36.2 is accepted when c +d > (1 +2RangeVar )×e. Note that the safety check becomes more strict as the range variance increases. This leads to a lower coverage, defined as the percentage of non-anchor nodes for which a position was determined. We now describe some modifications to Niculescu’s “neighbor vote” method that remedy the poor selection of the location for Self in important corner cases. The first problem occurs when the two votes are identical because, for instance, the three neighbors (n1, n2, and n3) are collinear. In these cases it is hard to select the right alternative. Our solution is to leave equal vote cases unsolved, instead of picking an alternative and propagating an error with 50% chance. We filter all indecisive cases by adding the requirement that the standard deviation of the votes for the selected distance must be at most 1/3rd of the standard deviation of the other distance. The second problem that we address is that of a bad neighbor with inaccurate information spoiling the selection process by voting for two wrong distances. This case is filtered out by requiring that the standard deviation of the selected distance is at most 5% of that distance. To achieve good coverage, we use both the neighbor vote and common neighbor methods. If both produce a result, we use the result from the modified “neighbor vote” because we found it to be the most accurate of the two. If both fail, the flooding process stops, leading to the situation where certain nodes are not able to establish the distance to enough anchor nodes. Sum-dist and DV-hop, on the other hand, never fail to propagate the distance and hop count, respectively.

36.2.3 Phase 2: Node Position In the second phase nodes determine their position based on the distance estimates to a number of anchors provided by one of the three Phase 1 alternatives (Sum-dist, DV-hop, or Euclidean). The ad-hoc positioning and Robust positioning approaches use lateration for this purpose. N-hop multilateration, on the other hand, uses a much simpler method, which we named Min-max. In both cases the determination of the node positions does not involve additional communication. 36.2.3.1 Lateration The most common method for deriving a position is lateration, which is a form of triangulation. From the estimated distances (di ) and known positions (xi , yi ) of the anchors we derive the following system of equations:

(x1 − x)2 + (y1 − y)2 = d1 2 .. . (xn − x)2 + (yn − y)2 = dn 2

© 2006 by Taylor & Francis Group, LLC

Distributed Localization Algorithms

36-7

where the unknown position is denoted by (x, y). The system can be linearized by subtracting the last equation from the first n − 1 equations. x1 2 − xn 2 − 2(x1 − xn )x + y1 2 − yn 2 − 2(y1 − yn )y = d1 2 − dn 2 .. . xn−1 2 − xn 2 − 2(xn−1 − xn )x + yn−1 2 − yn 2 − 2(yn−1 − yn )y = dn−1 2 − dn 2 Reordering the terms gives a proper system of linear equations in the form Ax = b, where 

 2(x1 − xn ) 2(y1 − yn )   .. .. A=  . . 2(xn−1 − xn ) 2(yn−1 − yn )   x 1 2 − xn 2 + y1 2 − yn 2 + dn 2 − d1 2   .. b=  . 2 2 2 2 2 2 xn−1 − xn + yn−1 − yn + dn − dn−1 The system is solved using a standard least-squares approach: xˆ = (A T A )−1 A T b. In exceptional cases the matrix inverse cannot be computed and Lateration fails. In the majority of the cases, however, we succeed in computing a location estimate x. ˆ We run an additional sanity check by computing the residue between ˆ the given distances (di ) and the distances to the location estimate x: n residue =

i =1

(xi − xˆ )2 + (yi − yˆ )2 − di n

A large residue signals an inconsistent set of equations; we reject the location xˆ when the length of the residue exceeds the radio range. 36.2.3.2 Min-Max Lateration is quite expensive in the number of floating point operations that is required. A much simpler method is presented by Savvides et al. as part of the N-hop multilateration approach. The main idea is to construct a bounding box for each anchor using its position and distance estimate, and then to determine the intersection of these boxes. The position of the node is set to the center of the intersection box. Figure 36.3 illustrates the Min-max method for a node with distance estimates to three anchors. Note that the estimated position by Min-max is close to the true position computed through Lateration (i.e., the intersection of the three circles). The bounding box of anchor a is created by adding and subtracting the estimated distance (da ) from the anchor position (xa , ya ): [xa − da , ya − da ] × [xa + da , ya + da ] The intersection of the bounding boxes is computed by taking the maximum of all coordinate minimums and the minimum of all maximums: [max(xi − di ), max(yi − di )] × [min(xi + di ), min(yi + di )] The final position is set to the average of both corner coordinates. As for Lateration, we only accept the final position if the residue is small.

© 2006 by Taylor & Francis Group, LLC

36-8

Embedded Systems Handbook

Anchor1 Anchor2

est. pos.

Anchor3

FIGURE 36.3

Determining position using Min-max.

36.2.4 Phase 3: Reﬁnement The objective of the third phase is to refine the (initial) node positions computed during Phase 2. These positions are not very accurate, even under good conditions (high connectivity, small range errors), because not all available information is used in the first two phases. In particular, most ranges between neighboring nodes are neglected when the node-anchor distances are determined. The iterative Refinement procedure proposed by Savarese et al. [13] does take into account all inter-node ranges, when nodes update their positions in a small number of steps. At the beginning of each step a node broadcasts its position estimate, receives the positions and corresponding range estimates from its neighbors, and performs the Lateration procedure of Phase 2 to determine its new position. In many cases the constraints imposed by the distances to the neighboring locations will force the new position towards the true position of the node. When, after a number of iterations, the position update becomes small, Refinement stops and reports the final position. The basic iterative refinement procedure outlined above proved to be too simple to be used in practice. The main problem is that errors propagate quickly through the network; a single error introduced by some node needs only d iterations to affect all nodes, where d is the network diameter. This effect was countered by (1) clipping undetermined nodes with non-overlapping paths to less than three anchors, (2) filtering out difficult symmetric topologies, and (3) associating a confidence metric with each node and using them in a weighted least-squares solution (wAx = wb). The details (see Reference 17) are beyond the scope of this chapter, but the adjustments considerably improved the performance of the Refinement procedure. This is largely due to the confidence metric, which allows filtering of bad nodes, thus increasing the (average) accuracy at the expense of coverage. The N-hop multilateration approach by Savvides et al. [14] also includes an iterative refinement procedure, but it is less sophisticated than the Refinement discussed above. In particular, they do not use weights, but simply group nodes into so-called computation subtrees (over-constrained configurations) and enforce nodes within a subtree to execute their position refinement in turn in a fixed sequence to enhance convergence to a pre-specified tolerance. In the remainder of this chapter we will only consider the more advanced Refinement procedure of Savarese et al.

36.3 Simulation Environment To compare the three original distributed localization algorithms (Ad-hoc positioning, Robust positioning, and N-hop multilateration) and to try out new combinations of phase 1, 2, and 3 alternatives, we extended the simulator developed by Savarese et al. [13]. The underlying OMNeT++ discrete event simulator [22] takes care of the semi-concurrent execution of the specific localization algorithm. Each sensor node

© 2006 by Taylor & Francis Group, LLC

Distributed Localization Algorithms

36-9

“runs” the same C++ code, which is parameterized to select a particular combination of phase 1, 2, and 3 alternatives. Our network layer supports localized broadcast only, and messages are simply delivered at the neighbors within a fixed radio range (circle) from the sending node; a more accurate model should take radio propagation effects into account (see future work). Concurrent transmissions are allowed if the transmission areas (circles) do not overlap. If a node wants to broadcast a message while another message in its area is in progress, it must wait until that transmission (and possibly other queued messages) are completed. In effect we employ a CSMA policy. Furthermore we do not consider message corruption, so all messages sent during our simulation are delivered (after some delay). At the start of a simulation experiment we generate a random network topology according to some parameters (#nodes, #anchors, etc.). The nodes are randomly placed, with a uniform distribution, within a square area. Next we select which nodes will serve as an anchor. To this end we superimpose a grid on top of the square, and designate to each grid point its closest node as an anchor. The size of the grid is chosen as the maximal number s that satisfies s ×s ≤ #anchors; any remaining anchors are selected randomly. The reason for carefully selecting the anchor positions is that most localization algorithms are quite sensitive to the presence, or absence, of anchors at the edges of the network. (Locating unknowns at the edges of the network is more difficult because nodes at the edge are less well connected and positioning techniques like lateration perform best when anchors surround the unknown.) Although anchor placement may not be feasible in practice, the majority of the nodes in large-scale networks (1000+ nodes) will generally be surrounded by anchors. By placing anchors we can study the localization performance in large networks with simulations involving only a modest number of nodes. The measured range between connected nodes is blurred by drawing a random value from a normal distribution having a parameterized standard deviation and having the true range as the mean. We selected this error model based on the work of Whitehouse and Culler [23], which shows that, although individual distance measurements tend to overshoot the real distance, a proper calibration procedure yields distance estimates with a symmetric error distribution. The connectivity (average number of neighbors) is controlled by specifying the radio range. At the end of a run the simulator outputs a large number of statistics per node: position information, elapsed time, message counts (broken down per type), etc. These individual node statistics are combined and presented as averages (or distributions), for example, as an average position error. Nodes that do not produce a position are excluded from such averaged metrics. To account for the randomness in generating topologies and range errors we repeated each experiment 100 times with a different seed, and report the averaged results. To allow for easy comparison between different scenarios, range errors as well as errors on position estimates are normalized to the radio range (i.e., 50% position error means a distance of half the range of the radio between the real and estimated positions).

36.3.1 Standard Scenario The experiments described in the subsequent sections share a standard scenario, in which certain parameters are varied: radio range (connectivity), anchor fraction, and range errors. The standard scenario consists of a network of 225 nodes placed in a square with sides of 100 units. The radio range is set to 14, resulting in an average connectivity of about 12. We use an anchor fraction of 5%, hence, 11 anchors in total, of which 9 (3 × 3) are placed in a grid-like position. The standard deviation of the range error is set to 10% of the radio range. The default flood limit for Phase 1 is set to 4 (Lateration requires a minimum of 3 anchors). Unless specified otherwise, all data will be based on this standard scenario.

36.4 Results In this section we present results for the first two phases (anchor distances and node positions). We study each phase separately and show how alternatives respond to different parameters. These intermediate

© 2006 by Taylor & Francis Group, LLC

36-10

Embedded Systems Handbook

results will be used in Section 36.5, where we will discuss the overall performance, and compare complete localization algorithms. Throughout this section we will vary one parameter in the standard scenario (radio range, anchor fraction, range error) at a time to study the sensitivity of the algorithms. The reader, however, should be aware that the three parameters are not orthogonal.

36.4.1 Phase 1: Distance to Anchors Figure 36.4 shows the performance of the Phase 1 alternatives for computing the distances between nodes and anchors under various conditions. There are two metrics of interest: first, the bias in the estimate, measured here using the mean of the distance errors, and second, the precision of the estimated distances, measured here using the standard deviation of the distance errors. Therefore, Figure 36.4 plots both the average error, relative to the true distance, and the standard deviation of that relative error. We will now discuss the sensitivity of each alternative: Sum-dist, DV-hop, and Euclidean. 36.4.1.1 Sum-Dist Sum-dist is the cheapest of the three methods, both with respect to computation and communication costs. Nevertheless it performs quite satisfactorily, except for large range errors (≥0.1). There are two opposite tendencies affecting the bias of Sum-dist. First, without range errors, the sum of the ranges along a multi-hop path will always be larger than the actual distance, leading to an overestimation of the distance. Second, the algorithm searches for the shortest path, forcing it to select links that underestimate the actual distance when range errors are present. The combined effect shows non-intuitive results. A small range error reduces the bias of Sum-dist. Initially, the detour effect leads to an overshoot, but the shortest-path effect takes over when the range errors increase, leading to a large undershoot. When the radio range (connectivity) is increased, more nodes can be reached in a single hop. This leads to straighter paths (less overshoot), and provides more options for selecting a (incorrect) shortest path (higher undershoot). Consequently, increasing the connectivity is not necessarily a good thing for Sum-dist. 36.4.1.2 DV-Hop The DV-hop method is a stable and predictable method. Since it does not use range measurements, it is completely insensitive to this source of errors. The low relative error (5%) shows that the calibration wave is very effective. DV-hop searches for the path with the minimum number of hops, causing the average hop distance to be close to the radio range. The last hop on the path from an anchor to a node, however, is usually shorter than the radio range, which leads to a slight overestimation of the node-anchor distance. This effect is more pronounced for short paths, hence the increased error for larger radio ranges and higher anchor fractions (i.e., fewer hops). 36.4.1.3 Euclidean Euclidean is capable of determining the exact anchor-node distances, but only in the absence of range errors and in highly connected networks. When these conditions are relaxed, Euclidean’s performance rapidly degrades. The curves in Figure 36.4 show that Euclidean tends to underestimate the distances. The reason is that the selection process is forced to choose between two options that are quite far apart and that in many cases the shortest distance is incorrect. Consider Figure 36.2 again, where the shortest distance r2 falls within the radio range of the anchor. If r2 would be the correct distance then the node should be in direct contact with the anchor avoiding the need for a selection. Therefore nodes simply have more chance to underestimate distances than to overestimate them in the face of (small) range errors. This error can then propagate to nodes that are multiple hops away from the anchor, causing them to underestimate the distance to the anchor as well. We quantified the impact of the selection bias towards short distances. Figure 36.5 shows the distribution of the errors, relative to the true distance, on the standard scenario for Euclidean’s selection mechanism (solid line) and an oracle that always selects the best distance (dashed line). The oracle’s distribution is

© 2006 by Taylor & Francis Group, LLC

Distributed Localization Algorithms

36-11

0.8 DV-hop Sum-dist Euclidean Mean Std. dev.

Relative distance error (× actual distance)

0.6 0.4 0.2 0 –0.2 –0.4 0

0.1

0.2 0.3 Range variance

0.4

0.5

0.8 DV-hop Sum-dist Euclidean Mean Std. dev.

Relative distance error (× actual distance)

0.6 0.4 0.2 0 –0.2 –0.4

8 (4.2)

9

10 (6.4)

11

12 (9.0)

13

14 (12.1)

15

16 (15.5)

Radio range (avg. connectivity) 0.8 DV-hop Sum-dist Euclidean Mean Std. dev.

Relative distance error (× actual distance)

0.6 0.4 0.2 0 –0.2 –0.4 0

0.05

0.1 Anchor fraction

0.15

0.2

FIGURE 36.4 Sensitivity of Phase 1 methods: distance error (solid lines) and standard deviation (dashed lines).

© 2006 by Taylor & Francis Group, LLC

36-12

Embedded Systems Handbook

Euclidean Oracle

Probability density

5 4 3 2 1 0 –1

–0.5

0 Relative range error

0.5

1

FIGURE 36.5 The impact of incorrect distance selection on Euclidean.

nicely centered around zero (no error) with a sharp peak. Euclidean’s distribution, in contrast, is skewed by a heavy tail at the left, signalling a bias for underestimations. Euclidean’s sensitivity for connectivity is not immediately apparent from the accuracy data in Figure 36.4. The main effect of reducing the radio range is that Euclidean will not be able to propagate the anchor distances. Recall that Euclidean’s selection methods require at least three neighbors with a distance estimate to advance the anchor distance one hop. In networks with low connectivity, two parts connected only by a few links will often not be able to share anchors. This leads to problems in Phase 2, where fewer node positions can be computed. The effects are quite pronounced, as will become clear in Section 36.5 (see the coverage curves in Figure 36.10).

36.4.2 Phase 2: Node Position To obtain insight into the fundamental behavior of the Lateration and Min-max algorithms we now report on some experiments with controlled distance errors and anchor placement. The impact of actual distance errors as produced by the Phase 1 methods will be discussed in Section 36.5. 36.4.2.1 Distance Errors Starting from the standard scenario we select for each node the five nearest anchors, and add some noise to the real distances. This noise is generated by first taking a sample from a normal distribution with the actual distance as the mean and a parameterized percentage of the distance as the standard deviation. The result is then multiplied by a bias factor. The ranges for the standard deviation and bias factor follow from the Phase 1 measurements. Figure 36.6 shows the sensitivity of Lateration and Min-max when the standard deviation percentage was varied from 0 to 0.25, and the bias factor fixed at zero. Lateration outperforms Min-max for precise distance estimates, but Min-max takes over for large standard deviations (≥0.15). Figure 36.7 shows the effect of adding a bias to the distance estimates. The curves show that Lateration is very sensitive to a bias factor, especially for precise estimates (std. dev. = 0). Min-max is rather insensitive to bias, because stretching the bounding boxes has little effect on the position of the center. For precise distance estimates and a small bias factor Lateration outperforms Min-max, but the bottom graph in Figure 36.7 shows that Min-max is probably the preferred technique when the standard deviation rises above 10%. Although Min-max is not very sensitive to bias, we do see that Min-max performs better for a positive range bias (i.e., an overshoot). This is a consequence of the error introduced by Min-max using a bounding box instead of a circle around anchors. For simplicity we limit the explanation to the effects on

© 2006 by Taylor & Francis Group, LLC

Distributed Localization Algorithms

36-13

100 Lateration Min-max

Position error (%r)

80

60

40

20

0 0

0.05

0.1 0.15 Range error std. dev.

0.2

0.25

FIGURE 36.6 Sensitivity of Phase 2 to precision. 100

100 std. dev. =0

Lateration Min–max

std. dev. = 0.1 80 Position error (%r)

Position error (%r)

80

60

40

Lateration Min–max

60

40

20

20

0 –0.2 –0.15 –0.1 –0.05 0 0.05 Bias factor

0.1

0.15

0.2

0 –0.2 –0.15 –0.1 –0.05 0 0.05 Bias factor

0.1

0.15

0.2

FIGURE 36.7 Sensitivity of Phase 2 to bias.

Anchor2 r2 Anchor1 a1

r1

a2

FIGURE 36.8 Min-max scenario.

the x-coordinate only. Figure 36.8 shows that Anchor1 making a small angle with the x-axis yields a tight bound (to the right of the true location), and that the large angle of Anchor2 yields a loose bound (to the left of the true location). The estimated position is off in the direction of the loose bound (to the left). Adding a positive bias to the range estimates causes the two bounds to shift proportionally. As a

© 2006 by Taylor & Francis Group, LLC

36-14

Embedded Systems Handbook Min-max

Lateration

Anchor Unknown Estimated position

FIGURE 36.9

Node locations computed for network topology of Figure 36.1.

consequence the center of the intersection moves into the direction of the bound with the longest range (to the right). Consequently the estimated coordinate moves closer to the true coordinate. The opposite will happen if the anchor with the largest angle has the longest distance. Min-max selects the strongest bounds, leading to a preference for small angles and small distances, which favors the number of “good” cases where the coordinate moves closer to the true coordinate if a positive range bias is added. 36.4.2.2 Anchor Placement Min-max has the advantage of being computationally cheap and insensitive to errors, but it requires a good constellation of anchors; in particular, Savvides et al. recommend to place the anchors at the edges of the network [14]. If the anchors cannot be placed and are uniformly distributed across the network, the accuracy of the node positions at the edges is rather poor. Figure 36.9 illustrates this problem graphically. We applied Min-max and Lateration to the example network presented in Figure 36.1. In the case of Min-max, all nodes that lie outside the convex envelope of the four anchor nodes are drawn inwards, yielding considerable errors (indicated by the dashed lines); the nodes within the envelope are located adequately. Lateration, on the other hand, performs much better. Nodes at the edges are located less accurately than interior nodes, but the magnitude of and variance in the errors is smaller than for Min-max. The differences in sensitivity to anchor placement between Lateration and Min-max can be considerable. For instance, for DV-hop/Lateration in the standard scenario, the average position accuracy degrades from 43 to 77%, when anchors are randomly distributed instead of the grid-based placement. The accuracy of DV-hop/Lateration also degrades, but only from 42 to 54%.

36.5 Discussion Now that we know the behavior of the individual phase 1 and 2 components, we can turn to the performance effects of concatenating both phases, followed by applying Refinement in Phase 3. We will study the sensitivity of various combinations to connectivity, anchor fraction, and range errors using both the resulting position error and coverage.

36.5.1 Phases 1 and 2 Combined Combining the three Phase 1 alternatives (Sum-dist, DV-hop, and Euclidean) with the two Phase 2 alternatives (Lateration and Min-max) yields a total of six possibilities. We will analyze the differences in terms of coverage (Figure 36.10) and position accuracy (Figure 36.11). When fine-tuning localization

© 2006 by Taylor & Francis Group, LLC

Distributed Localization Algorithms

36-15

algorithms, the trade-off between accuracy and coverage plays an important role; dropping difficult cases increases average accuracy at the expense of coverage. 36.5.1.1 Coverage Figure 36.10 shows the coverage of the six Phase 1/Phase 2 combinations for varying range error (top), radio range (middle), and anchor fraction (bottom). The solid lines denote the Lateration variants; the dashed lines denote the Min-max variants. The first observation is that Sum-dist and DV-hop are able to determine the range to enough anchors to position all the nodes, except in cases when the radio range is small (≤11), or equivalently when the connectivity is low (≤7.5). In such sparse networks, Lateration provides a slightly higher coverage than Min-max. This is caused by the sanity check on the residue. A consistent set of anchor positions and distance estimates leads to a low residue, but the reverse does not hold. Occasionally if Lateration is used with an inconsistent set, an outlier is produced with a small residue, which is accepted. Min-max does not suffer from this problem because the positions are always constrained by the bounding boxes and thus cannot produce such outliers. Lateration’s higher coverage results in higher errors, see the accuracy curves in Figure 36.11. The second observation is that Euclidean has great difficulty in achieving a reasonable coverage when conditions are non-ideal. The combination with Min-max gives the highest coverage, but even that combination only achieves acceptable results under ideal conditions (range variance ≤0.1, connectivity ≥15, anchor fraction ≥0.1). The reason for Euclidean’s poor coverage is twofold. First, the triangles used to propagate anchor distances are checked for validity (see Section 36.2.2); this constraint becomes more strict as the range variance increases, hence the significant drop in coverage. Second, Euclidean can only forward anchor distances if enough neighbors are present (see Section 36.4.1) resulting in many nodes “locating” only one or two anchors. Lateration requires at least three anchors, but Min-max does not have this requirement. This explains why the Euclidean/Min-max combination yields a higher coverage. Again, the price is paid in terms of accuracy (cf. Figure 36.11). 36.5.1.2 Accuracy Figure 36.11 gives the average position error of the six combinations under the same varying conditions as for the coverage plots. To ease the interpretation of the accuracies we filtered out anomalous cases whose coverage is below 50%, which mainly concerns Euclidean’s results. The most striking observation is that the Euclidean/Lateration combination clearly outperforms the others in the absence of range errors: 0% error versus at least 29% (Sum-dist/Min-max). This follows from the good performance of both Euclidean and Lateration in this case (see Section 36.4). The downside is that both components were also shown to be very sensitive to range errors. Consequently, the average position error increases rapidly if noise is added to the range estimates; at just 2% range variance, Euclidean/Lateration looses its advantage over the Sum-dist/Min-max combination. When the range variance exceeds 10%, DV-hop performs best. In this scenario DV-hop achieves comparable accuracies for both Lateration and Min-max. Which Phase 2 algorithm is most appropriate depends on anchor placement, and whether the higher computation cost of Lateration is important. Notice that Sum-dist/Lateration actually becomes more accurate when a small amount of range variance is introduced, while the errors of Sum-dist/Min-max increase. This matches the results found in Sections 36.4.1 and 36.4.2. Adding a small range error causes Sum-dist to yield more accurate distance estimates (cf. Figure 36.4). Lateration benefits greatly from a reduced bias, but Min-max is not that sensitive and even deteriorates slightly (cf. Figure 36.7). The combined effect is that Sum-dist/Lateration benefits from small range errors; Sum-dist/Min-max does not show this unexpected behavior. All six combinations are quite sensitive to the radio range (connectivity). A minimum connectivity of 9.0 is required (at radio range 12) for DV-hop and Sum-dist, in which case Sum-dist slightly outperforms DV-hop and the difference between Lateration and Min-max is negligible. Euclidean does not perform well because of the 10% range variance in the standard scenario.

© 2006 by Taylor & Francis Group, LLC

36-16

Embedded Systems Handbook 100 DV-hop Sum-dist Euclidean Lateration Min-max

Coverage (%)

80

60

40

20

0

0

0.1

0.2 0.3 Range variance

0.4

0.5

100

Coverage (%)

80

60

40 DV-hop Sum-dist Euclidean Lateration Min-max

20

0

8 (4.2)

9

10 11 12 13 14 (6.4) (9.0) (12.1) Radio range (avg. connectivity)

15

16 (15.5)

100

Coverage (%)

80

60

40 DV-hop Sum-dist Euclidean Lateration Min-max

20

0

FIGURE 36.10

0

0.05

0.1 Anchor fraction

0.15

0.2

Coverage of phase 1/2 combinations.

The sensitivity to the anchor fraction is quite similar for all combinations. More anchors ease the localization task, especially for Euclidean, but there is no hard threshold like for the sensitivity to connectivity.

36.5.2 Phase 3: Reﬁnement For brevity we do not report the effects of refining the initial positions produced by all six phase 1/2 combinations, but limit the results to the three combinations proposed in the original papers: Sum-dist/ Min-max, Euclidean/Lateration, and DV-hop/Lateration (cf. Table 36.1). Figure 36.12 shows the coverage © 2006 by Taylor & Francis Group, LLC

Distributed Localization Algorithms

36-17

Position error (%r)

150 DV-hop Sum-dist Euclidean Lateration Min-max 100

50

0

0

0.1

0.2 0.3 Range variance

0.4

0.5

300 DV-hop Sum-dist Euclidean Lateration Min-max

Position error (%r)

250 200 150 100 50 0

8 (4.2)

9

10 11 12 13 14 (6.4) (9.0) (12.1) Radio range (avg. connectivity)

15

16 (15.5)

Position error (%r)

150 DV-hop Sum-dist Euclidean Lateration Min-max

100

50

0

0

0.05

0.1 Anchor fraction

0.15

0.2

FIGURE 36.11 Accuracy of phase 1/2 combinations.

with (solid lines) and without (dashed lines) Refinement for the three selected combinations. Figure 36.13 shows the average position error, but only if the coverage exceeds 50%. The most important observation is that Refinement dramatically reduces the coverage for all combinations. For example, in the standard case (10% range variance, radio range 14, and 5% anchors) the coverage for Sum-dist/Min-max and DV-hop/Lateration drops from 100% to a mere 51%. For the nodes that are not rejected Refinement results in a better accuracy: the average error decreases from 42 to 23% for DV-hop, and from 38 to 24% for Sum-dist. Other tests have revealed that Refinement does not only improve accuracy by merely filtering out bad nodes; the initial positions of good nodes are improved as well. A second observation is that Refinement equalizes the performance by Sum-dist and DV-hop. © 2006 by Taylor & Francis Group, LLC

36-18

Embedded Systems Handbook

100

100 DV-hop, Lateration Sum-dist, Min-max Euclidean, Lateration Phases 1 and 2 only With Refinement

60

80 Coverage (%)

Coverage (%)

80

40

60 40 20

20

0

0 0

0.1

0.2

0.3

0.4

8 (4.2)

0.5

Range variance

9

10 (6.4)

11

12 (9.0)

13

14 (12.1)

15

16 (15.5)

15

16 (15.5)

Radio range (avg. connectivity)

FIGURE 36.12

Coverage after refinement.

150 DV-hop, Lateration Sum-dist, Min-max Euclidean, Lateration Phases 1 and 2 only With Refinement

100

Position error (%r)

Position error (%r)

150

50

DV-hop, Lateration Sum-dist, Min-max Euclidean, Lateration Phases 1 and 2 only With Refinement

100

50

0

0 0

0.1

0.2

0.3

Range variance

0.4

0.5

8 (4.2)

9

10 (6.4)

11

12 (9.0)

13

14 (12.1)

Radio range (avg. connectivity)

FIGURE 36.13 Accuracy after refinement.

As a consequence the simpler Sum-dist is to be preferred in combination with Refinement to save on computation and communication.

36.5.3 Communication Cost The network simulator maintains statistics about the messages sent by each node. Table 36.2 presents a breakdown per message type of the three original localization combinations (with Refinement) on the standard scenario. The number of messages in Phase 1 (Flood + Calibration) is directly controlled by the flood limit parameter, which is set to 4 by default. Figure 36.14 shows the message counts in Phase 1 for various flood limits. Note that Sum-dist and DV-hop scale almost linearly; they level off slightly because information on multiple anchors can be combined in a single message. Euclidean, on the other hand, levels off completely because of the difficulties in propagating anchor distances, especially along long paths. For Sum-dist and DV-hop we expect nodes to transmit a message per anchor. Note, however, that for low flood limits the message count is higher than expected. In the case of DV-hop, the count also includes the calibration messages. With some fine-tuning the number of calibration messages can be limited to one, but the current implementation needs about as many messages as the flooding itself. A second factor that increases the number of messages for DV-hop and Sum-dist is the update information to be sent

© 2006 by Taylor & Francis Group, LLC

Distributed Localization Algorithms

36-19

TABLE 36.2 Average Number of Messages Per Node Type Flood Calibration Refinement

8

DV-hop

Euclidean

4.3 — 32

2.2 2.6 29

3.5 — 20

150

DV-hop, Lateration Sum-dist, Min-max Euclidean, Lateration

7 6

Position error (%r)

#Messages per node

Sum-dist

5 4 3 2

DV-hop, Lateration Sum-dist, Min-max Euclidean, Lateration

100

50

1 0

0

FIGURE 36.14

2

4 6 Flood limit

8

10

0

0

2

4 6 Flood limit

8

10

Sensitivity to flood limit.

when a shorter path is detected, which happens quite frequently for Sum-dist. Finally, all three algorithms are self-organizing and nodes send an extra message when discovering a new neighbor that needs to be informed of the current status. Although the flood limit is essential for crafting scalable algorithms, it affects the accuracy, see the bottom graph in Figure 36.14. Note that using a higher flood limit does not always improve accuracy. In the case of Sum-dist, there is a trade-off between using few anchors with accurate distance information, and using many anchors with less accurate information. With DV-hop, on the other hand, the distance estimates become more accurate for longer paths (last-hop effect, see Section 36.4.1). Euclidean’s error only increases with higher flood limits because it starts with a low coverage, which also increases with higher flood limits. DV-hop and Sum-dist reach almost 100% coverage at flood limits of 2 (Min-max) or 3 (Lateration). With the flood limit set to 4, the nodes send about 4 messages during Phase 1 (cf. Table 36.2). This is comparable to the three messages needed by a centralized algorithm: set up a spanning tree, collect range information, and distribute node positions. Running Refinement in Phase 3, on the other hand, is extremely expensive, requiring 20 (Euclidean) to 32 messages (Sum-dist). The problem is that Refinement takes many iterations before local convergence criteria decide to terminate. We added a limit to the number of Refinement messages a node is allowed to send. The effect of this is shown in Figure 36.15. A Refinement limit of 0 means that no refinement messages are sent, and Refinement is skipped completely. The position errors in Figure 36.15 show that most of the effect of Refinement takes place in the first few iterations, so hard limiting the iteration count is a valid option. For example, the accuracy obtained by DV-hop without Refinement is 42% and it drops to 28% after two iterations; an additional 4% drop can be achieved by waiting until Refinement terminates based on the local stopping criteria, but this requires another 27 messages (29 in total). Thus the communication cost of Refinement can effectively be reduced to less than the costs for Phase 1. Nevertheless, the poor coverage of Refinement limits its practical use.

36.5.4 Recommendations From the previous discussion it follows that no single combination of phase 1, 2, and 3 alternatives performs best under all conditions; each combination has its strengths and weaknesses. The results

© 2006 by Taylor & Francis Group, LLC

36-20

Embedded Systems Handbook 100

100

DV-hop, Lateration, Refinement Sum-dist, Min-max, Refinement Euclidean, Lateration, Refinement

80 Position error [%r]

Coverage [%]

80

60

40

20

0

DV-Hop, Lateration, Refinement Sum-dist, Min-max, Refinement Euclidean, Lateration, Refinement

60

40

20

0

FIGURE 36.15 TABLE 36.3

2

4 6 Refinement limit

8

10

0

0

2

4 6 Refinement limit

8

10

Effect of refinement limit. Comparison; Anchor Fraction Fixed at 5%, No Refinement Radio range (avg. connectivity)

Range variance 0 0.025 0.05 0.1 0.25 0.5

16 (15.5)

14 (12.1)

12 (9.0)

10 (6.4)

8 (4.2)

Euclidean/Lateration Sum-dist/Lateration Sum-dist/Lateration Sum-dist/Min-max DV-hop/Lateration DV-hop/Lateration

Euclidean/Lateration Sum-dist/Min-max Sum-dist/Lateration Sum-dist/Lateration DV-hop/Min-max DV-hop/Min-max

Sum-dist/Min-max Sum-dist/Min-max Sum-dist/Min-max Sum-dist/Min-max DV-hop/Min-max DV-hop/Min-max

Sum-dist/Min-max Sum-dist/Min-max Sum-dist/Min-max Sum-dist/Min-max DV-hop/Min-max DV-hop/Min-max

DV-hop/Lateration DV-hop/Lateration DV-hop/Lateration DV-hop/Lateration DV-hop/Lateration DV-hop/Lateration

presented in Section 36.5 follow from changing one parameter (radio range, range variance, and anchor fraction) at a time. Since the sensitivity of the localization algorithms may not be orthogonal in the three parameters, it is difficult to derive general recommendations. Therefore, we conducted an exhaustive search for the best algorithm in the three-dimensional parameter space. For readability we do not present the raw outcome, a 6×6×5 cube, but show a two-dimensional slice instead. We found that the localization algorithms are the least sensitive to the anchor fraction, so Table 36.3 presents the results of varying the radio range and the range variance, while keeping the anchor fraction fixed at 5%. In each case we list the algorithm that achieves the best accuracy (i.e., the lowest average position error) under the condition that its coverage exceeds 50%. Since Refinement often results in very poor coverage, we only examine Phases 1 and 2 here. The exhaustive parameter search, and basic observations about Refinement, lead to the following recommendations: 1. Euclidean should always be used in combination with Lateration, but only if distances can be measured very accurately (range variance <2%) and the network has a high connectivity (≥12). When the anchor fraction is increased, Euclidean captures some more entries in the left-upper corner of Table 36.3, and the conditions on range variance and connectivity can be relaxed slightly. Nevertheless, the window of opportunity for Euclidean/Lateration is rather small. 2. DV-hop should be used when there are no or poor distance estimates, for example, those obtained from the signal strength (cf. the bottom rows in Table 36.3). Our results show that DV-hop outperforms the other methods when the range variance is large (>10% in this slice). The presence of Lateration in the last column, that is, with a very low connectivity, is an artifact caused by the filtering on coverage. DV-hop/Min-max has a coverage of 49% in this case (versus 56% for DV-hop/Lateration), but also a much lower error. Regarding the issue of combining DV-hop with Lateration or Min-max, we observe that overall, Min-max is the preferred choice. Recall, however, its sensitivity for anchor placement leading to large errors at the edges of the network.

© 2006 by Taylor & Francis Group, LLC

Distributed Localization Algorithms

36-21

3. Sum-dist performs best in the majority of cases, especially if the anchor fraction is (slightly) increased above 5%. Increasing the number of anchors reduces the average path length between nodes and anchors, limiting the accumulation of range errors along multiple hops. Except for a few corner cases, Sum-dist performs best in combination with Min-max. In scenarios with very low connectivity and a low anchor fraction, Sum-dist tends to overestimate the distance significantly. Therefore DV-hop performs better in the far right column of Table 36.3. 4. Refinement can be used to improve the accuracy of the node positions when the range estimates between neighboring nodes are quite accurate. The best results are obtained in combination with DV-hop or Sum-dist, but at a significant (around 50%) drop in coverage. This renders the usage of Refinement questionable, despite its the modest communication overhead. A final important observation is that the localization problem is still largely unsolved. In ideal conditions Euclidean/Lateration performs fine, but in all other cases it suffers from severe coverage problems. Although Refinement uses the extra information of the many neighbor-to-neighbor ranges and reduces the error, it too suffers from coverage problems. Under most conditions, there is still significant room for improvement.

36.6 Conclusions This chapter addressed the issue of localization in ad-hoc wireless sensor networks. From the known localization algorithms specifically proposed for sensor networks, three approaches were selected that meet basic requirements of self-organization, robustness, and energy-efficiency: Ad-hoc positioning [11], Robust positioning [13], and N-hop multilateration [14]. Although these three algorithms were developed independently, they share a common structure. We were able to identify a generic, 3-phase approach to determine the individual node positions consisting of the steps below: 1. Determine the distances between unknowns and anchor nodes. 2. Derive for each node a position from its anchor distances. 3. Refine the node positions using information about the range to, and positions of, neighboring nodes. We studied three Phase 1 alternatives (Sum-dist, DV-hop, and Euclidean), two Phase 2 alternatives (Lateration and Min-max) and an optional Refinement procedure for Phase 3. To this end the discrete event simulator developed by Savarese et al. [13] was extended to allow for the execution of an arbitrary combination of alternatives. Section 36.4 dealt with Phase 1 and Phase 2 in isolation. For Phase 1 alternatives, we studied the sensitivity to range errors, connectivity, and fraction of anchor nodes (with known positions). DV-hop proved to be stable and predictable, Sum-dist and Euclidean showed tendencies to under estimate the distances between anchors and unknowns. Euclidean was found to have difficulties in propagating distance information under non-ideal conditions, leading to low coverage in the majority of cases. The results for Phase 2 showed that Lateration is capable of obtaining very accurate positions, but also that it is very sensitive to the accuracy and precision of the distance estimates. Min-max is more robust, but is sensitive to the placement of anchors, especially at the edges of the network. In Section 36.5 we compared all six phase 1/2 combinations under different conditions. No single combination performs best; which algorithm is to be preferred depends on the conditions (range errors, connectivity, anchor fraction, and placement). The Euclidean/Lateration combination [11] should be used only in the absence of range errors (variance <2%) and requires a high node connectivity. The DV-hop/ Min-max combination, which is a minor variation on the DV-hop/Lateration approach proposed in [11] and [13], performs best when there are no or poor distance estimates, for example, those obtained from the signal strength. The Sum-dist/Min-max combination [14] is to be preferred in the majority of other conditions. The benefit of running Refinement in Phase 3 is considered to be questionable since in many cases the coverage dropped by 50%, while the accuracy only improved significantly in the case of small

© 2006 by Taylor & Francis Group, LLC

36-22

Embedded Systems Handbook

range errors. The communication overhead of Refinement was shown to be modest (2 messages per node) in comparison to the controlled flooding of Phase 1 (4 messages per node).

36.6.1 Future Work Regarding the future, the ultimate distributed localization algorithm is yet to be devised. Under ideal circumstances Euclidean/Lateration performs fine, but in all other cases there is significant room for improvement. Furthermore, additional effort is needed to bridge the gap between simulations and realworld localization systems. For instance, we need to gather more data on the actual behavior of sensor nodes, particularly with respect to physical effects like multipath, interference, and obstruction.

Acknowledgments This work was first published in Elsevier Computer Networks [24]. We thank Elsevier for giving us permission to reproduce the material. We also thank Andreas Savvides and Dragos Niculescu for their input and for sharing their code with us.

References [1] I. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. A survey on sensor networks. IEEE Communications Magazine, 40: 102–114, 2002. [2] S. Atiya and G. Hager. Real-time vision-based robot localization. IEEE Transactions on Robotics and Automation, 9: 785–800, 1993. [3] J. Leonard and H. Durrant-Whyte. Mobile robot localization by tracking geometric beacons. IEEE Transanctions on Robotics and Automation, 7: 376–382, 1991. [4] R. Tinos, L. Navarro-Serment, and C. Paredis. Fault tolerant localization for teams of distributed robots. In IEEE International Conference on Intelligent Robots and Systems, Vol. 2, Maui, HI, October 2001, pp. 1061–1066. [5] J. Hightower and G. Borriello. Location systems for ubiquitous computing. IEEE Computer, 34: 57–66, 2001. [6] N. Bulusu, J. Heidemann, and D. Estrin. GPS-less low-cost outdoor localization for very small devices. IEEE Personal Communications, 7: 28–34, 2000. [7] S. Capkun, M. Hamdi, and J.-P. Hubaux. GPS-free positioning in mobile ad-hoc networks. Cluster Computing, 5: 157–167, 2002. [8] J. Chen, K. Yao, and R. Hudson. Source localization and beamforming. IEEE Signal Processing Magazine, 19: 30–39, 2002. [9] L. Doherty, K. Pister, and L. El Ghaoui. Convex position estimation in wireless sensor networks. In IEEE Infocom 2001, Anchorage, AK, April 2001. [10] T. He, C. Huang, B. M. Blum, J. A. Stankovic, and T. Abdelzaher. Range-free localization schemes for large scale sensor networks. In ACM International Conference on Mobile Computing and Networking (Mobicom), San Diego, CA, September 2003, pp. 81–95. [11] D. Niculescu and B. Nath. Ad-hoc positioning system. In IEEE GlobeCom, San Antonio, TX, November 2001, pp. 2926–2931. [12] V. Ramadurai and M. Sichitiu. Localization in wireless sensor networks: a probabilistic approach. In International Conference on Wireless Networks (ICWN), Las Vegas, NV, June 2003, pp. 275–281. [13] C. Savarese, K. Langendoen, and J. Rabaey. Robust positioning algorithms for distributed ad-hoc wireless sensor networks. In USENIX Technical Annual Conference, Monterey, CA, June 2002, pp. 317–328.

© 2006 by Taylor & Francis Group, LLC

Distributed Localization Algorithms

36-23

[14] A. Savvides, H. Park, and M. Srivastava. The bits and flops of the n-hop multilateration primitive for node localization problems. In Proceedings of the First ACM International Workshop on Wireless Sensor Networks and Application (WSNA), Atlanta, GA, September 2002, pp. 112–121. [15] N. Priyantha, A. Chakraborty, and H. Balakrishnan. The cricket location-support system. In Proceedings of the 6th ACM International Conference on Mobile Computing and Networking (Mobicom), Boston, MA, August 2000, pp. 32–43. [16] P. Bahl and V. Padmanabhan. RADAR: an in-building RF-based user location tracking system. In Infocom, Vol. 2, Tel Aviv, Israel, March 2000, pp. 575–584. [17] J. Hightower, R. Want, and G. Borriello. SpotON: an indoor 3D location sensing technology based on RF signal strength. UW CSE 00-02-02, University of Washington, Department of Computer Science and Engineering, Seattle, WA, February 2000. [18] J. Zhao and R. Govindan. Understanding packet delivery performance in dense wireless sensor networks. In Proceedings of the First International Conference on Embedded Networked Sensor Systems (SenSys), Los Angeles, CA, November 2003, pp.1–13 [19] A. Savvides, C.-C. Han, and M. Srivastava. Dynamic fine-grained localization in ad-hoc networks of sensors. In Proceedings of the 7th ACM International Conference on Mobile Computing and Networking (Mobicom), Rome, Italy, July 2001, pp. 166–179. [20] L. Girod and D. Estrin. Robust range estimation using acoustic and multimodal sensing. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Maui, Hawaii, October 2001. [21] Y. Xu, J. Heidemann, and D. Estrin. Geography-informed energy conservation for ad-hoc routing. In Proceedings of the 7th ACM International Conference on Mobile Computing and Networking (Mobicom), Rome, Italy, 2001, pp. 70–84. [22] A. Varga. The OMNeT++ discrete event simulation system. In European Simulation Multiconference (ESM’2001), Prague, Czech Republic, June 2001. [23] K. Whitehouse and D. Culler. Callibration as parameter estimation in sensor networks. In Proceedings of the 1st ACM International Workshop on Wireless Sensor Networks and Application (WSNA), Atlanta, GA, September 2002, pp. 59–67. [24] K. Langendoen and N. Reijers. Distributed localization in wireless sensor networks: a quantitative comparison. Elsevier Computer Networks, 43: 499–518, 2003.

© 2006 by Taylor & Francis Group, LLC

37 Routing in Sensor Networks Shashidhar Gandham and Ravi Musunuri University of Texas at Dallas

Udit Saxena Microsoft Corporation

37.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-1 37.2 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-2 Flat Routing Protocols • Cluster-Based Routing Protocols

37.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-7

37.1 Introduction Sensor networks are expected to be deployed in large numbers for applications, such as environmental monitoring, surveillance, security, and precession agriculture [1–4]. Each sensor node consists of a sensing device, processor with limited computational capabilities, memory, and wireless transceiver. These nodes are typically deployed in inaccessible terrain in an ad hoc manner. Once deployed, each sensor node is expected to periodically monitor its surrounding environment and detect the occurrence of some predetermined events. For example, in a sensor network deployed for monitoring forest fires any sudden surge in the temperature of the surrounding area would be an event of interest. Similarly, in a sensor network deployed for surveillance any moving object in the surroundings would be an event of interest. On detecting an event a sensor node is expected to report the details of the event to a base station associated with the sensor network. In most cases, the base station might not be in direct reach of the reporting nodes. Hence, sensor nodes need to form a multihop wireless network to reach the base station. A medium access control protocol and a routing protocol are essential in setting up such a wireless network. In this chapter we present an overview of design challenges associated with routing in sensor networks and some existing routing protocols. It is known that each sensor node is powered by limited battery supplied energy. Nodes drain their energy in carrying local tasks and in communicating with neighboring nodes. The amount of energy spent in communication is known to be order of magnitudes higher than the amount spent in local tasks [5]. As stated earlier, sensor nodes are expected to be deployed in inaccessible terrain and it might not be feasible to replenish the energy available at each node. Thus, energy available at sensor nodes is an important design constraint in routing.

37-1

© 2006 by Taylor & Francis Group, LLC

37-2

Embedded Systems Handbook

37.2 Routing Each sensor node is expected to monitor some environmental phenomenon and forward the corresponding data toward the base station. To forward the data packets each node needs to have the routing information. Here, we would like to state that the flow of packets is mostly directed from sensor nodes toward the base station. As a result, each sensor node need not maintain explicit routing tables. Routing protocols can in general be divided into flat routing and cluster-based routing protocols.

37.2.1 Flat Routing Protocols In flat routing protocols the nodes in the network are considered to be homogeneous. Each node in the network participates in route discovery, maintainance, and forwarding of the data packets. Here, we describe few existing flat routing protocols for the sensor networks. 37.2.1.1 Sequential Assignment Routing Sequential Assignment Routing (SAR) [5] takes into consideration the energy and Quality of Service (QoS) for each path, and the priority level of each packet for making routing decisions. Every node maintains multiple paths to the sink to avoid the overhead of route recomputation due to the node or link failure. Multiple paths are built by building multiple trees rooted at the one-hop sink neighbors. Each tree is grown outward by successively adding more nodes to one-hop sink neighbors, while avoiding nodes with lower QoS and energy reserves. Each sensor node can control, which of the neighbors can be used for relaying a message. Each node associates two parameters, an additive QoS metric and energy measure, with every path. Energy is measured by estimating the maximum number of packets that can be routed without energy being depleted if this node is exclusively using the path. SAR then calculates a weighted QoS metric as the product of the additive QoS metric and a weighted coefficient associated with the priority level of the packet. The SAR algorithm attempts to minimize the average weighted QoS metric in the lifetime of the network. A periodic recomputation of paths is triggered by the sink to account for any changes in the topology. Failure recovery is done by a handshaking procedure between neighbors. 37.2.1.2 Directed Diffusion Estrin et al. [2] proposed a diffusion-based scheme for routing queries from base station to sensor nodes and forwarding corresponding replies. In directed diffusion, an attribute-based naming is used by the sensor nodes. Each sensor names data that it generates using one or more attributes. A sink may query for data by disseminating interests. Intermediate nodes propagate these interests. Interests establish gradients of data toward the sink that expressed that interest. For example, a seismic sensor may generate a data: type = seismic, id = 12, location = NE, time stamp = 01.01.01, footprint = vehicle/wheeled/over 40 tons. A sink may send an interest of the form: type = seismic, location = NE. The intermediate nodes may send an interest for data for vehicle data in the NE quadrant toward the approximate direction. The strength of the gradient may be different toward different neighbors resulting in different amounts of information flow. 37.2.1.3 Minimum Cost Forwarding Algorithm for Large Sensor Networks The minimum cost forwarding approach proposed by Ye et al. [6] exploits the fact that the data flow in sensor networks is in a single direction and is always toward the fixed base station. Their method neither requires sensor nodes to have unique identity nor maintain routing tables to forward the messages. Each node maintains the least cost estimate from itself to the base station. Each message to be forwarded is broadcasted by the node. On receiving a message, the node checks if it is on the least cost path between the source sensor node and the base station. If so it would forward the message by broadcasting. In principle, the concept behind minimum cost forwarding is similar to the gravity field that drives waterfalls from top of mountain to the ground. At each point water flows from a high post to a low post along the shortest path. For this algorithm to work each node needs to have the least cost estimate from itself to the base station. The base station broadcasts an advertisement message with the cost set to

© 2006 by Taylor & Francis Group, LLC

Routing in Sensor Networks

37-3

zero. Every node initially has the estimate set to infinity. On receiving an advertisement message, every node checks if the estimate in the message plus the cost of link on which it is received, is less than the current estimate. If so the current estimate and the estimate in the advertisement message is updated. If the received advertisement message is updated with a new cost estimate, it is forwarded, else it is purged. As a result of forwarding advertisement message immediately after updating, it was noticed by the authors that some nodes will get multiple updates and do multiple forwards as lesser cost estimates flow in. Furthermore, the nodes far away from the base station get more updates than those close to the base station. To avoid this instability during the setup phase, a back off algorithm was proposed. According to this back off algorithm on updating the current cost estimate, the advertisement message is not forwarded until A × Cnode units of time. Where, A is a constant determined through simulations and Cnode is the cost of link on which the advertisement message was received. 37.2.1.4 Flow-Based Routing Protocol In Reference 7, the authors proposed to model the sensor network as a flow network and have proposed an Integer Linear Program (ILP) [8] based routing method. The objective of this ILP-based method is to minimize the maximum energy spent by any sensor node during a period of time. Through simulation results, the authors have shown that our ILP-based routing heuristic increases the lifetime of sensor network significantly. In the above-mentioned study, the authors observed that the sensor nodes that are one-hop away from a base station (sink) drain their energy much earlier than other nodes in the network. As a result, the base station is disconnected from the network. To address this problem deployment of multiple, intermittently mobile base stations was proposed. The operation time of sensor network was split into equal periods of time referred to as rounds. Modifications were made to the flow network model and the ILP such that, solving it gives locations of base stations in addition to the routing information. The ILP with modification is given below:

Minimize Emax xij − xki = T ,

j∈N (i)

Et

xij + Er

j∈N (i)

i∈V

(37.1)

k∈N (i)

xki ≤ αREi ,

i∈V

(37.2)

k∈N (i)

yl ≤ Kmax

(37.3)

l∈Vf

Et

j∈N (i)

xik ≤ T |Vs |yk ,

k ∈ Vf

(37.4)

i∈Vs

xij + Er

xki ≤ Emax ,

i∈V

(37.5)

yk ∈ {0, 1}, k ∈ Vf

(37.6)

k∈N (i)

xij ≥ 0, i ∈ Vs , j ∈ V ;

In formulating the above ILP, the sensor network is represented as a graph G(V , E) where (1) V = Vs ∪ Vf where Vs represents the sensor nodes and Vf represents the feasible sites; (2) E ⊆ V × V represents the set of wireless links. 0–1 integer variables yl are defined such that for each l ∈ Vf , yl = 1 if a base station is located at feasible site l; 0 otherwise. N (i) = {j: (i, j) ∈ E}. Kmax and REi represent the maximum number of base stations available and residual energy of nodes, respectively. Given G(V , E), α and Kmax , the above ILP, denoted by BSLmm (G, α, Kmax ), minimizes the maximum energy spent, Emax , by a sensor node in a round. For a detailed explaination of the ILP, we refer the readers to Reference 7. Apart from increasing the lifetime of network, the authors argued that multiple, mobile base stations would decrease the average hop length taken by each packet and increase the robustness of the system.

© 2006 by Taylor & Francis Group, LLC

37-4

Embedded Systems Handbook

37.2.1.5 Sensor Protocols for Information via Negotiation Kulik and coworkers [9] a set of protocols to disseminate individual sensor information to all the sensor nodes. Sensor Protocols for Information via Negotiation (SPIN) overcomes information implosion and overlap by using negotiation and information descriptors (metadata). Classic flooding suffers from the problem of implosion in that the information is sent to all nodes regardless of whether they have already seen that information or not. Another problem is that of overlap of information where two pieces of information might have some components in common, so it might be sufficient to just forward the information after removing the common part. SPIN uses three kinds of messages to communicate: • ADV — when a node has data to send, it advertises it using this message. • REQ — a node sends this message when it wishes to receive some data. • DATA — data message contain the data with a metadata header. The details are as follows: 1. SPIN-PP. This protocol is designed for point-to-point communication, assuming that two nodes can communicate with each other without interfering with other nodes communication. This protocol also assumes that energy is not a constraint and packets are never lost. This protocol works on a hop-by-hop basis. A node that has information to send advertises this by sending an ADV to its neighboring nodes. The nodes who are interested in receiving this information express their interest by sending an REQ. The originator of the ADV then sends the data to the nodes that sent an REQ. These nodes then send ADV messages to their neighbors and the process repeats itself. 2. SPIN-EC. This protocol adds an energy heuristic to the previous protocol. A node participates in the process only if it can complete all the stages in the protocol without going below a low-energy threshold. 3. SPIN-BC. This protocol was defined for broadcast channels. The advantage is that all nodes within hearing range can hear a broadcast while the disadvantage is that the nodes have to desist from transmitting if the channel is already in use. Another difference from the previous protocols is that nodes do not immediately send out REQ messages on hearing an ADV. Each node sets a random timer and on expiry of that timer sends out the REQ message. The other nodes whose timer have not yet expired cancel it on hearing the request thus preventing redundant copies of the request being sent again. 4. SPIN-RL. This protocol was designed for lossy broadcast channels by incorporating two adjustments. First, each node keeps track of the advertisements it receives and re-requests data if a response from the requested node is not received within a specified time interval. Second, nodes limit the frequency with which they will resend data. Every node waits for a predetermined time period before servicing requests for the same piece of data again. Multihop flat routing can also be subdivided according to the signal processing techniques. There are two types of cooperative signal processing techniques: noncoherent and coherent. For noncoherent processing, raw data is preprocessed at the node itself before forwarding it to the Central Node (CN) for further processing. For coherent processing, the data is forwarded after minimum processing to the CN. The processing at the node involves operations, such as time stamping. Thus for energy efficiency, algorithmic techniques assume importance for coherent processing since the data traffic is low, while path optimality is important for coherent processing. 37.2.1.6 Geographic Routing Protocols Geographic routing protocols are based on the assumption that each node is aware of geographical location of its neighbors and the destination node. There are many known location determination algorithms [10,11], which would enable sensor nodes to learn about their location once deployed. On determining its position, each node can inform its neighbors about its location. In addition, the dataflow in sensor networks is mostly directed toward a base station whose position can be sent to the nodes on deployment. The basic idea in geographic routing protocols is to forward packets to a neighbor that is closer to the destination. Every node employs the same forwarding strategy until the packet reaches the destination node. It is known that this simple packet-forwarding strategy suffers from local minimum phenomenon [12].

© 2006 by Taylor & Francis Group, LLC

Routing in Sensor Networks

37-5

Packets might reach a node whose neighbors are all further away from the destination. Thus, they are struck with no further node to which they can be forwarded. Karp and Kung [12] proposed the right-hand rule to overcome the local minimum phenomenon. They assume that the underlying connectivity graph is planar. When a packet gets struck at a node, they propose to forward the packet along the face of the graph in counterclockwise direction. Face routing is employed until the packet reaches a node that is closer to the destination. Fang et al. [13] show that local minimum phenomenon can be addressed in nonplanar graphs too. 37.2.1.7 Parametric Probabilistic Routing In parametric probabilistic routing protocol, proposed by Barrett et al. [14] each node forwards a packet based on a probability density function. Barrett et al. proposed two variations of their protocol. In the first variation, referred to as a Destination Attractor, the probability with which a packet is forwarded to a neighbor depends on the number hops the source node is from the destination and the number hops the current node is from the destination. The basic idea behind this variation is to increase the probability of retransmission if the packet is approaching the destination; and to decrease the probability of retransmission if the packet is going away from the destination. The second variation, referred to as Directed Transmission, uses number hops already traversed by the packet in addition to the two parameters used by destination attractor. In directed transmission, nodes on the shortest path to the destination retransmit with higher probability. 37.2.1.8 Min–MinMax, an Energy Aware Routing Protocol Gandham et al. [15] have formulated the energy aware routing during a round as described below: the sensor network is represented as a graph G(V , E) where 1. V = Vs ∪ Vb where Vs is the set of sensor nodes and Vb is the set of base station(s). 2. E ⊆ V × V represents the set of wireless links. A round is assumed to consist of T time frames and each sensor node generates one packet of data in every time frame. At the beginning of a round the residual energy at a sensor node i is represented by REi . During a round, the total energy spent by sensor node i can be at most α × REi , where α (0 < α ≤ 1) is a parameter. The goal is to determine routing information so as to minimize the total energy spent in the network such that the maximum energy spent by a node in a round is minimized. It is known that the energy spent by a node is directly proportional to the amount of flow (number of packets) passing through the node. Thus, minimizing the maximum energy spent by a node is same as minimizing the maximum flow through a node. Exploiting this fact, energy aware routing is cast as a variant of the maximum flow problem [16]. In the maximum flow problem [16], given a directed graph G(V , E), supply node Ss and demand node Sd and capacity uij of each link (i, j) ∈ E. We need to determine the flow xij on each arc (i, j) ∈ E such that the net outflow from the supply node is maximized. We refer the readers to Reference 15 for the details.

37.2.2 Cluster-Based Routing Protocols In cluster-based routing protocols, special nodes referred to as cluster heads, discover and maintain routes and noncluster head nodes join one of the clusters. All the data packets originating in the cluster are forwarded toward the cluster head. The cluster head in turn will forward these packets toward destination using the routing information. Here, we describe some cluster-based routing protocols from the literature. 37.2.2.1 Low-Energy Adaptive Clustering Hierarchy Chandrakasan and coworkers [17] proposed the Low-Energy Adaptive Clustering Hierarchy (LEACH) as an energy-efficient communication protocol for wireless sensor networks. Authors of LEACH claim that this protocol will extend the life of wireless sensor networks by a factor of 8, when compared with protocols based on multihop routing and static clustering. LEACH is a cluster-based routing algorithm in which self-elected cluster heads collect data from all the sensor nodes in their cluster, aggregate the

© 2006 by Taylor & Francis Group, LLC

37-6

Embedded Systems Handbook

collected data by data fusion methods, and transmit the data directly to the base station. These self-elected cluster heads continue to be cluster heads for a period referred to as a round. At the beginning of each round, every node determines if it can be a cluster head during the current round. If it decides to be a cluster head for the current round it announces its decision to its neighbors. Other nodes that choose not to be cluster heads opt to join one of the cluster heads on listening to these announcements, based on predetermined parameters, such as signal to noise ratio. LEACH is proposed for routing data in wireless sensor networks, which have a fixed base station to which the recorded data needs to be routed. All the sensor nodes are considered to be static, homogeneous, and energy constrained. The sensor nodes are expected to sense the environment continuously and thus have data to be sent at a fixed rate. This assumption makes it unsuitable for sensor networks where a moving source needs to be monitored. Furthermore, radio channels are assumed to be symmetric. The term symmetric here means that energy required to transmit a particular message between two nodes is same in either direction. A first-order radio model is assumed to describe the transmission characteristics of the sensor nodes. In this model energy required to transmit a signal has a fixed part and a variable part. The variable part is directly proportional to square of the distance. Some constant energy is required to receive a signal by any receiving antenna. Based on these assumptions it is clear that having too many intermediate nodes to route the data might consume more energy, on a global perspective, when compared to direct transmission to the base station. This argument supports the decision to transmit the aggregated data directly from cluster head to base station. The key features of LEACH are localized coordination for cluster setup and operation, randomized rotation of cluster heads and local fusion of data to reduce global communication costs. LEACH is organized into rounds where each round starts with a setup phase followed by a longer steady-state data transfer phase. Here we describe various subphases involved in both these phases: 1. Advertisement phase. A predetermined fraction of nodes, say p, elect themselves as cluster heads. The optimum value of p can be found from the plot between normalized energy dissipation and percentage of nodes acting as cluster heads. For detailed description of this procedure we refer the reader to Reference 18. The decision to be a cluster head is made by choosing a random number between 0 and 1. If the generated number is less than a threshold T (n) then the node will be a cluster head for current round. The threshold T (n) is given by the expression p/[1 − p(r mod (1/p))]. This would ensure that every node would be a cluster head once in 1/p rounds. Once the decision is made, cluster heads advertise their id and this is done by employing CSMA MAC protocol. 2. Cluster setup phase. On listening to advertisements in the previous phase, noncluster head nodes determine which cluster head to join by comparing signal to noise ratio from various cluster heads surrounding it. Each node informs the cluster head of the cluster that it decides to join by employing CSMA MAC protocol. 3. Schedule creation. On receiving all the messages, cluster head creates a TDMA schedule and announces it to all the nodes in the cluster. In order to avoid interference between nodes in adjacent clusters, cluster head determines the CDMA code to be used by all the nodes in its cluster. This CDMA code to be used in the current round is transmitted along with the TDMA schedule. 4. Data transmission. Once the schedule is known each node will transmit the data during the time slot allocated to it. When the cluster head receives data from all the nodes in its cluster it will run some data fusion algorithms to aggregate the data. The resulting data is transmitted directly to the base station. 37.2.2.2 Threshold Sensitive Energy-Efﬁcient Sensor Network Protocol In Reference 19, the authors have classified sensor networks into proactive networks and reactive networks. Nodes in proactive networks continuously monitor the environment and thus have the data to be sent at a constant rate. LEACH suits such sensor networks in transmitting data efficiently to the base station. In case of the reactive sensor networks, nodes need to transmit the data only when an event of interest occurs. Hence, all the nodes in the network do not have equal amount of data to be transmitted. Manjeshwar

© 2006 by Taylor & Francis Group, LLC

Routing in Sensor Networks

37-7

and Agrawal [19] proposed Threshold Sensitive Energy-Efficient Sensor Network (TEEN) for routing in reactive sensor networks. TEEN employs the cluster formation strategy of LEACH but adopts a different strategy in data transmission phase. TEEN makes use of two user-defined parameters hard threshold (Ht) and soft threshold (St) to determine if it needs to transmit the value it sensed currently. When the monitored value exceeds Ht for the first time, it is stored in a variable and is transmitted during the time slot of the node. Subsequently, if the monitored value exceeds the currently stored value by a magnitude of St then the node will transmit the data. This transmitted value is stored for comparing in future. 37.2.2.3 Two-Level Clustering Algorithm Estrin et al. [2] proposed a two-level clustering algorithm that can be extended to build a cluster hierarchy. In this algorithm every sensor at a particular level is associated with a radius or the number of hops that its advertisements will reach. Sensors at a higher level are associated with higher radii. All sensors start with level 0. Each sensor sends out periodic advertisements to other nodes that are within its radius. The advertisements carry its current level, its parent’s identity (if any) and the remaining energy. After transmitting the advertisement, each node will wait for a time proportional to its radius to receive advertisements from other nodes. At the end of the wait time all level 0 nodes start a promotion timer that is proportional to its remaining energy reserves and the number of level 0 nodes whose advertisements it received. When the promotion timer expires the node promotes itself to level 1 and starts sending out periodic advertisements. In these new advertisements it lists its potential children which are the level 0 nodes that it previously heard. A level 0 node then picks up its parent from one of the level 1 nodes, whose advertisements included its identity. Once a level 0 node picks up its parent it cancels its promotion timer and drops out of the race. At the end, each of the level 1 node starts a wait timer and waits for its potential children’s acknowledgments. If no level 0 node selected it as its parent or its energy dropped below a certain level it demotes itself to a level 0 node. All level 0 and level 1 nodes periodically enter the wait stage to take into account any change in network conditions and reclustering takes place.

37.3 Conclusions In this chapter we presented a brief overview of some known routing algorithm in wireless sensor networks. Both flat and cluster-based routing algorithms were discussed.

References [1] Estrin, D., Girod, L., Pottie, G., and Srivastava, M. Instrumenting the world with wireless sensor networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, 2001, pp. 2033–2036. [2] Estrin, D., Govindan, R., Heidemann, J., and Kumar, S. Next century challenges: scalable coordination in sensor networks. In Proceedings of the 5th Annual ACM/IEEE International Conference on Mobile Computing and Networking, IEEE, 1999, pp. 263–270. [3] Pottie, G.J. and Kaiser, W.J. Wireless integrated network sensors. Communications of the ACM, 43, 51–58, 2000. [4] Pottie, G.J. Wireless sensor networks. In Proceedings of the Information Theory Workshop, 1998, pp. 139–140. [5] Sohrabi, K., Gao, J., Ailawadhi, V., and Pottie, G.J. Protocols for self-organization of a wireless sensor network. IEEE Personal Communications, 7, 16–27, 2000. [6] Ye, F., Chen, A., Liu, S., and Zhang, L. A scalable solution to minimum cost forwarding in large sensor networks. In Proceedings of the 10th International Conference on Computer Communications and Networks, 2001, pp. 304–309.

© 2006 by Taylor & Francis Group, LLC

37-8

Embedded Systems Handbook

[7] Gandham, Shashidhar Rao, Dawande, Milind, Prakash, Ravi, and Venkatesan, S. Energy efficient schemes for wireless sensor networks with multiple mobile stations. In Proceedings of the IEEE Globecom, IEEE, 2003. [8] Nemhauser, G.L. and Wolsey, L.A. Integer Programming and Combinatorial Optimization. John Wiley & Sons, New York, 1988. [9] Heinzelman, W., Kulik, J., and Balakrishnan, H. Negotiation-based protocols for disseminating information in wireless sensor networks. In Proceedings of the 5th Annual ACM/IEEE International Conference on Mobile Computing and Networking, IEEE, 1999. [10] Saikat Ray, Rachanee Ungrangsi, Francesco De Pellegrini, Ari Trachtenberg, and David Starobinski. Robust location detection in emergency sensor networks. In Proceedings of the INFOCOM, IEEE, 2003. [11] Nirupama Bulusu, John Heidemann, and Deborah Estrin. GPS-less low cost outdoor localization for very small devices. Technical report 00-729, USC/ISI, April 2000. [12] Karp, B. and Kung, H. GPRS: greedy perimeter stateless routing for wireless networks. In Proceedings of the Mobicom, ACM, 2000. [13] Qing Fang, Jie Gao, and Leonidas J. Guibas. Locating and bypassing routing holes in sensor networks. In Proceedings of the INFOCOM, IEEE, 2004. [14] Christopher L. Barrett, Stephan J. Eidenbenz, Lukas Kroc, Madhav Marathe, and James P. Smith. Parametric probabilistic sensor network routing. In Proceedings of the WSNA’03, 2003. [15] Shashidhar Gandham, Milind Dawande, and Ravi Prakash. An integral flow-based energy-efficient routing algorithm for wireless sensor networks. In Proceedings of the WCNC, IEEE, 2004. [16] Ahuja, R.K. and Orlin, J.B. A fast and simple algorithm for the maximum flow problem. Operations Research, 37, 748–759, 1989. [17] Heinzelman, W.R., Chandrakasan, A., and Balakrishnan, H. Energy-efficient communication protocol for wireless micro sensor networks. In Proceedings of the 33rd Annual Hawaii International Conference on System Sciences, 2000, pp. 3005–3014. [18] Heinzelman, W., Kulik, J., and Balakrishnan, H. Adaptive protocols for information dissemination in wireless sensor networks. In Proceedings of the 5th Annual ACM/IEEE International Conference on Mobile Computing and Networking, IEEE, 1999, pp. 174–185. [19] Manjeshwar, A. and Agrawal, D.P. TEEN: a routing protocol for enhanced efficiency in wireless sensor networks. In International Proceedings of the 15th Parallel and Distributed Processing Symposium, 2001, pp. 2009–2015.

© 2006 by Taylor & Francis Group, LLC

38 Distributed Signal Processing in Sensor Networks 38.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-1 38.2 Spectrum Estimation Using Sensor Networks . . . . . . . . . 38-3 Background • Mathematical Formulation of the Problem

38.3 Inverse and Ill-Posed Problems . . . . . . . . . . . . . . . . . . . . . . . . . 38-6 Ill-Posed Linear Operator Equations • Regularization Methods for Solving Ill-Posed Linear Operator Equations

Omid S. Jahromi Bioscrypt Inc.

Parham Aarabi University of Toronto

38.4 Spectrum Estimation Using Generalized Projections 38-9 38.5 Distributed Algorithms for Calculating Generalized Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-11 The Ring Algorithm • The Star Algorithm

38.6 Concluding Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-17 Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-17 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-17

38.1 Introduction Sensors are vital means for scientists and engineers to observe physical phenomena. They are used to measure physical variables such as temperature, pH, velocity, rotational rate, flow rate, pressure, and many others. Most modern sensors output a discrete-time (digitized) signal that is indicative of the physical variable they measure. Those signals are often imported into digital signal processing (DSP) hardware, stored in files or plotted on a computer display for monitoring purposes. In recent years there has been an emergence of a number of new sensing concepts which advocate connecting a large number of inexpensive and small sensors in a sensor network. The trend to network many sensors together has been reinforced by the widespread availability of cheap embedded processors and easily accessible wireless networks. The building blocks of a sensor network, often called “Motes,” are self-contained, battery-powered computers that measure light, sound, temperature, humidity, and other environmental variables (Figure 38.1). Motes can be deployed in large numbers providing enhanced spatio-temporal sensing coverage in ways that are either prohibitively expensive or impossible using conventional sensing assets. For example, they can allow monitoring of land, water, and air resources for environmental monitoring. They can also be used to monitor borders for safety and security. In defence applications, sensor networks can provide

38-1

© 2006 by Taylor & Francis Group, LLC

38-2

Embedded Systems Handbook

FIGURE 38.1 A wireless sensor node or “Mote” made by Crossbow Technology, Inc. in San Jose, CA.

enhanced battlefield situational awareness which can revolutionize a wide variety of operations from armored assault on open terrain to urban warfare. Sensor networks have many potential applications in biomedicine, factory automation, and control of transportation systems as well. In principle, a distributed network of sensors can be highly scalable, cost effective, and robust with respect to individual Mote’s failure. However, there are many technological hurdles that must be overcome for sensor networks to become viable. For instance, Motes are inevitably constrained in processing speed, storage capacity, and communication bandwidth. Additionally, their lifetime is determined by their ability to conserve power. These constraints require new hardware designs and novel network architectures. Sensor networks raise nontrivial theoretical issues as well. For example, new networking protocols must be devised to allow the sensor nodes to spontaneously create an impromptu network, dynamically adapt to device failure, manage movement of sensor nodes, and react to changes in task and network requirements. From a signal processing point of view, the main challenge is the distributed fusion of sensor data across the network. This is because individual sensor nodes are often not able to provide useful or comprehensive information about the quantity under observation. Furthermore, the following constraints must be considered while designing the information fusion algorithm: 1. Each sensor node is likely to have limited power and bandwidth capabilities to communicate with other devices. Therefore, any distributed computation on the sensor network must be very efficient in utilizing the limited power and bandwidth budget of the sensor devices. 2. Owing to the variable environmental conditions in which sensor devices may be deployed, one can expect a fraction of the sensor nodes to be malfunctioning. Therefore, the underlying distributed algorithms must be robust with respect to device failures. Owing to the large and often ad hoc nature of sensor networks, it would be a formidable challenge to develop distributed information fusion algorithms without first developing a simple, yet rigorous and flexible, mathematical model. The aim of this chapter is to introduce one such model. We advocate that information fusion in sensor networks should be viewed as a problem of finding a “solution point” in the intersection of some “feasibility sets.” The key advantage of this viewpoint is that the solution can be found using a series of projections onto the individual sets. The projections can be computed locally at each sensor node allowing the fusion process to be done in a parallel and distributed fashion. To maintain clarity and simplicity, we will focus on solving a benchmark signal processing problem (spectrum estimation) using sensor networks. However, the fusion algorithms that result from our formulations are very general and can be used to solve other sensor network signal processing problems as well.

© 2006 by Taylor & Francis Group, LLC

Distributed Signal Processing in Sensor Networks

38-3

Notation: Vectors are denoted by capital letters. Boldface capital letters are used for matrices. Elements of a matrix A are referred to as [A]ij . We denote the set of real M-tuples by RM and use the notation R+ for positive real numbers. The expected value of a random variable x is denoted by E {x }. The linear convolution operator is denoted by . The spaces of Lebesgue-measurable functions are represented by L1 (a, b ), L2 (a, b ), etc. The end of an example is indicated using the symbol ♦.

38.2 Spectrum Estimation Using Sensor Networks 38.2.1 Background Spectrum estimation is concerned with determining the distribution in frequency of the power of a random process. Questions such as “Does most of the power of the signal reside at low or high frequencies?” or “Are there resonance peaks in the spectrum?” are often answered as a result of a spectral analysis. Spectral analysis finds frequent and extensive use in many areas of physical sciences. Examples abound in oceanography, electrical engineering, geophysics, astronomy, and hydrology. Let x (n ) denote a zero-mean Gaussian wide-sense stationary (WSS) random process. It is well known that a complete statistical description of such a process is provided by its autocorrelation sequence (ACS)

Rx (k ) = E {x (n )x (n + k )} or, equivalently, by its power spectrum also known as power spectral density (PSD) Px (e j ω ) =

∞

Rx (k )e −j ωk

k =−∞

The ACS sequence is a time-domain description of the second-order statistics of a random process. The power spectrum provides a frequency domain description of the same statistics. An issue of practical importance is, how to estimate the power spectrum of a time series given a finitelength data record. This is not a trivial problem as reflected in a bewildering array of power spectrum estimation procedures, with each procedure claimed to have or show some optimum property.1 The reader is referred to the excellent texts [3–6] for analysis of empirical spectrum estimation methods. Consider the basic scenario where a sound source (a speaker) is monitored by a collection of Motes put at various known locations in a room (Figure 38.2). Because of reverberation, noise, and other artifacts, the signal arriving at each Mote location is different. The Motes (which constitute the sensor nodes in our network) are equipped with microphones, sampling devices, sufficient signal processing hardware, and some communication means. Each Mote can process its observed data, come up with some statistical inference about it and share the result with other nodes in the network. However, to save energy and communication bandwidth, the Motes are not allowed to share their raw observed data with each other. Now, how should the network operate so that an estimate of the power spectrum of the sound source consistent with the observations made by all Motes is obtained? We will provide an answer to this question in the sections that follow.2 1 The

controversy is rooted in the fact that power spectrum is a probabilistic quantity and these quantities cannot be constructed using finite-size sample records. Indeed, neither the axiomatic theory [1] nor the frequency theory [2] of probability specifies a constructive way for building probability measures from empirical samples. 2 The problem of estimating the power spectrum of a random signal, when the signal itself is not available but some measured signals derived from it are observable, has been studied in Reference 7. The approach developed in Reference 7, however, leads to a centralized fusion algorithm which is not suited to sensor network applications.

© 2006 by Taylor & Francis Group, LLC

38-4

Embedded Systems Handbook

Data acquisition and processing module Communications module

Microphone

Speech source

Sensor nodes

FIGURE 38.2 A sensor network monitoring a stationary sound source in a room.

38.2.2 Mathematical Formulation of the Problem Let x (n ) denote a discrete version of the signal produced by the source and assume that it is a zero-mean Gaussian WSS random process. The sampling frequency fs associated with x (n ) is arbitrary and depends on the frequency resolution desired in the spectrum estimation process. We denote by vi (n ) the signal produced at the front end of the ith sensor node. We assume that vi (n ) are related to the original source signal x (n ) by the model shown in Figure 38.3. The linear filter Hi (z) in this figure models the combined effect of room reverberations, microphone’s frequency response, and an additional filter which the system designer might want to include. The decimator block that follows the filter represents the (potential) difference between the sampling frequency fs associated with x(n) and the actual sampling frequency of the Mote’s sampling device. Here, it is assumed that the sampling frequency associated with vi (n) is fs /Ni where Ni is a fixed natural number. It is straightforward to show that the signal vi (n) in Figure 38.3 is also a WSS processes. The autocorrelation coefficients Rvi (k) associated with vi (n) are given by Rvi (k) = Rxi (Ni k)

(38.1)

Rxi (k) = (hi (k) hi (−k)) Rx (k)

(38.2)

where

and hi (k) denotes the impulse response of Hi (z). We can express Rvi (k) as a function of the source signal’s power spectrum as well. To do this, we define Gi (z) = Hi (z)Hi (z −1 ) and then use it to write (38.2) in

© 2006 by Taylor & Francis Group, LLC

Distributed Signal Processing in Sensor Networks

x (n − D)

38-5

xi (n) Hi (z)

Ni

vi (n)

Processor

Speech source x (n)

FIGURE 38.3 The relation between the signal vi (n) produced by the front end of the ith sensor and the original source signal x(n).

the frequency domain: Rxi (k) =

1 2π

π

Px (ejω )Gi (ejω )ejkω dω

(38.3)

Px (ejω )Gi (ejω )ejNi kω dω

(38.4)

−π

Combining (38.1) and (38.3), we then get Rvi (k) =

1 2π

π −π

The above formula shows that Px (ejω ) uniquely specifies Rvi (k) for all values of k. However, the reverse is not true. That is, in general, knowing Rvi (k) for some or all values of k is not sufficient for characterizing Px (ejω ) uniquely. Recall that vi (n) is a WSS signal so all the statistical information that can be gained about it is confined in its autocorrelation coefficients. One might use the signal processing hardware available at each sensor node and estimate the autocorrelation coefficients Rvi (k) for some k, say 0 ≤ k ≤ L − 1. Now, we may pose the sensor network spectrum estimation problem as follows: Problem 38.1 Let Qi,k denote the set of all power spectra which are consistent with the kth autocorrelation coefficient Rvi (k) estimated at the ith sensor node. That is, Px (e jω ) ∈ Qi,k if 1 2π

π −π

Px (ejω )Gi (ejω )e jMkω dω = Rvi (k), Px (ejω ) ≥ 0, Px (ejω ) = Px (e−jω ), Px (ejω ) ∈ L1 (−π, π ).

L−1 Define Q = ∩N i=1 ∩k=0 Qi,k where N is the number of nodes in the network and L is the number of autocorrelation coefficients estimated at each node. Find a Px (ejω ) in Q.

© 2006 by Taylor & Francis Group, LLC

38-6

Embedded Systems Handbook

If we ignore measurement imperfections and assume that the observed autocorrelation coefficients Rvi (k) are exact, then the sets Qi,k are nonempty and admit a nonempty intersection Q as well. In this case Q contains infinitely many Px (ejω ). When the measurements vi (n) are contaminated by noise or Rvi (k) are estimated based on finite-length data records, the intersection set Q might be empty owing to the potential inconsistency of the autocorrelation coefficients estimated by different sensors. Thus, Problem 38.1 has either no solution or infinitely many solutions. Problem which have such undesirable properties are called ill-posed. Ill-posed problems are studied in Section 38.3.

38.3 Inverse and Ill-Posed Problems The study of inverse problems has been one of the fastest-growing areas in applied mathematics in the last two decades. This growth has largely been driven by the need of applications in both natural sciences (e.g., inverse scattering theory, astronomical image restoration, and statistical learning theory) and industry (e.g., computerized tomography, remote sensing). The reader is referred to References 8 to 11 for detailed treatments of the theory of ill-posed problems and to References 12 and 13 for applications in inverse scattering and statistical inference, respectively. By definition, inverse problems are concerned with determining causes for a desired or an observed effect. Most often, inverse problems are much more difficult to deal with (from a mathematical point of view) than their direct counterparts. This is because they might not have a solution in the strict sense or solutions might not be unique or depend on data continuously. Mathematical problems having such undesirable properties are called ill-posed problems and cause severe numerical difficulties (mostly because of the discontinuous dependence of solutions on the data). Formally, a problem of mathematical physics is called well-posed or well-posed in the sense of Hadamard if it fulfills the following conditions: 1. For all admissible data, a solution exists. 2. For all admissible data, the solution is unique. 3. The solution depends continuously on the data. A problem for which one or more of the above conditions are violated is called ill-posed. Note that, the conditions mentioned do not make a precise definition for well-posedness. To make a precise definition in a concrete situation, one has to specify the notion of a solution, which data are considered admissible, and which topology is used for measuring continuity. The study of concrete ill-posed problems often involves the question “how can one enforce uniqueness by additional information or assumptions?” Not much can be said about this in a general context. However, the aspect of lack of stability and its restoration by appropriate methods known as regularization methods can be treated in sufficient generality. The theory of regularization is well developed for linear inverse problems and will be introduced, very briefly, in Section 38.3.1.

38.3.1 Ill-Posed Linear Operator Equations Let the linear operator equation Ax = y

(38.5)

be defined by the continuous operator A that maps the elements x of a metric space E1 onto elements y of the metric space E2 . In the early 1900s, noted French mathematician Jacques Hadamard observed that under some (very general) circumstances the problem of solving the operator equation (38.5) is ill-posed. This is because, even if there exists a unique solution x ∈ E1 that satisfies the equality (38.5), a small deviation on the right-hand side can cause large deviations in the solution. The following example illustrates this issue.

© 2006 by Taylor & Francis Group, LLC

Distributed Signal Processing in Sensor Networks

38-7

Example 38.1 Let A denote a Fredholm integral operator of the first kind. Thus, we define

(Ax)(s) =

b

K (s, t )x(t ) dt

(38.6)

a

The kernel K (s, t ) is continuous on [a b] × [a b] and maps a function x(t ) continuous on [a b] to a function y(s) also continuous on [a b]. We observe that the continuous function

gω (s) =

b

K (s, t ) sin(ωt ) dt

(38.7)

a

which is formed by means of the kernel K (s, t ), possesses the property lim gω (s) = 0,

ω→∞

for every s ∈ [a, b]

(38.8)

The above property is a consequence of the fact that the Fourier series coefficients of a continuous function tend to zero at high frequencies (see, e.g., [14, chapter 14, section I]). Now, consider the integral equation Ax = y + gω

(38.9)

where y is given and gω is defined in (38.7). Since the above equation is linear, it follows using (38.7) that its solution x(t ˆ ) has the form (38.10) x(t ˆ ) = x ∗ (t ) + sin(ωt ) where x ∗ (t ) is a solution to the original integral equation Ax = y. For sufficiently large ω, the right-hand side of (38.9) differs from the right-hand side of (38.5) only by the small amount gω (s), while its solution differs from that of (38.5) by the amount sin(ωt ). Thus, the problem of solving (38.5) where A is a Fredholm integral operator of the first kind is ill-posed. ♦ One can easily verify that the problem of solving the operator equation (38.5) is equivalent to finding an element x ∗ ∈ E1 such that the functional

R(x) = Ax − yE2

(38.11)

is minimized.3 Note that the minimizing element x ∗ ∈ E1 always exists even when the original equation (38.5) does not have a solution. In any case, if the right-hand side of (38.5) is not exact, that is, if we replace y by yδ such that y − yδ E2 < δ where δ is a small value, a new element xδ ∈ E1 will minimize the functional (38.12) Rδ (x) = Ax − yδ E2 However, the new solution xδ is not necessarily close to the first solution x ∗ even if δ tends to zero. In other words, limδ→0 x ∗ − xδ E1 = 0 when the operator equation Ax = y is ill-posed.

38.3.2 Regularization Methods for Solving Ill-Posed Linear Operator Equations Hadamard [15] thought that ill-posed problems are a pure mathematical phenomenon and that all reallife problems are well-posed. However, in the second half of the 20th century, a number of very important real-life problems were found to be ill-posed. In particular, as we just discussed, ill-posed problems arise when one tries to reverse the cause–effect relations to find unknown causes from known consequences. Even if the cause–effect relationship forms a one-to-one mapping, the problem of inverting it can be save in notation, we write a − bE to denote the distance between the two elements a, b ∈ E whether the metric space E is a normed space or not. If E is a normed space too, our notation is self-evident. Otherwise, it should be interpreted only as a symbol for the distance between a and b. 3 To

© 2006 by Taylor & Francis Group, LLC

38-8

Embedded Systems Handbook

ill-posed. The discovery of various regularization methods by Tikhonov, Ivanov, and Phillips in the early 60s made it possible to construct a sequence of well-posed solutions that converges to the desired one. Regularization theory was one of the first signs of existence of intelligent inference. It demonstrated that where the “self-evident” methods of solving an operator equation might not work, the “nonself-evident” methods of regularization theory do. The influence of the philosophy created by the theory of regularization is very deep. Both the regularization philosophy and the regularization techniques became widely disseminated in many areas of science and engineering [10,11]. 38.3.2.1 Tikhonov’s Method In the early 60s, it was discovered by A.N. Tikhonov [16,17] that, if instead of the functional Rδ (x) one minimizes (38.13) Rreg (x) = Ax − yδ E2 + ξ(δ)S(x) where S(x) is a stabilizing functional (that belongs to a certain class of functionals) and ξ(δ) is an appropriately chosen constant (whose value depends on the noise level, δ), then one obtains a sequence of solutions xδ that converges to the desired one as δ tends to zero. For the above result to be valid, it is required that: 1. The problem of minimizing Rreg (x) be well-posed for fixed values of δ and ξ(δ). 2. limδ→0 x ∗ − xδ E1 → 0 when ξ(δ) is chosen appropriately. Consider a real-valued lower semicontinuous4 functional S(x). We shall call S(x) a stabilizing functional if it possesses the following properties: 1. The solution of the operator equation Ax = y belongs to the domain of definition D(S) of the functional S. 2. S(x) ≥ 0, ∀x ∈ D(S). 3. The level sets {x: S(x) ≤ c}, c = const., are all compact. It turns out that the above conditions are sufficient for the problem of minimizing Rreg (x) to be wellposed [8, page 51]. Now, the important remaining problem is to determine the functional relationship between δ and ξ(δ) such that the sequence of solutions obtained by minimizing (38.13) converges to the solution of (38.11) as δ tends to zero. The following theorem establishes sufficient conditions on such a relationship: Theorem 38.1 [13, page 55] Let E1 and E2 be two metric spaces and let A : E1 → E2 be a continuous and one-to-one operator. Suppose that for y ∈ E2 there exists a solution x ∈ D(S) ⊂ E1 to the operator equation Ax = y. Let yδ be an element in E2 such that y − yδ E2 ≤ δ. If the parameter ξ(δ) is chosen such that: (i) ξ(δ) → 0 when δ → 0, (ii) limδ→0 δ 2 /ξ(δ) < ∞, Then the elements xδ ∈ D(S) minimizing the functional Rreg (x) = Ax − yδ E2 + ξ(δ)S(x) converge to the exact solution x as δ → 0. If E1 is a Hilbert space, the stabilizing functional S(x) may simply be chosen as x2 , which, indeed, is the original choice made by Tikhonov. In this case, the level sets of S(x) will only be weakly compact. However, the convergence of the regularized solutions will be a strong one in view of the properties of function f : RN → [−∞, ∞] is called lower semicontinuous at X ∈ RN if for any t < f (X ) there exists δ > 0 such that for all y ∈ B(X , δ), t < δ. The notation B(X , δ) represents a ball with center at X and radius δ. This definition generalizes to functional spaces by using the appropriate metric in defining B(X , δ). 4A

© 2006 by Taylor & Francis Group, LLC

Distributed Signal Processing in Sensor Networks

38-9

Hilbert spaces. The conditions imposed on the parameter ξ(δ) are, nevertheless, more stringent than those stated in the above theorem.5 38.3.2.2 The Residual Method The results presented above are fundamentals in Tikhonov’s theory of regularization. Tikhonov’s theory, however, is only one of several proposed schemes for solving ill-posed problems. An important variation known as Residual Method was introduced by Phillips [18]. In Phillips’s method one minimizes the functional RP (x) = S(x) subjected to the constraint Ax − yδ E2 ≤ µ where µ is a fixed constant. The stabilizing functional S(x) is defined as in Section 38.3.2.1. 38.3.2.3 The Quasi-Solution Method The quasi-solution method was developed by Ivanov [19,20]. In this method, one minimizes the functional

RI (x) = Ax − yδE2 subjected to the constraint S(x) ≤ σ where σ is a fixed constant. Again, the stabilizing functional S(x) is defined as in Tikhonov’s method. Note that, the three regularization methods mentioned contain one free parameter (ξ in Tikhonov’s method, µ for Phillips’ method, and σ in Ivanov’s method). It has been shown [21] that these methods are all equivalent in the sense that if one of the methods (say Phillips’) for a given value of its parameter (say µ∗ ) produces a solution x ∗ , then there exist corresponding values of parameters of the other two methods that produce the same solution. We remark in passing that a smart choice of the free parameter is crucial in obtaining a good (fast converging) solution using any of the regularization methods mentioned. There exist several principles for choosing the free parameter in an optimal fashion [10, section 4.3; 11, chapter 2].

38.4 Spectrum Estimation Using Generalized Projections The sensor network spectrum estimation problem (Problem 38.1) posed in Section 38.2.2 is essentially finding a P(e jω ) in the intersection of the feasible sets Qi,k . It is easy to verify that the sets Qi,k are closed and convex [7]. The problem of finding a point in the intersection of many finitely closed convex sets is known as the convex feasibility problem and is an active area of research in applied mathematics. An elegant way to solve a convex feasibility problem is to employ a series of generalized projections [22]. A generalized projection is essentially a regularization method with a generalized distance serving as the stabilizing functional. A great advantage of using the generalized projections formulation is that the solution P ∗ ∈ Q can be found using a series of projections onto the intermediate sets Qi,k . These intermediate projections can be computed locally at each sensor node thus allowing the computations to be done simultaneously and in a highly distributed fashion. A generalized distance is a real-valued nonnegative function of two vector variable D(X , Y ) defined in a specific way such that it’s value may represent the distance between X and Y in some generalized sense. When defining generalized distances, it is customary not to require the symmetry condition. this case, ξ(δ) should converge to zero strictly slower than δ 2 . In more precise terms, limδ→0 δ 2 /ξ(δ) = 0 must hold. 5 In

© 2006 by Taylor & Francis Group, LLC

38-10

Embedded Systems Handbook

Thus, D (X , Y ) may not be the same as D (Y , X ). Moreover, we do not insist on the triangle inequality that a traditional metric must obey either. Example 38.2 Let P1 (ej ω ) > 0 and P2 (ej ω ) > 0 be two power spectra in L1 (−π, π ). The functions D1 (P1 , P2 ) =

π

−π π

(P1 − P2 )2 dω

P1 P1 ln + P2 − P1 dω D2 (P1 , P2 ) = P2 −π π P1 P1 − ln − 1 dω D3 (P1 , P2 ) = P2 −π P2 can be used to measure the generalized distance between P1 (e j ω ) and P2 (e j ω ). These functions are nonnegative and become zero if and only if P1 = P2 . Note that, D1 is simply the Euclidean distance between P1 and P2 . The functions D2 and D3 have roots in information theory and statistics. They are known as the Kullback-Leibler divergence and Burg cross entropy, respectively. ♦ By using a suitable generalized distance, we can convert our original sensor network spectrum estimation problem (Problem 38.1) into the following minimization problem: Problem 38.2 Let Q be defined as in Problem 38.1. Find Px∗ (ej ω ) in Q such that P ∗ = arg min D (P, P0 ) P ∈Q

(38.14)

where P0 (e j ω ) is an arbitrary power spectrum, say P0 (e j ω ) = 1, −π ≤ ω < π. When a unique P ∗ exists, it is called the generalized projection of P0 onto Q [23]. In general, a projection of a given point onto a convex set is defined as another point which has two properties: first, it belongs to the set onto which the projection operation is performed and, second, it renders a minimal value to the distance between the given point and any point of the set (Figure 38.4). If the Euclidean distance ||X −Y || is used in this context then the projection is called a metric projection. In some cases, such as the spectrum estimation problem considered here, it turns out to be very useful to introduce more general means to measure the “distance” between two vectors. The main reason is that the functional form of the solution will depend on the choice of the distance measure used in the projection. Often, a functional form which is easy to manipulate or interpret (for instance, a rational function) cannot be obtained using the conventional Euclidean metric. It can be shown that the distances D1 and D2 in Example 38.2 lead to well-posed solutions for P ∗ . The choice D3 will lead to a unique solution given that ceratin singular power spectra are excluded from the space of valid solutions [24]. It is not known whether D3 will lead to a stable solution. As a result, the well-posedness of Problem 38.2 when D3 is used is not yet established.6 6 Well-posedness

of the minimization problem (38.14) when D is the Kullback-Leibler divergence D2 has been established in several works including References 25 to 29. Well-posedness results exist for certain classes of generalized distance functions as well [29,30]. Unfortunately, the Burg cross entropy D3 does not belong to any of these classes. While Burg cross entropy lacks theoretical support as a regularizing functional, it has been used successfully to resolve ill-posed problems in several applications including spectral estimation and image restoration (see, e.g., [31] and references therein). The desirable feature of Burg cross entropy in the context of spectrum estimation is that its minimization (subjected to linear constraints Px∗ (e jω ) ∈ Q) leads to rational power spectra.

© 2006 by Taylor & Francis Group, LLC

Distributed Signal Processing in Sensor Networks (a)

38-11 (b)

||X − Y || Y

D (X,Y ) Y X*

X*

Y X Q

Q

FIGURE 38.4 Symbolic depiction of metric projection (a) and generalized projection (b) of a vector Y into a closed convex set Q. In (a) the projection X ∗ is selected by minimizing the metric ||X − Y || over all X ∈ Q while in (b) X ∗ is found by minimizing the generalized distance D(X , Y ) over the same set.

38.5 Distributed Algorithms for Calculating Generalized Projection As we mentioned before, a very interesting aspect of the generalized projections formulation is that the solution P ∗ ∈ Q can be found using a series of projections onto the intermediate sets Qi,k . In this section, we first calculate the generalized projection of a given power spectrum onto the sets Qi,k for the sample distance functions introduced in Example 38.2. Then, we propose a distributed algorithm for calculating the final solution P ∗ from these intermediate projections. Let P[P1 →Qi,k ;Dj ] denote the power spectrum resulting from projecting a given power spectrum P1 onto the set Qi,k using a given distance functions Dj . That is,

P[P1 →Qi,k ;Dj ] = arg min Dj (P, P1 ) P∈Qi,k

(38.15)

Using standard techniques from calculus of variations we can show that the generalized distances D1 , D2 , and D3 introduced in Example 38.2 result in projections of the form P[P1 →Qi,k ;D1 ] = P1 (ejω ) − αGi (ejω ) cos(Mkω) P[P1 →Qi,k ;D2 ] = P1 (ejω ) exp(−βGi (ejω ) cos(Mkω)) P[P1 →Qi,k ;D3 ] = (P1 (ejω )−1 + γ Gi (ejω ) cos(Mkω))−1 where α, β, and γ are parameters (Lagrange multipliers). These parameter should be chosen such that in each case P[P1 →Qi,k ;Dj ] ∈ Qi,k . That is, 1 2π

π −π

P[P1 →Qi,k ;Dj ] Gi (ejω )ejMkω dω = Rvi (k)

(38.16)

The reader may observe that the above equation leads to a closed-form formula for α but in general finding β and γ requires numerical methods. The projection formulae developed above can be employed in a variety of iterative algorithms to find a solution in the intersection of Qi,k . We discuss two example algorithms below.

38.5.1 The Ring Algorithm Ring Algorithm is a very simple algorithm: it starts with an initial guess P0 for Px (ejω ) and then calculates a series of successive projections onto the constraint sets Qi,k . Then, it takes the last projection, now called P (1) , and projects it back onto the first constraint set. Continuing this process will generate

© 2006 by Taylor & Francis Group, LLC

38-12

Embedded Systems Handbook

The Ring Algorithm Input: A distance function Dj (P1 , P2 ), an initial power spectrum P0 (e j ω ), the squared sensor frequency responses Gi (e j ω ), and the autocorrelation estimates Rvi (k ) for k = 0, 1, . . . , L − 1 and i = 1, 2, . . . , N . Output: A power spectrum P ∗ (e j ω ). Procedure: 1. Let m = 0, i = 1, and P (m) = P0 . 2. Send P (m) to the ith sensor node. At the ith sensor: (i) Let k = 0 and define P˜ k = P (m) . (ii) Calculate P˜ k = P[P˜ k −1 →Qi,k ;Dj ] for k = 1, 2, . . . , L − 1. (iii) If D (P˜ L −1 , P˜ 0 ) > then let P˜ 0 = P˜ L −1 and go back to item (ii). Otherwise, let i = i + 1 and go to Step 3. 3. If (i mod N ) = 1 then set m = m + 1 and reset i to 1. Otherwise, set P (m) = P˜ L −1 and go back to Step 2. 4. Define P (m) = P˜ L −1 . If D (P (m) , P (m−1) ) > , go back to Step 2. Otherwise output P ∗ = P m and stop.

a sequence of solutions P (0) , P (1) , P (1) , . . . which will eventually converge to a solution P ∗ ∈ ∩i,k Qi,k [22]. Steps of the Ring Algorithm are summarized in the text box above. A graphical representation of this algorithm is shown in Figure 38.5. Example 38.3 Consider a simple four-sensor network similar to the one shown in Figure 38.5. Assume that the down-sampling ratio in each Mote is equal to four. Thus, N0 = N1 = N2 = N3 = 4. Assume, further, that the transfer functions H0 (z ) to H3 (z ) which relate the Motes’ front-end output vi (n ) to the original source signal x (n ) are given as follows: H0 (z ) =

0.0753 + 0.1656z −1 + 0.2053z −2 + 0.1659z −3 + 0.0751z −4 1.0000 − 0.8877z −1 + 0.6738z −2 − 0.1206z −3 + 0.0225z −4

H1 (z ) =

0.4652 − 0.1254z −1 − 0.3151z −2 + 0.0975z −3 − 0.0259z −4 1.0000 − 0.6855z −1 + 0.3297z −2 − 0.0309z −3 + 0.0032z −4

H2 (z ) =

0.3732 − 0.8648z −1 + 0.7139z −2 − 0.1856z −3 − 0.0015z −4 1.0000 − 0.5800z −1 + 0.5292z −2 − 0.0163z −3 + 0.0107z −4

H3 (z ) =

0.1931 − 0.4226z −1 + 0.3668z −2 − 0.0974z −3 − 0.0405z −4 1.0000 + 0.2814z −1 + 0.3739z −2 + 0.0345z −3 − 0.0196z −4

The above transfer functions were chosen to show typical low-pass, band-pass, and high-pass characteristics (Figure 38.6). They were obtained using standard filter design techniques. The input signal whose power spectrum is to be estimated was chosen to have a smooth low-pass spectrum. We used the Ring Algorithm with L = 4 and the Euclidean metric D1 as the distance function to estimate the input signal’s spectrum. The results are shown in Figure 38.7. As seen in this figure, the algorithm converges to a solution which is in this case almost identical to the actual input spectrum in less than 100 rounds. ♦

38.5.2 The Star Algorithm The Ring Algorithm is completely decentralized. However, it will not converge to a solution if the feasible sets Qi,k do not have an intersection (which can happen owing to measurement noise) or one or more

© 2006 by Taylor & Francis Group, LLC

Distributed Signal Processing in Sensor Networks

38-13

Input P (m) e(jv) Output P (m) e( jv)

P (0) e(jv) Feasible sets Qi,k

P (m) e(jv) P (m) e(jv)

Speech source x(n)

P (m) e(jv) P (m) e(jv)

FIGURE 38.5 Graphical depiction of the Ring Algorithm. For illustrative reasons, only three feasible sets Qi,k are shown in the inside picture. Also, it is shown that the output spectrum P (m) (ejω ) is obtained from the input P (m) (ejω ) only after three projections. In practice, each sensor node has L feasible sets and has to repeat the sequence of projections many times before it can successfully project the input P (m) (ejω ) into the intersection of its feasible sets.

1.4 1.2

|Hi (v)|

1 0.8 0.6 0.4 0.2 0 0

0.5

1 1.5 2 2.5 Radian frequency (v)

3

3.5

FIGURE 38.6 Frequency response amplitude of the transfer functions used in Example 38.3. The curves show, from left to right, |H0 (ejω )|, |H1 (ejω )|, |H2 (ejω )|, and |H3 (ejω )|.

sensors in the network are faulty. The Star Algorithm is an alternative distributed algorithm for fusing individual sensors’ data. It combines successive projections onto Qi,k with a kind of averaging operation to generate a sequence of solutions P (m) . This sequence will eventually converge to a solution P ∗ ∈ ∩i,k Qi,k if one exists. The Star Algorithm is fully parallel and hence much faster than the Ring Algorithm. It provides

© 2006 by Taylor & Francis Group, LLC

38-14

Embedded Systems Handbook m=0

(a) 1.4

1.2 Power spectrum Pm(v)

Power spectrum Pm(v)

1.2 1 0.8 0.6 0.4 0.2 0

0

0.5

1 1.5 2 2.5 Radian frequency (v)

3

0.8 0.6 0.4

0

3.5

m=4

0

0.5

Power spectrum Pm(v)

0.8 0.6 0.4

3.5

3

3.5

3

3.5

m = 10

1 0.8 0.6 0.4 0.2

0.2 0

0.5

1

1.5

2

2.5

3

0

3.5

0

0.5

m = 20

(e) 1.4

1

1.5

2

2.5

Radian frequency (v)

Radian frequency (v)

m = 100

(f) 1.4 1.2 Power spectrum Pm(v)

1.2 Power spectrum Pm(v)

3

1.2

1

1 0.8 0.6 0.4

1 0.8 0.6 0.4 0.2

0.2 0

1 1.5 2 2.5 Radian frequency (v)

(d) 1.4

1.2 Power spectrum Pm(v)

1

0.2

(c) 1.4

0

m=1

(b) 1.4

0

0.5

1

1.5

2

2.5

Radian frequency (v)

3

3.5

0

0

0.5

1

1.5

2

2.5

Radian frequency (v)

FIGURE 38.7 Ring Algorithm convergence results. In each figure, the dashed curve shows the source signal’s actual power spectrum while the solid curve is the estimate obtained by the Ring Algorithm after m rounds. A “round” means projections have been passed through all the nodes in the network.

some degree of robustness to individual node’s failure as well. However, it includes a centralized step which needs to be accommodated for when the system’s network protocol is being designed. Steps of the Star Algorithm are summarized in the text box further. A graphical representation of this algorithm is shown in Figure 38.8.

© 2006 by Taylor & Francis Group, LLC

Distributed Signal Processing in Sensor Networks

38-15

Input P (m)e( jv)

...

Feasible sets Qi,k

Feasible sets Qi,k

Feasible sets Qi,k

P (m + 1) Speech source x(n) Pi (m )

FIGURE 38.8 The Star Algorithm. Again, only three feasible sets Qi,k are shown in the inside picture. In practice, each sensor node has to repeat the sequence of projections and averaging many times before it can successfully project the input P (m) (ejω ) supplied by the central node into the intersection of its feasible sets. The projection (m) (m) result, which is called Pi (ejω ) is sent back to the central node. The central node then averages all the Pi (ejω ) it (m+1) jω has received and averages them to produce P (e ). This is sent back to the individual nodes and the process repeats.

The Star Algorithm Input: A distance function Dj (P1 , P2 ), an initial power spectrum P0 (e jω ), the squared sensor frequency responses Gi (e jω ), and the autocorrelation estimates Rvi (k). Output: A power spectrum P ∗ (e jω ). Procedure: 1. Let m = 0 and P (0) = P0 . 2. Send P (m) to all sensor nodes. At the ith sensor: (i) Let n = 0 and define P˜ (n) = P (m) . (ii) Calculate P˜ k = P[P˜ (n) →Qi,k ;Dj ] for all k. (iii) Calculate P˜ (n+1) = arg minP k D(P, P˜ k ). (m) (iv) If D(p˜ (n+1) , P˜ (n) ) > go to item (ii) and repeat. Otherwise, define Pi = P˜ (n+1) and send it to the central unit. (m) (m) 3. Receive Pi from all sensor and calculate P (m+1) = arg minP i D(P, Pi ). (m+1) (m) , P ) > , go to step 2 and repeat. Otherwise stop and output P ∗ = p (m+1) . 4. If D(P

© 2006 by Taylor & Francis Group, LLC

38-16

Embedded Systems Handbook m=0

(a) 1.4

1.2 Power spectrum P m(v)

Power spectrum P m(v)

1.2 1 0.8 0.6 0.4 0.2

0.8 0.6 0.4

0 0

0.5

1 1.5 2 2.5 Radian frequency (v)

3

3.5

m=4

(c) 1.4

0

Power spectrum P m(v)

0.8 0.6 0.4 0.2 1 1.5 2 2.5 Radian frequency (v)

3

3.5

3

3.5

3

3.5

m = 10

1 0.8 0.6 0.4 0.2

3.5

m = 20

0

0.5

1 1.5 2 2.5 Radian frequency (v) m = 100

(f) 1.4 1.2 Power spectrum P m(v)

1.2 1 0.8 0.6 0.4 0.2 0.5

1

1.5

2

2.5

Radian frequency (v)

FIGURE 38.9

3

0 0.5

(e) 1.4

0 0

1 1.5 2 2.5 Radian frequency (v)

1.2

1

0 0

0.5

(d) 1.4

1.2 Power spectrum P m(v)

1

0.2

0

Power spectrum P m(v)

m=1

(b) 1.4

3

3.5

1 0.8 0.6 0.4 0.2 0

0

0.5

1

1.5

2

2.5

Radian frequency (v)

Star Algorithm results.

Example 38.4 Consider a simple five-sensor network similar to the one shown in Figure 38.8. Assume that the down-sampling ratio in each Mote is equal to four. Thus, again, N0 = N1 = N2 = N3 = 4. Assume, further, that the transfer functions H0 (z) to H3 (z) which relate the Motes’ front-end output vi (n) to the original source signal x(n) are the same as those introduced in Example 38.3. We simulated the Star Algorithm with L = 4 and the Euclidean metric D1 as the distance function to estimate the input signal’s spectrum. The results are shown in Figure 38.9. Like the Ring Algorithm, the Star Algorithm also converges to a solution which is almost identical to the actual input spectrum in less than 100 rounds. ♦

© 2006 by Taylor & Francis Group, LLC

Distributed Signal Processing in Sensor Networks

38-17

38.6 Concluding Remark In this chapter we considered the problem of fusing the statistical information gained by a distributed network of sensors. We provided a rigorous mathematical model for this problem where the solution is obtained by finding a point in the intersection of finitely many closed convex sets. We investigated distributed optimization algorithms to solve the problem without exchanging the raw observed data among the sensors. The information fusion theory presented in this chapter is by no means complete. Many issues regarding both the performance and implementation of the two algorithms we introduced need to be investigated. Other algorithms for solving the problem of finding the solution in the intersection of the feasible sets are possible as well. We hope that our results point out the way toward more complete theories and help to give shape to the emerging field of sensor processing for sensor networks. MATLAB codes implementing the algorithms mentioned in this chapter are maintained online at www.multirate.org.

Acknowledgments The authors would like to thank Mr. Mayukh Roy for his help in drawing some of the figures. They are also very grateful to the Editor, Dr. Richard Zurawski, for his patience and cooperation during the long process of writing this chapter.

References [1] H. Jeffreys, Theory of Probability, 3rd ed., Oxford University Press, London, 1967. [2] R. von Mises, Mathematical Theory of Probability and Statistics, Academic Press, New York, 1964. [3] S.M. Kay, Modern Spectrum Estimation: Theory and Applications, Prentice Hall, Upper Saddle River, NJ, 1988. [4] D.B. Percival and A.T. Walden, Statistical Digital Signal Processing and Modeling, Cambridge University Press, London, 1993. [5] M.H. Hayes, Statistical Signal Processing and Modeling, John Wiley and Sons, New York, 1996. [6] B. Buttkus, Spectral Analysis and Filter Theory in Applied Geophysics, Springer-Verlag, Berlin, 2000. [7] O.S. Jahromi, B.A. Francis, and R.H. Kwong, Spectrum estimation using multirate observations, IEEE Transactions on Signal Processing, 52(7), 1878–1890, July 2004. Preprint available from www.multirate.org (to appear). [8] A.N. Tikhonov and V.Y. Arsenin, Solutions of Ill-Posed Problems, V.H. Winston & Sons, Washington, DC, 1977. [9] V.V. Vasin and A.L. Ageev, Ill-Posed Problems with A Priori Information, VSP, Utrecht, The Netherlands, 1995. [10] H.W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1996. [11] A.N. Tikhonov, A.S. Leonov, and A.G. Yagola, Nonlinear Ill-Posed Problems, Chapman & Hall, London, 1998, 2 Vols. [12] K. Chadan, D. Colton, L. Päivärinta, and W. Rundell, An Introduction to Inverse Scattering and Inverse Spectral Problems, SIAM, Philadelphia, 1997. [13] V. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, 1999. [14] F. Jones, Lebesgue Integration on Euclidean Space, Jones and Bartlett Publishers, Boston, MA, 1993. [15] J. Hadamard, Lectures on Cauchy’s Problem in Linear Partial Differential Equations, Yale University Press, New Haven, CT, 1923. [16] A.N. Tikhonov, On solving ill-posed problems and the method of regularization, Doklady Akademii Nauk SSSR, 151, 501–504, 1963 (in Russian), English translation in Soviet Math. Dokl.

© 2006 by Taylor & Francis Group, LLC

38-18

Embedded Systems Handbook

[17] A.N. Tikhonov, On the regularization of ill-posed problems, Doklady Akademii Nauk SSSR, 153, 49–52, 1963 (in Russian), English translation in Soviet Math. Dokl. [18] D.L. Phillips, A technique for numerical solution of certain integral equations of the first kind, Journal of the Association for Computing Machinery, 9, 84–97, 1962. [19] V.K. Ivanov, Integral equations of the first kind and the approximate solution of an inverse potential problem, Doklady Akademii Nauk SSSR, 142, 997–1000, 1962 (in Russian), English translation in Soviet Math. Dokl. [20] V.K. Ivanov, On linear ill-posed problems, Doklady Akademii Nauk SSSR, 145, 270–272, 1962 (in Russian), English translation in Soviet Math. Dokl. [21] V.V. Vasin, Relationship of several variational methods for approximate solutions of ill-posed problems, Mathematical Notes, 7, 161–166, 1970. [22] Y. Censor and S.A. Zenios, Parallel Optimization: Theory, Algorithms, and Applications, Oxford University Press, Oxford, 1997. [23] H.H. Bauschke and J.M. Borwein, On projection algorithms for solving convex feasibility problems, SIAM Review, 38, 367–426, 1996. [24] J.M. Borwein and A.S. Lewis, Partially-finite programming in L1 and the existence of maximum entropy estimates, SIAM Journal of Optimization, 3, 248–267, 1993. [25] M. Klaus and R.T. Smith, A Hilbert space approach to maximum entropy reqularization, Mathematical Methods in Applied Sciences, 10, 397–406, 1988. [26] U. Amato and W. Hughes, Maximum entropy regularization of Fredholm integral equations of the first kind, Inverse Problems, 7, 793–808, 1991. [27] J.M. Borwein and A.S. Lewis, Convergence of best maximum entropy estimates, SIAM Journal of Optimization, 1, 191–205, 1991. [28] P.P.B. Eggermont, Maximum entropy regularization for Fredholm integral equations of the first kind, SIAM Journal of Mathematical Analysis, 24, 1557–1576, 1993. [29] M. Teboulle and I. Vajda, Convergence of best φ-entropy estimates, IEEE Transactions on Information Theory, 39(1), 297–301, 1993. [30] A.S. Leonev, A generalization of the maximal entropy method for solving ill-posed problems, Siberian Mathematical Journal, 41, 716–724, 2000. [31] N. Wu, The Maximum Entropy Method, Springer-Verlag, Berlin, 1997.

© 2006 by Taylor & Francis Group, LLC

39 Sensor Network Security

Guenter Schaefer Fachgebiet Telematik/Rechnernetze Technische Universitaet Ilmenau Berlin

39.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 39.2 DoS and Routing Security. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39.3 Energy Efficient Confidentiality and Integrity . . . . . . . . . 39.4 Authenticated Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39.5 Alternative Approaches to Key Management . . . . . . . . . . 39.6 Secure Data Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39-2 39-4 39-7 39-11 39-13 39-19 39-21 39-22

This chapter gives an introduction to the specific security challenges in wireless sensor networks and some of the approaches to overcome them that have been proposed so far. As this area of research is very active at the time of writing, it is to be expected that more approaches are going to be proposed as the field gets more mature, so this chapter should be understood as a snapshot rather than a definitive account of the field. When thinking of wireless sensor network security, one major question that comes to mind is: what are the differences between security in sensor networks and “general” network security? Well, in both cases one usually aims to ensure certain security objectives (also called security goals). In general, the following objectives are pursuit: authenticity of communicating entities and messages (data integrity), confidentiality, controlled access, availability of communication services, and nonrepudiation of communication acts [1]. And basically, these are the same objectives that need to be ensured also in wireless sensor networks (with maybe the exception of nonrepudiation which is of less interest at the level on which sensor networks operate). Also, in both cases cryptographic algorithms and protocols [2] are the main tool to be deployed for ensuring these objectives. So, from a high level point of view, one could come to the conclusion that sensor network security does not add much to what we already know from network security in general, and thus the same methods could be applied in sensor networks as in classical fixed or wireless networks. However, closer consideration reveals various differences that have their origins in specific characteristics of wireless sensor networks, so that straightforward application of known techniques is not appropriate. In this chapter we, therefore, first point out these characteristics and give an overview of the specific threats and security challenges in sensor networks. The remaining sections of the chapter then deal in more detail with the identified challenges, that are: Denial of Service (DoS) and routing security, energy efficient confidentiality and integrity, authenticated broadcast, alternative approaches to key management, and secure data aggregation.

39-1

© 2006 by Taylor & Francis Group, LLC

39-2

Embedded Systems Handbook

39.1 Introduction and Motivation The main characteristics of wireless sensor networks can be explained by summarizing that they are envisaged to be: • Formed by tens to thousands of small, inexpensive sensors that communicate over a wireless interface. • Connected via base stations to traditional networks/hosts running applications interested in the sensor data. • Using multi-hop communications among sensors in order to bridge the distance between sensors and base stations. • Considerably resource constrained owing to limited energy availability. To get an impression of the processing capabilities of a wireless sensor node, one should have the following example of a sensor node in mind: a node running a 8-bit CPU with 4 MHz clock frequency, 4 KB free of 8 KB flash read-only memory, 512 bytes SRAM main memory, a 19.2 Kbit/sec radio interface and the node being powered by battery. Typical applications envisaged for wireless sensor networks are environment monitoring (earthquake or fire detection, etc.), home monitoring and convenience applications, site surveillance (intruder detection), logistics and inventory applications (tagging and locating goods, containers, etc.), as well as military applications (battleground reconnaissance, troop coordination, etc.). The fundamental communication pattern to be used in such a network consists of an application demanding some named information in a specific geographical area. Upon this request, one or more base stations broadcast the request, and wireless sensors relay the request and generate answers to it if they contribute to the requested information. The answers are then processed and aggregated as they flow through the network toward the base station(s). Figure 39.1 shows an examplary sensor network topology as currently designated for such applications. The sensor network itself consists of one or more base stations that may be able to communicate among each other by some high-bandwidth link (e.g., IEEE 802.11). The base stations furthermore communicate with sensor nodes over a low-bandwidth link. As not all sensor nodes can communicate directly with the base station, multi-hop communication is used in the sensor network to relay queries or commands sent

Classical infrastructure Classical Infrastruct ure

Sensor network Network

...

Internet

Sensor node

Low-power radio link

Base station

High-bandwidth radio link

FIGURE 39.1 A general sensor network topology example.

© 2006 by Taylor & Francis Group, LLC

Sensor Network Security

39-3

by the base station to all sensors, as well as to send back the answers from sensor nodes to the base station. If multiple sensors contribute to one query, partial results may be aggregated as they flow toward the base station. In order to communicate results or report events to an application residing outside the sensor network, one or more base stations may be connected to a classical infrastructure network. As the above description already points out, there are significant differences between wireless sensor and so-called ad hoc networks, to which they are often compared. Both types of networks can be differentiated more specifically by considering the following characteristics [3]: • Sensor networks show distinctive application specific characteristics, for example, depending on its application, a sensor network might be very sparse or dense. • The interaction of the network with its environment may cause rather bursty traffic patterns. Consider, for example, a sensor network deployed for detecting/predicting earthquakes or fire detection. Most of the time, there will be little traffic, but if an incident happens the traffic load will increase heavily. • The scale of sensor networks is expected to vary between tens to thousands of sensors. • Energy is even more scarce than in ad hoc networks as sensors will be either battery powered or powered by environmental phenomena (e.g., vibration). • Self-configurability will be an important feature of sensor networks. While this requirement also exists for ad hoc networks, its importance is even higher in sensor networks, as, for example, human interaction during configuration might be prohibitive, the geographic position of sensor nodes has to be learnt, etc. • Regarding dependability and Quality-of-Service (QoS), classical QoS notions such as throughput, jitter, etc., are of little interest in sensor networks, as the main requirement in such networks is the plain delivery of requested information, and most envisaged applications only pose low-bandwidth requirements. • As sensor networks follow a data centric model, sensor identities are of little interest, and new addressing schemes, for example, based on semantics or geography, are more appealing. • The required simplicity of sensor nodes in terms of operating system, networking software, memory footprint, etc., is much more constraining than in ad hoc networks. So far, we have mainly described sensor networks according to their intrinsic characteristics, and regarding their security, we have only stated that principally the same security objectives need to be met as in other types of networks. This leads to the question, what makes security in sensor networks a genuine area of network security research? To give a short answer, there are three main reasons for this. First, sensor nodes are deployed under particularly harsh conditions from a security point of view, as there will often be a high number of nodes distributed in a (potentially hostile) geographical area, so that it has to be assumed that at least some nodes may get captured and compromised by an attacker. Second, the severe resource constraints of sensor nodes in terms of computation time, memory, and energy consumption demand for very optimized implementation of security services, and also lead to a very unfair power balance between potential attacker (e.g., equipped with a notebook) and defender (cheap sensor node). Third, the specific property of sensor networks to aggregate (partial) answers to a request as the information flows from the sensors toward the base station calls for new approaches for ensuring the authenticity of sensor query results, as established end-to-end security approaches are not appropriate for this. Consequently, the following security objectives prove to be challenging in wireless sensor networks: Avoiding and coping with sensor node compromise. This includes measures to partially “hide” the location of sensor nodes at least on the network layer, so that an attacker should ideally not be able to use network layer information in order to locate specific sensor nodes. Furthermore, sensor nodes should as far as possible be protected from compromise through tamper-proofing measures, where this is economically feasible. Finally, as node compromise cannot be ultimately prevented, other sensor network security mechanisms should degrade gracefully in case of single node compromises.

© 2006 by Taylor & Francis Group, LLC

39-4

Embedded Systems Handbook

Maintaining availability of sensor network services. This requires a certain level of robustness against so-called DoS attacks, protection of sensor nodes from malicious energy draining and ensuring the correct functioning of message routing. Ensuring confidentiality and integrity of data. Data retrieved from sensor networks should be protected from eavesdropping and malicious manipulation. In order to attain these goals in sensor networks, both efficient cryptographic algorithms and protocols as well as an appropriate key management are required, and furthermore the specific communication pattern of sensor networks (including data aggregation) has to be taken into account. In the following sections, we will discuss these challenges in more detail and present first approaches that have been proposed to meet them.

39.2 DoS and Routing Security Denial of Service attacks aim at denying or degrading a legitimate user’s access to a service or network resource, or at bringing down the servers offering such services themselves. From a high level point of view, DoS attacks can be classified into the two categories resource destruction and resource allocation. In a more detailed examination the following DoS attacking techniques can be identified: 1. Disabling services by: • Breaking into systems (“hacking”) • Making use of implementation weaknesses as buffer overrun, etc • Deviation from proper protocol execution 2. Resource depletion by causing: • Expensive computations • Storage of state information • Resource reservations (e.g., bandwidth) • High traffic load (requires high overall bandwidth from attacker) Generally speaking, these attacking techniques can be applied to protocol processing functions at different layers of the protocol architecture of communication systems. While some of the attacking techniques can be defended by a combination of established means of good system management, software engineering, monitoring and intrusion detection, the attacking techniques protocol deviation, and resource depletion require dedicated analysis for specific communication protocols. In sensor networks, two aspects raise specific DoS concerns: first, breaking into sensor nodes is facilitated by the fact that it might be relatively easy for an attacker to physically capture and manipulate some of the sensor nodes distributed in an area, and second, energy is a very scarce resource in sensor nodes, so any opportunity for an attacker to cause a sensor node to wake up and perform some processing functions is a potential DoS vulnerability. In 2002, Wood and Stankovic [4] published an article on DoS threats in sensor networks in which they mainly concentrated on protocol functions of the first four Open System Interconnection (OSI) layers. Table 39.1 gives an overview of their findings and potential countermeasures proposed. On the physical layer, jamming of the wireless communication channel represents the principal attacking technique. Spread-spectrum techniques are by nature more resistant against this kind of attack, but can nevertheless not guarantee the availability of physical layer services. In case the bandwidth available in an area is reduced by a DoS attack, giving priority to more important messages could help to maintain at least basic operations of a sensor network. While jamming mainly disturbs the availability of sensor nodes to communicate, it has second DoS relevant side effect: as a consequence of worse channel conditions, sensor nodes need more energy to exchange messages. Depending on protocol implementation, this could even lead to energy exhaustion of some nodes, if they tirelessly tried to send their messages instead of waiting for better channel conditions. Therefore, from a DoS avoidance point of view, lower

© 2006 by Taylor & Francis Group, LLC

Sensor Network Security TABLE 39.1

39-5 DoS-Threats in Wireless Sensor Networks [4]

Network layer

Attacks

Countermeasures

Physical

Tampering Jamming

Link

Collision Exhaustion Unfairness Neglect and greed Homing Misdirection Black holes Flooding Desynchronization

Tamper-proofing, hiding Spread-spectrum, priority messages, lower duty cycle, region mapping, mode change Error-correcting code Rate limitation Small frames Redundancy, probing Encryption (only partial protection) Egress filtering, authorization, monitoring Authorization, monitoring, redundancy Client puzzles Data origin authentication

Network

Transport

duty cycles could be a beneficial protocol reaction to bad channel conditions. Furthermore, the routing protocol (see also later) should avoid to direct messages into jammed areas, and ideally, cooperating sensor nodes located at the edge of a jammed area could collaborate to map jamming reports and reroute traffic around this area. If sensor nodes posses of multiple modes of communication (e.g., wireless and infrared communications), changing the mode is also a potential countermeasure. Finally, even if not directly related to communications, capturing and tampering with sensor nodes can also be classified as a physical layer threat. Tamper-proofing of nodes is one obvious measure to avoid further damage resulting from misuse of captured sensor nodes. A traditional preventive measure to at least render capturing of nodes more difficult, is to hide them. On the link layer, Wood and Stankovic identify (malicious) collisions and unfairness as potential threats and propose as classical measures the use of error-correcting codes and small frames. While one could argue that both threats (and respective countermeasures) are not actually security specific but also known as conventional problems (and strategies for overcoming them), their deliberate exposure for DoS attacks could nevertheless lead to temporal unavailability of communication services, and ultimately to exhaustion of sensor nodes. For the latter threat the authors propose rate limitation as a potential countermeasure (basically the same idea as lower duty cycle mentioned in the physical layer discussion). Considering the network layer, threats can be further subdivided into forwarding- and routing-related threats. Regarding forwarding, the main threats are neglect and greed, that is, sensor nodes that might only be interested in getting their own packets transfered in the network without correctly participating in the forwarding of other node’s packets. Such behavior could potentially be detected by the use of probing packets and circumvented by using redundant communication paths. However, both measures increase the network overhead and thus do not come for free. If packets contain the geographical position of nodes in cleartext, this could be exploited by an attacker for homing (locating) specific sensor nodes in order to physically capture and compromise them. As a countermeasure against this threat, Wood and Stankovic propose encryption of message headers and content between neighboring nodes. Regarding routingrelated threats, deliberate misdirection of traffic could lead to higher traffic load, as a consequence to higher energy consumption in a sensor network, and potentially also to unreachability of certain network parts. Potential countermeasures against this threat are egress filtering, that is, checking the direction in which messages will be routed, authorization verification of routing-related messages, monitoring of routing- and forwarding-behavior of nodes by neighboring nodes, and redundant routing of messages over multiple paths that in the ideal case do not share common intermediate nodes. The same countermeasures can also be applied in order to defend against so-called black hole attacks, in which one node or part of the network attracts a high amount of traffic (e.g., by announcing short routes to the base station) but does not forward this traffic.

© 2006 by Taylor & Francis Group, LLC

39-6

Embedded Systems Handbook

On the transport layer, the threats flooding with connection requests and desynchronization of sequence numbers are identified in Reference 4. Both attack techniques are known from classical Internet communications and might also be potentially applied to sensor networks, in case such networks are going to make use of transport layer connections. Established countermeasures to defend them are so-called client puzzles [5] and authentication of communication partners. Recapitulating the given discussion, it can be seen that especially the network layer exhibits severe DoS vulnerabilities and proves to be the most interesting layer for potential attackers interested in degrading the availability of sensor network services. This is mostly owing to the fact that in this layer the essential forwarding- and routing-functionality is realized, so that an attacker can cause significant damage with rather moderate means (e.g., in comparison to jamming a large area). In the following, we will, therefore, further elaborate on this layer and at the same time extend our discussion in general threats on forwardingand routing-functions including attacks beyond pure DoS interests. In Reference 6, Karlof and Wagner give an overview on attacks and countermeasures regarding secure routing in wireless sensor networks. From a high level point of view they identify the following threats: • Insertion of spoofed, altered, or replayed routing information with the aim of loop construction, attracting, or repelling traffic, etc. • Forging of acknowledgments which may trick other nodes to believe that a link or node is either dead or alive when in fact it is not. • Selective forwarding which may be realized either “in path” or “beneath path” by deliberate jamming, and which allows to control what information is forwarded and what information is suppressed. • Creation of so-called “sinkholes,” that is attracting traffic to a specific node, for example, to prepare selective forwarding. • Simulating multiple identities (“Sybil attacks”) which allows to reduce effectiveness of fault-tolerant schemes like multi-path routing. • Creation of so-called “wormholes” by tunneling messages over alternative low-latency links, for example, to confuse the routing protocol, create sinkholes, etc. • Sending of so-called “hello floods” (more precisely: “hello shouting”), in which an attacker sends or replays a routing protocol’s hello packets with more energy in order to trick other nodes into the belief that they are neighbors of the sender of the received messages. In order to give an example for such attacks, Figure 39.2 [7] illustrates the construction of a breadthfirst spanning tree, and Figure 39.3 [6] shows the effect of two attacks on routing schemes that use the breadth-first search tree idea to construct their forwarding tables. One example of a sensor network operating system that builds a breadth-first spanning tree rooted at the base station is TinyOS. In such networks, an attacker disposing of one or two laptops can either send out forged routing information or launch a wormhole attack. As can be seen in Figure 39.3, both attacks lead to entirely different routing trees and can be used to prepare further attacks such as selective forwarding, etc. In order to defend against the abovementioned threats, Karlof and Wagner discuss various methods. Regarding forging of routing information or acknowledgments, data origin authentication and confidentiality of link layer PDUs (Protocol Data Units) can serve as an effective countermeasure. While the first naive approach of using a single group key for this purpose exhibits the rather obvious vulnerability that a single node compromise would result in complete failure of the security, a better, still straightforward approach is to let each node share a secret key with a base station and to have base stations act as trusted third parties in key negotiation (e.g., using the Otway–Rees protocol [8]). Combined with an appropriate key management, the abovementioned link layer security measures could also limit the threat potential of the attack of simulating multiple identities: by reducing the number of neighbors a node is allowed to have, for example, through enforcement during key distribution, authentic sensor nodes could be protected from accepting too many neighborhood relations. Additionally, by keeping track of authentic identities and associated keys, the ability of potentially compromised nodes to simulate multiple identities could be restricted. However, the latter idea requires some kind of global

© 2006 by Taylor & Francis Group, LLC

Sensor Network Security

39-7

(a)

(b)

BS

BS

(c)

(d)

BS

BS

FIGURE 39.2 Breadth-first search: (a) BS sends beacon, (b) first answers to beacon, (c) answers to first answers, and (d) resulting routing tree.

Example routing tree

Forging routing updates

Wormhole attack

FIGURE 39.3 Attacks on breadth-first search routing.

knowledge, that often can only be realized efficiently by a centralized scheme which actively involves a base station in the key distribution protocol. When it comes to hello shouting and wormhole/sinkhole attacks, however, pure link layer security measures cannot provide sufficient protection, as they cannot completely protect against replay attacks. Links should, therefore, be checked in both directions before making routing decisions in order to defend against simple hello shouting attacks. Detection of wormholes actually proves to be difficult and first approach to this problem requires rather tight clock synchronization [9]. Sinkholes might be avoided by deploying routing schemes like geographical routing, that do not rely on constructing forwarding tables according to distance measured in hops to destination. Selective forwarding attacks might be countered with multi-path routing. However, this requires redundancy in the network and results in higher network overhead.

39.3 Energy Efﬁcient Conﬁdentiality and Integrity The preceding discussion of potential countermeasures against DoS attacks and general attacks on routing in wireless sensor networks has shown that the security services, confidentiality and integrity, prove to be valuable mechanisms against various attacks. Obviously, they are also effective measures to protect application data (e.g., commands and sensor readings) against unauthorized eavesdropping and manipulation,

© 2006 by Taylor & Francis Group, LLC

39-8

Embedded Systems Handbook

respectively. In this section, we will therefore examine their efficient implementation in resource restricted sensor networks. In their paper, SPINS: Security Protocols for Sensor Networks, Perrig et al. [10] discuss the requirements and propose a set of protocols for realizing efficient security services for sensor networks. The main challenges in the design of such protocols arise out of tight implementation constraints in terms of instruction set, memory, CPU speed, a very small energy budget in low-powered devices, and the fact that some nodes might get compromised. These constraints opt out some well established alternatives: asymmetric cryptography [11–13] is generally considered to be too expensive as it results in high computational cost and long ciphertexts and signatures (sending and receiving is very expensive!). Especially, public key management based on certificates exceeds the sensor nodes’ energy budget, and key revocation is almost impossible to realize under the restricted conditions in sensor networks. Even symmetric cryptography implementation turns out to be nonstraightforward owing to architectural limitations and energy constraints. Furthermore, the key management for authenticating broadcast-like communications calls for new approaches, as simple distribution of one symmetric group key among all receivers would not allow to cope with compromised sensor nodes. Perrig et al. therefore propose two main security protocols: • The Sensor Network Encryption Protocol (SNEP) for realizing efficient end-to-end security between nodes and base stations. • A variant of the Timed Efficient Stream Loss-Tolerant Authentication Protocol (TESLA), called µTESLA, for authenticating broadcast communications, that will be further discussed in Section 39.4. The main goal in the development of SNEP was the efficient realization of end-to-end security services for two-party communication. SNEP provides the security services data confidentiality, data origin authentication, and replay protection. The considered communication patterns are node to base station (e.g., sensor readings) and base station to individual nodes (e.g., specific requests). Securing messages from a base station to all nodes (e.g., routing beacons, queries, reprogramming of the entire network) is the task of the µTESLA protocol to be discussed in Section 39.4. The main design decisions in the development of SNEP were to avoid the use of asymmetric cryptography, to construct all cryptographic primitives out of a single block cipher, and to exploit common state in order to reduce communication overhead where this is possible. SNEP’s basic trust model assumes that two communicating entities A and B share a common master key KA,B . Initially, the base station shares a master key with all nodes and node-to-node master keys can be negotiated with the help of the base station (see later). From such a master key two confidentiality keys CKA,B , CKB,A (one per direction), two integrity keys IKA,B , IKB,A , and a random seed RKA,B are derived according to the following equations: CKA,B = FXA,B

(39.1)

CKB,A = FXA,B

(39.2)

IKA,B = FXA,B

(39.3)

IKB,A = FXA,B

(39.4)

RKA,B = FXA,B

(39.5)

The principal cryptographic primitive of SNEP is the RC5 algorithm [14]. Three parameters of this algorithm can be configured: the word length w[bit], the number of rounds r, and the key size b[byte], and the resulting instantiation of the algorithm is denoted as RC5-w/r/b. What makes RC5 specifically suitable for implementation in sensor nodes is the fact that it can be programmed with a few lines of code and that the main algorithm only makes use of three simple and efficient instructions: two’s complement addition + of words (mod 2w ), Bit-wise XOR of words, and cyclic rotation <<<. Figure 39.4 illustrates

© 2006 by Taylor & Francis Group, LLC

Sensor Network Security

39-9

// Algorithm: RC5-Encryption // Input: A, B = plaintext stored in two words // S[0, 2r+1] = an array filled by a key setup // procedure // Ouput: A, B = ciphertext stored in two words A := A + B := B + for i := A := B :=

S[0]; S[1]; 1 to r ( (A ⊕ B) <<< B) + s[2i]; ( (B ⊕ A) <<< A) + s[2i + 1];

FIGURE 39.4 The RC5 encryption algorithm.

TABLE 39.2 Plaintext Requirements for Differential Attacks on RC5 Number of rounds

4

6

8

10

12

14

16

Differential attack (chosen plaintext) Differential attack (known plaintext)

27

216

228

236

244

252

261

236

241

247

251

255

259

263

the encryption function. The corresponding decryption function can be easily obtained by basically “reading the code in reverse.” Prior to en- or decryption with RC5, an array s [0, 2r + 1] has to be filled by a key preparation routine that is a little bit more tricky, but also uses only simple instructions. Regarding the security of the RC5 algorithm, Kaliski and Yin [15] report in 1998 that the best known attacks against RC5 with a blocklength of 64-bit have plaintext requirements as listed in Table 39.2. According to the information given in Reference 10 (RAM requirements, etc.), Perrig et al. seem to plan for RC5 with 8 rounds and 32-bit words (leading to a blocklength of 64-bit), so that a differential cryptanalysis attack would require about 228 chosen plaintexts and about 247 known plaintexts and CPU effort in the same order of magnitude. Taking into account progress in PC technology, this should be considered on the edge of being secure (if an attacker can collect that many plaintexts). Nevertheless, by increasing the number of rounds the required effort could be raised to 261 or 263 , respectively. Even higher security requirements can by principle only be ensured by using a block cipher with a larger block size. In SNEP, encryption of messages is performed by using the RC5 algorithm in an operational mode called counter mode, that XORs the plaintext with a pseudo-random bit sequence which is generated by encrypting increasing counter values (see also Figure 39.5). The encryption of message Msg with key K and counter value Counter is denoted as: {M }K ,Counter For computing Message Authentication Codes (MACs), SNEP uses the well established Cipher Block Chaining Message Authentication Code (CBC-MAC) construction. This mode encrypts each plaintext block P1 , . . . , Pn with an integrity key IK , XORing the ciphertext of the last encryption result Ci −1 with the plaintext block Pi prior to the encryption step. The result of the last encryption step is then taken as the message authentication code (see also Figure 39.6). Depending on whether encryption of message data is required or not, SNEP offers two message formats: 1. The first format appends an RC5–CBC–MAC computed with the integrity key IKA,B over the message data: A → B : Msg | RC5 − CBC(IKA,B , Msg)

© 2006 by Taylor & Francis Group, LLC

39-10

Embedded Systems Handbook Counter

CKA,B

CKA,B

RC5

P1

FIGURE 39.5

Counter + 1

C1

Counter + 2

CKA,B

RC5

P2

C2

RC5

C3

P3

Encryption in counter mode. Pn

P2

P1

+ IK

RC5 encrypt

IK

RC5 encrypt

Cn -1

... IK

C2

C1

+ RC5 encrypt Cn

MAC (64 bits)

FIGURE 39.6

Computing a MAC in cipher block chaining mode.

2. The second format encrypts the message and appends a MAC in whose computation the counter value is also included: A → B : {Msg}CKA,B ,Counter |RC5−CBC(IKA,B , Counter, {Msg}CKA,B ,Counter ) Please note that the counter value itself is not transmitted in the message, so that common state between sender and receiver is exploited in order to save transmission energy and bandwidth. Furthermore, random numbers are generated by encrypting a (different) counter, and the RC5–CBC construction is also used for key derivation, as the key deriving function mentioned above is realized as: FXA,B (n) := RC5−CBC(XA,B , n) In order to be able to successfully decrypt a message, the receiver’s decryption counter needs to be synchronized with the sender’s encryption counter. An initial counter synchronization can be achieved by the following protocol, in which both entities A and B communicate their individual counter value for encryption CA and CB to the other party, and authenticate both values by exchanging two MACs computed with their integrity keys IKA,B and IKB,A , respectively: A → B: CA B → A: CB | RC5−CBC(IKB,A , CA , CB ) A → B: RC5−CBC(IKA,B , CA , CB ) In case of a message loss, counters get out of synchronization. By trying out a couple of different counter values, a few message losses can be tolerated. However, as this consumes energy, after trying out a couple of succeeding values, an explicit resynchronization dialog is initiated by the receiver A of a message. The dialog consists of sending a freshly generated random number NA to B, who answers with his current

© 2006 by Taylor & Francis Group, LLC

Sensor Network Security

39-11

counter CB and a MAC computed with his integrity key over both the random number and the counter value: A → B: NA B → A: CB | RC5–CBC(IKB,A , NA , CB ) As encrypted messages are only accepted by a receiver if the counter value used in their MAC computation is higher than the last accepted value, the implementation of the confidentiality service in SNEP to a certain degree also provides replay protection. If for a specific request Req an even tighter time synchronization is needed, the request can also contain a freshly generated random number NA that will be included in the computation of the MAC of the answer message containing the response Rsp: A → B: NA , Req B → A: {Rsp}CKB,A ,CB |RC5–CBC(IKB,A , NA , CB , {Rsp}CKB,A ,CB ) In order to establish a shared secret SKA,B between two sensor nodes A and B with the help of base station BS, SNEP provides the following protocol: A → B: NA | A B → BS: NA | NB | A | B |RC5–CBC(IKB,BS | NA | NB | A | B) BS → A: {SKA,B }KBS,A |RC5–CBC(IKBS,A | NA | B | {SKA,B }KBS,A ) BS → B: {SKA,B }KBS,B |RC5–CBC(IKBS,B | NB | A | {SKA,B }KBS,B ) In this protocol, A first sends a random number NA and his name to B, who in turn sends both values together with his own random number NB and name B to the base station. The base station then generates a session key SKA,B and sends it to both sensor nodes in two separate messages, which are encrypted with the respective key the base station shares with each node. The random numbers NA and NB allow both sensor nodes to verify the freshness of the returned message and the key contained in it. Regarding the security properties of this protocol, however, it has to be remarked that in a strict sense the protocol as formulated in Reference 10 does neither allow A nor B to perform concurrent key negotiations with multiple entities, as in such a case they would not be able to securely relate the answers to the correct protocol run (please note that the name of the peer entity is not transmitted in the returned message but only included in the MAC computation). Furthermore, neither A nor B knows, if the other party received the key and trusts in its suitability, which is commonly regarded as an important objective of a key management protocol [16]. Finally, the base station cannot deduce anything about the freshness of messages and can therefore not differentiate between fresh and replayed requests for a session key.

39.4 Authenticated Broadcast Authenticated broadcast is required if one message needs to be sent to all (or many) nodes in a sensor network, and the sensor nodes have to be able to verify the authenticity of the message. Examples for this communication pattern are authenticated query messages, routing beacon messages, or commands to reprogram an entire network. As it has to be ensured, that recipients of such a message should not be able to make use of their verifying key for forging authenticated messages, an asymmetric mechanism has to be deployed. Classical asymmetric cryptography, however, is considered to be too expensive in terms of computation, storage, and communication requirements for sensor nodes. One basic idea for obtaining asymmetry while at the same time deploying a symmetrical cryptographic algorithm is to send a message that has been authenticated with a key Ki and to disclose this key at a later

© 2006 by Taylor & Francis Group, LLC

39-12

Embedded Systems Handbook

point in time, so that the authenticity of the message can be verified. Of course, from the moment in which the key disclosure message has been sent, a potential attacker could use this key to create MACs for forged messages. Therefore, it is important that all receivers have at least loosely synchronized clocks and only use a key Ki to verify messages that have been received before the key disclosure message was sent. However, it must also be ensured that a potential attacker cannot succeed in tricking genuine nodes into accepting bogus authentication keys generated by himself. One elegant way to achieve this is the inverse use of a chain of hash codes for obtaining integrity keys, basically a variation of the so-called one-time password idea [17]. The TESLA protocol uses a reversed chain of hash values to authenticate broadcast data streams [18]. The µTESLA protocol proposed to be used in sensor networks is a minor variation of the TESLA protocol, with the basic difference being the cryptographic scheme used to authenticate the initial key. While TESLA uses asymmetric cryptography for this, µTESLA deploys the SNEP protocol, so that the base station calculates for each sensor node one individual MAC that authenticates the initial key K0 . Furthermore, while TESLA discloses the key in every packet, µTESLA discloses the key only once per time interval in order to reduce protocol overhead, and only base stations authenticate broadcast packets because sensor nodes are not capable of storing entire key chains. In order to setup a sender, first the length n of the key chain to be computed is chosen and the last key of the key chain Kn is randomly generated. Second, the entire hash key chain is computed according to the equation Kn−1 := H (Kn ), stored at the sender, and the key K0 is communicated and authenticated to all participating sensor nodes. For this, each sensor node A sends a random number NA to the base station and the base station answers with a message containing its current time, the currently disclosed key Ki (in the initial case: i = 0), the time period Ti in which Ki was valid for authenticating messages, the interval length TInt , the number of intervals δ the base station waits before disclosing a key, and a MAC computed with the integrity key KBS,A over these values: A → BS: NA | A BS → A: TBS | Ki | Ti | TInt | δ |RC5–CBC(IKBS,A | NA | TBS | Ki | Ti | TInt | δ) After this preparatory phase, broadcasting authenticated packets is then realized as follows: • Time is divided in uniform length intervals Ti and all sensor nodes are loosely synchronized to the clock of the base station. • In time interval Ti , the sender authenticates packets with key Ki . • The key Ki is disclosed in time interval i + δ (e.g., δ = 2). Figure 39.7 illustrates this reverse use of the chain of hash values for authenticating packets. In order to check the authenticity of a received packet, a sensor node first has to store the packet together with Ti and wait until the respective key has been disclosed by the base station. Upon disclosure of the appropriate key Ki the authenticity of the packet can be checked. Of course, it is crucial to discard all packets that have been authenticated with an already disclosed key for this scheme to be secure.This requires at least a loose time synchronization with an appropriate value H

H

H

H

H

H

H

Use key

K1

K2

K3

K4

K5

K6

K7

K8

Interval

T1

T2

T3

T4

T5

T6

T7

T8

t

K6

K7

Packet

P1

P2

MAC2

MAC3

Disclose

FIGURE 39.7 An example of TESLA operation.

© 2006 by Taylor & Francis Group, LLC

K1

P3

P4

P5

MAC5 MAC5

K2

K3

MAC7

K4

K5

Sensor Network Security

39-13

of δ that needs to be selected in accordance with the maximum clock drift. However, as nodes cannot store many packets, key disclosure cannot be postponed for a long time so that the maximum clock drift should not be too big. If a sensor node should need to send a broadcast packet, it would send a SNEP protected packet to the base station, which in turn would then send an authenticated broadcast packet. The main reason for this is that sensor nodes do not have enough memory for storing key chains and can, therefore, not authenticate broadcast packets on their own.

39.5 Alternative Approaches to Key Management Key management is often said to be the hardest part of implementing secure communications, as on one hand legitimate entities need to hold or be able to agree on the required keys, and on the other hand, a suite of security protocols cannot offer any protection if the keys fall in the hands of an attacker. The SNEP protocol suite as described in Section 39.3 includes a simple and rather traditional key management protocol, that enables two sensor nodes to obtain a shared secret key with the help of a base station. In this section, we will treat the subject of key management in more depth and review alternative approaches to it. Key management comprises of the following tasks [1]: • Key generation is the creation of the keys that are used. This process must be executed in a random or at least pseudo-random-controlled way, because hackers will otherwise be able to execute the process themselves and in a relatively short time, will discover the key that was used for security. Pseudo-random-controlled key generation means that keys are created according to a deterministic approach but each possible key has the same probability of being created from the method. Pseudorandom generators must be initialized with a real random value so that they do not always produce the same keys. If the process of key generation is not reproducible, it is referred to as “really random” key generation. • The task of key distribution consists of deploying generated keys in the place in a system where they are needed. In simple scenarios the keys can be distributed through direct (e.g., personal) contact. If larger distances are involved and symmetric encryption algorithms are used, the communication channel again has to be protected through encryption. Therefore, a key is needed for distributing keys. This necessity supports the introduction of what is called key hierarchies. • When keys are stored, measures are needed to make sure that they cannot be read by unauthorized users. One way to address this requirement is to ensure that the key is regenerated from an easy to remember but sufficiently long password (usually an entire sentence) before each use, and therefore is only stored in the memory of the respective user. Another possibility for storage is manipulation-safe crypto-modules that are available on the market in the form of processor chip cards at a reasonable price. • Key recovery is the reconstruction of keys that have been lost. The simplest approach is to keep a copy of all keys in a secure place. However, this creates a possible security problem because an absolute guarantee is needed that the copies of the keys will not be tampered with. The alternative is to distribute the storage of the copies to different locations, which minimizes the risk of fraudulent use so long as there is an assurance that all parts of the copies are required to reconstruct the keys. • Key invalidation is an important task of key management, particularly with asymmetric cryptographic methods. If a private key is known, then the corresponding public key needs to be identified as invalid. In sensor networks, key invalidation is expected to be a quite likely operation, as sensor nodes may be relatively easy to capture and compromise. • The destruction of no longer required keys is aimed at ensuring that messages ciphered with them also cannot be decrypted by unauthorized persons in the future. It is important to make sure that all copies of the keys have really been destroyed. In modern operating systems this is not a trivial task since storage content is regularly transferred to hard disk through automatic storage

© 2006 by Taylor & Francis Group, LLC

39-14

Embedded Systems Handbook

management and the deletion in memory gives no assurance that copies of the keys no longer exist. In the case of magnetic disk storage devices and so-called EEPROMs (Electrically Erasable Programmable Read-Only Memory), these have to be overwritten or destroyed more than once to guarantee that the keys stored on them can no longer be read, even with sophisticated technical schemes. From the listed tasks, most key management protocols address the task of key distribution and sometimes also concern key generation. Approaches to distributing keys in traditional networks, however, do not work well in wireless sensor networks. Methods based on asymmetric cryptography require very resource intensive computations and are, therefore, often judged as not being appropriate for sensor networks. Arbitrated key management based on predetermined keys, such as the key management protocol of SNEP, on the other hand, assume predetermined keys at least between the base station and sensor nodes. This requires predistribution of these keys before deployment of the sensor network and also has some security implications in case of node compromise. There are a couple of particular requirements to key management schemes for sensor networks resulting from their specific characteristics [19]: Vulnerability of nodes to physical capture and node compromise. Sensor nodes may be deployed in difficult situation to protect/hostile environments and can therefore fall into the hands of an attacker. Because of tight cost constraints, nodes will often not be tamper-proof, so that cryptographic keys might be captured by an attacker. This leads to the requirement, that compromise of some nodes and keys should not compromise the overall network’s security (“graceful degradation”). Lack of a priori knowledge of deployment configuration. In some applications, sensor networks will be installed via random scattering (e.g., from an airplane), so that neighborhood relations are not known a priori. Even with manual installation, preconfiguration of sensor nodes would be expensive in large networks. This leads to the requirement that sensor networks key management should support for“automatic” configuration after installation. Resource restrictions. As mentioned earlier, nodes of a sensor network only possess limited memory and computing resources, as well as very limited bandwidth and transmission power. This puts tight constraints on the design of key management procedures. In-network processing. Over-reliance on a base station as source of trust may result in inefficient communication patterns (cf. data aggregation in Section 39.6). Also, it turns base stations into attractive targets (which they are in any case!). Therefore, centralistic approaches like the key management protocol of SNEP should be avoided. Need for later addition of sensor nodes. Compromise, energy exhaustion or limited material/calibration lifetime may make it necessary to add new sensors to an existing network. However, legitimate nodes that have been added to sensor network should be able to establish secure relationships with existing nodes. Erasure of master keys after initial installation (cf. the LEAP approach described later) does not allow this. In the following, we will describe two new alternatives to traditional key management approaches that have been proposed for sensor networks: the neighborhood-based initial key exchange protocol Localized Encryption and Authentication Protocol (LEAP), and the approach of probabilistic key distribution. The LEAP [20] enables “automatic” and efficient establishment of security relationships in an initialization phase after installation of the nodes. It supports key establishment for various trust relationships between: • • • •

Base station and sensor with so-called “individual keys” Sensors that are direct neighbors with “pairwise shared keys” Sensors that form a cluster with “cluster keys” All sensors of a network with a “group key”

In order to establish individual keys prior to deployment, every sensor node u is preloaded with an individual key Kum known only to the node and the base station. The base station s generates these keys

© 2006 by Taylor & Francis Group, LLC

Sensor Network Security

39-15

from a master key Ksm and the node identity u according to the equation Kum := f (Ksm , u). Generating all node keys from one master key is supposed to save memory at the base station, as the individual keys need not be stored at the base station but can be generated on-the-fly when they are needed. In scenarios in which pairwise shared keys cannot be preloaded into sensor nodes because of installation by random scattering but neighboring relationships remain static after installation, LEAP provides for a simple key establishment procedure for neighboring nodes. For this, it is assumed that there is a minimum time interval Tmin during which a node can resist against attacks. After being scattered in the field, sensor nodes establish neighboring relations during this time interval based on an initial group key KI that has been preconfigured into all sensor nodes before deployment. First, every node u computes its master key Ku = f (KI , u). Then, every node discovers its neighbors by sending a message with its identity u and a random number ru and collecting the answers: u → ∗: u | ru v → u: v | MAC(Kv , ru | v) As u can also compute Kv , it can directly check this MAC and both nodes compute the common shared secret Ku,v := f (Kv , u). After expiration of the timer Tmin , all nodes erase the initial group key KI and all computed master keys so that only the pairwise shared keys are kept. This scheme can be augmented with all nodes forwarding also the identities of their neighbors, enabling a node to compute pairwise shared keys with nodes that are one hop away. In order to establish a cluster key with all its immediate neighbors, a node randomly generates a cluster key Kuc and sends it individually encrypted to all neighbors v1 , v2 , . . . : u → vi : E(Ku,vi , Kuc ) All nodes vi decrypt this message with their pairwise shared key Ku,vi and store the obtained cluster key. When a node is revoked, a new cluster key is distributed to all remaining nodes. If a node u wants to establish a pairwise shared key with a node c that is multiple hops away, it can do so by using other nodes it knows as proxies. In order to detect suitable proxy nodes vi , u broadcasts a query message with its own node id and that of c. Nodes vi knowing both nodes u and c will answer to this message: u → ∗: u | c vi → u: vi Assuming that node u has received m answers, it then generates m shares sk1 , . . . , skm of the secret key Ku,c to be established with c and sends them individually over the appropriate nodes vi : u → vi : E(Ku,vi , ski ) | f (ski , 0) vi → c: E(Kvi ,c , ski ) | f (ski , 0) The value f (ski , 0) allows the nodes vi and c to verify if the creator of such a message actually knew the key share ski , as otherwise it would not have been able to compute this value (the function f needs to be a one-way function for this to be secure). After receiving all values ski , node c computes Ku,c := sk1 ⊕ · · · ⊕ skm . In order to establish a new group key Kg , the base station s randomly generates a new key and sends it encrypted with its own cluster key to its neighbors: s → vi : E(Ksc , Kg )

© 2006 by Taylor & Francis Group, LLC

39-16

Embedded Systems Handbook

All nodes receiving such a message forward the new group key encrypted with their own cluster key to their neighbors. Node revocation is performed by the base station and uses µTESLA. All nodes, therefore, have to be preloaded with an authentic initial key K0 , and loose time synchronization is needed in the sensor network. In order to revoke a node u, the base station s broadcasts the following message in time interval Ti using the µTESLA key Ki valid for that interval: s → ∗: u | f (Kg , 0) | MAC(Ki , u | f (Kg , 0)) The value f (Kg , 0) later on allows all nodes to verify the authenticity of a newly distributed group key Kg . This revocation becomes valid after disclosure of the TESLA key Ki . A couple of remarks to some security aspects of LEAP have to be mentioned at this point: As every node u knowing KI may compute the master key Kv of every other node v, there is little additional security to be expected from distinguishing between these different “master keys.” Especially, all nodes need to hold KI during the discovery phase in order to be able to compute the master keys of answering nodes. The authors of Reference 20 give no reasoning as to why they think that this differentiation of master keys should attain any additional security. As any MAC construction that deserves its name should not leak information about KI in a message authentication code MAC(KI , ru | v), it is hard to see any benefit in this (is it “crypto snake oil”?). The synchronization of the time interval for pairwise key negotiation is critical. However, the authors of Reference 20 give no hint on how the nodes should know when this time interval starts, or if there should be a signal and if there should, what to do if a node misses this signal or “sleeps” during the interval? It is clear that if any node is compromised before erasure of KI the approach fails to provide protection against disclosure of pairwise shared keys. It does not become clear, what is the purpose of the random value (nonce) in the pairwise shared key establishment dialog. Pairwise shared keys are only established during Tmin , and most probably, all neighbors will answer to the first message anyway (including the same nonce from this message). This random value is not even included in the computation of Ku,v , and so the only thing that can be defended against it, is an attacker that sends replayed replies during Tmin , but these would not result in additional storage of keys Ku,v or anything else than having to parse and discard these replays. The cluster key establishment protocol does not allow a node to check the authenticity of the received key, as every attacker could send some binary data that is decrypted to “something.” This would overwrite an existing cluster key Kuc with garbage, leading to a DoS vulnerability. By appending a MAC this could be avoided. However, an additional replay protection would be required in this case in order to avoid overwriting with old keys. Furthermore, after expiration of the initial time interval Tmin , it is no longer possible to establish pairwise shared keys among neighbors, so that the LEAP approach does not support later addition/exchange of sensor nodes. In 2002, Eschenauer and Gligor [21] proposed a probabilistic key management scheme that is based on the simple observation that, on one hand, sharing one key KG among all sensors leads to weak security, and on the other hand, sharing individual keys Ki,j among all nodes i, j requires too many keys in large sensor networks (n 2 − n keys for n nodes). The basic idea of probabilistic key management is to randomly give each node a so-called “key ring” containing a relatively small number of keys from a large key pool, and to let neighboring nodes discover the keys they share with each other. By properly adjusting the size of the key pool and the key rings, a “sufficient” degree of shared key connectivity for a given network size can be attained. The basic scheme published in Reference 21 consists of three phases: • Key predistribution • Shared key discovery • Path key establishment © 2006 by Taylor & Francis Group, LLC

Sensor Network Security

39-17

The key predistribution consists of five steps that are processed offline. First, a large key pool P with about 217 to 220 keys and accompanying key identifiers is generated. Then, for each sensor k keys are randomly selected out of P without replacement, in order to establish the sensor’s key ring. Every sensor is loaded with its key ring comprising the selected keys and their identifiers. Furthermore, all sensor identifiers and the key identifiers of their key ring are loaded into a controller node. Finally, a shared key for secure communication with each sensor s is loaded into the controller node ci, according to the following rule: if K1 , . . . , Kk denote the keys on the key ring of sensor s, the shared key Kci,s is computed as Kci,s := E(K1 ⊕ · · · ⊕ Kk , ci). The main purpose of the key predistribution is to enable any two sensor nodes to identify a common key with a certain probability. This probability, that two key rings KR1, KR2 share at least one common key, can be computed as follows: Pr(KR1 & KR2 share at least one key) = 1 − Pr(KR1 & KR2 share no key) The number of possible key rings is: P P! = k k!(P − k)! The number of possible key rings after k keys have been drawn from the key pool without replacement is:

P −k k

=

(P − k)! k!(P − 2k)!

Thus, the probability that no key is shared is the ratio of the number of key rings without a match divided by the total number of key rings. Concluding the probability of at least one common key is: Pr(at least one common key) = 1 −

k!(P − k)!(P − k)! P! k!(P − 2k)!

After being installed, all sensor nodes start discovering their neighbors within the wireless communication range, and any two nodes, wishing to find out if they share a key, simply exchange lists of key ids on their key ring. Alternatively, each node s could broadcast a list: s → ∗: α | E(K1 , α) | · · · | E(Kk , α) A node receiving such a list would then have to try all its keys, in order to find out matching keys (with a high probability). This would hide from an attacker which node holds which key ids but requires more computational overhead from each sensor node. The shared key discovery establishes a (random graph) topology in which links exist between nodes that share at least one key. It might happen that one key is used by more than one pair of nodes. In the path key establishment phase, path keys are assigned to pairs of nodes (s1 , sn ) that do not share a key but are connected by two or more links so that there is a sequence of nodes which share keys and “connect” s1 to sn . The article [21], however, does not contain any clear information on how path keys are computed or distributed. It only states that they do not need to be generated by the sensor nodes. Furthermore, it is mentioned, that “the design of the DSN ensures that, after the shared key discovery phase is finished, a number of keys on any ring are left unassigned to any link.” However, it does not become clear from Reference 21 how two nodes can make use of these unused keys for establishing a path key. If a node is detected to be compromised, all keys on its ring need to be revoked. For this, the controller node generates a signature key Ke and sends it individually to every sensor node si, encrypted with the key Kci,si : ci → si: E(Kci,si , Ke ) © 2006 by Taylor & Francis Group, LLC

39-18

Embedded Systems Handbook

Afterwards, it broadcasts a signed list of all identifiers of keys that have to be revoked: ci → ∗: id1 | id2 | · · · | idk | MAC (Ke , id1 | id2 | · · · | idk ) Every node receiving this list has to delete all listed keys from his key ring. This removes all links to the compromised node plus some more links from the random graph. Every node that had to remove some of its links tries to reestablish as much as possible of them by starting a shared key discovery and a path key establishment phase. Chan et al. [19] proposed a modification to the basic random pre-distribution scheme described so far by requiring to combine multiple shared keys. In this variant, two nodes are required to share at least q keys on their rings, in order to establish a link. So, if K1 , . . . , Kq are the common keys of nodes u and v (with q ≥ q), then the link key is computed as follows: Ku,v := h (K1 , . . . , Kq ) On one hand, it becomes harder with this scheme for an attacker to make use of one or multiple key ring(s) obtained by node compromise, and this increase in difficulty is exponential in q. On the other hand, the size of the key pool |P | has to be decreased in order to have a high enough probability that two nodes share enough keys on their rings in order to establish a link. This gives an attacker a higher percentage of compromised keys per compromised nodes. In Reference 19, a formula is derived which determines how to compute the key pool size so that any two nodes share enough keys with a given probability >p. This scheme is called the q-composite scheme. In the same paper, Chan et al. proposed a second scheme, called multi-path key reinforcement. The basic idea of this scheme is to “strengthen” an already established key by combining it with random numbers that are exchanged over alternative secure links. After the discovery phase of the basic scheme has been completed and enough routing information can be exchanged so that a node u knows all (or at least enough) disjoint paths p1 , . . . , pj to a node v, node u generates j random values v1 , . . . , vj and sends each value along another path to node v. After having received all j values, node v computes the new link key: Ku,v = Ku,v ⊕ v1 ⊕ · · · ⊕ vj

Clearly, more the paths used, the harder it gets for an attacker to eavesdrop on all of them. However, the probability for an attacker to be able to eavesdrop on a path increases with the length of the path, so that utilizing more but longer paths does not necessarily increase the overall security to be attained by the scheme. In Reference 19, the special case of 2-hop multi-path key reinforcement is analyzed probabilistically. Furthermore, the paper also describes a third approach called random pairwise key scheme, that hands out keys to pairs of nodes which also store the identity of the respective peer node holding the same key. The motivation behind this approach is to allow for node-to-node authentication (see Reference 19 for details). Concerning security, the following remarks on probabilistic key management should be noted. The nice property of having a rather high probability that any two given nodes share at least one key (e.g., p = 0.5, if 75 keys out of 10,000 keys are given to every node), also plays in the hands of an attacker who compromises a node, and an attacker that has compromised more than one node has an even higher probability of holding at least one key with any given node. This problem also exists with the q-composite scheme, as the key pool size is reduced in order to ensure a high enough probability that any two nodes share at least q keys. This especially concerns the attacker’s ability to perform active attacks, as eavesdropping attacks are less probable because the probability that the attacker holds exactly the key that two other nodes are using is rather small (and even a lot smaller in the q-composite scheme). Furthermore, keys of compromised nodes are supposed to be revoked, but as to how to detect compromised nodes is

© 2006 by Taylor & Francis Group, LLC

Sensor Network Security

39-19

FIGURE 39.8 Aggregating data in a sensor network.

still an open question, how to know in a sensor network which nodes and keys should be revoked? Finally, the presented probabilistic schemes do not support node-to-node authentication (with the exception of the random pairwise key scheme).

39.6 Secure Data Aggregation As already mentioned in the introduction, data from different sensors is supposed to be aggregated on its way toward the base station (see also Figure 39.8). This raises the question, how to ensure authenticity and integrity of aggregated data? If every sensor would add a MAC to its answer in order to ensure data origin authentication, all (answer, MAC)-tuples would have to be sent to the base station in order to enable checking of the authenticity. This shows that individual MACs are not suitable for data aggregation. However, if only the aggregating node added one MAC, a subverted node could send arbitrary data regardless of the data sent by sensors. At GlobeCom’03, Du et al. [22] proposed a scheme that allows a base station to “check the integrity” of an aggregated value based on endorsements provided by so-called witness nodes. The basic idea of this scheme is that multiple nodes perform data aggregation and compute a MAC over their result. This requires individual keys between each node and the base station. In order to allow for aggregated sending of data, some nodes act as so-called data fusion nodes, aggregating sensor data and sending it toward the base station. As a data fusion node could be a subverted or malicious node, his result needs to be endorsed by witness nodes. For this, neighboring nodes receiving the same sensor readings compute their own aggregated result, compute a MAC over this result and send it to the data fusion node. The data fusion node computes a MAC over his own result and sends it together with all received MACs to the base station. Figure 39.9 illustrates this approach. In more detail, the scheme described in Reference 22 is as follows: 1. The sensor nodes S1 , S2 , . . . , Sn collect data from their environment and make binary decisions b1 , b2 , . . . , bn (e.g., fire detected) based on some detection rules. 2. Every sensor node sends his/her decision to the data fusion node F which computes an aggregated decision SF . 3. Neighboring witness nodes w1 , w2 , . . . , wm also receive the sensor readings and compute their own fusion results s1 , s2 , . . . , sm . Every wi computes a message authentication code MACi with key ki it shares with the base station, MACi := h(si , wi , ki ), and sends it to the base station.

© 2006 by Taylor & Francis Group, LLC

39-20

Embedded Systems Handbook

Phenomenon x1 S1

S2

b1

b2

Witness 1

xn

x2

MAC1

...

Sn

bn

Data fusion node

MAC2

Witness 1

u0

Base station

FIGURE 39.9

Overview of the witness based approach [22].

4. Concerning the verification at the base station, Du et al. proposed two variants. The first one is an m + 1 out of m + 1 voting scheme and works as follows: • The data fusion node F computes his message authentication code: MACF := h(SF , F , kF , MAC1 ⊕ MAC2 ⊕ · · · ⊕ MACm ) • F sends to base station: (SF , F , w1 , . . . , wm , MACF ). • The base station computes all MACi = h(SF , wi , ki ) and the authentication code to be expected from F : MACF := h(SF , F , kF , MAC1 ⊕ MAC2 ⊕ · · · ⊕ MACm )

The base station then checks if MACF = MACF and otherwise discards the message. If the set (w1 , . . . , wm ) remains unchanged, the identifiers of the wi need only to be transmitted with the first MACF in order to save transmission bandwidth. There is, however, one major drawback with this scheme: if one witness deliberately sends a wrong MACi , the aggregated data gets refused by the base station (representing a DoS vulnerability). 5. In order to overcome the DoS vulnerability of the first scheme, Du et al. [22] also proposed an n out of m + 1 voting scheme: • F sends to the base station: (SF , F , MACF , w1 , MAC1 , . . . , wm , MACm ). • The base station checks if at least n out of m + 1 MACs match, that is at least n − 1 MACi match MACF . This scheme is more robust against erroneous or malicious witness nodes, but requires a higher communication overhead as m MACs must be sent to the base station. Du et al. [22] analyzed the minimum length of the MACs in order to ensure a certain tolerance probability 2−δ that an invalid result is accepted by a base station. For this, they assume that each MAC has the length k, there are m witnesses, no witness colludes with F and F needs to guess the endorsements MACi for at least n − 1 witnesses. As the probability of correctly guessing one MACi is p = 1/2k , the authors compute the chance of correctly guessing at least n − 1 values to: PS =

m m i p (1 − p)m−i i

i=n−1

© 2006 by Taylor & Francis Group, LLC

Sensor Network Security

39-21

After some computation they yield: m(k/2 − 1) ≥ δ From this, Du et al. conclude that it is sufficient if mk ≥ 2(δ + m), and give an example how to apply this. If δ = 10 so that the probability of accepting an invalid result is 1/1024, and there are m = 4 witnesses, k should be chosen so that k ≥ 7. This observation is supposed to enable economizing transmission effort. In case a data fusion node is corrupted, Du et al. propose to obtain a result as follows: if the verification at the base station fails, the base station is supposed to poll witness stations as data fusion nodes, and to continue trying until the n out of m + 1 scheme described above succeeds. Furthermore, the expected number of polling messages T (m + 1, n) to be transmitted before the base station receives a valid result is computed. Regarding the security of the proposed scheme, however, it has to be considered if an attacker actually needs to guess MACs in order to send an invalid result? As all messages are transmitted in clear, an eavesdropper E could easily obtain valid message authentication codes MACi = h(si , wi , ki ). If E later on wants to act as a bogus data fusion node sending an (at this time) incorrect result si he can replay MACi to support this value. As Reference 22 assumes a binary decision result, an attacker only needs to eavesdrop until he has received enough MACi supporting either value of si . Thus, the scheme completely fails to provide adequate protection against attackers forging witness endorsements. The main reason for this vulnerability is the missing verification of the actuality of a MAC i at the base station. One could imagine as a quick fix letting the base station regularly send out random numbers rB that have to be included in the MAC computations. In such a scheme, every rB should only be accepted for one result, requiring the generation and exchange of large random numbers. A potential alternative could make use of timestamps, which would require synchronized clocks. However, there are more open issues with this scheme. For example, it is not clear what should happen if some witness nodes cannot receive enough readings? Also, it is not clear why the MACi are not sent directly from the witness nodes to the base station? This would at least allow for a direct n out of m + 1 voting scheme avoiding the polling procedure described earlier in case of a compromised data fusion node. Furthermore, the suffix mode MAC construction h(message, key) selected by the authors is considered to be vulnerable [2, note 9.65]. A further issue is, how to defend against an attacker flooding the network with “forged” MACi (“forged” meaning arbitrary garbage that looks like a MAC)? This would allow an attacker to launch a DoS attack as an honest fusion node could not know which values to choose. One more“hotfix”for this could be using a local MAC among neighbors to authenticate the MACi . Nevertheless, this would imply further requirements (e.g., shared keys among neighbors, replay protection), and the “improved scheme” nevertheless would not appear to be mature enough to rely on it. Some more general conclusions that can be drawn from this are that first, optimization (e.g., economizing on MAC size, message length) can be considered as being one of the attacker’s best friends, and second in security, we often learn (more) from failures. Nevertheless, the article of Du et al. allows to discuss the need and the difficulties of constructing a secure data aggregation scheme, that does not consume too many resources and is efficient enough to be deployed in sensor networks. As such it can be considered as a valuable contribution despite its security deficiencies.

39.7 Summary Wireless sensor networks are an upcoming technology with a wide range of promising applications. As in other networks, however, security is crucial for any serious application. Prevalent security objectives in wireless sensor networks are confidentiality and integrity of data, as well as availability of sensor network services being threatened by DoS attacks, attacks on routing, etc. Severe resource constraints

© 2006 by Taylor & Francis Group, LLC

39-22

Embedded Systems Handbook

in terms of memory, time, and energy, and an “unfair” power balance between attackers and sensor nodes makes attaining these security objectives particularly challenging. Approaches proposed for wireless ad hoc networks which are based on asymmetric cryptography are generally considered to be too resource consuming. This chapter has reviewed basic considerations on protection against DoS and attacks on routing, and given an overview of first approaches proposed so far. For ensuring confidentiality and integrity of data the SNEP and µTESLA protocols were discussed, and considering key management the LEAP protocol and probabilistic key management has been reviewed. At present there are only few works on how to design security functions suitable for the specific communication patterns in sensor networks (especially with respect to data aggregation). The witness based approach described in Reference 22 with its flaws reveals the difficulties in designing an appropriate protocol for this.

References [1] Schäfer, G. Security in Fixed and Wireless Networks. John Wiley & Sons, New York, 2003 [2] Menezes, A., van Oorschot, P., and Vanstone, S. Handbook of Applied Cryptography. CRC Press LLC, Boca Raton, FL, 1997. [3] Karl, H. and Willig, A. A Short Survey of Wireless Sensor Networks. TKN Technical report series, TKN-03-018, Technical University, Berlin, Germany, 2003. [4] Wood, A. and Stankovic, J. Denial of Service in Sensor Networks. IEEE Computer, 35, 54–62, 2002. [5] Tuomas Aura, Pekka, N., and Leiwo, Jussipekka. DOS-Resistant Authentication with Client Puzzles. In Proceedings of the Security Protocols Workshop 2000, Vol. 2001 of Lecture Notes in Computer Science. Springer, Cambridge, UK, April 2000. [6] Karlof, C. and Wagner, D. Secure Routing in Wireless Sensor Networks: Attacks and Countermeasures. AdHoc Networks Journal, 1, 293–315, 2003. [7] Wood, A. Security in Sensor Networks. Sensor Networks Seminar, University of Virginia, USA, 2001. [8] Otway, D. and Rees, O. Efficient and Timely Mutual Authentication. ACM Operating Systems Review, 21(1), 8–10, 1987. [9] Hu, Y., Perrig, A., and Johnson, D. Wormhole Detection in Wireless Ad Hoc Networks. Technical report TR01-384, Rice University, USA, June 2002. [10] Perrig, A., Szewcyk, R., Tygar, J., Wen, V., and Culler, D. SPINS: Security Protocols for Sensor Networks. Wireless Networks, 8, 521–534, 2002. [11] Diffie, W. and Hellman, M.E. New Directions in Cryptography. Transactions of IEEE Information Theory, IT-22, 644–654, 1976. [12] Rivest, R.L., Shamir, A., and Adleman, L.A. A Method for Obtaining Digital Signatures and Public Key Cryptosystems. Communications of the ACM, 21(2), 120–126, 1978. [13] ElGamal, T. A Public Key Cryptosystem and a Signature Scheme Based on Discrete Logarithms. IEEE Transactions on Information Theory, 31, 469–472, 1985 [14] Baldwin, R. and Rivest, R. The RC5, RC5–CBC, RC5–CBC–Pad, and RC5–CTS Algorithms. RFC 2040, IETF, Status: Informational, October 1996. ftp://ftp.internic.net/rfc/rfc2040.txt [15] Kaliski, B.S. and Yin, Y.L. On the Security of the RC5 Encryption Algorithm. RSA Laboratories Technical report, TR-602, Version 1.0, 1998. [16] Gong, L., Needham, R.M., and Yahalom, R. Reasoning About Belief in Cryptographic Protocols. In Symposium on Research in Security and Privacy. IEEE Computer Society, IEEE Computer Society Press, Washington, May 1990, pp. 234–248. [17] Haller, N., Metz, C., Nesser, P., and Straw, M. A One-Time Password System. RFC 2289, IETF, Status: Draft Standard, February 1998. ftp://ftp.internic.net/rfc/rfc2289.txt [18] Perrig, A. and Tygar, J.D. Secure Broadcast Communication in Wired and Wireless Networks. Kluwer Academic Publishers, Dordrecht, 2003.

© 2006 by Taylor & Francis Group, LLC

Sensor Network Security

39-23

[19] Chan, H., Perrig, A., and Song, D. Random Key Predistribution Schemes for Sensor Networks. In Proceedings of the IEEE Symposium on Security and Privacy. Berkeley, California, 2003, pp. 197–213. [20] Zhu, S., Setia, S., and Jajodia, S. LEAP: Effcient Security Mechanisms for Large-Scale Distributed Sensor Networks. In Proceedings of the 10th ACM Conference on Computer and Communication Security. Washington, DC, USA, 2003, pp. 62–72. [21] Eschenauer, L. and Gligor, V.D. A Key Management Scheme for Distributed Sensor Networks. In Proceedings of the 9th ACM Conference on Computer and Communication Security. Washington, DC, USA, 2002, pp. 41–47. [22] Du, W., Deng, J., Han, Y., and Varshney, P.A. Witness-Based Approach for Data Fusion Assurance in Wireless Sensor Networks. In Proceedings of the IEEE 2003 Global Communications Conference (Globecom’2003). San Francisco, CA, USA, 2003, pp. 1435–1439.

© 2006 by Taylor & Francis Group, LLC

40 Software Development for Large-Scale Wireless Sensor Networks 40.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-1 40.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-2 Architectural Layer Model • Middleware and Services for Sensor Networks • Programming Aspect versus Behavioral Aspect

40.3 Current Software Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-5 TinyOS • MATÉ • TinyDB • SensorWare • MiLAN • EnviroTrack • SeNeTs

Jan Blumenthal, Frank Golatowski, Marc Haase, and Matthias Handy University of Rostock

40.4 Simulation, Emulation, and Test of Large-Scale Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-16 TOSSIM — A TinyOS SIMulator • EmStar • Sensor Network Applications (SNA) — Test and Validation Environment

40.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-25 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-25

40.1 Introduction The increasing miniaturization of electronic components and advances in modern communication technologies enable the development of high-performance spontaneously networked and mobile systems. Wireless microsensor networks promise novel applications in several domains. Forest fire detection, battlefield surveillance, or telemonitoring of human physiological data are only in the vanguard of plenty of improvements encouraged by the deployment of microsensor networks. Hundreds or thousands of collaborating sensor nodes form a microsensor network. Sensor data is collected from the observed area, locally processed or aggregated, and transmitted to one or more base stations. Sensor nodes can be spread out in dangerous or remote environments whereby new application fields can be opened. A sensor node combines the abilities to compute, communicate, and sense. Figure 40.1 shows the structure of a typical sensor node consisting of processing unit, communication module (radio interface), and sensing and actuator device. Figure 40.2 shows a scenario taken from the environmental application domain: leakage detection of dykes. During floods, sandbags are used to reinforce dykes. Piled along hundreds of kilometers around

40-1

© 2006 by Taylor & Francis Group, LLC

40-2

Embedded Systems Handbook Sensor Central unit (processor, memory)

Sensor

Communication module

Actuator Battery

FIGURE 40.1 Structure of a sensor node.

River

Sandbag with sensor

Base station

FIGURE 40.2 Example of sensor network application: leakage detection.

lakes or rivers, sandbag dykes keep waters at bay and bring relief to residents. Sandbags are stacked against sluice gates and parts of broken dams to block off the tide. To find out spots of leakage each sandbag is equipped with a moisture sensor and transmits sensor data to a base station next to the dyke. Thus, leakages can be detected earlier and reinforcement actions can be coordinated more efficiently. Well-known research activities in the field of sensor networks are UCLA’s WINS [1], Berkeley’s Smart Dust [2], WEBS [3], and PicoRadio [4]. An example of European research activities is the EYES-Project [5]. Detailed surveys on sensor networks can be found in [6] and [7]. This chapter focuses on innovative architectures and basic concepts of current software development solutions for wireless sensor networks.

40.2 Preliminaries Central unit of a sensor node is a low-power microcontroller that controls all functional parts. Software for such a microcontroller has to be resource aware on one hand. On the other hand, several Quality-ofService (QoS) aspects have to be met by sensor node software, such as latency, processing time for data fusion or compression, or flexibility regarding routing algorithms or MAC techniques. Conventional software development for microcontrollers usually covers hardware abstraction layer (HAL), operating system and protocols, and application layer. Often, software for microcontrollers is limited to an application-specific monolithic software block that is optimized for performance and resource usage. Abstracting layers, such as HAL or operating system, are often omitted due to resource constraints and low-power aspects. Microcontrollers are often developed and programmed for a specific, well-defined task. This limitation of the application domain leads to high-performance embedded systems even with strict resource constraints. Development and programming of such systems is too much effort. Furthermore, an application developed for one microcontroller is in most cases not portable to any other one, so that it has to be reimplemented from scratch. Microcontroller and application form an inseparable unit. If the application domain of an embedded system changes often, the whole microcontroller is replaced instead of writing and downloading a new program. For sensor nodes, application-specific microcontrollers are preferred instead of general-purpose microprocessors. This is because of the small size and the low energy consumption of those controllers. However, requirements concerning a sensor node exceed the main characteristics of a conventional microcontroller

© 2006 by Taylor & Francis Group, LLC

Software Development

40-3

and its software. The main reason for this is the dynamic character of a sensor node’s task. Sensor nodes can adopt different tasks, such as sensor data acquisition, data forwarding, or information processing. The task assigned to a node with its deployment is not fixed until the end of its life-cycle. Depending on, for instance, location, energy level, or neighborhood of a sensor node, a task change can become advantageous or even necessary. Additionally, software for sensor nodes should be reusable. An application running on a certain sensor node should not be tied to a specific microcontroller but to some extent be portable onto different platforms to enhance interoperability of sensor nodes with different hardware platforms. Not limited to software development for wireless sensor networks is the general requirement for a straightforward programmability and, as a consequence, a short development time. It is quite hard or even impossible to meet the requirements mentioned above with a monolithic application. Hence, at present there is much research effort in the areas of middleware and service architectures for wireless sensor networks. A middleware for wireless sensor networks should encapsulate required functionality in a layer between operating system and application. Incorporating a middleware layer has the advantage that applications get smaller and are not tied to a specific microcontroller. At the same time, development effort for sensor node applications (SNAs) reduces since a significant part of the functionality moves from application to middleware. Another research domain tends to service architectures for wireless sensor networks. A service layer is based on mechanisms of a middleware layer and makes its functionality more usable.

40.2.1 Architectural Layer Model Like in other networking systems, the architecture of a sensor network can be divided into different layers (see Figure 40.3). The lower layers are the hardware and HAL. The operating system layer and protocols are above the hardware-related layers. The operating system provides basic primitives, such as multithreading, resource management, and resource allocation that are needed by higher layers. Also access to radio interface and input/output operations to sensing devices are supported by basic operating system primitives. Usually in node-level operating systems these primitives are rudimentary and there is no separation between user and kernel mode. On top of the operating system layer reside middleware, service, and application layer. In recent years, much work has been done to develop sensor network node devices (e.g., Berkeley motes [8]), operating systems and algorithms, for example, for location awareness, power reduction, data aggregation, and routing. Today researchers are working on extended software solutions including middleware and service issues for sensor networks. The main focus of these activities is to simplify application development process and to support dynamic programming of sensor networks. The overall development process of sensor node software usually ends with a manual download of an executable image over direct wired connections or over-the-air interface to target node.

Applications Services Middleware Operating systems and protocols Hardware abstraction layer Hardware

FIGURE 40.3 Layered software model.

© 2006 by Taylor & Francis Group, LLC

40-4

Embedded Systems Handbook

After deployment of nodes, it is nearly impossible to improve or adapt new programs to the target nodes. But this feature is necessary in future wireless sensor networks to adapt the behavior of sensor network dynamically through new injected programs or capsules, a possibility that exists in MATÉ [24]. The task assigned to a node with its deployment is not fixed until the end of its life-cycle. Depending on, for instance, location, energy level, or neighborhood of a sensor node, a task change can become advantageous or even necessary.

40.2.2 Middleware and Services for Sensor Networks In sensor networks, design and development of solutions for higher-level middleware functionality and creation of service architectures are an open research issue. Middleware for sensor networks has two primary goals: • Support of acceptable middleware application–programming interfaces (APIs), which abstracts and simplifies low-level APIs to ease application software development and to increase portability. • Distributed resource management and allocation. Besides the native network functions, such as routing and packet forwarding, future software architectures are required to enable location and utilization of services. A service is a program that can be accessed through standardized functions over a network. Services allow a cascading without previous knowledge of each other, and thus enable the solution of complex tasks. A typical service used during the initialization of a node is the localization of a data sink for sensor data. Gateways or neighboring nodes can provide this service. To find services, nodes use a service discovery protocol.

40.2.3 Programming Aspect versus Behavioral Aspect Wireless sensor networks do not have to consist of homogeneous nodes. In reality, a network composed of several groups of different sensor nodes is imaginable. This fact changes the software development approach and points out new challenges as they are well known from the distributed systems domain. In an inhomogeneous wireless sensor network, nodes contain different low-level system APIs however with similar functions. From a developer’s point of view it is hard to create programs, since APIs are mostly incompatible. To overcome the mentioned problems in heterogeneity and complexity, new software programming techniques are required. One attempt to accomplish this aspect is the definition of an additional API or an additional class library on top of each system API. But they are all limited by some means or other, for example, platform independency, flexibility, quantity, programming language. All approaches to achieve an identical API on different systems are covered by the programming aspect (Figure 40.4).

Software for sensor networks

Programming aspect • Systemwide API • Splitting the complexity of APIs • Hiding heterogeneity of distributed systems • Separation of interface and implementation • Optimization of interfaces

FIGURE 40.4 Two aspects of software for wireless sensor networks.

© 2006 by Taylor & Francis Group, LLC

Behavioral aspect • Access to remote resources without previous knowledge • Adaptation of software to dynamical changes • Task change • Evolution of network over time

Software Development

40-5

The programming aspect enables the developer to easily create programs on different hardware and software platforms. But an identical API on all platforms does not necessarily take the dynamics of the distributed system into account. Ideally, the application does not notice any dynamic system changes. This decoupling is termed as behavioral aspect and covers: • Access to remote resources without previous knowledge, for example, remote procedure calls (RPCs) and discovered services. • Adaptations within the middleware layer to dynamic changes in the behavior of a distributed system, caused by incoming or leaving resources, mobility of nodes, or changes of the environment. • The ability of the network to evolve over time including modifications of the system’s task, exchange or adaptation of running software parts, and mobile agents.

40.3 Current Software Solutions This chapter presents five important software solutions for sensor networks. It starts with the most mature development TinyOS and depending software packages. It continues with SensorWare followed by two promising concepts, MiLAN and EnviroTrack. The section finalizes with an introduction to SeNeTs that features interface optimization.

40.3.1 TinyOS TinyOS is a component-based operating system for sensor networks developed at UC Berkeley. TinyOS can be seen as an advanced software framework [8] that has a large user community due to its open source character and its promising design. The framework contains numerous prebuilt sensor applications and algorithms, for example, multihop ad hoc routing and supports different sensor node platforms. Originally it was developed for Berkeley’s Mica Motes. Programmers experienced with the C programming language can easily develop TinyOS applications written in a proprietary language called NesC [9]. The design of TinyOS is based on the specific sensor network characteristics: small physical size, lowpower consumption, concurrency-intensive operation, multiple flows, limited physical parallelism and controller hierarchy, diversity in design and usage, and robust operation to facilitate the development of reliable distributed applications. The main intention of the TinyOS developers was “retaining energy, computational and storage constraints of sensor nodes by managing the hardware capabilities effectively, while supporting concurrency-intensive operation in a manner that achieves efficient modularity and robustness” [10]. Therefore, TinyOS is optimized in terms of memory usage and energy efficiency. It provides defined interfaces between the components that reside in neighboring layers. A layered model is shown in Figure 40.5. 40.3.1.1 Elemental Properties TinyOS utilizes an event model instead of a stack-based threaded approach, which would require more stack space and multitasking support for context switching, to handle high levels of concurrency in a very small amount of memory space. Event-based approaches are the favorite solution to achieve high

Main (includes scheduler) Application (user components) Acting

Sensing Hardware abstraction

FIGURE 40.5 Software architecture of TinyOS.

© 2006 by Taylor & Francis Group, LLC

Communication

40-6

Embedded Systems Handbook

performance in concurrency intensive applications. Additionally, the event-based approach uses CPU resources more efficiently and therefore takes care of the most precious resource, the energy. An event is serviced by an event handler. More complex event handling can be done by a task. The event handler is responsible for posting the task to the task scheduler. Event and task scheduling is performed by a two-level scheduling structure. This kind of scheduling provides that events, associated with a small amount of processing, can be performed immediately, while longer running tasks can be interrupted by events. Tasks are handled rapidly, however, no blocking or polling is permitted. The TinyOS system is designed to scale with the technology trends supporting both, smaller designs and crossover of software components into hardware. The latter provides a straightforward integration of software components into hardware. 40.3.1.2 TinyOS Design The architecture of a TinyOS system configuration is shown in Figure 40.6. It consists of the tiny scheduler and a graph of components. Components satisfy the demand for modular software architectures. Every component consists of four interrelated parts: a command handler, an event handler, an encapsulated fixed-size and statically allocated frame, and a bundle of simple tasks. The frame represents the internal state of the component. Tasks, commands, and handlers execute in the context of the frame and operate on its state. In addition, the component declares the commands it uses and the events it signals. Through this declaration, modular component graphs can be composed. The composition process creates layers of components. Higher-layer components issue commands to lower-level components and these signal events to higher-level components. To provide an abstract definition of the interaction of two components via commands and events, the bidirectional interface is introduced in TinyOS.

h

t

h

h

h

FIGURE 40.6 TinyOS architecture in detail.

© 2006 by Taylor & Francis Group, LLC

s

Software Development

40-7

Commands are nonblocking requests made to lower-layer components. A command provides feedback to its caller by returning status information. Typically, the command handler puts the command parameters into the frame and posts a task into the task queue for execution. The acknowledgment whether the command was successful, can be signaled by an event. Event handlers are invoked by events of lower-layer components, or when directly connected to the hardware, by interrupts. Similar to commands, the frame will be modified and tasks are posted. Both, commands and tasks, perform a small fixed amount of work similar to interrupt service routines. Tasks perform the primary work. They are atomic, run to completion, and can only be preempted by events. Tasks are queued in a First In First Out (FIFO) task scheduler to perform an immediate return of event or command handling routines. Due to the FIFO scheduling, tasks are executed sequentially and should be short. Alternatively to the FIFO task scheduler, priority-based or deadline-based schedulers can be implemented into the TinyOS framework. TinyOS distinguishes three categories of components. Hardware abstraction components map physical hardware into the component model. Mostly, this components export commands to the underlying hardware and handle hardware interrupts. Synthetic hardware components extend the functionality of hardware abstraction components by simulating the behavior of advanced hardware functions, for example, bit-to-byte transformation functions. For future hardware releases, these components can directly cast into hardware. High-level software components perform application-specific tasks, for example, control, routing, data transmission, calculation on data, and data aggregation. An interesting aspect of the TinyOS framework is the similarity of the component description to the description of hardware modules in hardware description languages, for example, VHDL or Verilog. A hardware module, for example, in VHDL is defined by an entity with input and output declarations, status registers to hold the internal state, and a finite state machine controlling the behavior of the module. In comparison, a TinyOS component contains commands and events, the frame, and a behavioral description. These similarities simplify the cast of TinyOS components to hardware modules. Future sensor node generations can benefit from this similarity in describing hardware and software components. 40.3.1.3 TinyOS Application A TinyOS application consists of one or more components. These components are separated into modules and configurations. Modules implement application-specific code, whereas configurations wire different components together. By using a top-level configuration, wired components can be compiled and linked to form an executable. The interfaces between the components declare a set of commands and events which provide an abstract description of components. The application developer has to implement the appropriate handling routine into the component. Figure 40.7 shows the component graph of a simple TinyOS application, that turns an LED on and off depending on the clock. The top-level configuration contains the application-specific components (ClockC, LedsC, BlinkM) and an operating-system-specific component providing the tiny task-scheduler and initialization functions. The Main component encapsulates the TinyOS specific components from the application. StdControl, Clock, and Leds are the interfaces used in this application. While BlinkM contains the application code, ClockC and LedsC are again configurations encapsulating further component graphs controlling the hardware clock and the LEDs connected to the controller. TinyOS provides a variety of additional extensions, such as the virtual machine (VM) MATÉ and the database TinyDB for cooperative data acquisition.

40.3.2 MATÉ MATÉ [24] is a byte-code interpreter for TinyOS. It is a tiny communication-centric VM designed as a component for the system architecture of TinyOS. MATÉ is located in the component graph on top of several system components, represented by sensor components, network component, timer component, and nonvolatile storage component. The developer motivation for MATÉ was to solve novel problems in sensor network management and programming, in response to changing tasks, for example, exchange of the data aggregation function

© 2006 by Taylor & Francis Group, LLC

40-8

Embedded Systems Handbook

Main

StdControl

BlinkM

Clock

ClockC

Leds

LedsC TinyOS component graph Hardware

Clock

FIGURE 40.7

LED

Simple TinyOS application.

or routing algorithm. However, the associated inevitable reprogramming of hundreds or thousands of nodes is restricted to energy and storage resources of sensor nodes. Furthermore, the network is limited in bandwidth and network activity as a large energy draw. MATÉ attempts to overcome these problems, by propagating so-called code capsules through the sensor network. The MATÉ VM provides the possibility to compose a wide range of sensor network applications by the use of a small set of higher-level primitives. In MATÉ, these primitives are one-byte instructions and they are stored into capsules of 24 instructions together with identifying and versioning information. 40.3.2.1 MATÉ Architecture MATÉ is a stack-based architecture that allows a concise instruction set. The use of instructions hides the asynchronous character of native TinyOS programming; because instructions are executed successively as several TinyOS tasks. The MATÉ VM shown in Figure 40.8 has three execution contexts: Clock, Send, and Receive, which can run concurrently at instruction granularity. Clock corresponds to timer events and Receive to message receive events, signaled from the underlying TinyOS components. Send can only be invoked from the Clock or Receive context. Each context holds an operand stack for handling data and a return stack for subroutines calls. Subroutines allow programs to be more complex as a single capsule can provide. Therefore, MATÉ has four spaces for subroutine code. The code for the contexts and the subroutines is installed dynamically at runtime by code capsules. One capsule fits into the code space of a context or subroutine. The capsule installation process supports self-forwarding of capsules to reprogram a whole sensor network with new capsules. It is the task of the sensor network operator to inject code capsules in order to change the behavior of the network. The program execution in MATÉ starts with a timer event or a packet receive event. The program counter jumps to the first instruction of the corresponding context (Clock or Receive) and executes until it reaches the Halt instruction. Each context can call subroutines for expanding functionality. The Send context is invoked from the other contexts to send a message in response to a sensor reading or to route an incoming message.

© 2006 by Taylor & Francis Group, LLC

Software Development

40-9

Return stack

Operand stack

Receive

Return stack

Send

Operand stack

Clock

Return stack

Sub 3

Operand stack

Code

Sub 2

Code

Sub 1

Execution contexts

Code

Sub 0

Code

Subroutines

Subroutine call

Code

Code

Code PC

PC

PC

Shared variable

MATÉ

Logger

Sensor

Timer

Network

TinyOS framework

FIGURE 40.8 MATÉ architecture.

The MATÉ architecture provides separation of contexts. One context cannot access the state of another context. There is only one single shared variable among the three contexts that can be accessed by special instructions. The context separation qualifies MATÉ to fulfill the traditional role of an operating system. Compared to native TinyOS applications, the source code of MATÉ applications is much shorter.

40.3.3 TinyDB TinyDB is a query processing system for extracting information from a network of TinyOS sensor nodes [11]. TinyDB provides a simple, SQL-like interface to specify the kind of data to be extracted from the network along with additional parameters, for example, the data refresh rate. The primary goal of TinyDB is to prevent the user from writing embedded C programs for sensor nodes or composing capsules of instructions regarding to MATÉ. The TinyDB framework allows data-driven applications to be developed and deployed much more quickly as developing, compiling, and deploying a TinyOS application. Given a query specifying the data interests, TinyDB collects the data from sensor nodes in the environment, filters and aggregates the data, and routes it to the user autonomously. The network topology in TinyDB is a routing tree. Query messages flood down the tree and data messages flow back up the tree participating in more complex data query processing algorithms.

© 2006 by Taylor & Francis Group, LLC

40-10

Embedded Systems Handbook

The TinyDB system is divided into two subsystems: sensor node software and a Java-based client interface on a PC. The sensor node software is the heart of TinyDB running on each sensor node. It consists of: • Sensor catalog and schema manager responsible for tracking the set of attributes, or types of readings and properties available on each sensor. • Query processor, utilizing the catalog to fetch the values of local attributes, to receive sensor readings from neighboring nodes, to combine and aggregate the values together, to filter, and to output the values to parents. • Small, handle-based dynamic memory manager. • Network topology manager to deal with the connectivity of nodes and to effectively route data and query subresults through the network. The sensor node part of TinyDB is installed on top of TinyOS on each sensor node as an application. The Java-based client interface is used to access the network of TinyDB nodes from a PC physically connected to a bridging sensor node. It provides a simple graphical query-builder and a result display. The Java API simplifies writing PC applications that query and extract data from the network.

40.3.4 SensorWare SensorWare is a software framework for wireless sensor networks that provides querying, dissemination, and fusion of sensor data as well as coordination of actuators [12]. A SensorWare platform has less stringent resource restrictions. The initial implementation runs on iPAQ handhelds (1 MB ROM/128 KB RAM). The authors intended to develop a software framework regardless of present sensor node limitations. SensorWare developed at the University of California, Los Angeles, aims at the programmability of an existing sensor network after its deployment. The functionality of sensor nodes can be dynamically modified through autonomous mobile agent scripts. SensorWare scripts can be injected into the network nodes as queries and tasks. After injection, scripts can replicate and migrate within the network. Motivation for the SensorWare development was the observation that the distribution of updates and the download of complete images to sensor nodes are impractical for the following reasons. First, in a sensor network, a special sensor node may not be addressable because of missing node identifiers. Second, the distribution of complete images through a sensor network is highly energy consuming. Besides that, other nodes are affected by a download when multihop connections are necessary. Updating complete images does not correspond to the low-power requirements of sensor networks. As a consequence, it is more practicable to distribute only small scripts. In the following section, the basic architecture and concepts of SensorWare are described in detail. 40.3.4.1 Basic Architecture and Concepts SensorWare consists of a scripting language and a runtime environment. The language contains various basic commands that control and execute specific tasks of sensor nodes. These tasks include, for example, communication with other nodes, collaboration of sensor data, sensor data filtering, or moving scripts to other nodes. The language comprises necessary constructs to generate appropriate control flows. SensorWare utilizes Tcl as scripting language. However, SensorWare extends Tcl’s core commands. These core extension commands are joined in several API groups, such as Networking-API, Sensor-API, Mobility-API (see Figure 40.9). SensorWare is event based. Events are connected to special event handlers. If an event is signaled an event handler serves the event according to its inherent state. Furthermore, an event handler is able to generate new events and to alter its current state by itself. The runtime environment shown in Figure 40.10 contains fixed and platform-specific tasks. Fixed tasks are part of each SensorWare application. It is possible to add platform-specific tasks depending on specific application needs. The script manager task receives new scripts and forwards requests to the admission

© 2006 by Taylor & Francis Group, LLC

Software Development

40-11

FIGURE 40.9 SensorWare scripting language.

Script manager (e.g., state tracking, create new scripts)

Admission control and policing of resource usage

Resource handling Radio/ net-working

CPU and time service

Sensing

FIGURE 40.10 SensorWare runtime environment. Message exchanging

Injection of scripts by user

Applications and services

Scripts

SensorWare

Scripts Code migration

Applications and services

SensorWare

RTOS

RTOS

HW abstraction layer

HW abstraction layer

Hardware

Hardware

Sensor node1

Sensor node2

FIGURE 40.11 Sensor node architecture.

control task. The admission control task is responsible for script admission decisions and checks the overall energy consumption. Resource handlers manage different resources of the network. Figure 40.11 shows the architecture of sensor nodes with included SensorWare software. The SensorWare layer uses operating system functions to provide the runtime environment and control scripts. Static node applications coexist with mobile scripts. To realize dynamic programmability of a deployed sensor network a transient user can inject scripts into the network. After injection scripts are replicated within the network, the script code migrates between different nodes. SensorWare ensures that no script is loaded twice onto a node during the migration process.

© 2006 by Taylor & Francis Group, LLC

40-12

Embedded Systems Handbook

40.3.5 MiLAN Middleware Linking Applications and Networks (MiLANs) is a middleware concept introduced by Mark Perillo and Wendi B. Heinzelman from the University of Rochester [13,14]. The main idea is to exploit the redundancy of information provided by sensor nodes. The performance of a cooperative algorithm in a distributed sensor network application depends on the number of involved nodes. Because of the inherent redundancy of a sensor network where several sensor nodes provide similar or even equal information, evaluating all possible sensor nodes leads to high energy and network costs. Therefore, a sensor network application has to choose an appropriate set of sensor nodes to fulfill application demands. Each application should have the ability to adopt its behavior in respect to the available set of components and bandwidth within the network. This can be achieved by a parameterized sensor node selection process with different cost values. These cost values are described by the following cost equations: • Application performance: The minimum requirements for network performance are calculated from the needed reliability of monitored data. FR = { Si : ∀j ∈ J R(Si , j) ≥ rj } where FR stands for the allowable set of possible sensor node combinations, Si represents the available sensor nodes, and R(Si ) is their reliability. • Network costs: Defines a subset of sensor nodes that meet the network constraints. The network feasible set FN = {Si : N (Si ) ≤ n0 } where N (Si ) represents the total cost and n0 the maximal data rate the network can support. • Application performance and network costs are combined to the overall feasible set: F = FR ∩ FN . • Energy: Describes the energy dissipation of the network: CP (Si ) = sj ∈Si CP (Sj ), where CP (Sj ) is the power cost to node Sj . It is up to the application to decide how these equations are weighted. This decision-making process is completely hidden from the application. Thus, the development process is simplified significantly. MiLAN uses two strategies to achieve the objective to balance QoS and energy costs: • Turning off nodes with redundant information • Using of energy-efficient routing The MiLAN middleware is located between network and application layer. It can interface a great variety of underlying network protocols, such as Bluetooth and 802.11. MiLAN uses an API to abstract from network layer but gives the application access to low-level network components. A set of commands identifies and configures the network layer.

40.3.6 EnviroTrack EnviroTrack is a TinyOS-based application developed at the University of Virginia that solves a fundamental distributed computing problem, environmental tracking of mobile entities [25]. Therefore, EnviroTrack provides a convenient way to program sensor network applications that track activities in their physical environment. The programming model of EnviroTrack integrates objects living in physical time and space into the computational environment of the application through virtual objects, called tracking objects. A tracking object is represented by a group of sensor nodes in its vicinity and is addressed by context labels. If an object moves in the physical environment, then the corresponding virtual object moves too because it is not bound to a dedicated sensor node. Regarding the tracking of objects, EnviroTrack does not assume cooperation from the tracked entity. Before a physical object or phenomenon can be tracked, the programmer has to specify its activities and corresponding actions. This specification enables the system to discover and tag those activities and to instantiate tracking objects. For example, to track an object warmer than 100˚C, the programmer specifies a Boolean function, temperature >100˚C, a critical number or mass of sensor nodes, which fulfill the Boolean function within a certain time (a fact that is often referred to as freshness of information). These parameters of a tracking object are called aggregate state. All sensor nodes matching this aggregate

© 2006 by Taylor & Francis Group, LLC

Software Development

40-13

state join a group. The network abstraction layer assigns a context label to this group. Using this label, different groups can be addressed independent of the set of nodes currently assigned to it. If the tracked object moves, nodes join or leave the group because of the changed aggregate state but the label resides persistent. This group management enables context-specific computation. The EnviroTrack programming system consists of: EnviroTrack compiler. In EnviroTrack programs, a list of context declarations is defined. Each definition includes an activation statement, an aggregate state definition, and a list of objects attached to the definitions. The EnviroTrack compiler includes C program templates. The whole project is then built using the TinyOS development tools. Group management protocol. All sensors associated to a group are maintained by this protocol. A group leader is selected out of the group members when the critical mass of nodes and freshness of the approximate aggregate state is reached. The group management protocol ensures that only a single group leader per group exists. The leader sends a periodical heartbeat to inform its members that the leader is alive. Additionally, the heartbeat signal is used to synchronize the nodes and to inform nodes that are not part of the group, but fulfill the sensing condition. Object naming and directory services. These services maintain all active objects and their locations. The directory service provides a way to retrieve all objects of a given context type. It also assigns names to groups so they can be accessed easily. It handles dynamical joining and leaving of group members. Communication and transport services. The Migration Transport Protocol (MTP) is responsible for the transportation of data packets between nodes. All messages are routed via group leader nodes. Group leader nodes identify the context group of the target node and the position of its leader using the directory service. The packet is then forwarded to the leader of the destination group. All leadership information provided by MTP packets is stored in the leaders on a least recently used basis to keep the leader up-to-date and to reduce directory lookups. EnviroTrack enables the construction of an information infrastructure for the tracking of environmental conditions. It manages dynamic groups of redundant sensor nodes and attaches computation to external events in the environment. Furthermore, EnviroTrack implements noninterrupted communication between dynamically changing physical locales defined by environmental events.

40.3.7 SeNeTs SeNeTs is a middleware architecture for wireless sensor networks. It is developed at the University of Rostock [15]. The SeNeTs middleware is primarily designed to support the developer of a wireless sensor network during the predeployment phase (programming aspect). SeNeTs supports the creation of small and energy-saving programs for heterogeneous networks. One of the key features of SeNeTs is the optimization of APIs. The required configuration, optimization, and compilation of software components is processed by a development environment. Besides the programming aspect, the middleware supports the behavioral aspect as well such as task change or evolution over time. 40.3.7.1 SeNeTs Architecture SeNeTs is based on the software layer model introduced in Chapter 2. To increase flexibility and enhance scalability of sensor node software, it separates small functional blocks as shown in Figure 40.12. In addition, the operating system layer is separated into a node-specific operating system and a driver layer, which contains at least one sensor driver and several hardware drivers, such as timer driver and RF driver. The node-specific operating system handles device-specific tasks, for example, boot-up, initialization of hardware, memory management, and process management as well as scheduling. Host middleware is the superior software layer. Its main task is to organize the cooperation of distributed nodes in the network. Middleware core handles four optional components, which can be implemented and exchanged according to the node’s task. Modules are additional components that increase the functionality of the middleware. Typical modules are routing modules or security modules. Algorithms describe the behavior of modules.

© 2006 by Taylor & Francis Group, LLC

40-14

Embedded Systems Handbook Host middleware Algorithms

Modules

Services

VM

Middleware core

Node-specific operating system

Hardware drivers

Sensor driver

Hardware

FIGURE 40.12

Sensor

Structure of a node application.

Administration terminal

Sensor network application Distributed middleware Node A

FIGURE 40.13

Node B

Node C

Host middleware

Host middleware

Host middleware

Operating system

Operating system

Operating system

Hardware

Hardware

Hardware

Structure of a sensor network.

For example, the behavior of a security module can vary in the case the encryption algorithm changes. The services component contains the required software to perform local and cooperative services. This component usually cooperates with service components of other nodes to fulfill its task. VMs enable an execution of platform independent programs installed at runtime. Figure 40.13 shows the expansion of the proposed architecture to a whole sensor network from the logical point of view. Nodes can only be contacted through services of the middleware layers. The distributed middleware coordinates the cooperation of services within the network. It is logically located in the network layer but physically exists in the nodes. All layers together in conjunction with their configuration compose the sensor network application. Thus, nodes do not perform any individual tasks. The administration terminal is an external entity to configure the network and evaluate results. It can be connected to the network at any location. All functional blocks of the described architecture are represented by components containing real source code and a description about dependencies, interfaces, and parameters in XML. One functional block can be rendered by alternative components. All components are predefined in libraries. 40.3.7.2 Interface Optimization One of the key features in SeNeTs is interface optimization. Interfaces are the descriptions of functions between two software parts. As illustrated in Figure 40.14, higher-level applications using services and

© 2006 by Taylor & Francis Group, LLC

Software Development

40-15

Application Degree of hardwaredependent interfaces

Middleware Degree of abstract interfaces

Node-specific operating system Drivers Hardware

FIGURE 40.14 Interfaces within the software-layer model.

Middleware

Middleware Interface optimization

Node-specific operating system modules

Sensor drivers

Processor

Sensor

Node-specific operating system modules

ARM

Sensor driver

Temperature sensor

FIGURE 40.15 Interface optimization.

middleware technologies require abstract software interfaces. The degree of hardware-dependent interfaces increases in lower software layers. Hardware-dependent interfaces are characterized by parameters to configure hardware components directly in contrast to abstract software interfaces whose parameters describe abstractions of the underlying system. Software components require a static software interface to the application in order to minimize customization effort for other applications and to support compatibility. The use of identical components in different applications leads to a higher number of complex interfaces in these components. This is caused by component programming focused on supporting most possible use cases of all possible applications whereby each application uses only a subpart of the functionality of a component. Reducing the remaining overhead is the objective of generic software and can be done by interface optimization during compile time. Interface optimizations result in proprietary interfaces within a node (Figure 40.15). Parts of the software cannot be exchanged without sensible effort. In a sensor node, the software is mostly static except programs for VMs. Accordingly, static linking is preferred. Statically linked software in conjunction with interface optimization leads to faster and smaller programs. In SeNeTs, interfaces are customized to the application in contrast to common approaches used in desktop computer systems. These desktop systems are characterized by writing huge adaptation layers. The interface optimization can be propagated through all software layers and, therefore, saves resources. As an example of an optimization, a function OpenSocket(int name, int mode) identifies the network interface with its first parameter and the opening mode in the second parameter. However, a node that has only one interface opened with constant mode once or twice does not need these parameters. Consequently, knowledge of this information at compile time can be used for optimizing, for example, by: • Inlining the function • Eliminating both parameters from the delivery process

© 2006 by Taylor & Francis Group, LLC

40-16

Embedded Systems Handbook

TABLE 40.1 Types of Interface Optimization Optimization

Description

Parameter elimination

Parameters which are not used in one of the called subfunctions can be removed.

Static parameters

If a function is still called with same parameters, these parameters can be defined as constants or static variables in the global namespace. Thus, the parameter delivery to the function can be removed.

Parameter ordering

The sequence order of parameters is optimized in order to pass parameters through cascading functions with same or similar parameters. It is particularly favorable in systems using processor registers instead of the system stack to deliver parameters to subfunctions.

Parameter aggregation

In embedded systems, many data types are not byte-aligned, for example, bits to configure hardware settings. If a function has several non-byte-aligned parameters, these parameters may be combined.

Another possibility is to change the semantics of data types. A potential use case is the definition of accuracy of addresses that results in changing data type’s width. In SeNeTs, there are several types of interface optimizations proposed, which are given in Table 40.1. Some optimizations such as “static parameters” are sometimes counterproductive in particular, if register-oriented parameter delivery is used. This is caused by the use of offset addresses once at parameter delivery instead of absolute addresses embedded in the “optimized” function. Consequently, the introduced optimizations strongly depend on: • • • • •

Processor and processor architecture Type of parameter delivery (stack or register oriented) Memory management (small, huge, size of pointers) Objective of optimization (memory consumption, energy consumption, compact code, etc.) Sensor network application

40.3.7.3 Development Process Figure 40.16 shows the development process of sensor node software in SeNeTs. First, for each functional block the components have to be identified and included into the project. During design phase, the chosen components are interconnected and configured depending on developer’s settings. Then, interface as well as parameter optimization is performed. The final source codes are generated and logging components can be included to monitor runtime behavior. The generated source codes are compiled and the executable is linked together. During evaluation phase, the created node application can be downloaded to the node and executed. Considering the monitoring results, a new design cycle can be started to improve project settings. As a result of the design flow, optimized node application software is generated. The node application now consists of special tailored parts only needed by the specific application of the node. Optionally, software components in a node can be linked together statically or dynamically. Statically linking facilitates an optimization of interfaces between several components within a node. A dynamic link process is used for components exchanged during runtime, for example, algorithms downloaded from other nodes. This procedure results in system-wide interfaces with significant overhead and prevents interface optimization.

40.4 Simulation, Emulation, and Test of Large-Scale Sensor Networks Applications and protocols for wireless sensor networks require novel programming techniques and new approaches for validation and test of sensor network software. In practice, sensor nodes have to operate

© 2006 by Taylor & Francis Group, LLC

Software Development

40-17

Components

Design and edit

Interface optimization and source code generation

Compiling and linking

Evaluation

Node software

FIGURE 40.16 Development process of node software.

in an unattended manner. The key factor of this operation is to separate unnecessary information from important ones as early as possible in order to avoid communication overhead. In contrast, during implementation and test phases, developers need to obtain as much information as possible from the network. A test and validation environment for sensor network applications has to ensure this. Consider a sensor network with thousands of sensor nodes. Furthermore, consider developing a data fusion and aggregation algorithm that collects sensor information from nodes and transmits them to few base stations. During validation and test, developers often have to change application code, recompile, and upload a new image onto the nodes. These updates often result in flooding of the network using the wireless channel. However, this would dissipate a lot of time and energy. But how could we ensure that every node runs the most recent version of our application? Pure simulation produces important insights. However, modeling the wireless channel is difficult. Simulation tools often employ simplified propagation models in order to reduce computational efforts for large-scale networks. Widely used simulation tools, such as NS2 [16], use simplified network protocol stacks and do not simulate at bit level. Furthermore, code used in simulations often cannot be reused on real sensor node hardware; why should developers implement applications and protocols twice? In contrast to simulation, implementation on a target platform is often complicated. The targeted hardware itself may be still in development stage. Perhaps there are a few prototypes, but developers need hundreds of them for realistic test conditions. Moreover, prototype hardware is very expensive and far away from the targeted “1 cent/node.” Consequently, a software environment is required that combines the scaling power of simulations with real application behavior. Moreover, the administration of the network must not affect sensor network applications. Afterwards, three current software approaches are presented.

40.4.1 TOSSIM — A TinyOS SIMulator Fault analysis of distributed sensor networks or their particular components is quite expensive and time consuming especially when sensor networks consist of hundreds of nodes. For that purpose, a simulator providing examination of several layers (e.g., communication layer, routing layer) is an efficient tool for sensor application development.

© 2006 by Taylor & Francis Group, LLC

40-18

Embedded Systems Handbook

TinyOS SIMulator (TOSSIM) is a simulator for wireless sensor networks based on the TinyOS framework. As described in References 17 and 18, the objectives of TOSSIM are scalability, completeness, fidelity, and bridging. Scalability means TOSSIM’s ability to handle large sensor networks with many nodes in a wide range of configurations. The reactive nature of sensor networks requires not only the simulation of algorithms but also the simulation of complete sensor network applications. Therefore, TOSSIM achieves completeness in covering as many system interactions as possible. TOSSIM is able to simulate thousands of nodes running entire applications. The simulator’s fidelity becomes important for capturing subtle timing interactions on a sensor node and between nodes. A significant attribute is the revealing of unanticipated events or interactions. Therefore, TOSSIM simulates the TinyOS network stack down to bit level. At last, TOSSIM bridges the gap between an academic algorithm simulation and a real sensor network implementation. Therefore, TOSSIM provides testing and verification of application code that will run on real sensor node hardware. This avoids programming algorithms and applications twice, for simulation and for deployment. The TOSSIM components are integrated into the standard TinyOS compilation tool chain, which supports the direct compilation of unchanged TinyOS applications into the TOSSIM framework. Figure 40.17 shows a TinyOS application divided into hardware independent and hardware dependent components. Depending on the target platform, the appropriate hardware dependent modules are selected in the compilation step. This permits an easy extension regarding new sensor node platforms. At the same time, this is the interface to the TOSSIM framework. Compared with a native sensor node platform, TOSSIM is a sensor node emulation platform supporting multiple sensor node instances running on standard PC hardware. Additionally, the TOSSIM framework includes a discrete event queue, a small number of reimplemented TinyOS hardware abstraction components, mechanisms for extensible radio and Analog-to-Digital Converter (ADC) models, and communication services for external programs to interact with a simulation. The core of the simulator is the event queue. Because TinyOS utilizes an event-based scheduling approach, the simulator is event driven too. TOSSIM translates hardware interrupts into discrete simulator events. The simulator event queue emits all events that drive the execution of a TinyOS application. In contrast to real hardware interrupts, events cannot be preempted by other events and therefore are not nested. The hardware emulation of sensor node components is performed by replacing a small number of TinyOS hardware components. These include the ADC, the clock, the transmit strength variable potentiometer, the EEPROM, the boot sequence component, and several components of the radio stack. This enables simulations of a large number of sensor node configurations. The communication services are the interface to PC applications driving, monitoring, and actuating simulations by communicating with TOSSIM over TCP/IP. The communication protocol was designed at an abstract level and enables developers to write their own systems that hook into TOSSIM. TinyViz is an example of a TOSSIM visualization tool that illustrates the possibilities of TOSSIM’s communication services. It is a Java-based graphical user interface providing visual feedback on the simulation state to control running simulations, for example, modifying ADC readings and radio loss properties. A plug-in interface for TinyViz allows developers to implement their own application-specific visualization and control code. TOSSIM does not model radio propagation, power draw, or energy consumption. TOSSIM’s is limited by the interrupts that are timed by the event queue. They are nonpreemptive. In conclusion, TOSSIM is an event-based simulation framework for TinyOS-based sensor networks. The open-source framework and the communication services permit an easy adaptation or integration of simulation models and the connection to application-specific simulation tools.

40.4.2 EmStar EmStar is a software environment for developing and deploying applications for sensor networks consisting of 32-bit embedded Microserver platforms [19,20]. EmStar consists of libraries, tools, and services.

© 2006 by Taylor & Francis Group, LLC

Software Development

40-19

Hardware dependent system components

TinyOS application

ADC

Clock

TinyOS application

RFM

ADC

MICA mote hardware platform

Clock

RFM

TOSSIM platform

TinyOS application

ADC

Clock

RFM

RENE mote hardware platform

Component graphs

ADC event

Ra mo dio de l

Event queue

Communication services

APP

TEMP

PHOTO

AM

CRC

......... .........

BYTE

ADC

TOSSIM implementations

FIGURE 40.17

CLOCK

RFM

ADC model

Comparison of TinyOS and TOSSIM system architecture.

Libraries implement primitives of interprocess communication. Tools support simulation, emulation, and visualization of sensor network applications. Services provide network functionality, sensing, and synchronization. EmStar’s target platforms are so-called Microservers, typically iPAQ or Crossbow Stargate devices. EmStar does not support Berkeley Motes as platform, however, can easily interoperate with Motes. EmStar consists of various components. Table 40.2 gives the name and a short description of each component. The last row of the table contains hypothetical, application-specific components; all others are EmStar core components. Figure 40.18 illustrates the cooperation of EmStar components in a sample application for environmental monitoring. The dark-gray boxes represent EmStar core modules. Hypothetical application-specific modules are filled light-gray. The sample application collects data from an audio sensor and therewith tries to detect the position of an animal in collaboration with neighboring sensor nodes. 40.4.2.1 EmStar Tools and Services EmStar provides tools for the simulation, emulation, and visualization of a sensor network and its operation. EmSim runs virtual sensor nodes in a pure simulation environment. EmSim models both radio and sensor channels. EmCee runs the EmSim core, however, uses real radios instead of modeled channels. Both EmSim and EmCee use the same EmStar source code and associated configuration files as

© 2006 by Taylor & Francis Group, LLC

40-20

Embedded Systems Handbook TABLE 40.2

EmStar Components

Component

Description

Emrun

Management and watchdog process (responsible for start-up, monitoring, and shut-down of EmStar modules)

Emproxy

Gateway to a debugging and visualization system

udpd, linkstats, neighbors, MicroDiffusion

Network protocol stack for wireless connections

timehist, syncd, audiod

Audio sampling service

FFT, detect, collab_detect

“Hypothetical modules,” responsible for Fast Fourier Transformation, and (collaborative) event detection

collab_detect

Gradients

Data Sensor/anim

MicroDiffusion

Detect Link/ls0/neighbors Neighbors

sync/hist timehist

Sensor/audio/fft FFT

Clients EmProxy

Link/ls0 linkstats

Status EmLog/* EmRun

sync/params syncd

Sensor/audio/0 audiod

Link/udp0 udpd

802.11 NIC

ADC Sensor

FIGURE 40.18 EmStar sample application consisting of EmStar core modules (dark-gray) and hypothetical application-specific modules (light-gray).

a real deployed EmStar system. This aspect alleviates the development and debugging process of sensor network applications. EmView is a visualization tool for EmStar systems that uses a UDP protocol to request status updates from sensor nodes. In order to obtain sensor node or network information, EmView queries an EmProxy server that runs as part of a simulation or on a real node. EmRun starts, stops, and manages an EmStar system. EmRun supports process respawn, in-memory login, fast startup, and graceful shutdown. EmStar services comprise link and neighborhood estimation, time synchronization, and routing. The Neighborhood service monitors links and maintains lists of active, reliable nodes. EmStar applications can use these lists to get informed about topology changes. The LinkStats service provides applications with more detailed information about link reliability than the Neighborhood service, however, produces more packet overhead. Multipath-routing algorithms can benefit from the LinkStats service by weighting their © 2006 by Taylor & Francis Group, LLC

Software Development

40-21

path choices with LinkStats information. The TimeSync service is used to convert timestamps between different nodes. Additionally, EmStar supports several routing protocols, but allows the integration of new routing protocols as well. 40.4.2.2 EmStar IPC Mechanism Communication between EmStar modules is managed by so-called FUSD-driven devices (FUSD — Framework for User-Space Devices), a microkernel extension to Linux. FUSD allows devicefile callbacks to be proxied into user space and implemented by user-space programs instead of kernel code. Besides intermodule communication, FUSD allows interaction of EmStar modules and users. FUSD drivers are implemented in user space, however, can create device files with the same semantics as kernel-implemented device files. Applications can use FUSD-driven devices to transport data or expose state. Several device patterns exist for EmStar systems that are frequently needed in sensor network applications. Example device patterns comprise a status device pattern exposing the current state of a module, a packet device pattern providing a queued multiclient packet interface, a command device pattern that modifies configuration files and triggers actions, and a query device pattern implementing a transactional RPC mechanism.

40.4.3 Sensor Network Applications (SNA) — Test and Validation Environment In SeNeTs, SNAs run distributed on independent hosts such as PCs, PDAs, or evaluation boards of embedded devices [21]. The parallel execution decouples applications and simulation environment. The quasi-parallel and sequential processing of concurrently triggered events in simulations is disadvantageous compared with real-world programs. SeNeTs prevents this very unpleasant effect. Without SeNeTs, this effect results in sequenced execution of parallel working SNAs and corrupted simulation output. To summarize, realistic simulations of sensor networks are complicated. 40.4.3.1 System Architecture The development and particularly the validation of distributed applications are hard to realize. Especially systems with additional logging and controlling facilities affect the primary behavior of applications. Suppose, a logging message is transmitted, then an application message transmitted may be delayed. Exceptionally in wireless applications with limited channel capacity, the increased communication leads to a modified timing behavior and as a consequence to different results. The channel capacity is defined as √1n , since n symbolizes the number of nodes. Due to the degrading channel capacity in large sensor networks, the transport medium acts as a bottleneck [22]. Thus, in wireless sensor networks with thousands of nodes, the bottleneck effect becomes the dominant part. To eliminate the bottleneck effect, SeNeTs contains two independent communications channels as illustrated in Figure 40.19. The primary communication channel is defined by the sensor network application.

App 1 App 2

Base station

Primary transmission channel

App 6

App 3

App 7

App 4

App 8

Secondary transmission channel

FIGURE 40.19 Communication channels in SeNeTs. © 2006 by Taylor & Francis Group, LLC

App 5

40-22

Embedded Systems Handbook

Commands

Network server Host: spice

Graphical user interface

Scripts

Visualization, evaluation

FIGURE 40.20

Network configuration

Application server Host: speedy

Node 1

Application server Host: rtl

Node 3

Application configuration

Node applications

Node 2

Node 4

SeNeTs components using the secondary transmission channel.

It uses the communication method required by SNAs, for example, Bluetooth or ZigBee. The secondary communication channel is an administration channel only used by SeNeTs components. This channel transmits controlling and logging messages. It is independent of the primary communication channel and uses a different communication method, for example, Ethernet or Ultrasound. The separation into two communication channels simplifies the decoupling of application modules and administration modules after testing. The parallel execution of applications on different host systems requires a cascaded infrastructure to administrate the network. Figure 40.20 displays important modules in SeNeTs: node applications, application servers (ASs), a network server (NS), and optional evaluation or visualization modules. All of these modules are connected via the secondary transmission channel. 40.4.3.2 Network Server The NS administrates sensor networks and associated sensor nodes. The NS starts, stops, or queries SNAs. In an SeNeTs network, exactly one NS exists. However, this NS is able to manage several sensor networks simultaneously. Usually, the NS runs as service of the operating system. An NS opens additional communication ports. External programs, such as scripts, websites, or telnet clients, can connect to these ports to send commands. These commands may be addressed and forwarded to groups or stand-alone components. Furthermore, the NS receives logging messages from applications containing their current state. Optional components, such as graphical user interfaces, can install callbacks to receive this information. 40.4.3.3 Application Server The AS manages instances of node applications on one host (Figure 40.20). It acts as bridge between node applications and the NS. Usually, at least one AS exists within the SeNeTs network. Ideally, only one node application should be installed on an AS to prevent quasi-parallel effects during runtime. The AS runs independent of the NS. It connects to the NS via a pipe to receive commands. Each command is multiplexed to one of the connected node applications. Moreover, if the pipe to the NS breaks, node applications will not be affected besides losing logging and controlling facilities. Later, the NS can establish the pipe again. Generally, an AS starts as service together with the host’s operating system. At startup, it requires configuration parameters of the node’s hardware. With these parameters, the AS assigns hardware to node applications. Suppose a host system that comprises two devices representing sensor nodes as principally shown in Figure 40.20. Then, the AS requires device number, physical position of the node, etc. to configure the dynamically installed node applications at runtime. 40.4.3.4 SeNeTs Application Applications for wireless sensor nodes are usually designed based on a layered software model as depicted in Figure 40.21(a) [15]. On top of the node’s hardware, a specialized operating system is set up such as

© 2006 by Taylor & Francis Group, LLC

Software Development (a)

40-23

Sensor node application Algorithms

Modules

Services

(b)

SeNeTs sensor node application SeNeTs Adaptation Logging

Middleware management Controlling Operating system

Sensor driver

Sensor node application

Debugging

Target type Simulator

Node

SeNeTs

SeNeTs hardware abstraction

Environment emulation

FIGURE 40.21 (a) Software layer model of a sensor node application, (b) software layer model of a SeNeTs application.

TinyOS [23]. A sensor driver contains software to initialize the measurement process and to obtain sensor data. Above the operating system and the sensor driver, middleware components are located containing services to aggregate data or to determine the node’s position. The aforementioned modular design allows: • Abstraction of hardware, for example, sensors, communication devices, memory, etc. • Adaptation of the node’s operating system • Addition of optional components, for example, logging and configuration The SeNeTs Adaptation is a set of components which are added or exchanged to wrap the SNA. Figure 40.21(b) represents the SeNeTs Adaptation layer consisting of at least a logging component, a controlling unit, a HAL, and an optional environment encapsulation module. These additional components provide substantial and realistic test and controlling facilities. An application composed of an SNA and SeNeTs Adaptation components is called SeNeTs Application (SeA). The SNA is not changed by added components. Generally, it is not necessary to adapt the SNA to SeNeTs interfaces. However, supplementary macros can be added to interact with the linked components. An SeA runs as process of the host system. Due to the architecture of an SNA with its own operating system, the SeA runs autonomously without interaction to other processes of the host system. At startup, the SeA opens a pipe to communicate with the AS. After the test phase, all SeNeTs components can be removed easily by recompilation of all node applications. SeNeTs specific components and logging calls are automatically deactivated due to compiler switches. 40.4.3.5 Environment Management Sensor network applications require valid environment data, such as temperature or air pressure. Under laboratory conditions, this information is not or partly not available. Therefore, environment data must be emulated. SeNeTs provides these environment data to the node application by the environment emulation module (Figure 40.21[b]). All environment emulation modules are controlled by the environment management of the NS which contains all predefined or configured data (Figure 40.22). This data comprises positions of other nodes, distances to neighboring nodes, etc. If required, other data types may be added. In the AS, the environment data cache module stores all environment information required by each node application to reduce network traffic. Optionally, position-based filtering is provided by the environment emulation component of SeNeTs. Especially, if large topologies of sensor nodes should be emulated under small-sized laboratory conditions, this filtering approach is essential. Suppose real and virtual positions of nodes are known, a mapping from physical address to virtual address is feasible. A node application only receives messages from nodes that

© 2006 by Taylor & Francis Group, LLC

40-24

Embedded Systems Handbook

Network server Host: spice

Command execution

Application server Host: rtl Application management

Network management

Environment data cache Environment management

Logging

FIGURE 40.22

Reachability list air pressure temperature

Environment management in SeNeTs.

(a)

(b)

A

B

C

D

A

B

C

D

FIGURE 40.23 (a) Physically arranged sensor nodes (black dots). All nodes are in transmission range (surrounding circles) of each other. (b) Virtually arranged nodes with appropriate transmission ranges. Nodes are no longer able to communicate without routing.

are virtually in transmission range. All other messages are rejected by the SeNeTs Adaptation components. This is accomplished by setting up a filter in the primary communication channel. One application scenario that illustrates position-based filtering is flood prevention. Here, sensor nodes are deployed in sandbags, piled along a dyke of hundreds of meters or even kilometers. These nodes measure the humidity and detect potential leakages. Testing this scenario under real-world conditions is not practical and very expensive. However, the evaluation under real-world conditions of software regarding communication effort, self-organization of the network, routing, and data aggregation is most important. Figure 40.23 illustrates the difference between laboratory and real world. Figure 40.23(a) represents laboratory conditions where all nodes are in transmission range of each other. Figure 40.23(b) sketches the flood prevention scenario under real conditions. In Figure 40.23(a), the nodes A to D are in transmission range of each other. Therefore in contrast to the real-world scenario, no routing is required. Next, data aggregation yields wrong results, because nodes are not grouped as they would in reality. Thus, if physically arranged nodes in test environment do not meet the requirements of the real world, the results are questionable. Assume, node A sends a message to node D, then all nodes receive the message due to the physical vicinity in the test environment (Figure 40.23[a]). Node C and D receive the message, but they are not in the virtual transmission range of node A. Thus, the environment emulation module rejects these messages. As a result, SeNeTs prevents a direct transmission from node A to node D. Messages can be transmitted only by using routing nodes B and C (Figure 40.23[b]). In short, the emulation of the sensor network software becomes more realistic.

© 2006 by Taylor & Francis Group, LLC

Software Development

40-25

40.5 Summary At the present time, TinyOS is the most mature operating system framework for sensor nodes. The component-based architecture of TinyOS allows an easy composition of SNAs. New components can be added easily to TinyOS to support novel sensing or transmission technologies or to support upcoming sensor node platforms. MATÉ addresses the requirement to change a sensor node’s behavior at runtime by introducing a VM on top of TinyOS. Via transmitting capsules containing high-level instructions, a wide range of SNAs can be installed dynamically into a deployed sensor network. TinyDB was developed to simplify data querying from sensor networks. On top of TinyOS, it provides an easy to use SQL interface to express data queries and addresses the group of users nonexperienced with writing embedded C code for sensor nodes. TOSSIM is a simulator for wireless sensor networks based on the TinyOS framework. EnviroTrack is an object-based programming model to develop sensor network applications for tracking activities in the physical environment. Its main feature is dynamical grouping of nodes depending on environmental changes described by predefined aggregate functions, critical mass, and freshness horizon. SensorWare is a software framework for sensor networks employing lightweight and mobile control scripts that allow the dynamic deployment of distributed algorithms into a sensor network. In comparison to the MATÉ framework, the SensorWare runtime environment supports multiple applications to run concurrently on one SensorWare node. The MiLAN middleware provides a framework to optimize network performance, which needed sensing probability and energy costs based on equations. It is the programmer’s decision to weight these equations. EmStar is a software environment for developing and deploying applications for sensor networks consisting of 32-bit embedded Microserver platforms. SeNeTs is a new approach to optimize the interfaces of sensor network middleware. SeNeTs aims at the development of energy-saving applications and the resolving of component dependencies at compile time.

References [1] G.J. Pottie and W.J. Kaiser, Wireless integrated network sensors, Communications of the ACM, 43, 51–58, 2000. [2] J.M. Kahn, R.H. Katz, and K.S.J. Pister, Next century challenges: mobile networking for smart dust, in Proceedings of the ACM MobiCom’99, Washington, USA, 1999, pp. 271–278. [3] D. Culler, E. Brewer, and D. Wagner, A platform for WEbS (wireless embedded sensor actuator systems), Technical report, University of California, Berkeley, 2001. [4] J. Rabaey et al., PicoRadio supports ad hoc ultra-low power wireless networking, IEEE Computer, 33(7), 42–48, 2000. [5] EYES — Energy-efficient sensor networks, URL: http://eyes.eu.org [6] I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, A survey on sensor networks, IEEE Communications Magazine, 40(8), 102–114, 2002. [7] P. Rentala, R. Musunuri, S. Gandham, and U. Saxena, Survey on sensor networks, Technical report UTDCS-10-03, University of Texas, 2003. [8] J. Hill et al., System architecture directions for networked sensors, in Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, USA, November 2000. [9] D. Gay, P. Levis, R.V. Behren, M. Welsh, E. Brewer, and D. Culler, The nesC language: a holistic approach to networked embedded systems, in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), San Diego, CA, June 2003. [10] D. Culler, TinyOS — a component-based OS for the networked sensor regime, URL: http://webs.cs.berkeley.edu/tos/, 2003. [11] S. Madden, J. Hellerstein, and W. Hong, TinyDB: in-network query processing in TinyOS, Intel Research, IRB-TR-02-014, October 2002.

© 2006 by Taylor & Francis Group, LLC

40-26

Embedded Systems Handbook

[12] A. Boulis and M.B. Srivastava, A framework for efficient and programmable sensor networks, in Proceedings of the Fifth IEEE Conference on Open Architectures and Network Programming (OPENARCH 2002), New York, June 2002. [13] A. Murphy and W. Heinzelman, MiLAN: middleware linking applications and networks, Technical report, University of Rochester, Computer Science Department, URL: http:// hdl.handle.net/1802/305, January 2003. [14] M. Perillo and W. Heinzelman, Providing application qos through intelligent sensor management, in Proceedings of the First IEEE International Workshop on Sensor Network Protocols and Applications (SNPA’03), Anchorage, AK, USA, May 2003. [15] J. Blumenthal, M. Handy, F. Golatowski, M. Haase, and D. Timmermann, Wireless sensor networks — new challenges in software engineering, in Proceedings of the Ninth IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Lisbon, Portugal, September 2003. [16] The Network Simulator — ns-2, http://www.isi.edu/nsnam/ns [17] P. Levis et al., TOSSIM: accurate and scalable simulation of entire TinyOS applications, in Proceedings of the First ACM Conference on Embedded Networked Sensor Systems (SenSys 2003), Los Angeles, November 2003. [18] TOSSIM: A Simulator for TinyOS Networks — User’s Manual, in TinyOS documentation. [19] L. Girod, J. Elson, A. Cerpa, T. Stathopoulos, N. Ramanathan, and D. Estrin, EmStar: a software environment for developing and deploying wireless sensor networks, in Proceedings of USENIX 04, Boston, June 2004. [20] EmStar: software for wireless sensor networks, URL: http://cvs.cens.ucla.edu/emstar/, 2004. [21] J. Blumenthal, M. Handy, and D. Timmermann, SeNeTs — test and validation environment for applications in large-scale wireless sensor networks, in Proceedings of the Second IEEE International Conference on Industrial Informatics INDIN’04, Berlin, June 2004. [22] J. Li, C. Blake, D.S.J. De Couto, H.I. Lee, and R. Morris, Capacity of ad hoc wireless networks, in Proceedings of MobiCom, Rome, July 2001. [23] Berkeley WEBS: TinyOS, http://today.cs.berkeley.edu/tos/, 2004. [24] P. Levis and D. Culler, MATÉ: a tiny virtual machine for sensor networks, in Proceedings of the ACM Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), San Jose, California, USA, October 2002. [25] T. Abdelzaher, B. Blum et al., EnviroTrack: an environmental programming model for tracking applications in distributed sensor networks, Technical report CS-2003-02, University of Virginia, 2003.

© 2006 by Taylor & Francis Group, LLC

VI Embedded Applications

© 2006 by Taylor & Francis Group, LLC

Automotive Networks 41 Design and Validation Process of In-Vehicle Embedded Electronic Systems Françoise Simonot-Lion and YeQiong Song

42 Fault-Tolerant Services for Safe In-Car Embedded Systems Nicolas Navet and Françoise Simonot-Lion

43 Volcano — Enabling Correctness by Design Antal Rajnák

© 2006 by Taylor & Francis Group, LLC

41 Design and Validation Process of In-Vehicle Embedded Electronic Systems 41.1 In-Vehicle Embedded Applications: Characteristics and Specific Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41-1 Economic and Social Context • Several Domains and Specific Problems • Automotive Technological Standards • A Cooperative Development Process

41.2 Abstraction Levels for In-Vehicle Embedded System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41-8 Architecture Description Languages • EAST-ADL for In-Vehicle Embedded System Modeling

41.3 Validation and Verification Techniques . . . . . . . . . . . . . . . . 41-10

Françoise Simonot-Lion Institut National Polytechnique de Lorraine

YeQiong Song Université Henri Poincaré

General View of Validation Techniques • Validation by Performance Evaluation

41.4 Conclusions and Future Trends . . . . . . . . . . . . . . . . . . . . . . . . . 41-20 41.5 Appendix: In-Vehicle Electronic System Development Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41-21 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41-22

41.1 In-Vehicle Embedded Applications: Characteristics and Speciﬁc Constraints 41.1.1 Economic and Social Context While automobile production is likely to increase slowly in the coming years (42 million cars produced in 1999 and only 60 million planned in 2010), the part of embedded electronics and more precisely embedded software is growing. The cost of electronic systems was $37 billion in 1995 and $60 billion in 2000, with an annual growth rate of 10%. In 2006, the electronic embedded system will represent at least 25% of the total cost of a car and more than 35% for a high-end model [1]. The reasons for this evolution are technological as well as economical. On the one hand, the cost of hardware components is decreasing while their performance and reliability are increasing. The emergence

41-1

© 2006 by Taylor & Francis Group, LLC

41-2

Embedded Systems Handbook

of automotive embedded networks such as LIN, CAN, TTP/C, FlexRay, MOST, and IDB-1394 leads to a significant reduction of the wiring cost as well. On the other hand, software technology facilitates the introduction of new functions whose development would be costly or even not feasible if using only mechanical or hydraulic technology and therefore allows satisfying the end-user requirements in terms of safety and comfort. Well-known examples are Electronic Engine control, ABS, ESP, Active suspension, etc. In short, thanks to these technologies, the customers can buy a safe, efficient, and personalized vehicle while the carmakers are able to master the differentiation of product variants and the innovation (analysts stated that more than 80% of innovation, and therefore of added value, will be obtained thanks to electronic systems [2]). Another new factor is emerging. A vehicle already includes some electronic equipments such as hand free phones, audio/radio devices, and navigation systems. For the passengers, many entertainment devices, such as video equipments, and communication with outside world will be available in the very near future. Even if these kinds of applications have little to do with the vehicle’s operation itself, they increase significantly the part of software embedded in a car. Who is concerned by this evolution? First the vehicle customer, for which the requirements are on the one hand, the increase of performance, comfort, assistance for mobility efficiency (navigation), and on the other hand, the reduction of vehicle fuel consumption and cost. Furthermore, he requires a reliable embedded electronic system that ensures safety properties. Second, the stakeholders, carmakers, and suppliers, who are interested in the reduction of time-to-market, development, production, and maintenance costs. Finally, this evolution has a strong impact on the society: legal restrictions on exhaust emission, protection of the natural resources and the environment. The example of electronic systems formerly presented does not have to meet the same level of dependability. So their designs are relevant for different techniques. Nevertheless, common characteristics are their distributed nature and the fact that they have to provide a level of quality of service fixed by the market, the safety and cost requirements. Therefore their development and production have to be based on a suitable methodology including their modeling, validation, optimization, and test.

41.1.2 Several Domains and Speciﬁc Problems In-vehicle embedded systems are usually classified in four domains that correspond to different functionalities, constraints, and models [3, 4]. Two of them are concerned specifically with safety: “power train” and “chassis” domain. The third one, “body,” is emerging and presently integrated in a majority of cars. Finally, “telematic,” “multimedia,” and “Human Machine Interface” domains take benefit of continuous progress in the field of multimedia, wireless communications, and Internet. 41.1.2.1 Power Train This domain represents the system that controls the motor according to, on the one hand, requests of the driver, that can be explicit orders (speeding up, slowing down, etc.) or implicit constraints (driving facilities, driving comfort, fuel consumption, etc.) and, on the other hand, environmental constraints (exhaust pollution, noise, etc.). Moreover, this control has to take into account requirements from other parts of the embedded system as climate control or ESP (Electronic Stability Program). In this domain, the main characteristics are: • At a functional point of view: the power train control takes into account different working modes of the motor (slow running, partial load, full load, etc.); this corresponds to different and complex control laws (multivariables) with different sampling periods (classical sampling periods for signals provided by other systems are 1, 2, or 5 msec while the sampling of signals on the motor itself is in phase with the motor times. • At a hardware point of view: this domain requires sensors whose specification has to consider the minimization of the criteria “cost/resolution,” and microcontrollers providing high computation power, thanks to their multiprocessors architecture, dedicated coprocessors (floating point computations), and high storage capacities.

© 2006 by Taylor & Francis Group, LLC

Design and Validation Process

41-3

• At an implementation point of view: the specified functions are implemented as several tasks with different activation rules according to the sampling rules, stringent time constraints imposed on task scheduling, mastering safe communications with other systems and with local sensors/actuators. In this domain, systems are relevant of continuous systems, sampled systems, and discrete systems. Traditional tools for their functional design and modeling are, for example, Matlab/Simulink, Matlab/Stateflow. Currently the validations of these systems are mainly done by simulation and, for their integration, by emulation methods and/or tests. Last, as illustrated formerly, the power train domain includes hard real-time systems; so, performance evaluation and timing analysis activities have to be proceeded on their implementation models. 41.1.2.2 Chassis Chassis domain gathers all the systems that control the interaction of the vehicle with the road and the chassis components (wheel, suspension, etc.) according to the request of the driver (steering, braking, or speed up orders), the road profile, and the environmental conditions (wind, etc.). These systems have to ensure the comfort of driver and passengers (suspension) as well as their safety. This domain includes systems as ABS (Anti-lock Braking System), ESP (Electronic Stability Program), ASC (Automatic Stability Control), 4WD (4 Wheel Drive). Note that, chassis is the critical domain contributing to the safety of the passengers and of the vehicle itself. Furthermore, X-by-Wire technology, currently applied in avionic systems, is emerging in automotive industry. X-by-Wire is a generic term used when mechanical and/or hydraulic systems are replaced by “electronic” ones (intelligent devices, networks, computers supporting software components that implement filtering, control, diagnosis, functionalities). For example, we can cite brake-by-wire, steer-by-wire, that will be shortly integrated in cars for the implementation of critical and safety relevant functions. The characteristics of the chassis domain and the underlying models are similar to those presented for power train domain, that is multivariable control laws, different sampling periods, and stringent time constraints. Regarding the power train domain, systems controlling chassis components are fully distributed. Therefore, the development of such systems must define a feasible system, that is, satisfying performance, dependability, and safety constraints. Conventional mechanical and hydraulic systems have stood the test of time and have proved to be reliable; it is not the same for critical software based systems. In aerospace/avionic industries, X-by-Wire technology is currently employed; but, for ensuring safety properties, specific hardware and software components, specific fault tolerant solutions (heavy and costly redundancies of networks, sensors, and computers), and certified design and validation methods are used. Now there is a challenge to adapt these solutions to automotive industries that impose stringent constraints on component cost, electronic architecture cost (minimization of redundancies), and development time length. 41.1.2.3 Body Wipers, lights, doors, windows, seats, and mirrors are controlled more and more by software based systems. These kinds of functions make up the body domain. They are not subjected to stringent performance constraints but globally involve many communications between them and consequently a complex distributed architecture. There is an emergence of the notion of subsystem or subcluster based on low cost sensor–actuator level networks as, for example, LIN that connect modules realized as integrated mechatronic systems. On another side, the body domain integrates a central subsystem, termed the “central body electronic” whose main functionality is to ensure message transfers between different systems or domains. This system is recognized to be a central critical entity. Body domain mainly implies to discrete event applications. Their design and validation rely on state transition models (as SDL, Statecharts, UML state transition diagrams, Synchronous models). These models allow, mainly by simulation, the validation of a functional specification. Their implementation implies a distribution over complex hierarchical hardware architecture. High computation power for the central body electronic entity, fault tolerance, and reliability properties are imposed on the body domain systems. A challenge in this context is first to be able to develop exhaustive analysis of state transition

© 2006 by Taylor & Francis Group, LLC

41-4

Embedded Systems Handbook

diagrams and second, to ensure that the implementation respects the fault tolerance and safety constraints. The problem here is to achieve a good balance between time-triggered approach and flexibility. 41.1.2.4 Telematic and Human Machine Interface Next generation of telematic devices provides new-sophisticated Human Machine Interface (HMI) to the driver and the other occupants of a vehicle. They enable not only to communicate with other systems inside the vehicle but also to exchange information with the external world. Such devices will be upgradeable in the future and for this domain, a “plug and play” approach has to be favored. These applications have to be portable and the services furnished by the platform (operating system and/or middleware) have to offer generic interfaces and downloading facilities. The main challenge here is to preserve the security of the information from, to, or inside the vehicle. Sizing and validation do not rely on the same methods as for the other domains. Here we shift from considering messages, tasks, and deadline constraints to fluid data streams, bandwidth sharing, and multimedia quality of service and from safety and hard real-time constraints to security on information and soft real-time constraints. Note that, if this domain is more related to entertainment activities, some interactions exist with other domains. For example, the telematic framework offers a support for future remote diagnostic services. In particular, the standard OBD-3, currently under development, extends OBD-2 (Enhanced On Board Diagnosis) by adding telemetry. As its predecessor, it defines the protocol for collecting measures on the power train physical equipments and alerting, if necessary, the driver and a protocol for the exchanges with a scan tool. Thanks to a technology similar to that which is already being used for automatic electronic toll collection systems, an OBD-3equipped vehicle would be able to report the vehicle identification number and any emission problems directly to a regulatory agency.

41.1.3 Automotive Technological Standards A way for ensuring some level of interoperability between components developed by different partners is brought at first by the standardization of services sharing the hardware resources between the application processes. For this reason, in the current section, we provide an outline of the main standards used in automotive industry, in particular the networks and their protocols and the operating systems. Then, we introduce some works in progress for the definition of a middleware that will be a solution for portability and flexibility purpose. 41.1.3.1 Networks and Protocols Due to the stringent cost, real-time, and reliability constraints, specific communication protocols and networks have been developed to fulfill the needs of the ECU (Electronic Control Unit) multiplexing. SAE has defined three distinct protocol classes named class A, B, and C. Class A protocol is defined for interconnecting actuators and sensors with a low bit rate (about 10 Kbps). An example is LIN. Class B protocol supports a data rate as high as 100 Kbps and is designed for supporting nonreal-time control and inter ECU communication. J1850 and low speed CAN are examples of SAE class B protocol. Class C protocol is designed for supporting real-time and critical applications. Networks like high speed CAN, TTP/C belong to the class C, which support data rates as high as 1 or several mega bits per second. This section intends to outline the most known of them. 41.1.3.1.1 Controller Area Network Controller Area Network (CAN) [5,6] is without any doubt the mostly used in-vehicle network. CAN was initially designed by “Robert Bosch” company at the beginning of the 1980s for multiplexing the increasing number of ECUs in a car. It became an OSI standard in 1994 and is now a de facto standard for data transmission in automotive applications due to its low cost, robustness, and bounded communication delay. CAN is mainly used in power train, chassis, and body domains. Further information on CAN related protocols and development, including TTCAN, could be found in http://can-cia.org/. Controller Area Network is a priority-based bus that allows to provide a bounded communication delay for each message priority. The MAC (Medium Access Control) protocol of CAN uses CSMA with bit-by-bit

© 2006 by Taylor & Francis Group, LLC

Design and Validation Process TABLE 41.1 CAN Bit

CAN and VAN Frame Format

SOF 1

VAN Bit Time slot

41-5

ID 11/29

RTR (Reserved) 1 2

DLC

Data

CRC

ACK

EOF

IFS

4

0–64

16

2

7

3

SOF

ID

Command

Data

CRC

EOD

ACK

EOF

IFG

— 10

12 15

4 5

(0 – 28)·8 (0 – 28)·10

15 15 + 3

— 2

— 2

— 8

— 4

nondestructive arbitration over the ID field (Identifier). The identifier is coded using 11 bits (CAN2.0A) or 29 bits (CAN2.0B) and it also serves as priority. Up to 8 bytes of data can be carried by one CAN frame and a CRC of 16 bits is used for transmission error detection. CAN uses a NRZ bit encoding scheme for making the bit-by-bit arbitration feasible with a logical AND operation. However the use of bit-wise arbitration scheme intrinsically limits the bit rate of CAN as the bit time must be long enough to cover the propagation delay on the whole network. A maximum of 1 Mbps is specified to a CAN bus not exceeding 40 m. The maximum message transmission time should include the worst-case bit stuffing number (CAN2.0A). This length is given by: 34 + 8 · DLC · τbit Ci = 44 + 8 · DLC + 4

(41.1)

where DLC is the data length in bytes and τbit the bit time; the fraction represents the overhead due to the bit stuffing, a technique implemented by CAN for bit synchronization, which consists in inserting an opposite bit every time five consecutive bits of the same polarity are encountered. Frame format is given in Table 44.1. We will not detail field signification here; note however that the Inter Frame Space (IFS) has to be considered when calculating the bus occupation time of a CAN message. 41.1.3.1.2 Vehicle Area Network Vehicle Area Network (VAN) [7, 8] is quite similar to CAN. It was used by the French carmaker PSA Peugeot-Citroën for the body domain. Although VAN has some more interesting technical features than CAN, it is not largely adopted by the market and has now been abandoned in favor of CAN. Its MAC protocol is also CSMA with bit-by-bit nondestructive arbitration over the ID field (Identifier), coded with 12 bits. Up to 28 bytes of data can be carried by one VAN frame and a CRC of 15 bits is used. The bit rate can reach 1 Mbps. One of the main differences between CAN and VAN is that CAN uses NRZ code while VAN uses a so-called E-Manchester (Enhanced Manchester) code: a binary sequence is divided into blocks of 4 bits and the first three bits are encoded using NRZ code (whose duration is defined as one Time Slot per bit) while the fourth one is encoded using Manchester code (two Time Slots per bit). It means that 4 bits of data are encoded using 5 Time Slots (TS). Thanks to E-Manchester coding, VAN, unlike CAN, does not need bit stuffing for bit synchronization. This coding is sometimes denoted by 4B/5B. The format of VAN frame is given in Table 41.1. The calculation of the transmission duration (or equivalent frame length) of a VAN frame is given by: Ci = (60 + 10 · DLC) · TS

(41.2)

Note, however, that the Inter Frame Gap (IFG), fixed to 4 TS, has to be considered when calculating the total bus occupation time of a VAN message. Finally, VAN has one feature, which is not present in CAN: the in-frame response capability. The same single frame can include the remote message request of the consumer (identifier and command fields) and the immediate response of the producer (data and CRC fields). 41.1.3.1.3 J1850 SAE J1850 [9] is developed in North America and has been used by carmakers such as Ford, GM, and DaimlerChrysler. The MAC protocol follows the same principle as CAN and VAN, that is, it uses CSMA

© 2006 by Taylor & Francis Group, LLC

41-6

Embedded Systems Handbook

with bit-by-bit arbitration for collision resolution. J1850 supports two data rates: 41.6 Kbps for PWM (Pulse Width Modulation) and 10.4 Kbps for VPW (Variable Pulse Width). The maximum data length is 11 bytes. The typical applications are SAE class B ones such as instrumentation/diagnostics and data sharing in engine, transmission, ABS. 41.1.3.1.4 TTP/C Time-Triggered Protocol (TTP/C) [10] has been developed at the Vienna University of Technology. Hardware implementations of the TTP/C protocol, as well as software tools for the design of the application, are commercialized by TTTech (www.tttech.com). At the MAC layer, the TTP/C protocol implements the synchronous TDMA scheme: the stations (or nodes) have access to the bus in a strict deterministic sequential order. Each station possesses the bus for a constant time duration called a slot during which it has to transmit one frame. The sequence of slots that all stations have access once to the bus is called a TDMA round. TTP/C is suitable for SAE class C applications with strong emphasis on fault tolerant and deterministic real-time feature. It is now one of the two candidates for X-by-Wire applications. The bit rate is not limited in TTP/C specification. Today’s available controllers (TTP/C C2 chips) support data rates as high as 5 Mbps in asynchronous mode and 5 to 25 Mbps in synchronous mode. 41.1.3.1.5 FlexRay The FlexRay protocol (www.flexray.com) is currently being developed by a consortium of major companies from the automotive field. The purpose of FlexRay is, like TTP/C, to provide for X-by-Wire applications with deterministic real-time and reliability communication. The specification of the FlexRay protocol is however neither publicly available nor finalized at the time of writing of this chapter. The FlexRay network is very flexible with regard to topology and transmission support redundancy. It can be configured as a bus, a star, or multistars and it is not mandatory that each station possess replicated channels even though it should be the case for X-by-Wire applications. At the MAC level, FlexRay defines a communication cycle as the concatenation of a time-triggered (or static) window and an event-triggered (or dynamic) window. To each communication window, whose size is set at design time, a different protocol applies. The communication cycles are executed periodically. The time-triggered window uses a TDMA protocol. In the event-triggered part of the communication cycle, the protocol is FTDMA (Flexible Time Division Multiple Access): the time is divided into so called mini-slots, each station possesses a given number of mini-slots (not necessarily consecutive) and it can start the transmission of a frame inside each of its own mini-slot. A mini-slot remains idle if the station has nothing to transmit. 41.1.3.1.6 Local Interconnect Network Local Interconnect Network (LIN) (www.lin-subbus.org) is a low cost serial communication system intended to be used for SAE class A applications, where the use of other automotive multiplex networks such as CAN is too expensive. Typical applications are in body domain for controlling door, window, seat, proof, and climate. Besides the cost consideration, LIN is also a sub network solution to reduce the total traffic load on the main network (e.g., CAN) by building a hierarchical multiplex system. For this purpose, many gateways exist allowing for example to interconnect a LIN subnet to CAN. The protocol of LIN is based on the master/slave model. A slave node must wait for being polled by the master to transmit data. The data length can be 1/2/4/8 bytes. A master can handle at most 15 slaves (there are 16 identifiers by class of data length). LIN supports data rates up to 20 Kbps (limited for EMI-reasons). 41.1.3.1.7 Media Oriented System Transport Media Oriented System Transport (MOST) (http://mostnet.de/) is a multimedia fiber optic network developed in 1998 by MOST coorperation (a kind of consortium composed of carmakers, set makers, system architects and key component suppliers). The basic application blocks supported by MOST are

© 2006 by Taylor & Francis Group, LLC

Design and Validation Process

41-7

audio and video transfer, based on which end-user applications like radios, GPS navigation, video displays and amplifiers, and entertainment systems can be built. The MOST protocol defines data channels and control channels. The control channels are used to set up what data channels the sender and receiver use. Once the connection is established, data can flow continuously for delivering streaming data (Audio/Video). The MOST network proposes a data rate of 24.8 Mbps. 41.1.3.1.8 IDB-1394 IDB-1394 is an automotive version of IEEE-1394 for in-vehicle multimedia and telematic applications jointly developed by the IDB Forum (www.idbforum.org) and the 1394 Trade Association (www.1394ta.org). IDB-1394 defines a system architecture/topology that permits existing IEEE-1394 consumer electronics devices to interoperate with embedded automotive grade devices. The system topology consists of an automotive grade embedded plastic optical fiber network including cable and connectors, embedded network devices, one or more consumer convenience port interfaces, and the ability to attach hot-pluggable portable devices. The IDB-1394 embedded network will support data rates of 100 Mbps, 200 Mbps, and 400 Mbps. The maximum number of embedded devices is limited to 63 nodes. From both data rate and interoperability with existing IEEE-1394 consumer electronic devices point of view, IDB-1394 is a serious competitor of the MOST technology. 41.1.3.2 Operating Systems OSEK/VDX (Offene Systeme und deren schnittstellen für die Elektronik im Kraft-fahrzeug) [11] is a multitask operating system that becomes a standard in European automotive industry. Two types of tasks are supported by OSEK/VDX, basic tasks without blocking point and extended tasks that can include blocking points. This Operating System does not allow the dynamic creation/destruction of tasks. It implements a Fixed Priority (FP) scheduling policy combined with Priority Ceiling Protocol (PCP) [12] to avoid priority inversion or deadlock due to exclusive resource access. OSEK/VDX offers a synchronization mechanism through private events and alarms. A task can be preemptive or nonpreemptive. An implementation of OSEK/VDX has to be compliant to one of the four conformance classes — BCC1, BCC2, ECC1, ECC2 — defined according to the supported tasks (basic only or basic and extended), the number of tasks on each priority level (only one or possibly several) and the limit of the reactivation counter (only one or possibly several). The MODISTARC project (Methods and tools for the validation of OSEK/VDX-based DISTributed ARChitectures ) [13] aims to provide the relevant test methods and tools to assess the conformance of OSEK/VDX implementations. OSEK/VDX Com and OSEK/VDX NM are complementary to OSEK/VDX for communication and network management services. Furthermore, a language OSEK/OIL (OSEK Implementation Language) is a basis both for the configuration of an application and the tuning of the required operating system. In order to ensure dependability and fault tolerance for critical applications, the time-triggered operating system OSEKtime [11] was proposed. It supports static scheduling and offers interrupt handling, dispatching, system time and clock synchronization, local message handling, and error detection mechanisms and offers predictability and dependability through fault detection and fault tolerance mechanisms. It is compatible to OSEK/VDX and is completed by FTCom layer (Fault Tolerant Communication) for communication services. Rubus is another operating system tailored for automotive industry. It is developed by Arcticus systems [14], with support from the research community, and is, for example, used by Volvo Construction Equipment. Rubus OS consists of three parts achieving an optimum solution: the Red Kernel, which manages execution of offline scheduled time-triggered tasks, the Blue Kernel dedicated for execution of event-triggered tasks and the Green Kernel in charge of the external interrupts. These three operating systems are well suited to power train, chassis, and body domain because the number of tasks integrated in these applications is known offline. On the other hand, they do not fit to the requirements of telematic applications. In this last domain, are available, for example, Window CE for Automotive that extends the classical operating system Windows CE with telematic-oriented features.

© 2006 by Taylor & Francis Group, LLC

41-8

Embedded Systems Handbook

Finally, an important issue for the multipartners development and the flexibility requirement is the portability of software components. For this purpose, several projects aim to specify an embedded middleware, which has to hide the specific communication system (portability) and to support fault tolerance (see Titus Project [15], ITEA EAST EEA Project [24], DECOS Project [16], or Volcano [17]). Note that these projects as well as Rubus Concept [14] provide not only a middleware or an operating system but also a way for a Component Based Approach for designing a real-time distributed embedded application.

41.1.4 A Cooperative Development Process Strong cooperation between suppliers and carmakers in the design process implies the development of a specific concurrent engineering approach. For example, in Europe or Japan, carmakers provide the specification of subsystems to suppliers, which are, then, in charge of the design and realization of these subsystems including the software and hardware components and possibly the mechanical or hydraulic parts. The results are furnished to the carmakers, who have to integrate them on the car and test them. The last step consists in calibration activities, that is, in tuning some control and regulation parameters for meeting the required performances of the controlled systems. This activity is closely related to testing activities. In United States, the process is slightly different, as the suppliers cannot be really considered as independent of carmakers. Nevertheless, the subsystem integration and calibration activities are always to be done and, obviously, any error detected during this integration leads to a costly feedback on the specification or design steps. Therefore, in order to improve the quality of the development process, new design methodologies are emerging. In particular, the different actors of a system development apply more and more methods and techniques ensuring the correctness of subsystems as early as possible in the design stages and a new trend is to consider the integration of subsystems at a virtual level [18]. This means that carmakers as well as suppliers will be able to design, prove, and validate the models of each subsystem and of their integration at each level of the development in a cooperative way. This new practice will reduce the cost of development and production of new electronic embedded systems significantly while increasing flexibility for the design of variants.

41.2 Abstraction Levels for In-Vehicle Embedded System Description As shown in Section 41.1.4, the way to improve the quality and the flexibility of an embedded electronic system while decreasing the development and production cost is to design and validate this system at a virtual level. Therefore, the problem is, first, to identify the abstraction level at which the components and the whole system are to be represented. In order to apply validation and verification techniques on the models, the second problem consists in specifying which validation and verification activities have to be applied and, consequently, which formalisms support the identified models.

41.2.1 Architecture Description Languages Two main key words were introduced formerly: architectures, that refer to the concept of Architecture Description Language (ADL), well known in computer science and, components, that leads to modularity principles and object approach. An ADL is a formal approach for software and system architecture specification [19]. In the avionic context for which the development of embedded systems is relating to the same problems, MetaH [20] has been developed at Honeywell and, in 2001, has been chosen as the basis of a standardization effort aiming to defining an Avionics Architecture Description Language (AADL) standard under the authority of SAE. This language can describe standard control and data flow mechanisms used in avionic systems, and important nonfunctional aspects such as timing requirements, fault and error behaviors, time and space partitioning, and safety and certification properties. In automotive industry,

© 2006 by Taylor & Francis Group, LLC

Design and Validation Process

41-9

some recent efforts brought a solution for mastering the design, modeling, and validation of in-vehicle electronic embedded systems. The first result was obtained by the French project AEE (Architecture Embedded Electronic) [21] and more specifically through the definition of AIL_Transport (Architecture Implementation Language for Transport). This language, based on UML, allows specifying in the same framework, electronic embedded architectures, from the highest level of abstraction, for the capture of requirements and the functional views, to the lowest level, for the modeling of an implementation taking into account services and performances of hardware supports and the distribution of software components [22,23].

41.2.2 EAST-ADL for In-Vehicle Embedded System Modeling Taking AIL_Transport as one of the entry points for the European project ITEA EAST-EEA [24], (July 2001 to June 2004); a new language named EAST-ADL was defined. As AIL_Transport, EAST-ADL offers a support for the unambiguous description of in-vehicle embedded electronic systems at each level of their development. It provides a framework for the modeling of such systems through seven views (see Figure 41.1) [25]: • Vehicle view describing user visible features such as anti-lock braking or windscreen wipers. • Functional analysis architecture level represents the functions realizing the features, their behavior, and their exchanges. There is an n-to-n mapping between vehicle view entities and functional analysis architecture entities, that is, one or several functions may realize one or several features. • Functional design architecture level models a decomposition or refinement of functions described at functional analysis architecture level in order to meet constraints regarding allocation, efficiency, reuse, supplier concerns, and so on. Again, there is an n-to-n mapping between entities on functional design architecture and functional analysis architecture. • Logical architecture level where the class representation of the functional design architecture has been instantiated to a flat software structure suitable for allocation. This level provides an abstraction of the software components to implement on hardware architecture. The logical architecture contains the leaf functions of the functional design architecture. From the logical architecture point of view, the code could automatically be generated in many cases.

Vehicle view

Technical architecture

Functional design architecture

Logical architecture

Operational architecture

FIGURE 41.1 The abstraction layers of the EAST-ADL.

© 2006 by Taylor & Francis Group, LLC

Hardware architecture

Functional analysis architecture

41-10

Embedded Systems Handbook

In parallel to the application functionality, the execution environment is modeled from three views: 1. The hardware architecture level includes the description of the ECUs and more precisely those of the used microcontroller, the sensors and actuators, the communication links (serial links, networks) and their connections. 2. At technical architecture level given the model of the operating system or middleware API and the services provided (behavior of the middleware services, schedulers, frame packing, and memory management, in particular). 3. The operational architecture models the tasks, managed by the operating systems and frames, managed by the protocols. On this lowest abstraction level, all implementation details are captured. A system described on the functional analysis level may be loosely coupled to hardware based on intuition, various known constraints, or as a back annotation from more detailed analysis on lower levels. Furthermore, the structure of the functional design architecture and of the logical architecture is aware of the technical architecture. Finally, this EAST-ADL provides the consistency within and between artifacts belonging to the different levels, at a syntactic and semantic point of view. This leads to make an EAST-ADL based model a strong and nonambiguous support for automatically building models suited to optimal configuration and/or validation and verification activities. For each of these identified objectives (simulation or formal analysis at functional level, optimal distribution, frame packing, round building for TDMA-based networks, formal test sequences generation, timing analysis, performance evaluation, dependability evaluation, etc.), a software, specific to the activity, to the related formalism and to the EAST-ADL, extracts the convenient data from the EAST-ADL repository and translates into the adequate formalism. Then, the concerned activity can run, thanks to the adequate tools.

41.3 Validation and Veriﬁcation Techniques In this section, we introduce, briefly in Section 41.3.1, the validation issues in automotive industry and the place of these activities in the development process and detail in Section 41.3.2 a specific validation technique that aims to prove that an operational architecture meets its performance properties.

41.3.1 General View of Validation Techniques The validation of an embedded system consists of proving, on the one hand, that this system implements all the required functionalities and, on the other hand, that it ensures functional and extra-functional properties such as performance, safety properties. From an industrial point of view, validation and verification activities address two complementary objectives: 1. Validation and verification of all or parts of a system at a functional level without taking into account the implementation characteristics (e.g., hardware performance). For this purpose, simulation or formal analysis techniques can be used. 2. Verification of properties of all or parts of a system at operational level. These activities integrate the performances of both the hardware and technical architectures and the load that is due to a given allocation of the logical architecture. This objective can also be reached through simulation and formal analysis techniques. Furthermore, according to the level of guarantee required for the system under verification, a designer can need deterministic guarantees or simply probabilistic ones, involving different approaches. The expression “formal analysis” is employed when mathematic techniques can be applied on an abstraction of the system while “simulation” represents the possibility to execute a virtual abstraction of it. Obviously, formal analysis leads to an exhaustive analysis of the system, more precisely of the model that abstracts it. It provides a precise and definitive verdict. Nevertheless, the level of abstraction or the

© 2006 by Taylor & Francis Group, LLC

Design and Validation Process

41-11

accuracy of a model is in inverse ratio to its capacity to be treated in a bounded time. So this technique is generally not suitable to large systems at fine grain abstraction level as required, for example, for the verification of performance properties of a wide distributed operational architecture; in fact, in this case, the system is modeled thanks to timed automata or queuing systems whose complexity can make their analysis impossible. To solve this problem, simulation techniques can be applied. They accept models at almost any level of detail. However, the drawback is that it is merely impossible to guarantee all the feasible execution can be simulated. Therefore, the pertinence of these results is linked to the scenario and the simulation duration and therefore we can only ensure that a system is correct for a set of scenarios but this does not imply that the system will stay correct for any scenario. In fact, in automotive industry, simulation techniques are more largely used than formal analysis. An exception to this can be found in the context of the verification of properties to be respected by frames sharing a network. Well-known formal approach, usually named timing analysis is available for this purpose. Finally, note that some tools are of course of general interest for the design and validation of electronic embedded systems as, for example Matlab/Simulink or Stateflow [26], Ascet [27], Statemate [28], and SCADE [29] etc. In some cases, an interface encapsulates these tools in order to suit the tool to the automotive context. Moreover, these techniques that work on virtual platforms are completed by test techniques in order to assume that a realization is correct: test of software components, of logical architectures, test of an implemented embedded system. Note that the test activities as well as the simulation ones consist of providing a scenario of events and/or data that stimulate the system under test or stimulate an executable model of the system; then, in both techniques we have to look which events and/or data are produced by the system. The input scenario can be manually built or formally generated. In this last case, the test or simulation activity is closely linked to a formal analysis technique [30]. At last, one of the main targets of validation and verification activities is the dependability aspect of the electronic embedded systems. As seen in the first section, some of these systems are said to be safety-critical. This fact is enhanced in chassis domain through the emergence of X-by-Wire applications. In this case, high dependability level is required: the system has to be compliant to a number of failures by hour less than 10−9 (this means that the system has to work 115,000 years without a failure). By now, it is a challenge because it is impossible to ensure this property only through the actual reliability of the electronic devices. Moreover, as the application may be sensitive to electromagnetic perturbations, its behavior cannot be entirely predictable. So, the required safety properties can be reached by introducing fault tolerant strategies.

41.3.2 Validation by Performance Evaluation The validation of a distributed embedded system requires, at least, proving that all the timing properties are respected. These properties are generally expressed as timing constraints applied to the occurrences of specific events, as, for example, a bounded jitter on a frame emission, a deadline on a task, a bounded end-to-end response time between two events. The first way of doing this is analytically, but this means one should be able to establish a model that furnishes the temporal behavior of the system and that can be mathematically analyzed. Considering the complexity of an actual electronic embedded system, such a model has to be strongly simplified and generally provides only oversized solutions. For instance, the holistic scheduling approach introduced by Tindell and Clark [31] allows just the evaluation of the worst-case end-to-end response time for the periodic activities of a distributed embedded application. Using this holistic scheduling approach, Song et al., in [32], studied the end-to-end task response time for architecture composed of several ECUs, interconnected by CAN. Faced with the complexity of this mathematical approach, the simulation of a distributed application is therefore a complementary technique. It allows taking into account a more detailed model as well as the unavoidable perturbations that may affect the foreseen behavior. For example, simulation-based analysis [33] of the system presented in [32] gave more realistic performance measures than those obtained analytically.

© 2006 by Taylor & Francis Group, LLC

41-12

Embedded Systems Handbook

Engine controller

AGB

ABS/VDC

Suspension

WAS/DHC

CAN

Y

Z

Z

ISU

VAN

FIGURE 41.2

Hardware architecture.

An outline of these two approaches is illustrated in Sections 41.3.2.3 and 41.3.2.2 by the means of a common case study presented in Section 41.3.2.1; then, in Section 41.3.2.4, the respective results obtained are compared. Finally, we show how a formal architecture description language, as introduced in Section 41.2, is a strong factor for promoting validation and verification on virtual platform in automotive industry. 41.3.2.1 Case Study Figure 41.2 shows the electronic embedded system [34], used in the two following sections as a basis for both mathematical and simulation approaches. In fact, this system is derived from an actual one presently embedded in a vehicle manufactured by PSA Peugeot-Citroën Automobile Co. [35]. It includes functions related to power train, chassis, and body domains. 41.3.2.1.1 Hardware Architecture Level (Figure 41.2) • We consider nine nodes (ECU) that are interconnected by means of one CAN and one VAN network. The naming of these nodes recall the global function that they support. Engine controller, AGB (Automatic Gear Box), ABS/VDC (Anti-lock Brake System/Vehicle Dynamic Control), WAS/DHC (Wheel Angle Sensor/Dynamic Headlamp Corrector), Suspension controller refer nodes connected on CAN, while X, Y, and Z (named so for confidentiality reason) refer nodes connected on VAN. At last, the ISU (Intelligent Service Unit) node ensures the gateway function between CAN and VAN. • The communication is supported by two networks: CAN 2.0A (bit rate equal to 250 kbps) and VAN (time slot rate fixed to 62.5 kTS/s). • The different ECUs are connected to these networks by means of network controllers. For this case study are considered the Intel 82527 CAN network controller (14 transmission buffers), the Philips PCC1008T VAN network controllers (one transmission buffer and one First In First Out (FIFO) reception queue with two places), and the MHS 29C461 VAN network controllers (handling up to 14 messages in parallel). 41.3.2.1.2 Technical Level The operating system OSEK [11] runs on each ECU. The scheduling policy is Fixed Priority Protocol. Each OS task is a basic task, at OSEK sense. In the actual embedded system, preemption is not permitted for tasks. In the study presented, analytical method is applied strictly to this system, while simulations are run for different configurations among which, two accept preemptible tasks. 41.3.2.1.3 Operational Level The entities that are considered at this level are tasks and messages (frames); they are summarized in Figure 41.3, Figure 41.4, Figure 41.5, and Figure 41.7. The mapping of the logical architecture (not presented here) onto the Technical/Hardware ones produces 44 OSEK OS tasks (in short, tasks, in the following) and 19 messages exchanged between these tasks. Furthermore, we assume that a task operating

© 2006 by Taylor & Francis Group, LLC

Design and Validation Process

T_Engine1 T_Engine 2 T_Engine 3 T_Engine4 T_Engine5 T_Engine6 T_Engine7

ECU : Engine_Ctrl Pi Ti Input Output D i 1 10 M 1 10 4 20 M 3 20 7 20 M 10 100 3 M4 15 2 M2 14 6 M8 50 5 M6 40

41-13

T_AGB1 T_AGB2 T_AGB 3 T_AGB 4

T_SUS1 T_SUS2 T_SUS3 T_SUS4 T_SUS5

Pi 2 3 4 1 Pi 4 5 1 2 3

ECU:AGB T i Input Output 15 M4 50 M 11 M8 M2 ECU: Suspension T i Input Output 20 M9 M5 M1 M2 M7

Di 15 50 50 14 Di 20 20 10 14 15

T_ABS1 T_ABS2 T_ABS3 T_ABS4 T_ABS5 T_ABS6

Pi T_WAS1 1 T_WAS2 2

Pi 2 5 1 6 3 4

ECU: ABS/VDC T i Input Output 20 M5 40 M6 15 M7 100 M 12 M3 M9

Di 20 40 15 100 20 20

ECU: WAS/DHC T i Input Ci Output D i 14 2 M 2 14 M9 2 20

FIGURE 41.3 Operating system tasks on nodes connected to CAN.

T_X 1 T_X 2 T_X 3 T_X 4

E CU: X P i T i Input Output 2 150 M 16 4 200 M 17 1 M 15 3 M 19

Di 150 200 50 150

T_Y T_Y T_Y T_Y T_Y

1 2 3 4 5

P i Ti 2 50 3 1 4 5

E CU: Y Input Output D i M 15 50 M 13 50 M 14 10 M 18 100 M 16 150

T_Z1 T_Z2 T_Z3 T_Z4

EC U: Pi T i Input 2 100 3 150 4 M 17 1 M 15

Z Output D i M 18 100 M 19 150 200 50

FIGURE 41.4 Operating system tasks on nodes connected to VAN.

T_ISU1 T_ISU2 T_ISU3 T_ISU4 T_ISU5 T_ISU6 T_ISU7

ECU: ISU Pi T i Input Output 4 50 M8 5 M 11 M 13 1 M 1 M 14 6 M 10 3 M6 2 M9 7 M 12

Di 50 50 10 100 40 20 100

FIGURE 41.5 Operating system tasks distributed on the gateway ECU.

system consumes (produces) possibly a message simultaneously to the beginning (respectively the end) of its execution. In the case study, two kinds of task can be identified according to their activation law: • Tasks activated by occurrence of the event “reception of a message” (event-triggered tasks) as, for example, T_Engine6 and T_ISU2. • Tasks that are activated periodically (time-triggered tasks), as T_AGB2. Each task is characterized by its name and, on the ECU, named k, on which it is mapped by (see Figure 41.3, Figure 41.4, and Figure 41.5): • Tik . Its activation period in ms (for time-triggered tasks) or the name, Mn , of the message whose reception activates it (for event triggered tasks). • Cik . Its WCET (Worst-Case Execution Time) on this ECU (disregarding possible preemption); in the case study, we assume that this WCET is equal to 2 ms for each task. • Dik . Its relative deadline in ms. • Pik . Its priority. • Mi . Its possibly produced message (we assume, in this case study, that at most, one message is produced by one task; note the method can be applied even if a task produces more than one message).

© 2006 by Taylor & Francis Group, LLC

41-14

Embedded Systems Handbook Task activation

Task completion

Pre-emption (for preemptive task only)

t Task response time

FIGURE 41.6

Task response time. ti

(a)

Message (frame) M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12

FIGURE 41.7

Taski Producer task T _Engine1 T _WAS1 T _Engine2 T AGB1 T _ABS1 T _ABS2 T _ABS3 T _ISU1 T _SUS1 T _Engine3 T _AGB2 T _ABS4

DLCi (bytes) 8 3 3 2 5 5 4 5 4 7 5 1

Ti Inherited period 10 14 20 15 20 40 15 50 20 100 50 100

(b) ti

Message (frame) M 13 M 14 M 15 M 16 M 17 M 18 M 19

Task i Producer task T_ISU2 T_ISU3 T_Y1 T_X1 T_X2 T_Z1 T_Z2

DLCi (bytes) 8 10 16 4 4 20 2

Ti Inherited period 50 10 50 150 200 100 150

Messages exchanged over networks.

For notation convenience, we assume that, on each ECU named k, priority Pik is higher than priority Pik+1 . In the following section, a task is simply denoted τi if its priority is Pik on an ECU, named k. The task response time is classically defined as the time interval between the activation of a given task j and the end of its execution (Figure 41.6). We denoted Ri the task response time of the instance j of a task τi . Each message (frame) is characterized by its name and (Figure 41.7): • DLCi . Its size, in byte. • Ci . Its transmission duration; this duration is computed thanks to the formulae given in (41.1) and (41.2) (see Sections 41.1.3.1.1 and 41.1.3.1.2). • Taski . The name of the task that produces it. • Ti . Its inherited period (for time-triggered tasks), assumed in [31] and [32], to be equal to the activation period of its producer task. • Pi . Its priority. A message will also be denoted by τi if its priority is Pi . The message response time is the time interval between the production of a specific message and its j reception by a consumer task (Figure 41.8). We denoted Ri the message response time of the instance j of a message τi . Finally, in this system, from Figure 41.3, Figure 41.4, Figure 41.5, and Figure 41.7, we identify some “logical chains” that are causal sequences of tasks and messages. In the case study, the most complex logical chains that can be identified are: lc1: T_Engine1 - M1 - T_ISU3 - M14 - T_Y3 and lc2: T_AGB2 - M11 - T_ISU2 - M13 - T_Y2

© 2006 by Taylor & Francis Group, LLC

Design and Validation Process

41-15

Message production = producer task completion

End of message transmission = consumer task activation

t Message response time

FIGURE 41.8

Message response time.

T_AGB2 activation

T_AGB2 completion = M11 T_ISU2 production activation

T_ISU2 completion = M13 production

T_Y2 activation

T_Y2 completion t

T_AGB2 M11 T_ISU2 M13 T_Y2

Logical chain response time for lc2

FIGURE 41.9

Example of logical chain.

Here, the task T _Y 3 (respectively T _Y 2), running on a VAN connected node, depends on the message M14 (respectively M13) supported by the VAN bus; M14 (respectively M13) is produced by task T _ISU 3 (respectively T _ISU 2) running on the ISU node; T _ISU 3 (respectively T _ISU 2) is activated by the message M 1 (respectively M 11) that is produced by T _Engine1 (respectively T _AGB2) running on CAN connected nodes. The logical chain response time, more generally named End-to-End Response Time, is defined for lc1 (respectively lc2) as the time interval between the activation of T_Engine1 (respectively T_AGB2) and the j completion of T_Y2 (respectively T_Y3) (Figure 41.9). We note Rlci the logical chain response time of the instance j of the logical chain lci. 41.3.2.1.4 Performance Properties As presented in Figure 41.3, Figure 41.4, and Figure 41.5, relative deadline constraints are imposed on each task in this application. Furthermore, for the given application, some other performance properties were required. Among these properties, we focus on two specific ones: 1. Property A: No message, transmitted on CAN or VAN, is lost. This means that no message can be overwritten in network controller buffers or, more formally, that each message is transmitted before its inherited period Ti , considered as the worst case. 2. Property B: This property is expressed on the two logical chain lc1 and lc2 presented above. The logical chain response time for lc1 (respectively lc2) is as regular as possible for each instance of

© 2006 by Taylor & Francis Group, LLC

41-16

Embedded Systems Handbook

lc1 (respectively lc2). More formally, if R1 is the set of logical chain response times obtained for each j instance j of lc1 (respectively lc2), the property is: ∀j, |Rlc1 − E [R1]|. This kind of property is commonly required in embedded automatic control applications where the command elaborated through a logical chain has to be applied to an actuator as regularly as possible. An embedded system is correct if, at least, it meets the above-mentioned properties. Well, the task scheduling policy on each node and the MAC protocol of VAN and CAN lead unavoidably to jitters on task terminations. So, a mathematical approach as well as a simulation one were applied in order to prove that the proposed operational architecture meets all its constraints. Thanks to a mathematical approach, related to general techniques named timing analysis, for each entity (task or message) and for each logical chain, we find out lower and upper bounds of their respective response times. These values represent the best and worst cases. In order to handle more detailed and more realistic models, we use a simulation method, which furnishes minimum, maximum, and mean values of the same response times. Furthermore, several simulations, with different parameter configurations, were performed in order to obtain an architecture meeting the constraints. In fact, we use the mathematical approach for validating the results obtained by simulation. 41.3.2.2 Simulation Approach We model four different configurations of the presented operational architecture according to the formalism supported by SES Workbench tool. For each configuration, we use this tool in order to run a simulation and obtain different results. Furthermore, as we want to analyze specific response times, we introduce adequate probes in the model. Thanks to this, the log file obtained throughout the simulation process can be easily analyzed by applying an elementary filter that furnishes the results in a readable way. Three kinds of parameters are considered and can be different from one configuration to another: • Networks controllers, specifically the VAN ones • The fact that tasks can be preempted or not • The task priority Rather than describing a simulation campaign that should exhaustively include each possible case among these possibilities, we prefer to present it by following an intuitive reasoning, starting from a given configuration (configuration 1) and, by modifying one kind of parameter at a time, leads successively to better configuration (configuration 2, then configuration 3) for finally reaching a correct configuration that verifies the required properties A and B. 41.3.2.2.1 Configuration 1 As a first simulation attempt: • As given in the description of the actual embedded system (see Section 41.3.2.1), all the tasks are considered as being OSEK basic tasks and are characterized by their local priority. Moreover, their execution is done without preemption. • We assign the Intel 82527 controller to each node connected to CAN bus and the Philips PCC1008T controller for those are connected on VAN network. Note that the ISU ECU integrates these two network controllers. In this case, a probe is introduced in the model; it observes the occurrences of message production and message transmission and detects the fact that a given instance of a message is stored in the buffer of a network controller before the instance of the previous produced message was transmitted through the network. For each of these detected events, it writes a specific string in the log file. The filter then consists in extracting the lines containing this string from the log file. A screenshot is given in Figure 41.10 where it can be seen that some messages are overwritten in the single transmission buffer of the VAN controller that was chosen for this configuration. So, we conclude that the property A is not verified.

© 2006 by Taylor & Francis Group, LLC

Design and Validation Process

41-17

FIGURE 41.10 Log file filtered for verification of property A.

Configuration 2

lc1 lc2

9.09 9.82

Logical chain response times Simulation results Analytic results Standard mean maximum minimum maximum deviation 11.03 16.67 1.775 8.992 22.116 12.82 16.67 1.458 8.576 41.172

Configuration 3

lc1 lc2

9.09 12.45

9.45 14.16

12.62 20

0.667 2.101

8.992 8.576

16.116 35.172

Configuration 4

lc1 lc2

9.09 12.45

9.45 12.61

12.62 14.11

0.667 0.490

8.992 8.576

16.116 27.172

Configurations

Logical chain

minimum

FIGURE 41.11 Response time evaluation.

41.3.2.2.2 Configuration 2 One of the possible causes for the nonverification of property A by the previous configuration could be that the VAN controller PCC1008T, providing only one single buffer, is not suitable for the required performance property. So, we assign the full VAN controllerMHS 29C461 to all nodes transmitting message on the VAN bus (ISU computer, X , Y , and Z ). We modify the SES Workbench model and we relaunch the simulation. This time probes and filters proposed for the configuration 1 provides an empty list. So we can conclude that messages are correctly transmitted and that property A is verified. Furthermore, SES Workbench gives additional results such as the network load; for this configuration the load of CAN bus is less than 21.5% and of VAN bus less than 41%. On the same configuration, we study property B. For this purpose, probes are introduced for observing the occurrences of the first task activation (T_Engine1 or T_AGB2) and the occurrences of the last task completion (T_Y3 or T_Y2). A filter is developed for the evaluation of the minimum, mean, and maximum logical chain response times of lc1 and lc2 as well as their standard deviation. The obtained results are given in Figure 41.11. Under this configuration, none of the chains meet the required property. 41.3.2.2.3 Configuration 3 In the two last configurations, preemption is not allowed for any task. We change this characteristic and allow the preemption; as T_Engine1 and T_AGB2 have the highest local priority and considering that they are basic tasks, they will never wait for the processor. One more, we model the operational architecture by modifying the scheduling policy on nodes Engine_ctlr and AGB without changing the other parameters. The same probes and filters are used; the results obtained by simulation of configuration 3 are shown in Figure 41.11. We can conclude that property B is verified only for the logical chain lc1. So, this configuration does not correspond to a correct operational architecture. 41.3.2.2.4 Configuration 4 Further log file analysis points out the problem: priority of T_ISU2 is probably too weak. After modifying the priority of this task (2 in place of 5), by always using the same probes and filters and simulating the new model, we obtain the results presented in Figure 41.11. The property B is verified for lc1 and lc2.

© 2006 by Taylor & Francis Group, LLC

41-18

Embedded Systems Handbook

41.3.2.3 Deterministic Timing Analysis In order to validate these results, we apply analytic formulas of [32] to the case study. The main purpose of this analysis is to obtain the lower (best case) and the upper (worst case) bounds on the response times. It is worth noting that in practice neither the best case nor the worst case can necessarily be achieved, but they provide some deterministic bounds. As time-triggered design approach is adopted, both tasks and messages are hoped to be “periodic” although in practice jitters exist on events whose occurrences are supposed to be periodic. In the following, a nonpreemptive task or a message τi whose priority is Pi can be indifferently characterized by (Ci ,Ti ) as defined previously. As introduced earlier, we are interested in evaluating: • The response time Ri of such a task or message of priority Pi . • The logical chain response time of lc1 and lc2 obtained by summing these individual response times. 41.3.2.3.1 Best-Case Evaluation The best case corresponds to the situation where a task τi (respectively a message τi ), whose priority is Pi , is executed (respectively transmitted) without any waiting time. In this case, R i = Ci

(41.3)

The best case of the logical chain response time is the sum of best-case response times of all entities (tasks and messages) involved in the chain. Applying it to the two logical chains, we obtain (see Figure 41.11): R_bestlcx =

Cy

(41.4)

y∈lcx

41.3.2.3.2 Worst-Case Evaluation We distinguish the evaluation of the worst case for task and message response time. Messages. For a message τi of priority Pi the worst-case response time can be calculated as: Ri = Ci + I i

(41.5)

where Ii is the interference period during which the transmission medium is occupied by other higher priority messages and by one lower priority message (because of nonpreemption). Take notice of the fact that the message response time is defined here in a way different from those specified by Tindell and Burns in [36], so, the jitter Ji is not included in the formulae (41.5). The following recurrence relation can calculate the interference period Ii : Iin+1 =

max (Cj ) +

i+1≤j≤N

i−1 n Ii + Jj + 1 Cj Tj

(41.6)

j=1

where N is the number of messages and max i+1≤j≤N (Cj ) is the blocking factor due to the nonpreemption. A suitable initial value could be Ii0 = 0. Equation (41.6) converges to a value as long as the transmission medium’s utilization is less than or equal to 100%. We also notice that the jitters should be taken into account for the calculation of the worst-case interference period as the higher priority messages are considered periodic with jitters. Tasks. For a task τi whose priority is Pi , the same arguments lead to formulae similar to those obtained for messages. However, we must distinguish two cases for the task response time evaluation. For the nonpreemptive fixed priority scheduling, equations (41.5) and (41.6) are directly applicable while, if the basic tasks

© 2006 by Taylor & Francis Group, LLC

Design and Validation Process

41-19

are scheduled thanks to a preemptive fixed priority policy, the factor max i +1≤j ≤N (Cj ) has not to be considered (the possibility of preemption ensures that a task at a given priority level cannot be preempted by a task at a lower level). Therefore, the following recurrence relation allows to calculate the response time of a basic preemptive task Jj + R n n +1 i = Ci + (41.7) Cj Ri Tj j
Again, equation (41.7) converges to a value as long as the processor utilization is less than or equal to 100%. In addition, a suitable initial value for computing could be Ri0 = 0. Logical chains. Finally, we can apply equation (41.5) for nonpreemptive case (respectively equation [41.7] for preemptive case) to calculate the worst-case response time of the two logical chains: R_worstlcx =

Ry

(41.8)

y ∈lcx

Figure 41.11 presents the bounds (minimum and the maximum response time) obtained thanks to this mathematical timing analysis for both the logical chains according to the equations (41.4) and (41.8). Note that the maximum response time in Figure 41.11 corresponds to the nonpreemptive case for the configuration 2 while the other two configurations are based on preemptive assumption. 41.3.2.4 Comments on Results First, we notice that simulation results remain within bounds given by the analytical method of Section 41.3.2.3. However, it can be seen that analytic bounds for the worst case are never reached during simulation. Maximum values obtained by simulation vary from 40 to 70% of analytic calculated worst cases while mean values vary from 30 to 60%. The importance of the simulation to obtain more realistic results becomes obvious when evaluating performances of an embedded system. From these tables we can also see that, compared with nonpreemptive scheduling, preemptive scheduling logically results in shorter response time for high priority tasks and longer response time for low priority tasks. Note however that this fact seems to be in contrast with analytic method results, where the worst-case bound gets better for preemptive policies than for nonpreemptive ones, irrespective of task priority! This is perfectly normal when results from the two methods have not to be interpreted in the same way: analytic results can be used as bounds to validate simulation results, but they have different meanings and they are rather complementary. 41.3.2.5 Automatic Generation of Models for Simulation Purpose Usually, the direct use of a general-purpose simulation platform is not judged suitable by in-vehicle embedded system designers since too much effort must be made in building the simulation model. Thanks to a nonambiguous description of embedded systems, as seen in Section 41.2, it is possible to generate automatically a model that can be run on a specific discrete simulation tool. For example, in [34], is proposed a modeling methodology, developed in collaboration with the French carmaker PSA Peugeot-Citroën, based on a component approach. This methodology has been implemented through the development of a simulation tool called Carosse-Perf and based on SES Workbench simulation platform [37]. It is composed, on the one hand, of a library of prebuilt component modeled in SES Workbench formalism and, on the other hand, of a constructor that uses these models and the description of the embedded distributed architecture in order to obtain the whole model that will be simulated. The constructor extracts the pertinent information from the static description of the system at logical architecture level (tasks, data exchanged between tasks, behavior), from the technical and hardware architectures (policies for the access to resources — scheduler policy and network protocols —, performances of the hardware components) and, finally, from the description of how the logical architecture is mapped onto the technical one. Technical and hardware architecture components are modeled once and for all in SES Workbench formalism. The principle of the model building is presented in Figure 41.12(a).

© 2006 by Taylor & Francis Group, LLC

41-20

Embedded Systems Handbook

(a)

(b)

Library of predefined hardware components models

Extr acti on

Hardware architecture description

Runable simulation program Environment scenario description

Hardware architecture modeling Simulation

Logical architecture description

Hardware model in SES workbench language

ESI LAI

Trace Compilation Trace analysis Runable simulation program

FIGURE 41.12

Constraints description

Results

Simulator generation and simulation process.

As at the simulation step, the behavior of the logical architecture entities (tasks and messages) and the environment signal occurrences animate the simulation, the constructor has to include two generic modules in the model that will be executed by the simulator: a logical architecture interpreter and an environment scenario interpreter (named respectively LAI and ESI in Figure 41.12) whose role is to extract, during the simulation, the current event from logical architecture entities or environment signals that is to be managed by the discrete event simulator. This kind of tool allows designers to easily build a simulation model of their new in-vehicle embedded systems (operational architecture) and then to simulate the model. More details about the underlying principles can be found in [34]. Carosse-Perf was used to automatically construct the models corresponding to the four configurations of the case study and to simulate them.

41.4 Conclusions and Future Trends The part of embedded electronics and especially embedded software takes more and more importance within a car, in terms of both the functionality and cost. Due to the cost, real-time, and dependability constraints in automotive industry, many automotive specific networks (e.g., CAN, LIN, FlexRay) and operating systems (e.g., OSEK/VDX) have been or are still being developed, most of them within the SAE standardization process. Today’s in-vehicle embedded system is a complex distributed system, mainly composed of four different domains: power train, chassis, body, and telematic. Functions of the different domains are under quite different constraints. SAE has classified the automotive applications into classes A, B, and C with increasing order of criticality on real-time and dependability constraints. For the design and validation of such a complex system, an integrated design methodology as well as the validation tools are therefore necessary. After introducing the specificity of the automotive application requirements in terms of time-to-market, design cost, variant handling, real-time and dependability, and multipartner involvement (carmakers and suppliers) during the development phases, in this chapter, we have described the approach proposed by

© 2006 by Taylor & Francis Group, LLC

Design and Validation Process

41-21

EAST-ADL which is a promising design and development framework tailored to fit the specific needs of the embedded automotive applications. Concerning the validation of the meeting of the application constraints by an implementation of the designed embedded system, we have reviewed the possible ways and illustrated the use of simulation for validating the real-time performance. This illustration is done through a case study drawn from a PSA Peugeot-Citroën application. The obtained results have shown that the use of simulation approach, combined with the timing analysis method (especially the holistic scheduling method), permits to efficiently validate the designed embedded architecture. If we can consider that the power train and body domains begin to achieve their maturity, the chassis domain and especially the X-by-Wire systems are however still in their early developing phase. The finalization of the new protocol FlexRay as well as the development of the 42 V power supply will certainly push forward the X-by-Wire system development. The main challenge for X-by-Wire systems is to prove that their dependability is at least as high as that of the traditional mechanical/hydraulic systems. Portability of embedded software is another main preoccupation of the automotive embedded application developers, and consists of another main challenge. For this purpose, carmakers and suppliers established AUTOSAR consortium (http://www.autosar.org/) to propose an open standard for automotive embedded electronic architecture. It will serve as a basic infrastructure for the management of functions within both future applications and standard software modules. The goals include the standardization of basic system functions and functional interfaces, the ability to integrate and transfer functions, and to substantially improve software updates and upgrades over the vehicle lifetime.

41.5 Appendix: In-Vehicle Electronic System Development Projects System Engineering of Time-Triggered Architectures (SETTA). This project (January 2000 to December 2001) was partly funded by the European Commission under the Information Society Technologies program. The overall goal of the SETTA project was to push time-triggered architecture — an innovative European-funded technology for safety-critical, distributed, real-time applications such as fly-by-wire or drive-by-wire — to future vehicles, aircraft, and to train systems. The consortium was led by DaimlerChrysler AG. DaimlerChrysler and the partners Alcatel (A), EADS (D), Renault (F), and Siemens VDO Automotive (D) acted as the application providers and technology validators. The technology providers were Decomsys (A) and TTTech (A). The academic research component was provided by the University of York (GB), and the Vienna University of Technology (A). http://www.setta.org/. Embedded Electronic Architecture (EAST-EEA) — ITEA-Project-No. 00009. The major goal of EAST-EEA (July 2001 to June 2004) was to enable a proper electronic integration through definition of an open architecture. This would allow to reach hardware and software interoperability and reuse for mostly distributed hardware. The partners were AUDI AG (D), BMW AG (D), DaimlerChrysler AG (D), Centro Ricerche Fiat (I), Opel Power train GmbH (D), PSA Peugeot Citroën (F), Renault (F), Volvo Technology AB (S), Finmek Magneti Marelli Sistemi Elettronici (I), Robert Bosch GmbH (D), Siemens VDO Automotive AG (D), Siemens VDO Automotive SAS (F), Valeo (F), ZF Friedrichshafen AG (D), ETAS GmbH (D), Siemens SBS — C-LAB (D), VECTOR Informatik (D), CEA-LIST (F); IRCCyN (F), ·INRIA (F), Linköping University of Technology (S), LORIA (F), Mälardalen University (S), Paderborn University — C-LAB (D), Royal Institute of Technology (S), Technical University of Darmstadt (D). www.east-eea.net/docs. AEE project (Embedded Electronic Architecture). This project (November 1999 to December 2001) was granted by the French Ministry for Industry. It involved French carmakers (PSA, RENAULT), OEM suppliers (SAGEM, SIEMENS, VALEO), EADS LV company and research centers (INRIA, IRCCyN, LORIA). It aimed to specify new solutions for in-vehicle embedded system development. The Architecture Implementation Language (AIL_Transport) had been defined to specify and describe precisely any vehicle electronic architecture. http://aee.inria.fr/en/index.html.

© 2006 by Taylor & Francis Group, LLC

41-22

Embedded Systems Handbook

Electronic Architecture and System Engineering for Integrated Safety Systems (EASIS). The goal of the EASIS project (January 2004 to December 2006) is to define and develop a platform for software-based functionality in vehicle electronic systems providing common services upon which future applications can be built; a vehicle on-board electronic hardware infrastructure which supports the requirements of integrated safety systems in a cost effective manner; a set of methods and techniques for handling critical dependability-related parts of the development lifecycle and an engineering process enabling the application of integrated safety systems. This project is funded by the European Community (6th FWP). Partners are Kuratorium Offis E. V. (G), DAF Trucks N.V. (N), Centro Richerche FIAT, Societa Consortile per Azioni (I), Universitaet Duisburg — Essen, Standort — Essen (G), Dspace GMBH (G), Valéo Electronique et Systèmes de Liaison (F), Motorola GMBH (G), Peugeot-Citroën Automobiles SA (F), Mira Limited (UK), Philips GMBH Forschungslaboratorien (G), ZF Friedrichshafen AG (G), Adam Opel Aktiengesellschaft (G), ETAS (G), Volvo Technology AB (S), Lear Automotive (S), S.L. (S), Vector Informatik GMBH (G), Continental Teves AG & CO. OHG (G), Decomsys GMBH (A), Regienov (F), Robert Bosch GMBH (G). Automotive Open System Architecture (AUTOSAR). The objective of the partnership involved in AUTOSAR (May 2003 to August 2006) is the establishment of an open standard for automotive E/E architecture. It will serve as a basic infrastructure for the management of functions within both future applications and standard software modules. The goals include the standardization of basic system functions and functional interfaces, the ability to integrate and transfer functions and to substantially improve software updates and upgrades over the vehicle lifetime. The AUTOSAR scope includes all vehicle domains. A three-tier structure, proven in similar initiatives, is implemented for the development partnership. Appropriate rights and duties are allocated to the various tiers: Premium Members, Associate Members, Development Member, and Attendees. http://www.autosar.org/.

References [1] Society of Automotive Engineers, www.sae.org. [2] G. Leen, D. Heffernan, Expanding automotive electronic systems, Computer, 35, 88–93, 2002. [3] A. Sangiovanni-Vincentelli, Automotive Electronics: Trends and Challenges, Convergence 2000, Detroit MI, October 2000. [4] F. Simonot-Lion, In Car embedded electronic architectures: how to ensure their safety, in Proceedings of the 4th IFAC Conference on Fieldbus Systems and their Applications — FET03. Aveiro, Portugal, July 2003, pp. 1–8. [5] ISO, Road Vehicles — Interchange of Digital Information — Controller Area Network for High-Speed Communication, ISO 11898, International Organization for Standardization (ISO), 1994. [6] ISO, Road Vehicles — Low-Speed Serial Data Communication — Part 2: Low-Speed Controller Area Network, ISO 11519-2, International Organization for Standardization (ISO), 1994. [7] ISO, Road Vehicles — Low-Speed Serial Data Communication — Part 3: Vehicle Area Network, ISO 11519-3, International Organization for Standardization (ISO), 1994. [8] B. Abou, J. Malville, Le bus VAN Vehicle Area Network: Fondements du protocole, Dunod, Paris, 1997. [9] SAE, Class B Data Communications Network Interface, J1850, Society of Automotive Engineers (SAE), May 2001. [10] TTTech, Specification of the TTP/C Protocol, Version 0.5, TTTech Computertechnik GmbH, July 1999. [11] OSEK, OSEK/VDX Operating System, Version 2.2, 2001. http://www.osek-vdx.org. [12] J.B. Goodenough, L. Sha, The priority ceiling protocol: a method for minimizing the blocking of high priority tasks, in Proceedings of the 2nd International Workshop on Real-Time Ada Issues, Ada Letters 8, 1988, pp. 20–31. [13] Modistarc Project, http://www.osek-vdx.org/whats_modistarc.htm. [14] http://www.arcticus.se/.

© 2006 by Taylor & Francis Group, LLC

Design and Validation Process

41-23

[15] U. Freund, M. von der Beeck, P. Braun, and M. Rappl, Architecture centric modeling of automotive control software, SAE Technical paper series 2003-01-0856. [16] DECOS Project, http://www.decos.at/. [17] A. Rajnak, K. Tindell, and L. Casparsson, Volcano Communications Concept, Volcano Communications Technologies AB, Gothenburg, Sweden, 1998. Available at http://www.vct.se. [18] P. Giusto, J.-Y. Brunel, A. Ferrari, E. Fourgeau, L. Lavagno, and A. Sangiovanni-Vincentelli, Automotive virtual integration platforms: why’s, what’s, and how’s, in Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD’02), Freiburg, Germany, 16–18 September, 2002, pp. 370–378. [19] M. Nenad, N.T. Richard, A framework for classifying and comparing architecture description languages, Technical report, Department of Information and Computer Science, University of California, Irvine, 1997. [20] V. Steve, MetaH User’s Manual, Honeywell Technology Center, Carnegie-Mellon, 1995. http://www.htc.honeywell.com/metah/uguide.pdf. [21] AEE, Architecture Electronique Embarquée, 1999, http://aee.inria.fr. [22] E. Jean-Pierre, S.-L. Françoise, An architecture description language for in-vehicle embedded system development, in Proceedings of the 15th IFAC World Congress, IFAC B’02, Barcelona, Spain, 21–26 July, 2002. [23] M. Jörn, E. Jean-Pierre, Embedded electronic architecture, in Proceedings of 3rd International Workshop on Open Systems in Automotive Networks, Bad Homburg, Germany, 2–3 February, 2000. [24] ITEA — EAST EEA Project, www.east-eea.net/docs. [25] U. Freund, O. Gurrieri, J. Küster, H. Lonn, J. Migge, M.O. Reiser, T. Wierczoch, and M. Weber, An architecture description language for developing automotive ECU-software, in INCOSE’2004, Toulouse, France, 20–24 June, 2004, pp. 101–112. [26] www.mathworks.com/. [27] Ascet SupplyChain, www.ascet.com/. [28] Ilogic — Statemate, www.ilogix.com/. [29] Esterel Technologies — SCADE SuiteTM for Safety-Critical Software, www.estereltechnologies.com. [30] C. Jard. Automatic Test Generation Methods for Reactive Systems. CIRM Summer School, Marseille, 1998. [31] Tindell Ken and Clark John, Holistic schedulability analysis for distributed hard real-time systems, Microprocessor and Microprogramming, 40, 117–134, 1994. [32] Y.Q. Song, F. Simonot-Lion, and N. Navet, De l’évaluation de performances du système de communication à la validation de l’architecture opérationnelle — cas du système embarqué dans l’automobile, Ecole d’été temps réel 1999, Poitiers (France), C.N.R.S., Poitiers (France), Ed. LISI-ENSMA, 1999. [33] Y.Q. Song, F. Simonot-Lion, and B. Pierre, VACANS — A tool for the validation of CAN-based applications, in Proceedings of WFCS’97, Barcelona, Spain, October 1997. [34] C. Paolo, Y.Q. Song, F. Simonot-Lion, and A. Mondher, Analysis and simulation methods for performance evaluation of a multiple networked embedded architecture, IEEE Transactions on Industrial Electronics, 49, 1251–1264, 2002. [35] C. Alain, The electrical electronic architecture of PSA Peugeot Citroen vehicles: current situation and future trends, in Presentation at Networking and Communication in the Automobile, Munich, Germany, March 2000. [36] K. Tindell and A. Burns, Guaranteed message latencies on controller area network (CAN), in Proceedings of the 1st International CAN Conference, ICC’94, 1994. [37] SES Workbench, HyPerformix Inc. http://www.hyperformix.com.

© 2006 by Taylor & Francis Group, LLC

42 Fault-Tolerant Services for Safe In-Car Embedded Systems 42.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-1 The Issue of Safety-Critical Systems in the Automotive Industry • Generic Concepts of Dependability

42.2 Safety-Relevant Communication Services . . . . . . . . . . . . . 42-3 Reliable Communication • Higher-Level Services

42.3 Fault-Tolerant Communication Systems . . . . . . . . . . . . . . . 42-8

Nicolas Navet and Françoise Simonot-Lion Institut National Polytechnique de Lorraine

Dependability from Scratch: TTP/C • Scalable Dependability: FlexRay • Adding Missing Features to an Existing Protocol: CAN

42.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-12 Acknowledgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-12 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-12

42.1 Introduction In the next decade, most features of a car will be supported by an electronic embedded system. This strategy is already used for functions such as light management, window management, door management, etc., as well as for the control of traditional functions, such as braking, steering, etc. Moreover, the planned deployment of X-by-Wire technologies is leading the automotive industry in the world of safety-critical applications. Therefore, such systems must, obviously, respect their functional requirements, obey the properties of performance and cost and furthermore, guarantee their dependability despite the possible faults (physical or design) that may occur. More precisely, the design of such systems must take into account the dependability of two kinds of requirements. On the one hand, safety, the absence of catastrophic consequences, for the driver, the passengers, and the environment, has to be ensured and on the other hand, the system has to provide reliable service and be available for the request of its users. This section introduces the emerging standards that are likely to influence the certification process for in-vehicle embedded systems and describes the general concepts of dependability and the means by which dependability can be attained. The communication system is a key point for an application: it is

42-1

© 2006 by Taylor & Francis Group, LLC

42-2

Embedded Systems Handbook

in charge of the transmission of critical information or events between functions that are deployed on distant stations (Electronic Control Units — ECUs) and it is a means for the OEM (car-makers) to integrate functions provided by different suppliers. So, in this chapter, we pay special attention to in-vehicle embedded networks and to the services that enhance the dependability of the exchanges and the dependability of the embedded applications. Note that a classical means, that is sometimes imposed by the regulatory policies in domains close to those in automotives, consists of introducing mechanisms that enable a system to tolerate faults. The purpose of Section 42.2 is to present the main services, provided by a protocol, that allow an application to tolerate certain faults. These services generally provide fault detection and, for some of them, are able to mask fault occurrences from upper layer and to prevent the propagation of faults. In Section 42.3, we compare some classes of protocols with respect to their ability to ensure services for increasing the dependability of an application. For each class, we will discuss the effort needed at the middleware level or application level for reaching the same quality of system.

42.1.1 The Issue of Safety-Critical Systems in the Automotive Industry In some domains recognized as critical (e.g., nuclear plants, railways, avionics), safety requirements in computer-based embedded systems are very rigorous and the manner of specification and the management of dependability/safety requirements are an important issue. These systems have to obey regulatory policies that require these industries to follow a precise certification process. At the moment, nothing similar exists in the automotive industry for certifying electronic embedded systems. Nevertheless, the problem is crucial for car-makers as well as for suppliers and, so, several proposals, are presently under study. Among the existing certification standards [1], RTCA/DO-178B [2], used in avionics, or EN 50128 [3], applied in the railway industry, provide stringent guidelines for the development of a safety-critical embedded system. But, these standards are hardly transposable for in-vehicle software-based systems: partitioning of software (critical/noncritical), multiple versions, dissimilar software components, use of active redundancy, and hardware redundancy. In the automotive sector, the Motor Industry Software Reliability Association (MISRA), a consortium of the major actors of automotive products in UK, proposes a loose model for the safety-directed development of vehicles with software on-board [4]. Finally, the generic standard IEC 61508 [5], applied to Electrical/Electronic/Programmable Electronic systems is a good candidate for supporting a certification process in the automotive industry. In Europe, in particular, in the transport domain, the trend is to move from “rule-based” to “risk-based” regulation [6]. So, the certification process will certainly be based on the definition of safety performance levels that characterize a safety function regarding the consequences of its failures defined as catastrophic, severe, major, minor, or insignificant. The IEC 61508 standard proposes, in addition to other requirements on the design, validation, and testing processes, four integrity levels, termed “Safety Integrity Levels” (SILs) and a quantitative safety requirement for each (see Table 42.1). The challenge is therefore to prove that each function realized by a computer-based system, reaches the requirements imposed by its SIL. “Dependability,”“safety,”“failure,” etc. are terms used in standard documents. So, we evoke, in the next section, definitions admitted in the context of dependability. TABLE 42.1 Relationship between Integrity Levels and Quantitative Requirements for a System in Continuous Operation (IEC-61508) Integrity level SIL 4 SIL 3 SIL 2 SIL 1

© 2006 by Taylor & Francis Group, LLC

Probability of dangerous failure occurence/h P ≤ 10−8 10−8 < P ≤ 10−7 10−7 < P ≤ 10−6 10−6 < P ≤ 10−5

Fault-Tolerant Services

42-3

42.1.2 Generic Concepts of Dependability Dependability is defined in Reference 7 as “the ability of a system to deliver service that can justifiably be trusted.” The service delivered by a system is its behavior as it is perceived by another system (human or physical) interacting with it. A service can deviate from its desired functionality. The occurrence of such an event is termed a failure. An error is defined as the part of the system state that may cause a failure. A fault is the determined or hypothesized cause of an error. It can be active, when it produces an error and dormant otherwise. A system fails according to several failure modes. A failure mode characterizes a service that does not fit with its desired functionality according to three parameters: the failure domain (value domain or time domain, see Section 42.2.2.2.1), the perception of the failure by several users of the system (consistent or inconsistent), and the consequences of the failures (from insignificant to catastrophic). As we will see in Section 42.2 at the communication level, services are available to contend with the occurrences of failures in the value or time domain and to preserve the consistency, as well as the possibility, of the perception of a failure by several stations. The consequence of a failure at the communication level is the responsibility of the designer of the embedded system and its assessment is a difficult issue. Dependability is a concept that covers, in fact, several attributes. From a quality point of view, reliability, or the continuity of a correct service, and availability, expressing the readiness for a correct service, are important for automotive embedded systems. Note that the online detection of a low level of the reliability or availability of a service supported by an embedded system can lead to the “nonavailability” of the vehicle and consequently affect the quality of the vehicle as perceived by the customer. Safety is the reliability of the system regarding critical failure modes, or failure modes leading to catastrophic, severe, or major consequences [8]. This attribute characterizes the ability of a system to avoid the occurrences of catastrophic events that may be very costly in terms of monetary loss and human suffering. One way to reach the safety objective is, first, to apply a safe development process in order to prevent and remove any design faults. As presented in Reference 9, this method has to be completed, in the design step, with an evaluation of the embedded system’s behavior (fault forecasting). This can be achieved through a qualitative analysis (identification of failure modes, component failures, environmental conditions leading to a system failure) and a quantitative analysis (the probability evaluation applied to some parameters for the verification of dependability properties). The last means for reaching dependability is to apply a fault-tolerant approach. This technique is mandatory for in-car embedded systems because the environment of the system is partially known and the reliability of the hardware components cannot be fully guaranteed. Note that, the problem, in the automotive industry, is not only to be compliant to standards whose purpose mainly concerns the safety of the driver, the passengers, the vehicle, and its environment but also to ensure a level of performance, comfort, and, more generally, the quality of the vehicle. The specification, in a quantitative way, of the properties required by an electronic embedded system, and the proof that this system meets these requirements are the principal challenges in the automotive industry.

42.2 Safety-Relevant Communication Services In this section, we discuss the main services and functionalities that the communication system should offer for easying the design of fault-tolerant automotive applications. In order to reduce the development time and increase quality through the reuse of validated components, these services should, as much as possible, be implemented in layers below the application-level software. More precisely, some services such as the global time are usually provided by the communication controller, while others, such as redundancy management, are implemented in the middleware software layer (e.g., OSEK fault-tolerant layer [10] or the middleware described in Reference 11). As suggested in Reference 12, solutions where the middleware is running on a dedicated CPU, will enhance the predictability of the system by reducing the interactions between the middleware layer and the application-level software. In particular, it will prevent conflicts in accessing the CPU, which may induce temporal faults such as missed deadlines.

© 2006 by Taylor & Francis Group, LLC

42-4

Embedded Systems Handbook

42.2.1 Reliable Communication The purpose of this section is to discuss the main services and features related to the data exchange one can expect for safety-critical automotive applications. On the one hand, these services serve to hide the occurrence of faults from higher levels. For example, a shielded transmission support will mask some EMIs (electromagnetic interferences), considered as faults. On the other hand, other services are intended to detect the occurrence of errors and to avoid their propagation in the system (e.g., a Cyclic Redundancy Check [CRC] will prevent corrupted data from being used by an applicative process). 42.2.1.1 Robustness against EMIs Embedded automotive systems suffer from environmental perturbations such as α particles, temperature peaks, or EMIs. The EMI type of perturbations has been identified for a long time [13,14] as being a serious threat to the correct behavior of an automotive system. EMIs can either be radiated by some in-vehicle electrical devices (switches, relays, etc.) or come from a source outside the vehicle (radio, radar, flashes of lightning, etc.). EMIs could affect the correct functioning of all the electronic devices but the transmission support is a particularly “weak link.” The whole problem is to ensure that the system will behave according to its specification, whatever the environment. In general, the same Medium Access Control (MAC) protocol can be implemented on different types of physical layers (e.g., unshielded pair, shielded twisted pair, or plastic optical fiber) which exhibit significantly different behavior with regards to EMIs (see Reference 15 for more details on the electromagnetic sensitivity of different types of transmission support). Unfortunately, the use of an all-optical network, which offers very high immunity to EMIs, is not generally feasible because of the low-cost requirement imposed by the automotive industry. Besides using a resilient physical layer, another means to alleviate the EMI problem is to replicate the transmission channels where each channel transports its own copy of the same frame. Although an EMI is likely to affect both channels in quite a similar manner, the redundancy provides some resilience to transmission errors. The two previous approaches are classical means for hiding as well as a possible fault due to EMIs that can occur at the physical layer level. Nevertheless, when a frame is corrupted during transmission (i.e., at least one bit has been inverted), it is crucial that the receiver be able to detect it in order to discard the frame. This is the role of the CRC whose so-called Hamming distance indicates the number of inverted bits below which the CRC will detect the corruption. It is worth noting that if the Hamming distance of the MAC protocol CRC is too small with regard to the dependability objectives, a middleware layer can transparently insert an additional CRC in the data field of the MAC level frame. This will reinforce the ability of the system to detect errors happening during the transmission. 42.2.1.2 Time-Triggered Transmissions One major design issue is to ensure that at run-time no errors will jeopardize the requirements imposed on the temporal behavior of the system; for data exchanges, these temporal requirements can be imposed on response times of frames or jitter upon reception. Among communication networks, one distinguishes time-triggered (TT) protocols where transmissions are driven by the progress of time (i.e., frames are transmitted at predefined points in time) and event-triggered (ET) protocols where transmissions are driven by the occurrence of events. Major representatives of ET and TT protocols considered for use in safety-critical in-vehicle communications will be discussed in Section 42.3. Both types of communications have advantages and drawbacks but it is now widely admitted that dependability is much easier to ensure using a TT bus (see, for instance, [9,16–18]), the main reasons being that: • Access to the medium is deterministic (i.e., the order of the transmissions is defined statically at the design time and organized in “rounds” that repeat in cycles), and thus the frame response times are bounded and there is no jitter at reception.

© 2006 by Taylor & Francis Group, LLC

Fault-Tolerant Services

42-5

• It simplifies the composability, which is the ability to add new nodes without affecting existing ones,1 as well as partitioning, which is the property that assures that a failure occurring in one subsystem cannot propagate to others. • The behavior of a TT communication system is predictable, which makes it easier to understand its behavior and verify if the temporal constraints have been respected. • Message transmissions can be used as “heartbeats,” which allow a very prompt detection of station failures. • Finally, the medium access scheme does not limit the network bandwidth, as is the case with the arbitration on message priority used by Controller Area Network (CAN), and thus large amounts of data can be transferred between nodes. These reasons explain that, currently, only TT communication systems are being considered for use in safety-critical applications, such as steer-by-wire [19,20] or brake-by-wire. 42.2.1.3 Global Time Some control functions need to know the occurrence order among a set of events that happened in the system; some functions, such as diagnosis, even need to be able to precisely date them. This can be achieved by forming a global synchronized time base. The second reason why a global time is needed comes from the TT communication scheme. In TT communications, as time drives the transmissions, all nodes of the network must have a coherent notion of time and a clock synchronization algorithm is required. This clock synchronization algorithm is, in fact, a service that tolerates faults that can affect local clocks. In fact, the local clocks tend to drift apart since oscillators are not perfect; this imposes periodic resynchronization. For instance, on TTP/C, each node periodically adjusts its clock according to the difference between its own clock and the average value of those from other nodes (the clocks with the highest value and lowest value are discarded). A crucial performance metric for a clock synchronization algorithm is the maximum difference that can be observed among all local clocks. This value directly impacts the network’s throughput in TT buses since the length of a transmission window, in addition to the actual transmission time of the frame, has to include some extra time to compensate for the skew between local clocks (i.e., a frame transmitted at the right point in time must not be rejected because the clock of a receiver diverges from the clock of the sender). Other criteria of major interest are the number and the type of faults (e.g., wrong clock value or no value received) that can be tolerated by the algorithm. For example, the TTP/C algorithm can tolerate a single fault on a network composed of at least four nodes (see Reference 21 for a detailed analysis). 42.2.1.4 Atomic Broadcast and Acknowledgment At some point in time, it is mandatory that some functions distributed on the network have the same understanding of the state of the system in order to interoperate in a satisfactory manner. This implies that the information on the state of the system must be consistent throughout the whole network (this property is termed “spatial consistency” or “exact agreement”). The requirement of spatial consistency is particularly important for active redundancy,2 which is the basic strategy for ensuring fault tolerance, that is, the capacity of a system to deliver its service even in the presence of faults. To be able to compare the output results, it is crucial that the set of all replicated components process the same input data, which, in particular, implies that the values obtained from local sensors are exchanged over the network. All nonfaulty nodes must thus receive the messages in the same order and with the same content. This property, which is called “atomic broadcast” or “interactive consistent broadcast” (see References 22 and 16), enables 1 Adding

new nodes requires that some bandwidth has been reserved for their transmission at design time. For instance, in TTP/C, some “slots” can be left free for future use. 2 Active redundancy means that a set of components realizing the same functions in parallel enables the system to continue to operate despite the loss of one or more units. In passive redundancy, additional components are only activated when the primary component fails.

© 2006 by Taylor & Francis Group, LLC

42-6

Embedded Systems Handbook

distributed processes to reach common decisions or “consensus” despite faults, for instance, using majority voting. In practice, it may happen that all or a subset of nodes do not receive a message because of an incorrect signal shape due to EMIs or because nodes are temporarily faulty. The communication system usually provides, through the use of a CRC for detecting corrupted frames, a weak form of atomic broadcast that ensures that all stations that successfully receive a frame get the same value. This alone is however not sufficient for constructing fault-tolerant applications and, in addition, at least the acknowledgment of the reception of a message is needed because the sender, and possibly other nodes, may have to adapt their behavior according to this information (e.g., reschedule the transmission of the information in a subsequent frame). This latter requirement is important, in the automotive context, for distributed functions, such as steering, braking, or active suspension. 42.2.1.5 Avoiding “Babbling-Idiots” As said before, it is crucial that the system does not deviate from the temporal behavior defined at design time. If a node does not behave in the specified manner, it has to be detected and masked at the communication system level in order to prevent the failure from propagating. It may happen that a faulty ECU transmits outside its specification, for example, it may send at a wrong point in time or send a frame larger than planned at design time. When communications are multiplexed, this will perturb the correct functioning of the whole network, especially the temporal behavior of the data exchanges. One well-known manifestation is the so-called babbling idiots [23,24] nodes that transmit continuously (e.g., due to a defective oscillator). To avoid this situation, a component called the “bus guardian,” restricts the controller’s ability to transmit by allowing transmission only when the node exhibits a specified behavior. Ideally, the bus guardian should have its own copy of the communication schedule, should be physically separated from the controller, should possess its own power supply and should be able to construct the global time itself. Due to the strong pressure from the automotive industry concerning costs, these assumptions are not fulfilled in general, which reduces the efficiency of the bus guardian strategy. If the network has a star topology, with a central interface — called the “star” — for interconnection, instead of the classical bus topology, then the star can act as a central bus guardian and protect against errors that cannot be avoided by a local bus guardian. For instance, a star topology is more resilient to spatial proximity faults (e.g., temperature peaks) and to faults due to the desynchronization of an ECU (i.e., the star can disconnect a desynchronized station). To avoid a single point of failure, a dual star topology should be used with the drawback that the length of the wires is significantly increased.

42.2.2 Higher-Level Services In this section, we identify services that provide fault-tolerant mechanisms belonging conceptually to layers above the MAC in the OSI reference model. 42.2.2.1 Group Membership Service As discussed in Section 42.2.1.4, atomic broadcast ensures that all nonfaulty stations possess the same variables describing the state of the system at a particular point in time. Another property that is required for implementing fault tolerance at a high level is that all nonfaulty stations know the set of stations that are operational (or “nonfaulty”). This service, which is basically a consensus on the set of operational nodes, is provided by the group membership and it is generally highly recommended for X-by-Wire applications. A classical example detailed in Reference 12 is a brake-by-wire system where four ECUs, interconnected by a network, control the brakes located at the four wheels of the car. As soon as a wheel ECU is no longer functioning, the brake force applied to its wheel has to be redistributed among the remaining three wheels in such a way that the car can be safely parked. As pointed out in Reference 12, for a brake-by-wire application, the time interval between the dysfunctioning of the wheel ECU and the knowledge of this

© 2006 by Taylor & Francis Group, LLC

Fault-Tolerant Services

42-7

event by all other stations has an impact on the safety of the application and thus it has to be bounded and taken into account at design time. A membership service implemented at the communication system level assumes that all nodes that are correctly participating in the communication protocol are nonfaulty. In TT systems, as transmissions are perfectly foreseeable, the decisions regarding membership can be taken at points in time where frames should have been received. In a very simplified way, a missing or “faulty” frame indicates to the receivers that the sending node is not functioning properly. In addition, a node that is unable to transmit must consider itself as faulty and stops operating. Since it takes some time to detect faulty nodes, there can be faulty stations in the membership list of a node during some time intervals. The maximum number of such undetected faulty nodes, the maximum duration it takes to discover that a node is faulty, the maximum number of faulty stations, and the types of faults than can be detected are major performance criteria of a membership algorithm. Other criteria include: the time needed for a “repaired” node to rejoin the membership list, how well the different nodes agree on the membership list at any point in time (are “cliques,” i.e., sets of stations that disagree on the state of the system, possible?, and how long can these cliques coexist?) and the implementation overheads mainly in terms of CPU load and network bandwidth. Group membership algorithms are complex distributed algorithms and formal methods are of great help in analyzing and validating them; the reader can refer to [21,22,25,26] as good starting points on this topic. 42.2.2.2 Management of Nodes’ Redundancy A classical way for ensuring fault tolerance is to replicate critical components. We saw, in Section 42.2.1.1, that the redundancy of the bus can hide faults due to EMIs. To achieve fault tolerance, certain nodes are also replicated and clustered into so-called Fault-Tolerant Units (FTUs). An FTU is a set of several stations that performs the same function and each node of an FTU possesses its own slot in the round so that the failure of one or more stations in the same FTU can be tolerated. Actually, the role of FTUs is twofold. First, they make the system resilient in the presence of transmission errors (some frames sent by nodes of the FTU may be correct while others are corrupted). Second, they provide a means to fight against measurement and computation errors occurring before transmission (some nodes may send the correct values while others may make errors). 42.2.2.2.1 Fail-Silence Property In the fault tolerance terminology, a node is said to be fail-silent if (1) it sends frames at the correct point in time (correctness in the time domain), and (2) the correct value is transmitted (correctness in the value domain), or (3) it sends detectably incorrect frames (e.g., wrong CRC) in its own slot or no frame at all. A communication system such as TTP/C provides very good support for the requirements (1) and (3) (whose fulfillment provides the so-called fail silence in the temporal domain) especially through the bus guardian concept (see Section 42.2.1.5), while the value domain is the responsibility of higher-level layers. The use of fail-silent nodes greatly decreases the complexity of designing a critical application since data produced by fail-silent nodes is always correct and thus can be safely consumed by the receivers. Tolerating one arbitrary failure can be achieved with FTUs made of two nodes whereas three are necessary if the nodes are not fail silent. However, in practice, it is difficult to ensure the fail-silent assumption, especially in the value domain. Basically, a fail-silent node has to implement redundancy plus error detection mechanisms and stop functioning after a failure is detected. Self-check mechanisms can be implemented in hardware or, more usually, in software on commercial off-the-shelf hardware [27]. An example of such mechanisms is the “double execution” strategy, which consists of running each task twice and comparing the output. However, both executions can be affected in the same way by a single error; a solution that provides some protection against so-called common-mode faults is to perform a third execution with a set of reference input data and to compare the output of the execution with precomputed results that are known to be correct. This strategy is known as “double execution with reference check.” The reader is referred to References 11, 27, and 28 for good starting points on the problem of implementing fail-silent nodes.

© 2006 by Taylor & Francis Group, LLC

42-8

Embedded Systems Handbook

42.2.2.2.2 Message Agreement From an implementation point of view, it is usually preferable to present only one copy of data to the application in order to simplify the application code (considering possible divergences between replicated message instances is not needed) and to keep it independent from the degree of redundancy (i.e., the number of nodes composing an FTU). The algorithm responsible for the choice of the value that will be transmitted to the application is termed “the agreement algorithm.” Many agreement strategies are possible: pick-any (replicated messages are coming from an FTU made of fail-silent nodes), average-value, pick-a-particular-one (the selected value has been produced by the best sensor), majority vote, etc. OSEK/VDX consortium [10] has proposed a software layer responsible for implementing the agreement strategy. Two other important services of the OSEK FTCom (Fault-Tolerant Communication layer) are (1) to manage the packing of signals (elementary pieces of information such as the speed of the vehicle) into frames according to a precomputed configuration, which is needed if the use of network bandwidth has to be optimized (see, for instance, References 29 and 30 for frame-packing algorithms), and (2) to provide message filtering mechanisms for passing only “significant” data to the application. Another fault-tolerant layer that offers the agreement service is described, as well as the set of associated tools, in Reference 11. 42.2.2.3 Support for Functioning Mode A functioning mode is a specific operational phase of an application. Typically, several functioning modes, that are mutually exclusive, are defined in a safety-critical application. For a vehicle, possible modes include factory mode (e.g., download of calibration parameters), prerun mode (after doors are unlocked and before the engine is started — preheating is possible for some components), postrun mode (engine was shut-off but, e.g., cooling can still be necessary), park mode (most ECUs are powered off), and even show-room mode. Besides these “normal” functioning modes, the occurrence of a failure can trigger the switching to a particular mode that will aim to bring the system back to a safe state again. Particular functions corresponds to each functioning mode, which means a different set of tasks and messages as well as different schedules. If mode changes provide flexibility, great care must be taken that changes happen at the right points in time and that all nodes agree on the current mode. The communication system can provide some support in this area by ensuring that mode changes take place only at predefined points in time, are triggered by the authorized nodes and that the message schedule is changed simultaneously for all nodes. For example, TTP/C [31,32] offers services for immediate mode changes (i.e., the change is performed at the end of the transmission window where it was requested) as well as deferred mode changes (i.e., the change is performed at the end of the current message schedule or cluster cycle in the TPP/C terminology).

42.3 Fault-Tolerant Communication Systems Among communication protocols that are considered for being used in safety-critical automotive systems, one can distinguish three main types: • Protocols that have been designed from scratch to provide all the main fault-tolerant services. The prominent representative of this class is the TTP/C protocol [47]. • Protocols that offer the basic functionalities for fault-tolerant systems among which are global time and bus guardians. The idea is to allow a scalable dependability on a per network or even on a per node basis. Missing features are to be implemented in software layers above the communication controllers. The representative of this class in the automotive context is FlexRay [33]. • Protocols not initially conceived with the objective of fault tolerance to which missing features are added. This is the case with CAN [34], current de facto standard in production cars, which is being considered for use in safety-critical applications (see, for instance, Reference 17) with the condition of additional features.

© 2006 by Taylor & Francis Group, LLC

Fault-Tolerant Services

42-9

42.3.1 Dependability from Scratch: TTP/C The TTP/C protocol, which is specified in Reference 32, was designed and extensively studied at the Vienna University of Technology. TTP/C is a central part of the Time-Triggered Architecture (TTA — see Reference 35) which is a complete framework for building fault-tolerant distributed applications according to the TT paradigm. Hardware implementations of the TTP/C protocol, as well as software tools for the design of the application, are commercialized by the TTTech company and are available today. On a TTP/C network, transmission support is replicated and each channel transports its own copy of the same message. TTP/C can be implemented with a bus topology or a more resilient single star or dual star topology. At the MAC level, the TTP/C protocol implements a synchronous TDMA scheme: the stations (or nodes) have access to the bus in a strict deterministic sequential order and each station possesses the bus for a constant period of time called a “slot” during which it has to transmit one frame. The sequence of slots, such that all stations have accessed the bus one time, is called a “TDMA round.” The size of the slot is not necessarily identical for all stations in the TDMA round, but a slot belonging to one station is the same size in each round. Consecutive TDMA rounds may differ according to the data transmitted during the slots, and the sequence of all TDMA rounds is the “cluster cycle” that repeats itself in a cycle. The TTP/C possesses numerous features and services related to dependability along with TT communication. In particular, TTP/C implements a clique avoidance algorithm (the stations that belong to a “minority” in their understanding of the state of the system will eventually be excluded) and a membership algorithm that also provides data acknowledgment (one knows after a bounded time whether a station has received a message or not). Bus guardian, global clock, and support for mode changes are also parts of the specification. The algorithms used in TTP/C are by themselves intricate and interact in a very complex manner but most of them have been formally verified (see [21,25,36]). The fault hypothesis used for the design of TTP/C is well specified, but also quite restrictive (two successive faults such as transmission errors must occur at least two rounds apart). Situations outside the fault hypothesis are treated using “never give up” (NUP) strategies that aim to continue operating in a degraded mode. From the point of view of the set of available services, TTP/C is a mature solution. In our opinion, future research should investigate whether the fault hypothesis considered in the TTP/C design are pertinent in the context of automotive embedded systems where the environment can be very harsh (e.g., bursts of transmission errors may happen). This can be done starting from measurements taken on-board of prototypes that would help to estimate the relevance of the fault hypothesis. Other research could study the behavior of the communication system outside the fault hypothesis and the impact on the application; this could be undertaken by using fault injection.

42.3.2 Scalable Dependability: FlexRay A consortium of major companies from the automotive field is currently developing the FlexRay protocol. The core members are BMW, Bosch, Daimler-Chrysler, General Motors, Motorola, Philips, and Volkswagen. The first publicly available specifications of the FlexRay Protocol have already been released [33]. The FlexRay network is very flexible with regard to topology and transmission support redundancy. It can be configured as a bus, a star, or multistar, and it is not mandatory that each station possess replicated channels or a bus guardian, even though this should be the case for critical functions. At the MAC level, FlexRay defines a communication cycle as the concatenation of a TT (or static) window and an ET (or dynamic) window. In each communication window, whose size is set statically at design time, a different protocol is applied. The communication cycles are executed periodically. The TT window uses a TDMA MAC protocol; the main difference with TTP/C is that a station might possess several slots in the TT window, but the size of all the slots is identical. In the ET part of the communication cycle, the protocol is FTDMA (Flexible Time Division Multiple Access): the time is divided into so-called minislots, each station possesses a given number of minislots

© 2006 by Taylor & Francis Group, LLC

42-10

Embedded Systems Handbook MiniSlot

Frame ID n +1

Frame ID n +4

Channel A Channel B

n n

n +1

n +2 n+1 n+2

Frame ID n

n + 3 n +4

n+5

n+6 n+7

n+3 n+4

Frame ID n+2

Frame ID n +4

Slot Counter

FIGURE 42.1

Example of message scheduling in the dynamic segment of the FlexRay communication cycle.

(not necessarily consecutive) and it can start the transmission of a frame inside each of its own minislots. The bus guardian is not used in the dynamic window to control whether transmissions take place as specified. A minislot remains idle if the station has nothing to transmit. An example of a dynamic window is shown in Figure 42.1: on channel B, frame m has begun transmitting in minislot n while minislots n + 1 and n + 2 have not been used. It is noteworthy that frame n + 4 is not received simultaneously on channels A and B since, in the dynamic window, transmissions are independent in both channels. The FlexRay MAC protocol is more flexible than the TTP/C MAC since in the static window nodes are assigned as many slots as necessary (up to 4095 for each node) and since the frames are only transmitted if necessary in the dynamic part of the communication cycle. In a similar way as with TTP/C, the structure of the communication cycle is statically stored in the nodes, however, unlike TTP/C, mode changes with a different communication schedule for each mode are not possible. From the dependability point of view, FlexRay specifies solely TT communication with bus guardian and clock synchronization algorithm on dual wires (shielded or unshielded — see Reference 37 for the specifications of the physical layer). Should we consider the example of brake-by-wire in Section 42.2.2.1, the protocol offers no way offered for a node to know that one of the wheel ECUs is no longer operational, which would be needed to take the appropriate decision (e.g., redistribution of the brake force). Features that can be necessary for implementing fault-tolerant applications, such as membership and acknowledgment services or mode management facilities, will have to be implemented in software or hardware layers on top of FlexRay with the drawback that efficient implementations might be more difficult to achieve above the data-link layer level. There is indeed in literature individual solutions for each of the missing services but these protocols might have very complex interactions when used jointly, which requires that the whole communication profile is carefully validated by tests, simulation, fault injection, and formal proof under a well-defined fault hypothesis. In automotive systems, critical and noncritical functions will increasingly coexist and interoperate. In the FlexRay specification ([33], p. 8), it is argued that the protocol provides scalable dependability, that is, the “ability to operate in configurations that provide various degrees of fault-tolerance.” Indeed, the protocol allows for mixing single and dual transmission supports (interconnected though a star) on the same network, subnetworks of nodes without busguardians or with different fault-tolerance capability with regard to clock synchronization, nodes that do not send or receive TT messages, etc. This flexibility can prove to be efficient in the automotive context in terms of cost and reuse of existing components if missing fault-tolerance features are provided in a middleware layer such as OSEK FTCom (see introduction of Section 42.2 and Reference 10) or the one currently under development within the automotive industry project AUTOSAR (see http://www.autosar.org).

42.3.3 Adding Missing Features to an Existing Protocol: CAN Controller Area Network has proved to be a very cost and performance effective solution for data exchange in automotive systems during the last 15 years. However, as specified by the ISO standards [34,38],

© 2006 by Taylor & Francis Group, LLC

Fault-Tolerant Services

42-11

CAN lacks almost all the features and services identified in Section 42.2 as are important for the implementation of fault-tolerant systems: no redundant medium, no TT communication, no global time, no atomic broadcast (even in the “weak form” described in Section 42.2.1.4, due to the well-known inconsistent message omission [39]), no reliable acknowledgment, no bus guardian, no group membership, no functioning mode management services, etc. Some authors advocate that “CAN can be used as a base and missing facilities can be added as needed” [17] and, over the last years, there was in fact a number of studies and proposals aimed at adding faulttolerant features to CAN (see, for instance, [9,40–48]). In the rest of this section, we discuss some such proposals of possible interest for automotive systems. 42.3.3.1 TTCAN: TT Communications on Top of CAN Two main protocols were proposed to enable TT transmissions over CAN: TTCAN (Time-Triggered Controller Area Network — see References 40 and 49) and FTT-CAN (Flexible Time-Triggered CAN — see Reference 9). In the following, we consider TTCAN, which has received much attention in the automotive field since it was proposed by Robert Bosch GmbH, a major actor in the automotive industry. Time-Triggered CAN was developed on the basis of the CAN physical and data-link layers. The bus topology of the network, the characteristics of the transmission support, the frame format, as well as the maximum data rate — 1 Mbits/sec — are imposed by CAN protocol [49]. In addition to the standard CAN features, TTCAN controllers must have the possibility to disable automatic retransmission and to provide the application with the time at which the first bit of a frame was sent or received [49]. Channel redundancy is possible, but not standardized, and no bus guardian is implemented in the node. The key idea is to propose, as with FlexRay, a flexible TT/ET protocol. TTCAN defines a basic cycle (the equivalent of the FlexRay communication cycle) as the concatenation of one or several TT (or “exclusive”) windows and one ET (or “arbitrating”) window. Exclusive windows are devoted to TT transmissions (i.e., periodic messages) while the arbitrating window is ruled by the standard CAN protocol: transmissions are dynamic and bus access is granted according to the priority of the frames. Several basic cycles, that differ in their organization (exclusive and arbitrating windows) and in the messages sent inside exclusive windows, can be defined. The list of successive basic cycles is called the system matrix and the matrix is executed in loops. Interestingly, the protocol enables the master node, the node that initiates the basic cycle through the transmission of the “reference message,” to stop functioning in TTCAN mode and to resume in standard CAN. Later, the master node can switch back to TTCAN mode by sending a reference message. 42.3.3.2 Improving Error Conﬁnement Controller Area Network protocol possesses fault confinement-mechanisms aimed at differentiating between short disturbances caused by EMI and permanent failures due to hardware dysfunctioning. The scheme is based on error counters that are increased and decreased according to particular events (e.g., successful reception of a frame, reception of a corrupted frame, etc.). The relevance of the algorithms involved is questionable (see Reference 50) but the main drawback is that a node has to diagnose itself, which can lead to the nondetection of some critical errors such as the node transmitting continuously a dominant bit (one manifestation of the “babbling idiot” fault known as “stuck-at-dominant,” see Section 42.2.1.5 and Reference 46). Furthermore, other faults such as the partitioning of the network into several subnetworks may prevent all nodes from communicating due to bad signal reflection at the extremities. To address these problems, several solutions were proposed among which are the variant of RedCAN discussed in Reference 47 and CANcentrate discussed in Reference 46. The latter proposal is an active star that integrates some fault-diagnosis and fault-confinement mechanisms that can in particular prevent a stuck-at-dominant behavior. The former proposal relies on a ring architecture where each node is connected to the bus through a switch that possesses the ability to exclude a faulty node or a faulty segment from the communication. These two proposals are promising but developments are still needed (e.g., test implementation, fault injection, formal proofs) before they can be actually used in safety-critical

© 2006 by Taylor & Francis Group, LLC

42-12

Embedded Systems Handbook

applications. Furthermore, some faults such as a node transmitting correct frames more often than specified at design time are not covered by these proposals. Many other mechanisms were proposed for increasing the dependability on CAN-based networks [41–45,48], but as pointed out in Reference 43, if each proposal solves a particular problem, they have not been thought to be combined. Furthermore, the fault hypothesis used in the design are not necessarily the same and the interactions between protocols remains to be studied in a formal way.

42.4 Conclusion In the current state of practice, automotive embedded systems widely make use of fault-prevention (e.g., shielded ECU or transmission support), fault-detection (e.g., watch-dog ECU that monitors the functioning state of the engine controller, check whether a data is obsolete or out-of-range) and faultconfinement techniques (e.g., missing critical data are reconstituted on the basis of other data and more generally, specification and implementation of several degraded functioning modes). Redundancy is used at the sensor level (e.g., for the wheel angle) but seldom at the ECU level because of cost pressure and because the criticality of the functions does not absolutely impose it. Some future functions, such as brake- and steer-by-wire, are likely to require active redundancy in order to comply with the acceptable risk levels and the design guidelines that could be issued by certification organisms. For critical functions that are distributed and replicated throughout the network, the communication system will play a central role by providing the services that will simplify the implementation of faulttolerant applications. The networks that are candidates are TTP/C, FlexRay, and CAN-based TT solutions. TTP/C is a mature technology that provides the most important services for supporting fault-tolerant applications. Moreover, TTP/C was designed under a well-specified fault hypothesis and the committees of most of its algorithms were formally proven. In our opinion, future research should investigate the relevance of the TTP/C fault hypothesis in the context of automotive embedded systems and the behavior of the protocol outside the fault hypothesis. At the time of writing, FlexRay, which is developed by the major actors of the European automotive industry, seems in a strong position for becoming a standard in the industry. The main advantage of FlexRay is its flexibility; in particular, it provides both TT and ET communications and nodes with different fault-tolerance capabilities can coexist on the same network. The services provided by FlexRay do not fulfill all the needs for fault tolerance and higher-level protocols will have to be developed and validated before FlexRay can be used in very demanding applications. The major issue is that higher-level implementations tend to be less efficient (e.g., bandwidth overhead for acknowledgment, maximum time needed for detecting faulty nodes). Finally, the solutions based on the TT-CAN protocol will require additional low-level mechanisms for fault confinement as well as higherlevel services such as atomic broadcast and membership. Many proposals exist for more dependability on CAN-based network but much work remains to be done to come up with a coherent and validated communication stack that includes all necessary services.

Acknowledgment We would like to thank Mr. Christophe Marchand, project leader in the field of diagnosis at PSA Peugeot Citroën, for helpful comments on an earlier version of this chapter.

References [1] Y. Papadopoulos and J.A. McDermid. The potential for a generic approach to certification of safety-critical systems in the transportation sector. Journal of Reliability Engineering and System Safety, 63: 47–66, 1999. [2] Radio Technical Commission for Aeronautics. RTCA DO-178B — software considerations in airbone systems and equipment certification, 1994. © 2006 by Taylor & Francis Group, LLC

Fault-Tolerant Services

42-13

[3] CENELEC. Railway applications — software for railway control and protection systems, EN50128, 2001. [4] P.H. Jesty, K.M. Hobley, R. Evans, and I. Kendall. Safety analysis of vehicle-based systems. In Proceedings of the 8th Safety-Critical Systems Symposium, Southampton, UK, 2000. [5] IEC. IEC61508-1, Functional Safety of Electrical/Electronic/Programmable Safety-Related Systems — Part 1: General Requirements, IEC/SC65A, 1998. [6] J.A. McDermid. Trends in system safety: a European view? In Proceedings of the 7th Australian Workshop on Safety Critical Systems and Software, North Adelaide, Australia, 2002. [7] A. Avizienis, J. Laprie, and B. Randell. Fundamental concepts of dependability. In Proceedings of the 3rd Information Survivability Workshop, Boston, USA, 2000, pp. 7–12 . [8] ARTIST, Project IST-2001-34820. Selected topics in embedded systems design: roadmaps for research, May 2004. Available at http://www.artist-embedded.org/Roadmaps/ ARTIST_Roadmaps_Y2.pdf. [9] J. Ferreira, P. Pedreiras, L. Almeida, and J.A. Fonseca. The FTT-CAN protocol for flexibility in safety-critical systems. IEEE Micro, Special Issue on Critical Embedded Automotive Networks, 22: 46–55, 2002. [10] OSEK Consortium. OSEK/VDX Fault-Tolerant Communication, Version 1.0, July 2001. Available at http://www.osek-vdx.org/. [11] C. Tanzer, S. Poledna, E. Dilger, and T. Fuhrer. A fault-tolerance layer for distributed fault-tolerant hard real-time systems. In Proceedings of the Annual IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, San Juan, Puerto Rico, USA, 1999. [12] H. Kopetz and G. Bauer. The time-triggered architecture. Proceedings of the IEEE, 91: 112–126, 2003. [13] I.E. Noble. EMC and the automotive industry. Electronics and Communication Engineering Journal, 4(5): 263–271, 1992. [14] E. Zanoni and P. Pavan. Improving the reliability and safety of automotive electronics. IEEE Micro, 13: 30–48, 1993. [15] J. Barrenscheen and G. Otte. Analysis of the physical CAN bus layer. In Proceedings of the 4th International CAN Conference, ICC’97, Berlin, Germany, October 1997, pp. 06.02–06.08. [16] J. Rushby. A comparison of bus architecture for safety-critical embedded systems. Technical report, NASA/CR, March 2003. [17] L.-B. Fredriksson. CAN for critical embedded automotive networks. IEEE Micro, Special Issue on Critical Embedded Automotive Networks, 22: 28–35, 2002. [18] A. Albert. Comparison of event-triggered and time-triggered concepts with regards to distributed control systems. In Proceedings of Embedded World 2004, Nürnberg, February 2004. [19] X-by-Wire Project, Brite-EuRam 111 Program. X-By-Wire — safety related fault tolerant systems in vehicles, final report, 1998. [20] C. Wilwert, Y.Q. Song, F. Simonot-Lion, and T. Clément. Evaluating quality of service and behavioral reliability of steer-by-wire systems. In Proceedings of the 9th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Lisbon, Portugal, 2003. [21] J. Rushby. An overview of formal verification for the time-triggered architecture. In Proceedings of Formal Techniques in Real-Time and Fault-Tolerant Systems, Oldenburg, Germany, 2002, pp. 83–105. [22] T.D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43: 225–267, 1996. [23] K. Tindell and H. Hansson. Babbling idiots, the dual-priority protocol, and smart CAN controllers. In Proceedings of the 2nd International CAN Conference, London, UK, 1995, pp. 7.22–7.28. [24] C. Temple. Avoiding the babbling-idiot failure in a time-triggered communication system. In Proceedings of the 28th International Symposium on Fault-Tolerant Computing, Munich, Germany, June 1998. [25] H. Pfeifer. Formal verification of the TTP group membership algorithm. In Proceedings of FORTE/PSTV 2000, Pisa, Italy, 2000. © 2006 by Taylor & Francis Group, LLC

42-14

Embedded Systems Handbook

[26] H. Pfeifer and F.W. von Henke. Formal analysis for dependability properties: the time-triggered architecture example. In Proceedings of the 8th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA 2001), Antibes, France, October 2001, pp. 343–352. [27] F. Brasileiro, P. Ezhilchelvan, S. Shrivastava, N. Speirs, and S. Tao. Implementing fail-silent nodes for distributed systems. IEEE Transactions on Computers, 45: 1226–1238, 1996. [28] M. Hiller. Software fault-tolerance techniques from a real-time systems point of view — an overview. Technical report, Chalmers University of Technology, Göteborg, Sweden, November 1998. [29] R. Santos Marques, N. Navet, and F. Simonot-Lion. Frame packing under realtime constraints. In Proceedings of the 5th IFAC International Conference on Fieldbus Systems and their Applications — FeT’2003, Aveiro, Portugal, July 2003, pp. 185–192. [30] R. Saket and N. Navet. Frame packing algorithms for automotive applications. Technical report RR-4998, INRIA, 2003. Available at http://www.inria.fr/ rrrt/rr-4998.html. [31] H. Kopetz, R. Nossal, R. Hexel, A. Krüger, D. Millinger, R. Pallierer, C. Temple, and M. Krug. Mode handling in the time-triggered architecture. Control Engineering Practice, 6: 61–66, 1998. [32] TTTech Computertechnik GmbH. Time-Triggered Protocol TTP/C, High-Level Specification Document, Protocol Version 1.1, November 2003. Available at http://www.tttech.com. [33] FlexRay Consortium. FlexRay Communication System, Protocol Specification, Version 2.0, June 2004. Available at http://www.flexray.com. [34] International Standard Organization. ISO 11519-2, Road Vehicles — Low Speed Serial Data Communication — Part 2: Low Speed Controller Area Network, ISO, 1994. [35] H. Kopetz. Real-Time Systems: Design Principles for Distributed Embedded Applications. Kluwer Academic Publishers, Dordrecht, 1997. [36] G. Bauer and M. Paulitsch. An investigation of membership and clique avoidance in ttp/c. In Proceedings of the 19th IEEE Symposium on Reliable Distributed Systems, Nürnberg, Germany, 2000. [37] FlexRay Consortium. FlexRay Communication System, Electrical Physical Layer, Version 2.0, June 2004. Available at http://www.flexray.com. [38] International Standard Organization. ISO 11898, Road Vehicles — Interchange of Digital Information — Controller Area Network for High-Speed Communication, ISO, 1994. [39] J. Rufino, P. Veríssimo, G. Arroz, C. Almeida, and L. Rodrigues. Fault-tolerant broadcasts in CAN. In Proceedings of the 28th International Symposium on Fault-Tolerant Computing Systems, IEEE, Munich, Germany, June 1998, pp. 150–159. [40] International Standard Organization. 11898-4, Road Vehicles — Controller Area Network (CAN) — Part 4: Time-Triggered Communication, ISO, 2000. [41] G. Lima and A. Burns. Timing-independent safety on top of CAN. In Proceedings of the 1st International Workshop on Real-Time LANs in the Internet Age, Vienna, Austria, 2002. [42] G. Lima and A. Burns. A consensus protocol for CAN-based systems. In Proceedings of the 24th Real-Time Systems Symposium, Cancun, Mexico, 2003, pp. 420–429. [43] G. Rodriguez-Navas, M. Barranco, and J. Proenza. Harmonizing dependability and real time in CAN networks. In Proceedings of the 15th Euromicro Conference on Real-Time Systems, Porto, Portugal, 2003. [44] J. Ferreira, L. Almeida, J. Fonseca, G. Rodriguez-Navas, and J. Proenza. Enforcing consistency of communication requirements updates in FTT-CAN. In Proceedings of the 22nd Symposium on Reliable Distributed Systems, Florence, Italy, 2003. [45] G. Rodriguez-Navas and J. Proenza. Clock synchronization in CAN distributed embedded systems. In Proceedings of the 3rd International Workshop on Real-Time Networks, Catania, Italia, 2004. [46] M. Barranco, G. Rodriguez-Navas, J. Proenza, and L. Almeida. CANcentrate: an active star topology for can networks. In Proceedings of the 5th International Workshop on Factory Communication System, Vienna, Austria, 2004.

© 2006 by Taylor & Francis Group, LLC

Fault-Tolerant Services

42-15

[47] H. Sivencrona, T. Olsson, R. Johansson, and J. Torin. RedCAN: simulations of two fault recovery algorithms for CAN. In Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing, Papeete, French Polynesia, 2004, pp. 302–311. [48] L.M. Pinho and F. Vasques. Reliable real-time communication in can networks. IEEE Transactions on Computers, 52: 1594–1607, 2003. [49] Robert Bosch GmbH. Time Triggered Communication on CAN. Available at http://www.can. bosch.com/content/TT_CAN.html, 2004. [50] B. Gaujal and N. Navet. Fault confinement mechanisms on CAN: analysis and improvements. IEEE Transactions on Vehicular Technology, 54(5), 2004. Accepted for publication. Preliminary version available as INRIA Research Report at http: //www.inria.fr/rrrt/rr-4603.html.

© 2006 by Taylor & Francis Group, LLC

43 Volcano — Enabling Correctness by Design 43.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-1 43.2 Volcano Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-3 Volcano Signals and the Publish/Subscribe Model • Frames • Network Interfaces • The Volcano API • Timing Model • Capture of Timing Constraints

43.3 Volcano Network Architect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-10 The Car OEM Tool Chain — One Example • VNA — Tool Overview

43.4 Volcano Software in an ECU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-15 Volcano Configuration • Workflow

Antal Rajnák Volcano Communications Technologies AG

Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-18 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-18 More Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-18

43.1 Introduction Volcano is a holistic concept defining a protocol independent design methodology for distributed real-time networks in vehicles. The concept is dealing with both technical and nontechnical entities (i.e., partitioning of responsibilities into well-defined roles in the development process). The vision of Volcano is “Enabling Correctness by Design.” By taking a strict systems engineering approach and focusing resources into design, a majority of system-related issues can be identified and solved early in a project. The quality is designed into the vehicle, not tested out. Minimized cost, increased quality, and high degree of configuration/reconfiguration flexibility are the trademarks of the Volcano concept. The Volcano approach is particularly beneficial as the complexity of vehicles is increasing very rapidly and as projects will have to cope with new functions and requirements throughout their lifetime. A unique feature of the Volcano concept is the solution called “post-compile-time reconfiguration flexibility,” where the network configuration containing signal to frame mapping, ID assignment, and frame period is located in a configurable flash area of the Electronic Control Unit (ECU), and can be changed without the need for touching the application software — thus eliminating the need for re-validation, saving cost, and lead-time. The origin of the concepts can be traced back to a project at Volvo Car Corporation during 1994 to 1998 when development of Volvo’s new “large platform” [3] had taken place. It is reusing solid industrial experience, and is taking into account recent findings from real-time research (Figure 43.1) [2].

43-1

© 2006 by Taylor & Francis Group, LLC

43-2

Embedded Systems Handbook CAN high speed (250 kbit) CAN low speed (125 kbit) PDM ECE TCM

CCM

AUM PHM

SAS

UEM SRS

ETM DIM SWM

PSM

ABS

RTI REM

CEM

DDM

FIGURE 43.1 The main networks of the Volvo S80 [4].

The concept is characterized by three important features: • Ability to guarantee the real-time performance of the network already at the design stage, thus significantly reducing the need for testing. • Built-in flexibility enabling the vehicle manufacturer to upgrade the network in the preproduction phase of a project as well as in the aftermarket. • Efficient use of available resources. The actual implementation of the concept consists of two major parts: • The offline tool-set for requirement capturing and automated network design (covering multiple protocols and gateway configuration). It provides strong administrative functions for variant and version handling, which are needed during the complete life cycle of a car project. • The target part, represented by a highly efficient and portable embedded software package, offers a signal-based API, handles multiple protocols, integrated gateway functionality, and post-compiletime reconfiguration capability, together with a PC-based generation tool. Even though the implementation originally supported the Control Area Network (CAN) and Volcano lite1 protocols, it has successfully been extended to fit other emerging network protocols also. LIN was added first, followed by the FlexRay and MOST protocols. The philosophy behind this is that communication has to be managed in one single development environment, covering all protocols used, in order to ensure end-to-end timing predictability, which still provides the necessary architectural freedom to choose the most economic solution for the task. The Volcano approach is particularly beneficial as the complexity of vehicles is increasing very rapidly and as projects will have to cope with new functions and requirements throughout their lifetime. Over the last 40 years the computing industry has discovered that certain techniques are needed in order to manage complex software systems. Two of these techniques are: abstraction (where unnecessary information is hidden) and composability (if software components proven to be correct are combined, then the resulting system will be correct as well). Volcano is making heavy use of both these techniques. The automotive industry is implementing an increasing number of software functions. Introduction of protocols, such as MOST for multimedia and FlexRay for active chassis systems, results in highly complex electrical architectures. Finally all these complex subnetworks are linked through gateways. The behavior of the entire car network has a crucial influence upon the car’s performance and reliability. To manage software development involving many suppliers, hundreds of thousands of lines of code and thousands of signals require a structured systems engineering approach. Inherent in the concept of systems engineering is a clear partitioning of the architecture, requirements, and responsibilities. 1A

low-speed, SCI-based proprietary master-slave protocol used by Volvo.

© 2006 by Taylor & Francis Group, LLC

Volcano — Enabling Correctness by Design

43-3

A modern vehicle includes a number of microprocessor-based components called Electronic Control Units (ECUs), provided by a variety of suppliers. Control Area Network provides an industry-standard solution for connecting ECUs together using a single broadcast bus. A shared broadcast bus makes it much easier to add desired functionality: ECUs can be added easily, and they can communicate data easily and cheaply (adding a function may be “just software”). But increased functionality leads to more software and greater complexity. Testing a module for conformance to timing requirements is the most difficult of the problems. With a shared broadcast bus, the timing performance of the bus might not be known until all the modules are delivered and the bus usage of each is known. Testing for timing conformance can only then begin (which is often too far into the development of a vehicle to find and correct major timing errors). The supplier of a module can only do limited testing for timing conformance: they do not have a complete picture of the final load placed on the bus. This is particularly important when dealing with the CAN bus: arrivals of frames from the bus may cause interrupts on a module wishing to receive the frames, and so the load on the microprocessor in the ECU is partially dependent on the bus load. It is often thought that CAN is somehow unpredictable and the latencies for lower priority frames in the network are unbounded. This is untrue, and in fact CAN is a highly predictable communications protocol. Furthermore, CAN is well suited to handle large amounts of traffic with differing time constraints. However, with CAN there are a few particular problems: • The distribution of identifiers. CAN uses identifiers for two purposes: distinguishing different messages on the bus, and assigning relative priorities to those messages — the latter being often neglected. • Limited bandwidth. This is due to low maximum signaling speed of 1 Mbit/sec, further reduced by significant protocol overhead. Volcano was designed to provide abstraction, composability, and identifier distribution reflecting true urgencies, and at the same time providing the most efficient utilization of the protocol.

43.2 Volcano Concepts The Volcano concept is founded on the ability to guarantee the worst-case latencies of all frames sent in a multiprotocol network system. This is a key step because it gives the following: • A way of guaranteeing that there are no communications-related timing problems. • A way of maximizing the amount of information carried on the bus. The latter is important for reduced production costs. • The possibility to develop highly automated tools for design of optimal network configurations. The timing guarantee for CAN is provided by mathematical analysis developed from academic research [1]. Other protocols, such as FlexRay, are predictable by design. For this reason, some of the subjects discussed below are CAN specific, others are independent of the protocol used. The analysis is able to calculate the worst-case latency for each frame sent on the bus. This latency is the longest time from placing a frame in a CAN controller at the sending side to the time the frame is correctly received at all receivers. The analysis needs to make several assumptions about how the bus is used. One of these assumptions is that there is a limited set of frames that can access the bus, and that time-related attributes of these frames are known (e.g., frame size, frame periodicity, queuing jitter, and so on). Another important assumption is that the CAN hardware can be driven correctly: • The internal message queue within any CAN controller in the system is organized (or can be used) such that the highest priority message will be sent out first if more than one message is ready

© 2006 by Taylor & Francis Group, LLC

43-4

Embedded Systems Handbook

to be sent. (The hardware-slot position based arbitration is OK as long as the number of sent frames is less than the number of transmit slots available in the CAN controller.) • The CAN controller should be able to send out a stream of scheduled messages without releasing the bus in the interframe space between two messages. Such devices will arbitrate for the bus right after sending the previous message and will only release the bus in case of lost arbitration. A third important assumption is the error model: the analysis can account for retransmissions due to errors on the bus, but requires a model for the number of errors in a given time interval. The Volcano software running in each ECU controls the CAN hardware and accesses the bus so that all these assumptions are met, allowing application software to rely on all communications taking place on time. This means that integration testing at the automotive manufacturer can concentrate on functional testing of the application software. Another important benefit is that a large amount of communications protocol overhead can be avoided. Examples of how protocol overheads are reduced by obtaining timing guarantees are: • There is no need to provide frame acknowledgment within the communications layer, dramatically reducing bus traffic. The only case where an ECU can fail to receive a frame via CAN is if the ECU is “off the bus,” a serious fault that is detected and handled by network management and on-board diagnostics. • Retransmissions are unnecessary. The system-level timing analysis guarantees that a frame will arrive on time. Timeouts only happen after a fault, which can be detected and handled by network management and/or the on-board diagnostics. A Volcano system never suffers from intermittent overruns during correct operation because of the timing guarantees, and therefore achieves these efficiency gains.

43.2.1 Volcano Signals and the Publish/Subscribe Model The Volcano system provides “signals” as the basic communication object. Signals are small data items that are sent between ECUs. The “publish/subscribe” model is used for defining signaling needs. For a given ECU there are a set of signals that are “published” (i.e., made available to the system integrator), and a number of “subscribed” signals (i.e., signals that are required as inputs to the ECU). The signal model is provided directly to the programmer of ECU application software, and the Volcano software running in each ECU is responsible for translation between signals and CAN frames. An important design requirement for the Volcano software was that the application-programmer is unaware of the bus behavior: all the details of the network are hidden and the programmer only deals with signals through a simple API. This is crucial because a major problem with alternative techniques is that the application software makes assumptions about the CAN behavior and, therefore, changing the bus behavior becomes difficult. In Volcano there are three types of signals: • Integer signals. These represent unsigned numbers and are of a static size between 1 and 16 bits. So, for example, a 16-bit signal can store integers in the range 0 to 65,535. • Boolean signals. These represent truth conditions (true/false). Note that this is not the same as a 1-bit integer signal (which stores the integer values 0 or 1). • Byte signals. These represent data with no Volcano-defined structure. A byte signal consists of a fixed number of between 1 and 8 bytes. The advantage of Boolean and integer signals is that the values of a signal are independent of processor architecture (i.e., the values of the signals are consistent regardless of the “endian-ness” of the microprocessors in each ECU).

© 2006 by Taylor & Francis Group, LLC

Volcano — Enabling Correctness by Design

43-5

For published signals, Volcano internally stores the value of these signals and in case of periodic signals will send them to the network according to a pattern defined offline by the system integrator. The system integrator also defines the initial value of a signal. The value of a signal persists until updated by the application program via a “write” call or until Volcano is reinitialized. For subscribed signals, Volcano internally stores the current value of each signal. The system integrator also defines the initial value of a signal. The value of a subscribed signal persists until: • It is updated by receiving a new value from the network • Volcano is reinitialized • A signal refresh timeout occurs and the value is replaced by a substitute value defined by the application-programmer In the case where new signal values are received from the network, these values will not be reflected in the values of subscribed signals until a Volcano “input” call is made. A published signal value is updated via a “write” call. The latest value of a subscribed signal is obtained via a “read” call. A “write” call for a subscribed signal is not permitted. The last-written value of a published signal may be obtained via a “read” call. 43.2.1.1 Update Bits The Volcano concept permits placement of several signals with different update rates into the same frame. It provides a special mechanism — named “update bit” to indicate which signals within the frame has actually been updated: that is, the ECU generating the signal wrote a fresh value of the signal since the last time it has been transmitted. The Volcano software on an ECU transmitting a signal automatically clears the update bit when it has been sent. This ensures that a Volcano-based ECU on the receiving side will know each time the signal has been updated (the application can see this update bit, by using flags tied to an update bit: see below). Using update bits to their full extent require that the underlying protocol is “secure.” (Frames cannot be lost without being detected.) The CAN protocol is regarded as such, but not the LIN protocol. Therefore, the update bit mechanism is limited to CAN within Volcano. 43.2.1.2 Flags A flag is a Volcano object purely local to an ECU. It is bound to one of two things: • The update bit of a received Volcano signal; the flag is set when the update bit is set. • The containing frame of a signal; the flag is set when the frame containing the signal is received (regardless of whether an update bit for the signal is set). Many flags can be bound to each update bit, or the reception of a containing frame. Volcano sets all the flags bound to an object when the occurrence is seen. The flags are cleared explicitly by the application software. 43.2.1.3 Timeouts A timeout is, like the flags, a Volcano object purely local to an ECU. The timeout is declared by the application-programmer and is bound to a subscribed signal. A timeout condition occurs when the particular signal was not received within the given time limit. In this case, the signal (or/and a number of other signals) is/are set to a value specified as part of the declaration of the timeout. As with the flags, the timeout reset mechanism can be bound to either: • The update bit of a received Volcano signal. • The frame carrying a specific signal.

43.2.2 Frames A frame is a container capable of carrying a certain amount of data (0 to 8 bytes for CAN and LIN). Several signals can be packed into the available data space and transmitted together in one frame on the

© 2006 by Taylor & Francis Group, LLC

43-6

Embedded Systems Handbook

network. The total size of a frame is determined by the protocol. A frame can be transmitted periodically or sporadically. Each frame is assigned a unique “identifier.” The identifier serves two purposes in the CAN case: • Identifying and filtering a frame on reception at an ECU. • Assigning a priority to a frame. 43.2.2.1 Immediate Frames Volcano normally hides the existence of network frames from the application designer. However, under certain cases there is a need to send and receive frames with very short processing latencies. In these cases direct application support is required. Such frames are designated immediate frames. There are two Volcano calls to handle immediate frames: • A “transmit” call, which immediately sends the designated frame to the network. • A “receive” call, which immediately processes the designated incoming frame if that frame is pending. There is also a “read update bit” call to test the update bit of a subscribed signal within an immediate frame. The signals packed into an immediate frame can be accessed with normal “read” and “write” function calls in the same way as all other normal signals. The application-programmer is responsible for ensuring that the “transmit” call is made only when the signal values of published signals are consistent. 43.2.2.2 Frame Modes In Volcano, it is allowed to specify different frame modes for an ECU. A frame mode is a description of an ECU working mode, where a set of frames (signals) can be active (in- and output). The frames can be active in one or many frame modes. The timing properties of frames do not have to be the same for different frame modes supporting the same frame.

43.2.3 Network Interfaces A network interface is the device used to send and receive frames to and from networks. A network interface connects a given ECU to the network. In the CAN case, more than one network interface (CAN controller) on the same ECU may be connected to the same network. Likewise, an ECU may be connected to more than one network. The network interface in Volcano are protocol specific. The protocols currently supported are CAN and LIN — FlexRay and MOST are under implementation. The network interface is managed by a standard set of Volcano calls. These allow the interface to be initialized or reinitialized, connected to the network (i.e., begin operating the defined protocol), disconnected from the network (i.e., take no further part in the defined protocol). There is also a Volcano call to return the status of the interface.

43.2.4 The Volcano API The Volcano API provides a set of simple calls to manipulate signals and to control the CAN/LIN controllers. There are also calls to control Volcano sending to, and receiving from networks. To manipulate signals there are “read” and “write” calls. A “read” call returns to the caller the latest value of a signal; a “write” call sets the value of a signal. The “read” and “write” calls are the same regardless of the underlying network type. 43.2.4.1 Volcano Thread-of-Control There are two Volcano calls that must be called at the same fixed rate: v_input() and v_output(). If the v_gateway() function is used, the same calling rate shall be used as for the v_input() and v_output()

© 2006 by Taylor & Francis Group, LLC

Volcano — Enabling Correctness by Design

43-7

functions. The v_output() call places the frames into the appropriate controllers. The v_input() call takes received frames and makes the signal values available to “read” calls. The v_gateway() call copies values of signals in frames received from the network to values of signals in frames sent to the network. The v_sb_tick() call handles transmitting and receiving frames for sub-buses. Volcano also provides a very low latency communication mechanism in the form of the immediate frame API. This is a “view” of frames on the network, which allows transmission and reception from/to the Volcano domain without the normal Volcano input/output latencies, or mutual exclusion requirements with the v_input() and v_output() calls. There are two communication calls in the immediate signal API: v_imf_rx() and v_imf_tx(). The v_imf_tx() call copies values of immediate signals into a frame and places the frame in the appropriate CAN controller for transmission. The v_imf_rx() takes a received frame containing immediate signals and makes the signal values available to “read” calls. A third call v_imf_queued() allows the user to see if an immediate frame has really been sent on the network. The controller calls allow the application to initialize, connect, and disconnect from networks, and to place the controllers into “sleep” mode among others. 43.2.4.2 Volcano Resource Information The ambition of the Volcano concept is to provide a fully predictable communications solution. In order to achieve this, the resource usage of the Volcano embedded part has to be determined. Resources of special interest are memory and execution time. 43.2.4.2.1 Execution Time of Volcano Processing Calls In order to bound processing time a “budget” for the v_input() call — that is, the maximum number of frames that will be processed by a single call to v_input() has to be established. A corresponding process for transmitted frames applies as well.

43.2.5 Timing Model The Volcano timing model covers end-to-end timing (i.e., from button press to activation). A timing model is used to be able to set in context the signal timing information needed in order to analyze a network configuration of signals and frames. This section defines the required information that must be provided by an application-programmer in order to be able to guarantee the end-to-end timing requirements. A Volcano signal is transported over a network within a frame. Figure 43.2 identifies six time points between the generation and consumption of a signal value.

1

Notional generation

2

3

4

5

6

First v_output at which new value is available

Frame enters arbitration

Transmission completed

First v_input at which signal is available

Notional consumption

Time TPL

TBT

TT Max_age

FIGURE 43.2 The Volcano timing model.

© 2006 by Taylor & Francis Group, LLC

TAT

TSL

43-8

Embedded Systems Handbook

The six time points are: 1. Notional generation (signal generated) — either by hardware (e.g., switch pressed) or software (e.g., timeout signaled). The user can define this point to best reflect their system. 2. First v_output() (or v_imf_tx() for an immediate frame) at which a new value is available. This is the first such call after the signal value is written by a “write” call. 3. The frame containing the signal is first entered for transmission (arbitration on a CAN bus). 4. Transmission of the frame completes successfully (i.e., the subscriber’s communication controller receives the frame from the network). 5. v_input() (or v_imf_rx() for an immediate frame) makes the signal available to the application. 6. Notional consumption — the user application consumes the data. The user can define this point to best reflect their system. The max_age of the signal is the maximum age, measured from notional generation, at which it is acceptable for notional consumption. The max_age is the overall timing requirement on a signal. TPL (publish latency) is the time from notional generation to the first v_output() call when the signal value is available to Volcano (a “write” call has been made). It will depend on the properties of the publishing application. Typical values might be the frame_processing_period (if the signal is written fresh at every period but this is not synchronized with v_output()), the offset between the write call and v_output() (if the two are synchronized), or the sum of the frame_processing_period and the period of some lower rate activity that generates the value. This value must be given by the application-programmer. TSL (subscribe latency), the time from the first v_input that makes the new value available to the application to the time when the value is “consumed.” The consumption of a signal is a user-defined event that will depend on the properties of the subscribing function. As an example it can be a lamp being lit, or an actuator starting to move. This value must be given by the application-programmer. The intervals TBT , TT , and TAT are controlled by the Volcano 5 configuration and are dependent upon the nature of the frame in which the signal is transported. The value TBT is the time before transmission (the time from the v_output call until the frame enters arbitration on the bus). TBT is a per-frame value that depends on the type of frame carrying the signal (see later sections). This time is shared by all signals in the frame, and is common to all subscribers to those signals. The value TAT is the time after transmission (the time from when the frame has been successfully transmitted on the network until the next v_input call). TAT is a per-frame value that may be different for each subscribing ECU. The value TT is the time required to transmit the frame (including the arbitration time) on the network. 43.2.5.1 Jitter The application-programmer at the supplier must also provide information of the “jitter” to the systems integrator. This information is as follows: The input_jitter and output_jitter refer to the variability in the time taken to complete the v_input() and v_output() calls, measured relative to the occurrence of the periodic event causing Volcano processing to be done (i.e., calls to v_input() v_gateway(), and v_output() to be made). Figure 43.3 shows how the output_jitter is measured. In the figure, E marks the earliest completion time of the v_output() call, and L marks the latest completion time, relative to the start of the cycle. The output_jitter is therefore L − E. The input_jitter is measured according to the same principles. If a single-thread system is used, without interrupts, the calculation of the input_jitter and output_jitter is straightforward: the earliest time is the best-case execution time of all the calls in the cycle (including the v_output() call), and the latest time is the worst-case execution time of all the calls. The situation is more complex if interrupts can occur or the system consists of multiple tasks, since the latest time must take into account preemption from interrupts and other tasks.

© 2006 by Taylor & Francis Group, LLC

Volcano — Enabling Correctness by Design

43-9 Frame processing period

TV

TV E

Occurrence of periodic event that initiates Volcano processing calls

Worst-case execution time of v_output call

TV L

Best-case execution time of v_output call

Execution of other computation

Completion v_output call

FIGURE 43.3 Measurement of output jitter.

43.2.6 Capture of Timing Constraints The declaration of a signal in a Volcano fixed configuration file provides syntax to capture the following timing-related information: Whether a signal is state or state change — (info_type) Whether a signal is sporadic or periodic — (generation_type) The latency The min_interval The max_interval The max_age The first two (together with whether the signal is published or subscribed to) provide signal properties that determine the “kind” of the signal. A state signal carries a value that completely describes the signaled property (e.g., the current position of a switch). A subscriber to such a signal need only observe the signal value when the information is required for the subscriber’s purposes (e.g., signal values can be “missed” without affecting the usefulness of later values). A state change signal carries a value that must always be observed in order to be meaningful (e.g., distance traveled since last signal value). A subscriber must observe every signal value. A sporadic signal is one that is written by the application in response to some event (e.g., a button press). A periodic signal is one that is written by the application at regular intervals. The latency of a signal is the time from notional generation to being available to Volcano (for a published signal), or from being made available to the application by Volcano to notional consumption (for a subscribed signal). Note that immediate signals (those in immediate frames) include time taken to move frames to/from the network in these latencies. The min_interval has different interpretation for published and for subscribed signals. For a published signal, it is the minimum time between any pair of write calls to the signal (this allows, e.g., the calculation of the maximum rate at which the signal could cause a sporadic frame carrying it to be transmitted). For a subscribed signal, it is the minimum acceptable time between arrivals of the signal. This is optional: it is intended to be used if the processing associated with the signal is triggered by arrival of a new value, rather than periodic. In such a case, it provides a constraint that the signal should not be connected to published signal with a faster rate.

© 2006 by Taylor & Francis Group, LLC

43-10

Embedded Systems Handbook

The max_interval has different interpretation for published and subscribed signals. For a published signal, the interesting timing information is already captured by min_interval and publish latency. For a subscribed signal it is the maximum interval between “notional consumptions” of the signal (i.e., it can be used to determine that signal values are sampled quickly enough so that none will be missed). The max_age of a signal is the maximum acceptable age of a signal at notional consumption, measured from notional generation. This value is meaningful for subscribed signals. In addition to the signal timing properties described above, the Volcano fixed configuration file provides syntax to capture the following additional timing-related information: The Volcano processing period. The Volcano processing period defines the nominal interval between successive v_input() calls on the ECU, and also between successive v_output() calls (i.e., the rates of the calls are the same, but v_input() and v_output() are not assumed to “become due” at the same instant). For example, if the Volcano processing period is 5 msec then each v_output() call “becomes due” 5 msec after the previous one became due. The Volcano jitter time. The Volcano jitter defines the time by which the actual call may lag behind the time at which it became due. Note that “becomes due” refers to the start of the call, and jitter refers to completion of the call.

43.3 Volcano Network Architect To manage increasing complexity in electrical architectures a structured development approach is believed essential to assure correctness by design. Volcano automotive group has developed a network design tool, Volcano Network Architect (VNA), to support a development process, based on strict systems engineering principles. Gatewaying of signals between different networks is automatically handled by the VNA tool and the accompanying embedded software. The tool supports partitioning of responsibilities into different roles, such as system integrator and function owner. Third party tools may be used for functional modeling. These models can be imported into VNA. Volcano Network Architect is the top-level tool in the Volcano Automotive Groups tool chain for designing vehicle network systems. The tool chain supports important aspects of systems engineering such as: • Use of functional modeling tools. • Partitioning of responsibilities. • Abstracting away from hardware and protocol-specific details providing a signals-based API for the application developer. • Abstracting away from the network topology through automatic gatewaying between different networks. • Automatic frame compilation to ensure that all declared requirements are fulfilled (if possible), that is, delivering correctness by design. • Reconfiguration flexibility by supporting post-compile-time reconfiguration capability. The VNA tool supports network design and makes management and maintenance of distributed network solutions more efficient. The tool supports capturing of requirements and then takes a user through all stages of network definition.

43.3.1 The Car OEM Tool Chain — One Example Increasing competition and complex electrical architectures demands enhanced processes. Function modeling has proved to be a suitable tool to capture the functional needs in a vehicle. Tools, such as Rational Rose, provide a good foundation to capture all different functions and other tools, such as Statemate and Simulink, model them in order to allocate objects and functionality in the vehicle. Networking is essential

© 2006 by Taylor & Francis Group, LLC

Volcano — Enabling Correctness by Design

43-11

FIGURE 43.4 VNA screen.

since the functionality is distributed among a number of ECUs in the vehicle. Substantial parts of the outcome from the function modeling are highly suitable to use as input to a network design tool, such as VNA. The amount of information required to properly define the networks are vast. To support input of data VNA provides an automated import from third party tools through an XML-based format. It is the job of the signal database administrator/system integrator to ensure that all data entered into the system are valid and internally consistent. VNA supports this task through a built-in multilevel consistency checker that verifies all data (Figure 43.4). In this particular approach the network is designed by the system integrator in close contact with the different function owners in order to capture all necessary signaling requirements — functional and nonfunctional (including timing). When the requirements are agreed and documented in VNA, the system integrator uses VNA to pack all signals into frames, this can be done manually or automatically. The algorithm used by VNA handles gatewaying by partitioning end-to-end timing requirements into requirements per network segment. All requirements are captured in the form of a Microsoft Word document called Software Requirement Specification (SWRS) that is generated by VNA and sent to the different node owners as a draft copy to be signed-off. When all SWRS has been signed-off VNA automatically creates all necessary configuration files used in the vehicle along with a variety of files for third party analysis and measurement tools. The network level (global) configuration files are used as input to the Volcano Configuration Tool and Volcano Back-End Tool in order to generate a set of downloadable binary configuration files for each node. The use of reconfigurable nodes makes the system very flexible since the Volcano concept separates application dependent information and network dependent information. A change in the network by the system integrator can easily be applied to a vehicle without having to recompile the application software in the nodes. The connection between function modeling and VNA provide good support for iterative design. It verifies network consistency and timing up front, to ensure a predictable and deterministic network.

43.3.2 VNA — Tool Overview 43.3.2.1 Global Objects The workflow in VNA ensures that all relevant information about the network is captured. Global objects shall be created first, and then (re-)used in several projects. The VNA user works with objects of types such

© 2006 by Taylor & Francis Group, LLC

43-12

Embedded Systems Handbook Use manage

Frame compiler

D Backup

D Converter

Consistency check

DB in RA

GUI

D Console

D If

DB

Specs, reports, and document

FIBEX XML files

ASAP

Config. generator

SWRS

.ldf

Quick generator

Generi Exp./Imp.

XML

LIN description files

Fixed

Target

Network

Volcano configuration files

Conversion tool

HTML Third party format

Customer

FIGURE 43.5 The database is a central part of the VNA system. In order to ensure highest possible performance, each instance of VNA accesses a local mirror of the database that is continuously synchronized with its parent.

as signals/nodes/interfaces, etc. These objects are used to build up the networks used in a car. Signals are defined by name and type, and can have logical or physical encoding information attached. Interfaces detailing hardware requirements are defined leading to describe actual nodes on a network. For each node, receive and transmit signals are defined, and timing requirements are provided for the signals. This information is intended for “global use,” that is, across car variants, platforms, etc. 43.3.2.2 Project or Conﬁguration Related Data (Projects, Conﬁgurations, Releases) When all global data have been collected the network will be designed by connecting the interfaces in a desired configuration. VNA has strong project and variant handling. Different configurations can selectively use or adapt the global objects, for example, by removing a high-end feature from a low-end car model. This means that VNA can manage multiple configurations, designs, and releases, with version and variant handling. The release handling ensures that all components in a configuration are locked. It is however still possible to reuse the components in unchanged form. This makes it possible to go back to any released configuration at any point in time (Figure 43.5). 43.3.2.3 Database All data objects, both global and configuration-specific, are stored in a common database. The VNA tool was designed to have one common multiuser database per car OEM. In order to secure highest possible performance all complex and time-consuming VNA operations are performed toward a local RAM mirror of the database. A specially designed database interface ensures consistency in the local mirror. Operations that are not time critical, such as database management, operate toward the database. The built-in multiuser functionality allows multiple users to access all data stored in the database simultaneously. To ensure that a data object is not modified by more than one user the object must be locked before any modification, read access is of course allowed for all users although an object is locked for modification.

© 2006 by Taylor & Francis Group, LLC

Volcano — Enabling Correctness by Design

43-13

43.3.2.4 Version and Variant Handling The VNA database implements functionality for variants and versions handling. Most of the global data objects, for example, Signals, functions, and nodes, may exist in different versions, but only one version of an object can be used in a specific project/configuration. The node objects can be seen as the main global objects, since hierarchically they include all other types of global objects. The node objects can exist in different variants but only one object can be used from a variant folder in a particular project/configuration. 43.3.2.5 Consistency Checking Extensive functionality for consistency checking is built into the VNA tool. The consistency check can be manually activated when needed, but is also running continuously to check user input and give immediate feedback on any suspected inconsistency. The consistency check ensures that the network design follows predefined rules and generates errors when appropriate. 43.3.2.6 Timing Analysis/Frame Compilation The Volcano concept is based on a foundation of guaranteed message latency and a signals-based publish and subscribe model. This provides abstraction by hiding the network and protocol details, allowing the developer to work in the application domain with signals, functions, and related timing information. Much effort has been spent on developing and refining the timing analysis in VNA. The timing analysis is built upon a scheduling model called DMA (Deadline Monotonic Analysis), and calculates the worstcase latency for each frame among a defined set of frames sent on the bus. Parts of this functionality have been built into the consistency check routine as described above but the real power of the VNA tool is found in the frame packer/frame compiler functionality. The frame packer/compiler attempts to create an optimal packing of signals into frames, than calculate the proper IDs to every frame ensuring that all the timing requirements captured earlier in the process are fulfilled (if possible). This automatic packing of multiple signals into each frame makes more efficient use of the data bus, by amortizing some of the protocol overheads involved thus lowering bus load. The combined effect of multiple signals per frame and perfect filtering results in a lower interrupt and CPU load, which means that the same performance can be obtained at lower cost. The frame packer can create the most optimal solution if all nodes are reconfigurable. To handle carry over nodes that are not reconfigurable (ROM-based), these nodes and their associated frames can be classed as “fixed.” Frame packing can also be performed manually if desired. Should changes to the design be required at a later time, the process allows rapid turnaround of design changes, rerunning the Frame Compiler and regenerating the configuration files. The VNA tool can be used to design network solutions that are later realized by embedded software from any provider. However, the VNA tool is designed with the Volcano embedded software (VTP) in mind, which implements the expected behavior into the different nodes. To get the full benefits of the tool chain, VNA and VTP should be used together. 43.3.2.7 Volcano Filtering Algorithm A crucial aspect of network configuration is how to choose identifiers so that the load on a CPU related to handling of interrupts generated by frames of no interest for the particular node is minimized: most CAN controllers have only limited filtering capabilities. The Volcano filtering algorithm is designed to achieve this. An identifier is split into two parts: the priority bits and the filter bits. All frames on a network must have unique priority bits; for real-time performance the priority setting of a frame should reflect the relative urgency of the frame. The filter bits are used to determine if a CAN controller should accept or reject a frame. Each ECU that needs to receive frames by interrupts is assigned a single filter bit; the hardware filtering in the CAN controller is set to “must match 1” for the filter bit, and “don’t care” for all other bits.

© 2006 by Taylor & Francis Group, LLC

43-14

Embedded Systems Handbook 28

ID bit

1

24

6

20

0

16

0

12

0

8

4

4

C

0

2

1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 7

8

0

Number of priority bits = 7

4

C

2

Number of filter bits = 13 Unused bits (0)

FIGURE 43.6 A CAN identifier on an extended CAN network. The network clause has defined the CAN identifiers to have 7 priority bits and 13 filter bits. The least significant bit of the value corresponds with the bit of the identifier transmitted last. Only legal CAN identifiers can be specified: identifiers with the 7 most significant bits equal to 1 are illegal according to the CAN standard.

The filter bits of a frame are set for each ECU to which the frame needs to be seen. So a frame that is broadcast to all ECUs on the network is assigned filter bits all set to “1.” For a frame sent to a single ECU on the network, just one filter bit is set. Figure 43.6 illustrates this; the frame shown is sent to four ECUs. If an ECU takes an interrupt for just the frames that it needs then the filtering is said to be perfect. In some systems there may be more ECUs needing to receive frames by interrupt than there are filter bits in the network; in this case, some ECUs will need to share a bit. If this happens then Volcano will filter the frames in software, using the priority bits to uniquely identify the frame and discarding unwanted frames. The priority bits are the most significant bits. They indicate priority and uniquely identify a frame. The number of priority bits must be large enough to uniquely identify a frame in a given network configuration. The priority bits for a given frame are set by the relative urgency (or “deadline”) of the frame. This is derived from how urgently each subscriber of a signal in the frame needs the signal (as described earlier). In most systems 5 to 10 priority bits are sufficient. The filter bits are the remaining least significant bits and are used to indicate the destination ECUs for a given frame. Treating them as a “target mask” does this: Each ECU (or group of ECUs) is assigned a single filter bit. The filtering for a CAN controller in the ECU is set up to accept only frames where the corresponding filter bit in the identifier is set. This can give “perfect filtering”: an interrupt is raised if and only if the frame is needed by the ECU. Perfect filtering can dramatically reduce the CPU load compared with filtering in software. Indeed, perfect filtering is essential if the system integrator needs to connect ECUs with slow 8-bit CPUs to high-speed CAN networks (if filtering were implemented in software the CPU would spend most of its available processing time handling interrupts and discarding unwanted frames). The filtering scheme also allows for broadcast of a frame to an arbitrary set of ECUs. This can reduce the traffic on the bus since frames do not need to be transmitted several times to different destinations. Because the system integrator is able to define the configuration data and because that data defines the complete network behavior of an ECU, the in-vehicle networks are under the control of the system integrator. 43.3.2.8 Multiprotocol Support The existing version of VNA supports the complementary, contemporary network protocols of CAN and LIN. The next version, will also have support for the FlexRay protocol. A prototype version of VNA with

© 2006 by Taylor & Francis Group, LLC

Volcano — Enabling Correctness by Design

43-15

partial MOST support is currently under construction. As network technology continues to advance into other protocols, VNA will also move to support these advances. 43.3.2.9 Gatewaying A network normally consists of multiple network segments using different protocols. Signals, may be transferred from one segment to another through a gateway node. As implemented throughout the whole tool chain of Volcano Automotive Group, gatewaying of data even across multiple protocols is automatically configured in VNA. In this way VNA allows any node to subscribe any signal generated on any network without needing to know how this signal is gatewayed from the publishing node. Handling of timing requirements over one or more gateways is also handled by VNA. The Volcano solution requires no special gatewaying hardware and therefore provides the most cost-efficient solution to signal gatewaying. 43.3.2.10 Data Export and Import The VNA tool enables the OEMs to get a close integration between VNA and functional modeling tools and to share data among different OEMs and subcontractors, for example, node developers. Support of emerging standards, such as FIBEX and XML, will further simplify information sharing and become a basis for configuration of third party communication layers.

43.4 Volcano Software in an ECU The Volcano tool chain includes networking software running in each ECU in the system. This software uses the configuration data to control the transmission and reception of frames on one or more buses and present signals to the application-programmer. One view of the Volcano network software is as a “communications engine” under the control of the system integrator. The view of the applicationprogrammer is different: the software is a black-box into which published signals are placed, and out of which can be summoned subscribed signals. The main implementation goals for Volcano target software are as follows: • Predictable real-time behavior: no data-loss under any circumstances. • Efficiency: low RAM usage, fast execution time, small code size. • Portability: low cost of moving to a new platform.

43.4.1 Volcano Conﬁguration Building a configuration is a key part of the Volcano concept. A configuration is, as already mentioned, based on details, such as how signals are mapped into frames, allocation of identifiers, and processing intervals. For each ECU, there are two authorities acting in the configuration process: the system integrator and the ECU supplier. The system integrator provides the Volcano configuration for the ECU regarding the network behavior at the system level, and the supplier provides the Volcano configuration data for the ECU in terms of the internal behavior. 43.4.1.1 The Conﬁguration Files The Volcano configuration data is captured in four different types of files. These are: • Fixed information, which is agreed between the supplier and system integrator. • Private information, which is provided by the ECU supplier. The ECU supplier does not necessarily have to provide this information to the system integrator. • Network configuration information, which is supplied by the system integrator. • Target information, which is the supplier description of the ECU published to the system integrator.

© 2006 by Taylor & Francis Group, LLC

43-16

Embedded Systems Handbook

43.4.1.1.1 Fixed Information The fixed information is the most important in achieving a working system. It consists of a complete description of the dependencies between the ECU and the network. This includes a description of the signals the ECU needs from the network, how often Volcano calls will be executed, and so on. The information also includes description of the CAN controller(s), and possible limitations regarding reception and transmission boundaries and supported frame modes. The fixed information forms a “contract” between the supplier and the system integrator: the information should not be changed without both parties being aware of the changes. The fixed information file is referred to as the “FIX” file. 43.4.1.1.2 Private Information The private file contains additional information for Volcano, which does not affect the network: timeout values associated to signals and what flags are used by the application. The private information file is referred to as the “PRI” file. 43.4.1.1.3 Network Information The network information specifies the network configuration of the ECU. The system integrator must define the number of frames sent from, and received by the ECU, the frame identifier and length, and details of how the signals in the agreed information are mapped into these frames. Here, the vehicle manufacturer also defines the different frame modes used in the network. The network information file is referred to as the “NET” file. 43.4.1.1.4 Target Information The target information contains information about the resources that the supplier has allocated to Volcano in the ECU. It describes the ECUs hardware (e.g., used CAN controllers and where those are mapped in memory). The target information file is referred to as the “TGT” file.

43.4.2 Workﬂow The Volcano system identifies two major roles in the development of a network of ECUs: the application designer (which may include the designer of the ECU system or the application-programmer) and the system integrator. The application designer is typically located at the organization developing the ECU hardware and application software. The system integrator is typically located at the vehicle manufacturer. The interface between the application designer and the system integrator is carefully controlled, and the information owned by each side is strictly defined. The Volcano tool chain implementation is clearly reflecting this partitioning of roles. The Volcano system includes a number of tools to help the system integrator in defining a network configuration. The “Network Architect” is a high-level design tool, with a database containing all the publish/subscribe information for each ECU available, as described in the previous sections. After mapping the signaling needs on particular network architecture, thus defining the connections between the published and subscribed signals, an automatic Frame Compiler will be run. The “Frame Compiler” tool uses the requirements captured earlier to build a configuration that meet those requirements. There are many possibilities to optimize the bus behavior. The frame compiler includes the CAN bus timing analysis and LIN schedule table generation and will not generate a configuration that violates the timing requirements placed on the system. The frame compiler also uses the analysis to answer “what if?” type of questions and guide the user in building a valid and optimized network configuration. The output of the frame compiler is used to build configuration data specific to each ECU. This is used by the Volcano target software in the ECU to properly configure and use the hardware resources. The Volcano configuration data generator tool set (V5CFG/V5BND) is used to translate this ASCII text information to executable binary code in the following way: • When the supplier executes the tool, it reads the FIX, PRI, and TGT files to generate compile-time data files. These data files are compiled and linked together with the application program together with the Volcano library supplied for the specific ECU system.

© 2006 by Taylor & Francis Group, LLC

Volcano — Enabling Correctness by Design

43-17

• When the vehicle manufacturer executes the tool, it reads the FIX, NET, and TGT files to generate the binary data that is to be located in the ECU’s Volcano configuration memory (known as the “Volcano NVRAM”). An ECU is then configured (or reconfigured) by downloading the binary data to the ECU’s memory. Note: It is vital to realize that, changes to either the FIX or the TGT file cannot be done without coordinating between the system integrator and the ECU supplier. The vehicle manufacturer can, however, change the NET file without informing the ECU supplier. In the same way, the ECU supplier can change the PRI file without informing the system integrator. Figure 43.7 shows how the Volcano Target Code for an ECU is configured by the supplier and the system integrator.

Agreed information

Vehicle manufacturer

ECU supplier

“Private”

“Target”

“Fixed”

“Network”

V5CFG configuration tool

V5CFG configuration tool

Intermediate “fix / pri”

Intermediate “fix / net”

V5BND target code generation

V5BND target tailoring

Compile-time data

Binary data for ECU configuration

Volcano 5 target library

Application program

ECU memory Program code (ROM / FLASH EEPROM) Volcano 5 NVRAM pool

FIGURE 43.7 Volcano Target Code configuration process.

© 2006 by Taylor & Francis Group, LLC

43-18

Embedded Systems Handbook

The Volcano concept and related products has been successfully used in production since 1996. Present car OEMs using the entire tool chain are Aston Martin, Jaguar, LandRover, MG Rover Volvo Cars, and Volvo Bus Corporation.

Acknowledgments I wish to acknowledge the contributions of my colleagues at Volcano Automotive Group — in particular, István Horváth, Niklas Amberntsson, and Mats Ramnefors for their contributions to this chapter.

References [1] K. Tindell and A. Burns, Guaranteeing Message Latencies on Controller Area Network (CAN), in Proceedings of the 1st International CAN Conference, 1994, pp. 2–11. [2] K. Tindell, A. Rajnak, and L. Casparsson, CAN Communications Concept with Guaranteed Message Latencies, SAE paper, 1998. [3] L. Casparsson, K. Tindell, A. Rajnak, and P. Malmberg, Volcano — A Revolution in On-board Communication, Volvo Technology report, 1998. [4] W. Specks and A. Rajnák, The Scaleable Network Architecture of the Volvo S80, in Proceedings of the 8th International Conference on Electronic Systems for Vehicles, Baden-Baden, October 1998, pp. 597–641.

More Information http: www.VolcanoAutomotive.com

© 2006 by Taylor & Francis Group, LLC

Industrial Automation 44 Embedded Web Servers in Distributed Control Systems Jacek Szymanski

45 HTTP Digest Authentication for Embedded Web Servers Mario Crevatin and Thomas P. von Hoff

© 2006 by Taylor & Francis Group, LLC

44 Embedded Web Servers in Distributed Control Systems 44.1 Objective and Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-1 44.2 Application Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-2 44.3 FDWS Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-4 Embedded Server Functions • Embedded Site Structure • Embedded Server Operation • Site Implementation

44.4 Guided Tour to Embedded Server Implementation . . . 44-10 Steps of Embedded Site Implementation Process • Implementation of VFS • Implementation of Look-and-Feel Objects • Implementation of Page Composition Routines • Implementation of Script Activation Routines • Implementation of Application Wrappers • Putting Pieces Together

44.5 Example of Site Implementation in a HART Protocol Gateway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-20 Structure of the Site Embedded in the Protocol Gateway • Detailed Implementation of Principal Functions • Access to Site — Home Page • Access to Parameters of the Gateway • Access to Active Channel List • Access to Channel Parameters • Monitoring of Principal Channel Measure • Access Control and Authentication • Application Wrapper

44.6 Architecture Summary and Test Case Description for the Embedded Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-31 Embedded Site Architecture • Test Description • Test Scenarios

44.7 Summing Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-38 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-38 44.A1 Appendix: Configuration of VFS . . . . . . . . . . . . . . . . . . . . . . . 44-39

Jacek Szymanski Alstom Transport

Programming of VFS Component • BNF of Specification Language • Specification Example

44.1 Objective and Contents In today’s landscape of information technology the World Wide Web (WWW) is omnipresent. Its usage is unavoidable in everyday life and in all domains. The WWW technology is around for approximately

44-1

© 2006 by Taylor & Francis Group, LLC

44-2

Embedded Systems Handbook

15 yr. From its early days the WWW passed many phases from initial research status through the euphoria of “e-anything” in the late 1990s, till today’s mature status in the domain of information broadcast, advertising, and e-commerce. The main objective of this account is to present the application of web-related technologies in the domain of industrial control and more specifically in the area of distributed control equipment operating on the floor shop around the fieldbus interconnections. The rationale for using the technology is to provide access to (control) system elements by communication means based on the industrial standards defined around the WWW. Embedded Web Servers are now omnipresent within packages proposed by different software editors. The proposed products differ in size, performance, price, architecture, application area. The objective of this account is not to provide review of the existent products or their comparison. Rather than reviewing the features of the ready-made solution, the account proposes to go through the technological bases on which relies the construction of these applications on the example of an existent software application called in the sequel Field Device Web Server (FDWS). The FDWS was implemented with the objective to enhance the operational functionality of a large class of distributed control system architectures and especially fieldbus-based parts of these architectures, by providing them with the power and the flexibility of internet technology. The account outlines the design of embedded web servers in two steps. First, it presents the context in which the embedded web servers are usually implemented. Second, it sketches the structure of an FDWS application with the presentation of its component packages and the mutual relationship between the content of the packages and the architecture of a typical embedded site. The main motivation of the account is, however, to show the user an exemplary approach to an embedded site design. For this reason, an illustrated real-life example presents the details of design, implementation, and test trials of an embedded website implemented in an existent field device.

44.2 Application Context To sketch the impact of technology on control applications, it is important to identify the location of a field device in a typical architecture of the control system, as shown in Figure 44.1. The field device is part of an automation cell — a collection of cooperating instruments that realize a well-defined automation function. These devices are of different levels of complexity — from simple sensors with extremely limited functions to process computers equipped with powerful processors and large memory banks containing several embedded software programs. Automation cells of a control system cooperate in order to implement a coherent control application. The cooperation is possible, thanks to information exchange via a higher level network that forms the system backbone. The backbone links the automation cells with the control room supervisory computers that provide the interface between the system and human operators. So the global control applications are structured into two collections of functions: • Automation functions implemented in automation cells. • Supervisory functions implemented in control room computers. Information exchange between the two parts is based almost exclusively on the client–server paradigm. The idea that is at the base of the development of the FDWS has its origins in the analysis of the structure of the WWW, and in the statement that it makes a perfect example of successful application of interoperability principle applied to diverse software products. The interoperability of internet products (clients and servers) is based on universally accepted standards that are: TCP/IP protocol for reliable data transport. HTTP protocol for application information exchange. XML/ HTML format for information presentation and structuring.

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-3

Control room console

Interconnecting network Process computer

Fieldbus

Field devices

Automation cell

FIGURE 44.1 Place of field devices in automation system. (From J. Szymanski, Proceedings of WFCS 2000, September 6–8, ISEP, Porto Portugal, 2000, pp. 301–308. With permission. Copyright 2000 IEEE.)

Application

Client tier

Server tier

Application tier

FIGURE 44.2 Three-tier architecture of internet client–server application. (From J. Szymanski, Proceedings of WFCS 2000, September 6–8, ISEP, Porto Portugal, 2000, pp. 301–308. With permission. Copyright 2000 IEEE.)

Another successful application of universally accepted standards concerns the architectural pattern. The internet base distributed applications are based on the principle of multitier architecture (Figure 44.2), which makes use of universal client and server frameworks independent of the nature of data processed, on the condition that the data exchanged over the network are transported via HTTP protocol and are structured according to XML/HTML format. The multitier architecture standardizes basic services of client and server parts of the architecture. In the majority of configurations, the client part is totally independent of the application (so-called thin client). It is important to state that the properties and advantages of the multitier architecture are independent on the technology used for implementation of its components and are valid for both embedded and nonembedded implementations.

© 2006 by Taylor & Francis Group, LLC

44-4

Embedded Systems Handbook

The universal nature of the client places the burden of application personalization on the server side, which is interfaced directly with the embedded application software. This configuration is at the origin of the internal architecture of the server described in the further sections of the account. This architecture simplifies the development process of newly created applications. As one can see, the client tier of the architecture is totally standard and as such is not to be developed. A big part of server tier is also based on generic modules, such as TCP/IP and socket modules, which are included in the standard deliveries for the majority of implementation platforms. The FDWS is designed to interface with these modules and provides a large degree of independence from the implementation details. Not all application independent modules exist on implementation platforms and FDWS provides a collection of software modules that can be included into an application. It is important to state here that the FDWS is only an exemplary presentation of an embedded web technology. For other implementations consider Reference 1. Figure 44.2 shows the three-tier version of the architecture where the generic parts of it, described above, were completed by the application-dependent part that is considered monolithic. In the quest for further factorization of the design it can be considered that this monolithic software bloc be split in thinner layers. The application-dependent part of server tier is to be developed for each application. Despite evident advantages of a standard architectural pattern, design and implementation process of this part of the embedded site is not an easy task. The reason for this is that it requires the technical fluency in four disciplines: • Comprehension of basic application implemented by the equipment hosting the site. This is because the operation of the hosting equipment should be enhanced rather than modified — the embedded site is then the extension of existent application. • Skill of creation of HTTP-based websites implying the knowledge of technologies such as CGI scripting, HTML, Java, JavaScript, etc. This is because these techniques are the basis of operation of components executed in the generic client. • Good knowledge of constraints of the platform on which the site is to be installed. This reason is imposed by the principle of minimal interference with the basic application and should influence the complexity of site structure as well as the size of site components. • Good comprehension of FDWS technology; at least of its part in charge of data transfer through the server tier toward the application tier. The guided tour through the development of an application embedded in a field device is described in sections which follow.

44.3 FDWS Technology 44.3.1 Embedded Server Functions The place of the server tier in the structure presented in Figure 44.2 shows its role in the architecture of the application. This role has nothing to do with the organic mission of the hosting equipment, that is, the server software will not directly intervene in execution of control algorithms while installed on a Programmable Logic Controller (PLC) nor will it do any protocol conversion while installed in a protocol gateway such as HART/FIP converter described below. 1. Embedded server can store and serve the complete interface to the application within the equipment. 2. Server can activate routines that are able to extract and interpret orders sent from the client part and modify application status via accessible interface.

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-5

3. Server resident routines can extract the information coming from the application, format it, and wrap into HTML pages in order to provide them to the client side. 4. Dynamically generated user-interface components can easily manage the evolution of visual aspect of the user interface in function of the application status; this aspect can ease operations, such as anomaly signaling. 5. Internal server mechanisms provide the possibility of easy implementation of password-based security locks. In this context, the server role consists in providing a greatly flexible and relatively easily implementable interface. This interface provides remote clients with controlled and configurable access mechanisms to data, structure, status, and processing modes of (organic) applications embedded in control system devices. The FDWS software is designed to implement all the required functions of an embedded server. These functions express the requirements from the point of view of the final user and have to be reformulated in terms of communication architecture. From this viewpoint the embedded site takes the shape of an HTTP protocol server operating above the TCP/IP transport. The basic functions of such an entity are: 1. 2. 3. 4.

Management of connections coming from distant clients. Analysis of clients’ requests, in terms of syntax and semantics. Maintenance of local server objects in view of their access by distant clients. Decision of granting or refusing access to server objects; composition and transmission of responses corresponding to clients’ requests. 5. Execution of processing expressed in clients’ requests. The FWDS software is structured in five interrelated software packages. Each of the basic server functions is supported by one or many software modules. The roles of the packages are as explained in Table 44.1. The software is organized in five packages for better design, easier deployment, and maintainability. Figure 44.3 shows the mutual interdependence among packages. In a typical implementation the modules from all five packages have to be used in embedded site construction.

44.3.2 Embedded Site Structure The architecture of an embedded server does not differ in principle from the architecture of a regular (nonembedded) web-enabled application.

TABLE 44.1

Package Functions of FDWS Software

Package function

Role

Main server engine

In charge of connection management process: this package groups the modules that realize the functions of server engine operation, network adaptation, and support of persistence of request data

Http request parser

In charge of request analysis: this package implements the parsing of PDU and CGI environment building

Embedded file system

In charge of controlling access to server objects: this package provides the elements that support the implementation of embedded equivalent of disk file system

Dynamic page generator

Is in charge of generation of server’s requests on-the-fly (servlets): this package provides the elements that support the implementation of dynamically generated HTML pages

Embedder response composer

In charge of response composition

© 2006 by Taylor & Francis Group, LLC

44-6

Embedded Systems Handbook

HTTP HTTP request Request parser Parser Embedded Embedded file File system System

Dynamic Dynamic page Page page generator Generator

Main Main server Server engine Engine

Embedded Embedded response Response composer Composer

FIGURE 44.3 General architecture of FieldWebServer software. (From J. Szymanski, Proceedings of WFCS 2000, September 6–8, ISEP, Porto Portugal, 2000, pp. 301–308. With permission. Copyright 2000 IEEE.)

In the most general terms every server tier is composed of three basic elements: 1. Generic server body — principal active component that loops hearing to the incoming service requests and processes them; request processing consists in: (a) Parsing the Protocol Data Unit (PDU) syntax. (b) Recovering environment variables in order to support server operations. (c) Identifying requested resources together with the operations to be applied to them. The generic server body is in principle independent of the applications in which it is incorporated. Its basic elements are server engine, request parser, response composer, and persistence module. 2. Virtual File System (VFS), an embedded object repository organized as the file system of a typical compute. It is an active component implementing the logistics of server pages management. This component helps managing the collection of objects being in direct contact with the application. 3. Collection of application specific components — elements that implement both look-and-feel part of application (HTML pages, compressed images, Java applets, ActiveX controls) and its dynamics (embedded scripts and servlets). These components, which are managed by the VFS, are designed in order to convey data between client part and the essential application. Most naturally these elements are totally dependent on application. The analysis of the structure of embedded server puts in evidence yet another building block of the architecture — the application wrapper. This block is very often introduced into the device structure for convenience reasons. Its role consists in adapting the functional interface of basic application to the needs of the page composition module. The structure of this block is totally dependent on the basic application. The construction of this block is not supported by the modules of the FieldWebServer and for this reason it is left outside of the server-tier structure. Taking into account these considerations the whole Internet-based server architecture, in the context of a control device can be represented by the schematic from Figure 44.4. The left part of the schematic above shows the software architecture at run time that puts in evidence mutual relationships among all building block instances. The right part of the schematic shows the organization of FieldWebServer module library.

44.3.3 Embedded Server Operation Application of FDWS technology in a device is possible if and only if three conditions are fulfilled: • The execution model of the software embedded in the device is based on the multiprocess paradigm, it is necessary that all server operations are encapsulated in a separate thread of execution.

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-7

HTML HTML pages, pages, images and applets images & applets CGI“scripts” “scripts” CGI Page Page Composition composition module Module Virtual File VFS em

Application Application wrapper Wrapper module Module

Persistence Persistence module Module Request Request analyser Analyser

Basic Basic application Application

Response Response composer Composer

Server ServerEngine engine Socket presentation layer Socket Presentation Layer

TCP/IP TCP/IP Fieldbus stack (layers 11and 2) 2) Fieldbus stack (layers and

FIGURE 44.4 Architecture of internet-based application in a filedbus equipment. (From J. Szymanski, Proceedings of WFCS 2000, September 6–8, ISEP, Porto Portugal, 2000, pp. 301–308. With permission. Copyright 2000 IEEE.)

• The device processing power is sufficient enough to support the additional computation burden caused by the server operation. • The basic (organic) application of the device exposes a well-defined programming interface that will provide server routines with the means of accessing to the application data. Putting server into execution needs two operations which should be executed in order: • Server data that implement internal server objects have to be configured. • Server engine routine has to be activated by the device monitoring program in an independent execution thread. 44.3.3.1 Minimal Server Interface Efficient development of embedded servers using the FDWS technology is possible only in the case when the designer understands basic elements of server interface and server operation. The structure of FDWS software allows the user to have access to the interface of all the modules from which its packages are constructed. This means that an advanced user could have made calls to more than a hundred functions and has direct access to many tenths of global variables. It is really a very complex task and in the majority of cases not necessary. All what an average FDWS user should be aware of is limited to some five modules from three packages. In the case when the user disposes the adequate tools for software configuration, its knowledge of large FDWS interface can be limited to three data types, three global variables, and to less than 10 routine calls.

© 2006 by Taylor & Francis Group, LLC

44-8

Embedded Systems Handbook

Normal server operation is composed of two phases: • Initialization and configuration of internal data structures, this phase is executed only at the server thread startup. • Activation of the main server loop (server engine routine), lasting till the server task destruction. Both phases are described below. 44.3.3.2 Init Phase Server engine routine operates on four global variables, all exposed by the modules from request parser package and module from server engine package. The variables have the following meaning: Pointer to VFS root. Pointer to the data structure which is the root of VFS. The VFS represents the store of server-owned objects and is structured as a disk file system. It should contain all passive objects that the server is supposed to provide: HTML pages, images of all formats, Java applets. Pointer to CGI root. Pointer to the data structure which represents the structured store of active server objects commonly referred to as CGI scripts. Script exec routine. Pointer to a routine which is in charge of activation of CGI script routines. Page compose routine. Pointer to a routine which is in charge of composition of passive objects stored in VFS. Comprehension of the role of these variables is the key to understanding server operation, since the software user is to provide the adequate values to the variables. All four variables should be duly initialized prior to activation of the server engine routine or the server will not be able to provide any object or execute any CGI routine. The initialization code should be written by the application programmer according to the principles described in Section 44.3.3.3. 44.3.3.3 Operation Phase Initialization phase configures vital server resources. All passive and active components of the embedded site are placed within reach of server engine via VFS root and CGI root pointers. The engine is also provided with the methods of reaction to client requests via two user routines pointed by script exec and page compose routine. Operation phase can now be effectively reacted. The FDWS software contains the ready-made server engine that implements the operation phase according to a predefined scheme that suits the majority of cases. This standard server engine is implemented by the routine from the package server engine within the module. The signature of the routine is as follows: int server_boot (unsigned short service) where the formal parameter service is the TCP port number on which the server listens to the connections from the distant clients. The standard server engine provided within FDWS software implements the policy of the iterative server; that is, the server which realizes sequentially the requests received from distant clients. This means that two transactions are never actively executed by the server on the same time. If a series of requests arrives on the site their data are queued with the buffers of communication software modules. Server engine operation entails execution of an initialization step followed by a loop over a sequence of five steps. The most important operation of the initialization step involves an attempt of creation of an access point to the network via a passive socket of SOCK_STREAM type that is bound to the IP address of the hosting station and to the TCP port number passed as the procedure parameter. If this operation succeeds, the routine’s control activates the loop; in the contrary, the routine exits with an error code and error message that is passed over to the system of the embedding device.

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-9

The success in creation of this socket, often referred to as the main socket, enables the program to enter the processing loop. Server operation proceeds then in the following steps: Step 1. Server waits passively for incoming requests from the distant clients. On arrival of client’s request, server attempts to accept the connection request and open a secondary (stream-type) socket. If this operation fails, the secondary socket is not created and the execution thread returns to listening to the main socket. In the case of success, step 2 is executed. Step 2. Server reads and parses request data unit received via secondary socket. In this second step of processing loop, the HTTP request coming from a distant client is received and analyzed. This analysis is done by a set of routines grouped in the package of the HTTP parser. If the request structure is recognized as being conformant with the version of the protocol implemented by server, parsing routine extracts all the important data, which allows the server to elaborate the response. These data concern the following parameters contained in the request data unit: • Protocol version • Requested HTTP service (GET or POST) • Full identification of the service object (path within server internal structure, object name, extension) • Object class (HTML page, CGI routine, applet, image, . . .) • Browser options (type, accepted formats, acceted languages, OS type, etc.) • Request parameters (optionally, if included in the request) • CGI variables (optionally, if included in the request) In the case of nonconformance of the request data, the analysis is abandoned and the parsing routine returns an error status. If the error occurs the loop control is transferred to the step 4. Otherwise, the processing executes step 3. Step 3. Server searches the object identified by the request analysis, prepares, and sends the response data unit. Successful termination of the request analysis provides the server with all the data necessary to elaborate a response matching the received request. The step of response preparation is decomposed of three sequentially executed actions: • Identification of object class (passive object or CGI script). • Object search within one of the server object repositories. To execute this action object management routines exploit the data provided by the user in the initialization phase (server page root and CGI script root pointers to object repositories) if the requested object is found, the next action is executed. Otherwise, a standard “not found” page is send back to the client. • Object composition. To execute this action generic server routines call user-provided routines plugged to the loop thread in the initialization phase via user configured pointers, script exec routine and page compose routine. The generic part of these actions is implemented by the routines from the packages server engine and VFS. Execution of this step always transfers execution control to step 5. Step 4. Error report. This step is the alternative to step 3 and is executed only when the analysis of the received request declares its structure to be nonconformant with the structure of the conventions recognized by the server. In such a case, a standard error-notifying page is send back to the client. This situation should be considered as an implementation of a graceful failure mode. Step 5. Connection closure. According to the HTTP protocol requirements the complete transaction between the client and the server should be terminated by the request of connection closure initiated by the server. The routine operation is limited to secondary socket closure. Application of the ready-made solution freezes the main features of the embedded site. For example, the request processing policy is imposed to be iterative. The advantage of such a solution for embedded application is evident: no problems with concurrent accesses to server-owned objects. Disadvantage is the lack of efficiency since some client requests can be rejected by the underlying network software is the processing loop lags to fetch the arriving request at a sufficiently rapid pace.

© 2006 by Taylor & Francis Group, LLC

44-10

Embedded Systems Handbook

/* crude server loop routine */ void server_loop( unsigned short service_port_nr, tcallback parse_request_routine, tcallback generate_response_routine, tcallback error_report_routine, tcallback closing_routine); FIGURE 44.5 Code of the server routine body. (From J. Szymanski, Proceedings of WFCS 2000, September 6–8, ISEP, Porto Portugal, 2000, pp. 301–308. With permission. Copyright 2000 IEEE.)

The advanced user is not obliged to follow the standard engine. The server engine package contains a skeletal support for user configurable engines implemented by the routine server_loop, which has five parameters as shown in Figure 44.5. The first routine’s parameter is the TCP port number, four others are pointers to user-provided routines that should implement steps 2, 3, 4, and 5 of the above described loop. If any of the parameters is set to NULL, the appropriate step of the loop is implemented by the default routine provided by FDWS packages. The application programmer can thus take over the control of any of execution phases by providing the pointer to his own routine.

44.3.4 Site Implementation The programmer who wishes to implement the embedded site should undertake the following steps: • Create the routines that fill in the appropriate memory regions with embedded HTML objects (pages, page templates, images, applets). • Program the routines that implement the CGI scripts referenced within embedded objects. • Provide the routines that generate data structures enabling the management of all repositories included in VFS. These structures should contain references to memory regions, in which are stored server-embedded objects, and should also hold the addresses of all routines implementing CGI scripts. These routines should assign the reference of the data structure to the server page root and the script exec routine pointers. • Provide the routines that determine actions to be undertaken on invocation of each server object and on activation of each CGI script; references of these routines should be assigned to send_page_routine and send_script_routine pointers (the implementation of script exec routine and page compose routine, respectively). • An initialization routine that activates all necessary configuration actions described above. • Create the server process that calls the initialization routine and activates the server engine routine. All the steps need to be realized according to certain rules described in the next section.

44.4 Guided Tour to Embedded Server Implementation 44.4.1 Steps of Embedded Site Implementation Process Application programmer should understand FDWS software operation in main features. If he or she agrees on the predefined operation mode, nothing is to be modified nor extended in the routine implementing the main server loop, described in the Section 44.2.4. In the case he or she decides to customize one or many phases of the main server loop, the development effort is to increase. In any case, all application dependent part of the embedded site is to be developed.

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-11

Embedded site is placed within the target platform as an executable object. It is of no importance whether it is statically linked and loaded with the main application or if its dynamically linked when other software processes are already running. This detail depends on the platform. The scenario described in this section is based on the following assumptions: • Server is activated in a separate process. • Server process is implemented as a relocatable object code statically linked with the main application. • Modules implementing reusable server mechanisms are placed in the static library and are linked with the code of server process. The main effort of presentation below is concentrated on the development of application-dependent software. This software implements the following elements of embedded site: • The VFS tree which determines the skeleton of site structure. The VFS tree holds the references of all objects making part of the site and provides the mechanism of search for server objects: pages, applets, images, and CGI scripts. The application developer constructs this part of the embedded site by using the routines from the package of the VFS. This work can be tedious and complicated while done manually, but can be easily mechanized by using a configuration tool presented below. • Embedded look-and-feel objects which are data structures representing embedded passive objects (page frames, applets, images). These data are usually implemented as octet arrays residing in memory regions referenced by the VFS tree nodes. In the usual development process these objects are designed and implemented by the tools adapted to the object nature (HTML editors, image editors, Java development environment). The necessary step concerning the transformation from the standard formats of their representations (ASCII files, gif/jpeg files, byte code) to the byte arrays loadable to device memories is to be supported by appropriate tools. • Page composition routines which merge application-dependent data with static page frames in order to form complete HTML pages that incorporate application status; these routines are to be programmed manually or are to be generated from a user-friendly notation. • Routines representing active server objects (CGI scripts, dynamic pages) executed on requests received from the client. These routines serve to integrate application data to server pages. The routines usually reuse generic functions provided by the FDWS packages; their design is highly dependent on the application and only manual development process is possible. • Script launching routine. • Application wrapper which extracts useful information from the basic application of hosting device. • Initialization routine which assigns four pointer variables from the server interface with the appropriate values. • Server process code which calls initialization routine and bootstraps main server loop. The mutual relationships among these elements as well as the relationship with the library of server modules are presented in Figure 44.6. Their construction is described in detail, point by point, in the following sections.

44.4.2 Implementation of VFS The basis of the VFS is the data structure that is manipulated by the routines that look for the HTML objects referenced in the client requests. The standard implementation of the site architecture assumes that this data is composed of two separate lookup trees, called repositories: • Passive object repository holding the references of all HTML pages, images, and applets; the root of this tree is referenced by the repository root pointer. • Active object repository holding the references to routines that implement the CGI scripts; the root of this tree is referenced by script routine pointer.

© 2006 by Taylor & Francis Group, LLC

44-12

Embedded Systems Handbook

Custom component Customcomponent component Custom

INIT INIT

Page Page composer composer

VFS VFS

Script Script launcher launcher FDWS modules Pages, images, applets

CGI script, servlets dynamic pages

Application wrapper Application wrapper

Application Application

FIGURE 44.6 Relationships of elements of embedded site. (From J. Szymanski, Proceedings of WFCS 2000, September 6–8, ISEP, Porto Portugal, 2000, pp. 301–308. With permission. Copyright 2000 IEEE.)

The tree is built of three type of nodes: Repository root. It is a unique entry point to each data structure. This object holds the list of references to other elements of the tree: embedded directories and embedded files. One of the embedded files, called by default server page plays the special role in the process of response composition. The tree root can also hold a reference to the list of authentication records whose role and structure is described below. Embedded directory node type. This type of node plays the role of the root of a subtree within the server structure. It holds the list of references to other embedded directories and/or embedded files. It can also hold the reference to the list of authentication record. Embedded file node type. This node is the tree leaf that holds directly the reference to the data necessary to compose and send the requested object. An example of structure of the repository is presented in Figure 44.7. Data type of objects that form the structure of VFS repositories is defined in the VFS package. Repository creation and the tree grow up is obtained by the successive calls of these routines. One of the possible sequences of calls that implement the creation of the page repository like in Figure 44.7 can be as follows: 1. Creation of the tree root 2. Creation of the by-default page tree node 3. Append the default page to the tree root

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-13

Repository root

Directory public

Directory images

Page by default Directory javadir

FIGURE 44.7

4. 5. 6. 7.

Example of VFS repository.

Creation of embedded directory node named public Creation of series of embedded file nodes and insertion of nodes to the directory Append of directory node to the repository Repeat steps 4, 5, and 6 for directories images and javadir

It is important to state that the data structure constructed by the procedure detailed above generates the containers which should be filled with the references to data which really implement the embedded objects. These references should point to the really exploitable data (byte arrays mentioned in the preceding sections) which are to be generated as the result of a separate operation, described in the section which follows. The method of creation of active object repository is nearly identical of the one described above.

44.4.3 Implementation of Look-and-Feel Objects Data structures corresponding to the repositories of the VFS enable the usual operations of file management system like file creation, deletion, search, access to data stored in file-type node, activation of routine referenced by script-type node. They do not contain directly any data or code that should be created by separate operations. These data are implemented by the passive server objects, contributing to look-and-feel aspects of embedded site. The objects are placed inside memory regions accessible to the server routines. They take two different forms: • For static HTML pages and HTML page templates are represented as character strings • Embedded images and embedded Java applets are stored as byte arrays The difference of storage form is to be explained by the method of object processing in response composition phase of server operation. HTML pages are composed of printable characters only and never contain a null character which is used uniquely as the marker of page(string) terminator. The same assumption does not hold neither for images in .gif and .jpeg formats nor for applets. These objects can (and do) contain nonprintable bytes, included null character and cannot be stored as character strings. Their storage format should follow the pattern of a byte array of a known size. From the external point of view, embedded pages and embedded images do not differ from regular (nonembedded) ones and there is no reason for them not to be created by the tools which usually serve to edit them (Microsoft Front Page, Netscape Composer, Microsoft Image Composer, etc.). Standard tools create standard storage formats, compatible with the file system of the hosting platform. This is the main problem in the creation of embedded sites, since the standard storage formats are not directly useful in the construction of the server custom component. The output files produced by the tools have to be transformed into modules that can be linked (statically or dynamically) with the code of other server modules. Example of such a module, embedding an image in gif format, is shown in Figure 44.8.

© 2006 by Taylor & Francis Group, LLC

44-14

Embedded Systems Handbook

extern const unsigned char aautobull2_img[]; extern int aautobull2_img_length; const unsigned char aautobul12_img[] = { 0x47, 0x49, 0x46, 0x38, 0x39, 0x61, 0x0c, 0x00, 0x0c, 0x00, 0xb3, 0xff, 0x00, 0xff, 0xff, 0x66 , 0xff, 0xff, 0x33, 0xff, 0xff, 0x00, 0xcc, 0xff, 0x00, 0xc0, 0xc0, 0xc0, 0x99, 0xff, 0x00, 0x99 , 0xcc, 0x00, 0x99, 0x99, 0x00, 0x99, 0x66, 0x00, 0x66, 0x99, 0x00, 0x66, 0x66, 0x00, 0x33, 0x66 , 0x00, 0x33, 0x33, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x21, 0xf9, 0x04 , 0x01, 0x00, 0x00, 0x04, 0x00, 0x2c, 0x00, 0x00, 0x00, 0x00, 0x0c, 0x00, 0x0c, 0x00, 0x00, 0x04 , 0x42, 0x90, 0xc8, 0x49, 0x6b, 0xbb, 0xb7, 0x92, 0x86, 0x42, 0x10, 0x47, 0x43, 0x35, 0x02, 0xe0 , 0x09, 0x83, 0x21, 0x6e, 0x88, 0x79, 0x0e, 0x83, 0x22, 0x92, 0x81, 0x39, 0x08, 0xc6, 0x91, 0xcc , 0xde, 0x57, 0x14, 0xba, 0xdd, 0x46, 0x40, 0x84, 0x19, 0x14, 0x0b, 0xd9, 0xe6, 0x00, 0xlb, 0xlc , 0x16, 0x50, 0xc6, 0xaa, 0x31, 0x28, 0x18, 0x12, 0x48, 0xe9, 0x48, 0xal, 0x53, 0x68, 0x2d, 0x98 , 0x95, 0x24, 0x02, 0x00, 0x3b}; int aautobull2_img_length = 149; /* */

FIGURE 44.8 Code snippet implementing an embedded .gif image. (From J. Szymanski, Proceedings of WFCS 2000, September 6–8, ISEP, Porto Portugal, 2000, pp. 301–308. With permission. Copyright 2000 IEEE.) extern char* transpassword_str; extern int transpassword_str_length; static const unsigned char transpassword_str_array[] = { 0x3c, 0x68, 0x74, 0x6d, 0x6c, 0x3e, 0x0a, 0x0a, 0x3c, 0x68, 0x65, 0x61, 0x64, 0x3e, 0x0a, 0x3c , 0x74, 0x69, 0x74, 0x6c, 0x65, 0x3e, 0x50, 0x61, 0x73, 0x73, 0x77, 0x6f, 0x72, 0x64, 0x20, 0x45• , 0x2d, 0x2d, 0x3e, 0x3c, 0x2f, 0x66, 0x6f, 0x6e, 0x74, 0x3e, 0x3c, 0x2f, 0x62, 0x6f, 0x64, 0x79 , 0x3e, 0x0a, 0x3c, 0x2f, 0x68, 0x74, 0x6d, 0x6c, 0x3e, 0x0a, 0x00}; char* transpassword_str = (char*) (&transpassword_str_array); int transpassword_str_length = 1546;

FIGURE 44.9

Code snippet representing an embedded HTML page.

The module contains two variables: • Byte array that holds the data normally placed within a disk file, the reference of this variable should be passed as the parameter of the call to file creation routine. • Integer variable storing the length of the array; this variable is used by the routines that serve the object data over the network. Nearly the same format can be used to store HTML pages, with the difference, shown in Figure 44.9. In this second module, the byte array is encapsulated within the module and only its reference, casted to the type compatible with the character string type, is exported. It can also be seen that the final null character is placed at the end of the byte array. This enables the processing of the array in exactly the same way as a character string is processed. Transformation of standard storage formats to the modules shown above is done by simple programs which read disk files and generate appropriate modules automatically. The principle of their operation is shown in Figure 44.10. The memory regions corresponding to the files are reserved and filled in at build time of the server code by the operation of compiler and linker producing object code of appropriate modules. The same process leads to the resolution of references of the regions held within data structures of repositories of VFS.

44.4.4 Implementation of Page Composition Routines One of the specific features of web servers operating on diskless platforms concerns the method of serving HTML objects. This problem is not so crucial in the case of servers placed on platforms equipped with disks, where the basic service consists of copying page contents from a disk file to the communication link (socket). A similar procedure can be used for the embedded platform, where an octet string or memory

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

.html

embedpage

44-15

.c

.c .class

.gif

embedbin

.jpeg

.c

.c

FIGURE 44.10 Principle of transformation of passive objects into embedding modules.

region takes the role of a mass storage file. However, this procedure is of lesser interest for the majority of embedded application. Static HTML pages have no appeal when used as front ends for control application. Genuinely useful pages should incorporate information produced by the back-end application. Two methods are possible to implement such a requirement: • HTML page is split into two objects: static frame (page template) and application-dependent data; in the process of page serving the frame is merged with the data recovered dynamically from the application. • HTML page is generated on the fly by routines that compute page components one by one; the attributes of page components are determined by the parameters of the routines, which are application-dependent data. The first method is simpler to implement but in the case of complicated interfaces requires voluminous page templates stored in large byte arrays. The second method reduces storage consumption but needs a supplementary special software basis composed of routines implementing page computations. In the case of FDWS, these routines are provided by package of online generation of HTML pages. Both methods are briefly described in the next sections. 44.4.4.1 Template-Based Dynamic Pages Template-based generation of dynamic HTML involves separation of every page into two components: static page template and dynamic data. Page generation involves merging both components before the page is transmitted to the requesting client. Page templates resemble a lot to regular pages and can be constructed using regular tools for HTML page creation. In the embedded site structure, templates are stored in the same way as the embedded static pages, that is, as byte arrays or character strings. A page template contains all constant page elements, such as nonvarying text, images, applets, constant hyperlinks, constant attributes of HTML tags, etc. Anything varying within the page is to be replaced by a placeholder — dynamic data representative. Placeholders can replace virtually any element, be it a text, numeric data, or a tag attribute. Their implementation strongly depends on the method of serving the page. In the case of this software, placeholders’ implementation is based on the C language conversion specifications, as used in format strings of C functions from the family of printf (printf, sprintf, fprintf, etc.). This means that any numeric integer

© 2006 by Taylor & Francis Group, LLC

44-16

Embedded Systems Handbook

data are replaced by the specifications %d, %i, %o, %u, %x, %X. Floating point numeric data are replaced by the %f, %e, and %E specifications. Strings are replaced by %s specification. Examples HTML page of the form shown below.

<TITLE> Count of visitors Page of ALSTOM TECHNOLOGY was seen by 123456 visitors.

If two highlighted elements are variable data, the page template should have the form shown below.

<TITLE> Count of visitors Page of %s was seen by %d visitors.

The origin of such representation lies in the type of implementation of the routine that merges the template with the variable data. In the FDWS this function is implemented by the routine from the server engine package. The routine signature is of the same type as that of fprintf or sprintf, namely: int sockprintf(unsigned short socket_id, char∗ page_template, …); The piece of code that composes the page as above is shown in Figure 44.11. Page template is referenced by vis_page_str pointer and ssock is the unique identifier of the server socket. The method employed

char comp_name_str[32]; int vis_nr; • • strcpy(comp_name_str, ‘‘ALSTOM TECHNOLOGY’’); vis_nr = 123456; send_result sockprintf(ssock,vis_page_str,comp_name_str,vis_nr); • • FIGURE 44.11

Code fragment which implements the example of embedded page generation.

© 2006 by Taylor & Francis Group, LLC

=

Embedded Web Servers in Distributed Control Systems

44-17

requires that any dynamic data to be merged with page templates be transformed to a signed decimal, a signed floating point number or to a string. The proposed server interface provides a unique entry point to page composition code of all server pages via the pointer send_page_routine. It is important to observe that page composition code sequences for every page within the server should be accessible via this entry point. This constraint can be fulfilled only if the embedded server customized component contains a routine which intercepts all requests for HTML object services and dispatches them as specialized pieces of code. The recommended solution is presented in the next chapter. 44.4.4.2 Dynamic Pages Generated on the Fly In the dynamic page generation the static page frame is not employed and the page is produced as a result of a series of routine calls that generate code strings representing successive components (HTML tags) of the page. Routines can be called conditionally and can have parameters dependent on application data. The data strings produced by the routines are either directly written to the socket or stored in a buffer that is finally send to the socket.

44.4.5 Implementation of Script Activation Routines The structure of the VFS imposed by the standard architecture of embedded server separates objects provided at user’s disposal in two distinguished collections: passive objects (pages, applets, images) and active objects (scripts). Standard application interface separates the configuration of passive object composition and transmission from active object servicing. The operation of active object servicing is done by the procedure which should be user provided and inserted to the server’s structure via its reference. As for page composition routine, this entry point should be unique for activation of every script solicited by client’s requests.

44.4.6 Implementation of Application Wrappers There is no general method proposed for this part of customized component. Only vague recommendations, deducted from the mission of the basic application, can be provided. The wrapper modules serve as an interface adaptor for the data transmitted between the server objects and application objects. Requirements imposed on these modules from the server-side follow the method of data insertion into dynamic pages. Any useful information extracted from the application should be transformed into a scalar data having one of the basic types usable with the page templates, that is, integer numbers, floating point numbers, and character strings. Complete requirements imposed on the interface software from the application-side are impossible to determine due to the diversity of application types. Some basic principles can however be identified. Data sent by the client are transported by the elements of CGI interface included in the body of POST service. These elements are normally constructed as a series of pairs (name, value). These data are automatically recovered from the client request PDU and stored within a special memory region accessible via the modules from the server engine package. The modules provide a set of functions that allow the programmer to recover and handle the requested data. The basis of module’s interface is built on one data type represented by the code in Figure 44.12 and two functions which provide the application program with access to the memory region in reading and in writing. As it can be seen, data in the region are identified by their alphanumeric identifiers recovered from the POST service request PDU. The implementation of five components of the embedded site described above provide all operations necessary to put in action the front-end tier of the application. Now the operations should be activated in a good sequence, initializations first, followed by the main server loop activation. The operation of initialization should set four variables exported by the server to the values that reference passive object repository root, active objects repository root, page composition routine, and script launching routine. An example of such an initialization routine is presented in Figure 44.13.

© 2006 by Taylor & Francis Group, LLC

44-18

Embedded Systems Handbook

typedef struct tdbtag_res{ int result; union{ char* string; int integer; float real;} value; }tdb_result FIGURE 44.12

Data type supporting interface with client provided PDUs.

void init_VFS(void){ server_root = db_page_root_gen(); send_page_routine =send_page; cgi_bin = db_cgi_bin_gen(); send_script_routine =send_script; } FIGURE 44.13

The interface of initialization routine.

int embedded_server_launcher(unsigned short service) { init_VFS(); init_application_wrapper(); return server_boot(service); } FIGURE 44.14

Server launching routine.

This piece of code, a series of four simple assignments, assumes that the application programmer has provided four routines: db_page_root_gen, which generates passive object repository, db_cgi_bin_gen, which generates active object repository, send_page, which implements page composition method send_script, which implements a method of script launching. The two functions db_page_root_gen() and db_cgi_bin_gen(), which generate VFS repositories, are called and executed by the initialization routine. The result of their execution is the immediate creation of VFS tree structures and assignment of their references to the pointers server_root and cgi_bin. This is not the case of page composition (send_page) and script activation (send_script) routines for which only references are affected to the server API pointers. In the presented solution, it is assumed that these four functions are exported from the modules that implement them. Now that the initialization routine is implemented, the code of server process can be proposed. The routine in Figure 44.14 is an example of such a code. It is a simple piece of code which undertakes two initialization functions: that of server structures via the call to init_VFS() routine and application interface via init_application_wrapper() routine before launching the server loop via the call to server_boot(service) routine. The only parameter of this routine signifies the number of TCP/IP port on which server expects to receive client’s requests.

44.4.7 Putting Pieces Together The preceding sections provide all necessary details concerning the creation of custom elements of the content of an embedded server. It is important to show how the steps of this development process are sequenced. This sequence of steps is presented by the graph shown in Figure 44.15. The process presented

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-19

Look-and-feel .html .htm

.html .htm

Text editor

embedpage .c

.gif .jpeg embedbin

HTTPreuse

Application

.o

Linker

.c

.class

Organisation

compilVFS .c

VFS conf

C/C++ compiler

init

Loadable object file

Links with application Appl. wrapper

Server driver

scripts

FIGURE 44.15

Development process for an embedded server application.

below shows how to obtain a final result, which is a loadable object code file from the primitive elements which are grouped in three categories: 1. Collection of passive elements (pages, images, applets). 2. VFS creation and management: page composition and script activation. 3. Interface with organic application, initialization code, and CGI scripts routines. These categories of application elements are developed with the means which are appropriate to the nature of each element. This signifies that: • • • •

HTML pages were created with an HTML editor. Gif and Jpeg objects were developed with image editing tools and devices. Java Applets and Beans were developed with the standard tools included in JDK. C/C++ code implementing application wrapper software and routines playing the role of CGI scripts should be developed with the suit of tools (editor/compiler/debugger/loader) used for the hosting platform.

In order to create the integrated component loadable to the hosting platform, the site components of should be preliminarily transformed from their initial storage format to a common format which is the collection of compilable C-coded modules. The preceding section proposed the method of transformation of passive components to C-coded modules by means of specialized processors. There is no problem for the modules originally coded in C: that is, for initialization code, script routines, and application wrapper. They should be designed and coded in accordance with the usual principles of efficient implementation. The method of development of VFS component poses the biggest challenge in the development of embedded site. VFS design is straightforward and relatively easy. Also its manual implementation seems to be a simple chain of repetitive actions. It can be directly deducted from the graphical site representation (like the one in Figure 44.7). The implementation process is so regular that it can be easily automated;

© 2006 by Taylor & Francis Group, LLC

44-20

Embedded Systems Handbook Protocol gateway

WorldFIP

HART/FIP gateway

HART channels

FIGURE 44.16 Automation cell with a HART/FIP gateway and its configuration console.

that is, the repository tree can be transformed to a sequence of procedure calls by a processor — a VFS compiler. This compiler transforms the textual description of the VFS repository structures into appropriate C-coded modules that implement all four functions necessary to initialize the server API. More detailed description of the compiler’s operation is placed in the Appendix.

44.5 Example of Site Implementation in a HART Protocol Gateway In order to illustrate a real-life application, a site embedded in an industrial device is presented. The chosen device is the instrumentation gateway to process control cells. Its role consists in linking the network of sensors and actuators with the process computers and PLCs. To do so the gateway collects the information from the instruments connected to it via HART instrumentation protocol and transfers it to automation cells via WorldFIP protocol. Each gateway can provide connections up to eight HART channels. In this example one of two compositions of channels is possible: eight input channels or six input channels and two outputs. Operation of the gateway is controlled by a collection of parameters that tune its performance to the needs of given installation. Through the set of parameters one can set-up characteristics of HART and WorldFIP protocols, modify certain translation parameters, gateway operation modes, etc. Each HART channel can also be tuned to the type of HART transmitter connected to it. All these tuning operations are usually done by a special purpose device console configuration (see schematic in Figure 44.16). The console communicates with gateway via a proprietary protocol based on UDP/IP transport. The idea of the application described below is to replace the special purpose tuning console by a standard web browser and implement all tuning functions on the basis of three-tier architecture described in the introductory section of this account. All the functions related to the man–machine interface and to the tuning operations will be implemented by the front-end tier based on embedded HTTP server. The architecture of such an application would be then transformed to the one shown in Figure 44.17. The server embedded in the gateway together with a standard web browser should replace the operation of the tuning console. For this reason, it should give the user access to all necessary functions implemented by the special purpose configuration console. The set of the functions is described below. The console gives access to the parameters of the gateway via an appropriate screen (Table 44.2). The front-end server placed within the gateway should provide access to the same set of parameters, respecting read only and read/write mode. The console-provides also the means of monitoring the status of all HART channels in real time by displaying the nature of the transmitter connected (HART type/non-HART type), its signaling

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-21

TCP/IP router

WorldFIP WWW browser HART/FIP gateway

HART Channels

Ethernet

FIGURE 44.17 Gateway parameter tuning functions realized by the three-tier architecture. TABLE 44.2 Access to Gateway Parameters Type of parameters

Parameter name

Access

Identification

Tag Product name Manufacturer name Software version

Read/write Read only Read only Read only

Hardware properties

Power supply type I/O mode WorldFIP medium type WorldFIP medium mode WorldFIP bit rate

Read only Read only Read only Read only Read only

WorldFIP protocol

Refreshment time Promptness

Read/write Read/write

HART protocol

Timeout No. of retries

Read/write Read/write

Processing parameters

Measure format Operation mode Antialiasing filter Configuration version Configuration revision

Read/write Read/write Read/write Read/write Read/write

(type, manufacturer name) and its status (active/inactive). The same function is to be placed in the front-end server of the protocol gateway. The gateway provides the possibility of parameter tuning of every active HART channel. The appropriate screen gives access to the set of channel parameters as in Table 44.3. Access to all the devices should be provided by the front-end server of the gateway. The operation of ALSPA P80H console is in principle oriented toward parameter-tuning functions. In some cases, however, it gives the possibility of direct monitoring of process variables by gaining access to transmitter primary variable. The same function is required from the server embedded in the protocol gateway.

44.5.1 Structure of the Site Embedded in the Protocol Gateway The architecture of the server embedded in the gateway is strongly influenced by the functional requirements presented above. It is composed of a collection of HTML pages, corresponding to the console screens, which are organized in five directories. Three of the five directories group the pages according to the functional criterion: there is one directory (di80_parameters) holding the pages provided for access

© 2006 by Taylor & Francis Group, LLC

44-22

Embedded Systems Handbook TABLE 44.3 Access to Channel Parameters Type of parameters

Parameter name

Access

Identification

Manufacturer name Transmitter model Transmitter tag Descriptor HART unique identifier

Read only Read only Read/write Read/write Read only

Cell limits

Upper cell limit Lower cell limit Minimum span

Read only Read only Read only

Transmitter configuration

Damping factor Transfer function Primary variable units Lower measurement range Upper measurement range

Read/write Read/write Read/write Read/write Read/write

Passive object repository root

Home Page

images _fpclass di80_parameters measures

transmitters

16 Virtual files (.gif and .jpeg) 8 Virtual files (.htm)

3 Virtual files (.class)

6 Virtual files (.htm)

36 Virtual files (.htm)

FIGURE 44.18

Passive object repository of the embedded server.

to the gateway parameters, one (transmitters) grouping the pages accessing active channel list together with channel-parameter-tuning pages and one (measures) that groups the pages which monitor channel measures. Two remaining directories group the pages according to structural criterion: one (images) is provided to store all embedded images included within the pages, the other (_fpclass) contains all embedded Java applets. This architecture is presented by the graph of the passive object directory shown in Figure 44.18. The VFS of the server contains also another repository which contains three scripting routines. These script files are placed directly under the repository root (see Figure 44.19). Totally, the server contains 70 embedded files collected within 5 directories.

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-23

CGI script repository root

3 Virtual files (.cgi)

FIGURE 44.19 CGI (script) repository of the embedded server.

FIGURE 44.20 Home Page of the embedded server.

44.5.2 Detailed Implementation of Principal Functions The sections below present all important pages that give the user access to functions implemented by the front-end server. All the pages were developed using Microsoft Front Page HTML editor and incorporate graphical elements provided by this tool (page background, fonts, banners, buttons, etc.).

44.5.3 Access to Site — Home Page The access to the site is obtained via the default page presented in Figure 44.20. This page has rather informative character; it displays the photo of the gateway and the list of the principal functions implemented by the embedded server. The direct access to the functions can be obtained via three buttons placed above the photo.

© 2006 by Taylor & Francis Group, LLC

44-24

FIGURE 44.21

Embedded Systems Handbook

Page giving access to the parameters of the gateway in read only mode.

44.5.4 Access to Parameters of the Gateway The first and the third button of the welcome page of server gives the access to two server pages that give the user the possibility of reading parameters (Parameters button) or modifying parameters (Set Parameters button). Both buttons link the default page with two pages residing in the directory transmitters. The page presented in Figure 44.21 gives the access to the parameters of the gateway. It is implemented as a frame of three panes, that is, its implementation requires four embedded files. The upper pane identifies the screen via the large title banner realized as an animated gif image. Left lower pane contains the menu composed of five hyperlinks that provide the convenient access to five groups of gateway parameters. The parameters, distributed between five tables, are displayed in the right lower pane in the form of tables. This pane is too small to display all five tables at the same time. This explains the need for menu pane that avoids using scrollbars for access to parameter tables. Parameter modification is implemented by the page presented in Figure 44.22. This page is separated from the previously described page for security reasons. It is to be used in case one wants to change some parameter values. He will then enter into a transaction with the server resident components which should finish up with the modification of a chosen parameter or parameter set. The set of parameters displayed in this page is narrower than the one displayed in the page previously described. This is normal since only modifiable parameters are displayed on the screen. The page editing environment provides the facility to add to the page some form-element controls coded automatically in JavaScript. This concerns obligations to fill in some cases (password — see Section 44.5.5), keep introduced value within a given interval of values (e.g., Retries field value should be kept between 1 and 6), respect the format of some of information input to certain elements (e.g., only figures are accepted in timeout, retries, refreshment, etc. cases).

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

FIGURE 44.22

44-25

Gateway parameter setting form.

The contents of the form elements filled in by the user willing to modify corresponding gateway parameters is sent to the server using POST service of HTTP protocol when the SUBMIT button of the page is pressed. The modification of the parameters and update of the browser’s screen with the new values is implemented by a specialized routine invoked via one of the CGI scripts.

44.5.5 Access to Active Channel List The middle button in the Home Page of the embedded site gives access to the page that displays the list of all active channels connected to the gateway at given time. The status of the channels is described in the tabular form, as shown in Figure 44.23. Each channel corresponds to a row in the table. Each row is composed of four elements that indicate the high-level descriptions of connected channels. The position of the channel descriptor in the table corresponds to the channel number. Third column of the table indicates transmitter status. The remaining columns of each line are filled in only if the third one is set to ACTIVE. In such a case, the first column contains the name of the transmitter manufacturer (if recognized), the second — the device type, and the fourth — the unique HART identifier (normalized by HART protocol description). This identifier serves as the link to the page describing HART channel parameters. In the case when the status of the channel is recognized as non-HART (analog 4/20 mA current loop) or NO CURRENT (current loop not connected), the three significant columns of such a row are empty.

44.5.6 Access to Channel Parameters This page provides the possibility to display the parameters of an active HART channel. The page that interfaces the user with this facility is presented in Figure 44.24. The page is organized in the form of a frame composed of three panes: heading pane with the title banner, menu pane giving access to groups of parameters, and parameter pane accessed either via menus or via scroll bar of browser’s window.

© 2006 by Taylor & Francis Group, LLC

44-26

Embedded Systems Handbook

FIGURE 44.23

Table displaying channel status.

FIGURE 44.24

Page giving access to HART channel parameters.

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-27

The page externally resembles the one that gives access to gateway parameters in read only mode (three panes, one of it accessed via menus placed in the other). Functionally, there are two fundamental differences between them. First, the channel parameter page gives access to all channel parameters respecting their mode. Read only parameters are displayed as plain text sections while read/write parameters correspond to the active form entries. Second, the page design ensures that only potentially modifiable parameter values can be sent back to the server while button SUBMIT is pushed. The process of parameter update is under the control of a CGI (script) routine.

44.5.7 Monitoring of Principal Channel Measure All the above-described functions are fully oriented toward parameter tuning. Service offered concerned either global properties of the HART/FIP converter or acted on characteristics of an individual channel. The quality of service was comparable with the one offered by the original configuration console. The function of recovery of primary measure of a HART channel described in this section differs from the others both in the nature and quality of service offered. Functionally, it is no longer a parameter-tuning operation. Data handled in this operation do not concern the status of the channel itself but reflect the evolution of the phenomenon measured by the channel transmitter. For this reason, doing this operation from time to time, in irregular time points, and displaying numerical values on the screen does not provide much valuable information. Unfortunately, this is the only mode in which this function can be exploited via an ordinary-tuning console. The three-tier architecture enables a totally different implementation. The monitoring function is implemented by an HTML page accessible from the channel-tuning page via the link “primary measure” (see Figure 44.24). This page contains a Java applet that does more than displaying a numerical value. Its operation involves periodically fetching channel measures and displaying them in the form of a trend curve (Figure 44.25). Each curve point corresponds to a complete transaction between the applet and the server. The transaction is initiated by the HTTP request that activates an embedded script routine (servlet), which elaborates the primary measure of the channel by activating an appropriate HART command. Measure values obtained by execution of this command are transported via an HTTP response (tunneled in HTTP PDU). This solution ensures that the communication remains operational even in the case when the browser machine executing the applet is connected outside of the system’s security barrier.

44.5.8 Access Control and Authentication HTTP-based systems are on principle open and accessible to any client that knows the URL of the server. This fact makes the system prone to unauthorized accesses and needs the implementation of access control functions. In the case of this application, the protection is implemented in two ways: • Natural authentication mechanism exploiting HTTP standard feature; on the client side this mechanism is built into any standard web browser. • Supplementary protection by password entry, which is built-in into forms and handled by a specialized script. The first mechanism is based on the procedure of access control standard for HTTP protocol. This procedure is based on so-called authentication challenge transaction. According to the standards of HTTP 1.0 protocol any Universal Resource Locator (URL) can point to a server resource that is accessible to a restricted set of users, each being identified by his name and his password. When the browser accesses such a resource for the first time since its activation, the server will produce the response initiating the challenge (declaration of unauthorized access see [2]). The browser will react to this response by displaying a dialog box as in Figure 44.26.

© 2006 by Taylor & Francis Group, LLC

44-28

Embedded Systems Handbook

FIGURE 44.25 Applet monitoring primary variable in a HART channel.

FIGURE 44.26 Authentication box for the French version of Internet Explorer.

The user is expected to fill in both text zones of the box and press the button OK. This operation will repeat the previously executed request with a PDU option that presents user’s credentials to the server engine. The credentials contain the pair of user name + user password, encoded according to the algorithm corresponding to the authentication mode. In the most popular authentication mode, called Basic Authentication, the credentials are coded according to so-called base64 encoding. If on the server side the pair user name:password corresponds to the contents of one of the authentication records attached to the resource, the response PDU will contain the resource contents. In the opposite case, the authentication process fails, the access to the resource is denied, and the authentication transaction reiterated.

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

FIGURE 44.27

Message from the script that forces the user to fill in the password box.

FIGURE 44.28

Page signaling bad authentication password.

44-29

It is worth noting that the authentication transaction for a given subtree is done only once per client session. This means that for a protected subtree of a VFS repository, the challenge will take place only during the first request. Any request that follows will automatically contain the authentication information. Basic authentication mode is not a reliable protection against unauthorized accesses since the coding scheme of credentials is simplistic and can be easily overcome. There are more powerful access control schemes, such as the one called Digest Authentication in which the decoding of credentials is more complex and provides higher level of security against undesirable intrusions. In any case, for the protected access, an authentication procedure based on HTPP standard features is not restrictive enough since once identified, the client station can operate without further authentication while accessing a given URL. If the station in question passes under the control of unauthorized user, the server will still answer positively to client’s request, since the credentials are memorized and kept ready for each subsequent URL access till the end of the client’s operation. To avoid any problem with this mode of authentication another model, operating on “authentication per request” basis is to be used. This is implemented by insertion into some pages (forms) a supplementary text zone of “password” type, which requires to be filled each time the form is submitted. The password is verified at each activation of associated script. Form page should be edited in a manner that constrains the submission of the form on sending the password. This is frequently done by a page embedded script that blocks the submission process when the password is not provided and prompts the user by a warning message (see Figure 44.27). Submission of the page with the wrong password provokes the contents of the form to be rejected by the server which sends back the refusal page, as in Figure 44.28.

44.5.9 Application Wrapper All the components described above contribute to implementation of user interface to the organic application of the protocol gateway. They rely on the data provided through the basic application interface but

© 2006 by Taylor & Francis Group, LLC

44-30

Embedded Systems Handbook

Basic application

Tag

Descriptor

HART timeout

HART retries

set_manufacturer

get_io_mode Filter constant

Operation mode

Version

Revision

set_io_mode

Server pages and scripts

get_manufacturer

di80_get_parameters

di80_set_parameters

FIGURE 44.29

Part of application wrapper providing access to the gateway parameters.

they impose some requirements on the formats of these data: • Access to the gateway parameters, to some of them in read only mode, on the individual basis. • Access to the list of active HART channels, which should be updated before being served via an HTML. • Access to channel parameters in reading and in writing on the individual basis. It is important that the values read from the application interface for all three types of data listed above should have either the form of scalar integer values, scalar floating point values, or zero-ended character strings. The original interface to the gateway API does not fulfill these requirements. In its original version it provides the following functions: • Global access to all parameters of the gateway, as well as in reading (access routines return packets of coded values in read mode and accept only records of coded values in write mode). • Global access to channel status data; it returns a packet of coded data in read mode. • Low-level mode of access to HART channels via blocks of octets specifically coded according to HART protocol standards. There is definitely a need for a supplementary adaptor module that transforms low-level data produced by the application tier into structured formats adapted to the mode of operation of the server. This module — the application wrapper — is split into two parts: • Part providing access to the parameters, both in reading and in writing. • Part managing access to channels — this part groups the function of displaying the channel status table and the function of accessing channel-specific data via HART protocol. The part providing access to the gateway parameters is built around the data structure representing the gateway parameters as a persistent record. The record is updated either by the server scripts modifying individual fields or by the call to the API routines. The server side routines manipulate data as individual scalar values and for this reason need access (both in reading and in writing) to the individual items of the record. The application side operates on the basis of global access to the data, that is, the gateway parameters are all set and read at once, by the activation of one of two interface routines (see Figure 44.29). The second part of the application wrapper provides access to eight HART channels connected to the protocol gateway. This part is structured around a table of eight records that represent eight potentially active channels. Each record groups the parameters of an individual HART channel (see Figure 44.30).

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-31

get_transmitters

Channel 1 . . . . . . Channel 8

get_sensor_tag

set_sensor_tag_and_desc

Server pages and scripts

Basic application

di80_get_instrumentlist

send_sensor_command

HART channel buffer

FIGURE 44.30 Access to HART channel parameters.

Home Page

di80_params

transmitters measures

images

_fpclass

FIGURE 44.31 Summary of configuration of the site embedded in the protocol gateway.

In general, the update of all parameters in this table is obtained via service requests sent to HART transmitters via eight channels of the gateway. From the user point of view not all the parameters are accessible by the same means. Parameters that define channel status and transmitter identity are obtained by one global command that updates one part of each record of all the table simultaneously. Other parameters are grouped into collections that correspond to one aspect of transmitter operation (cell configuration, primary measure characteristics, measure units, etc.). Access to each parameter collection is ruled by one command in writing and one command in reading. Collections are not disjoint and for this reason access to a parameters can be done by different commands.

44.6 Architecture Summary and Test Case Description for the Embedded Server 44.6.1 Embedded Site Architecture The overview of configuration of the server embedded in protocol gateway is presented in Figure 44.31. The diagram presented in this figure shows the functional relationships among different site components. The components are grouped in five directories, as shown above.

© 2006 by Taylor & Francis Group, LLC

44-32

Embedded Systems Handbook

The entry point of the embedded web of server pages is the HTML page named in the diagram above Home Page. This page contains the hypertext links to the objects placed in the directories’ transmitters and di80_params that represent the web domains responsible for browsing, respectively, HART channels and the gateway parameters. Server objects placed in the directory measures are referenced by the links embedded in the objects in the directory transmitters and are in charge of graphical representation of channel measures. Objects placed in the directory images (embedded images) and in the directory _fpclass (embedded applets) have different relationships with respect to other site objects. They are incorporated to the server pages rather than linked to them and from the functional point of view play the auxiliary role in the site operation. The contents of all five directories are presented below. 44.6.1.1 Directory Transmitters This directory contains the set of embedded HTML pages that enable browsing of parameters of active HART transmitters. The natural entry point to this realm of the embedded site is the page transmitter_list.htm that represents the status of eight HART channels, which can potentially be connected to eight HART transmitters. The interface to each potentially active channel is implemented by a collection of four pages. Potentially, there are eight groups of pages, one per channel, but only pages; corresponding to active channels can be displayed. The group of pages accessing parameters of a channel is organized according to the following pattern (see Figure 44.32): • Top-level channel front page of frame type incorporating three component HTML pages (HART_sensor0.htm–HART_sensor7.htm one per channel): (a) Banner page set.html (shared by all channels), (b) Menu page (menu0.htm–menu7.htm one per channel) (c) Parameter browsing form (sensor0.htm–sensor7.htm one per channel) • Password error signaling page (password0.htm–password7.htm one per channel) Pages contain links to other directories (see Figure 44.33) and incorporate elements from other directories. There is no direct links from one channel browsing page set to other channels. All links should pass via transmitter_list.htm page. 44.6.1.2 Directory di80_params This directory contains the set of embedded HTML pages that enable browsing of parameters of the gateway. The pages are organized according to the diagram shown in Figure 44.33. The pages provide access to two functions: • Displaying all gateway parameters in read only mode • Modifying some of the parameters The first function is accessible via the collection of four pages: • Top level, frame type page get_di80_params.htm that wraps three other pages located in the frame panes. • Page upper_page.htm containing banners and links to other domains of the site. • Menu page left_page.htm that supports direct selection of parameter groups; this page contains direct links to each of the five tables that group gateway parameters. • Page tables.htm that display actual values of parameters.

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-33

Channel 0

password_0.htm Home Page

HART_sensor0.htm

sensor0.htm

measure0.htm

menu0.htm

transmitter_list.htm

set.htm Channel 7

menu7.htm

measure7.htm

di80_params

HART_sensor7.htm

sensor7.htm

password_7.htm

FIGURE 44.32 Overview of pages placed in the directory transmitters.

transmitter_list.htm Home Page

get_di80_params.htm

set_di80.htm upper_page.htm

left_page.htm

tables.htm

FIGURE 44.33 Overview of pages placed in the directory di80_params.

© 2006 by Taylor & Francis Group, LLC

password.htm

44-34

Embedded Systems Handbook

measure0.htm

HART_sensor0.htm

measure1.htm

HART_sensor1.htm

measure6.htm

HART_sensor6.htm

measure7.htm

HART_sensor7.htm

HART Trend.class

FIGURE 44.34

Pages organized in the directory measures.

The second function is realized by two pages: • Page set_di80.htm — displays all modifiable parameters of gateway. The page is organized as a parameter browsing form which, while submitted, activates a script routine in charge of updating the parameters. • Page password.htm — signals password error in the submission of page set_di80.htm. Each of the two functions is attained by a separate entry point. It is possible to leave smoothly the functions via the links to Home Page of the server and to the transmitter list page. 44.6.1.3 Directory Measures The directory measures groups eight pages that correspond to the function of monitoring the values of primary measure for every HART channel. The pages have no other functions than wrapping the appropriate applets and providing exit links back to transmitter parameters page. All eight pages are independent; there is no link among them, each is attained via a separate entry point, and each has its own exit link (see Figure 44.34). 44.6.1.4 Directory _fpclass The directory _fpclass is the server domain whose name and existence is inherited from the server design pattern suggested by the site development tool. It contains three applets: two of them, proposed by the tool, implement active buttons that link pages. The third applet implements the display of signal trends and is used to monitor primary measures recovered from active HART channels. The design of this applet is nontrivial since it actively samples channel measures by activating a script routine on the server side. The routine is in charge of getting channel measure via an appropriate HART command and in transferring it to the applet wrapped in an HTTP response PDU. Thus the data exchange between the applet and the server script passes via an HTTP tunnel and can easily pass by the security barriers of the site (firewalls).

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

Ethernet/WorldFIP router

44-35

Intranet CCD/Clamart

WorldFIP

HART/FIP gateway Channel 7

Résistance Http client (browser)

Channel 0 Channel 6 Channel 5

Rosemount 3051C pressure transmitter

Fisher DVC 5000 valve Rosemount3144 temperature transmitter

FIGURE 44.35 Schematic of the test platform.

44.6.1.5 Directory Images The directory images is a flat collection of 16 images used by other pages. There is no links between objects in this directory. Some of the images are used by many different pages placed in different server domains. The idea of such an organization of images is inherited from the site development tool.

44.6.2 Test Description 44.6.2.1 Test Platform The summary of architecture presented above serves as a reference for the description of test scenarios described in this section. The system in which the tests were done is represented by the schematic from the Figure 44.35. The embedded server is placed within the HART/FIP gateway connected to a segment of 1 Mbit/sec twisted pair, dual-medium WorldFIP fieldbus. Data transfers over the segment are organized by the bus arbitrator operating with the basic cycle of 20 msec. The HART interface of the gateway is configured in the mode 6 inputs — 2 outputs. The HART channels are connected as follows: • • • • •

Channel 0: active — connected to Rosemount 3051C pressure transmitter Channels 1 to 4: empty Channel 5: active — connected to Rosemount 3144 temperature transmitter Channel 6: active — connected to Fisher’s DVD 5000 valve Channel 7: simulated active by a resistance enabling closed-current loop

The WorldFIP segment is connected to the CCD Ethernet-based Intranet via the router node implemented on Compaq Deskpro computer under Windows NT4. Routing of TCP/IP traffic is done by the native TCP/IP protocol stack of Windows NT4 that works with the standard Ethernet PC board on the Intranet side and with the WorldFIP CC121 board controlled by WorldFIP NDIS driver. Test scenarios are executed using a standard Internet browser connected to the Intranet. Two most popular Internet browsers were used for the tests: MS Internet Explorer V5 and Netscape Navigator V4.5.

© 2006 by Taylor & Francis Group, LLC

44-36

Embedded Systems Handbook

44.6.3 Test Scenarios This section describes nine test scenarios that comprise a necessary and sufficient set of operations which should prove the correctness of server’s operation. The tests activate all phases of server’s life cycle and under trial all designed functions embedded within the protocol gateway. Some functions of the server are tested by almost all test scenarios, except the first one. This concerns the generic functions of the server loop that are: request parsing, requested object search, and retrieval, response generation. Another generic test concerns the process of merging static page templates with dynamically retrieved process data. Almost all HTML pages in the server structure are obtained by this operation, except Home Page. These generic test objectives are not repeated in the description of test scenarios below. 44.6.3.1 Execution of Server Initialization Phase The objective of this scenario is to test smooth initialization of server’s data structures and creation of access point to the network. To execute the scenario launch the server process and observe server’s console. No error messages means that the software was executed till the beginning of the server loop: that is, server’s execution thread was created, VFS tree is instantiated, application wrapper is ready to communicate with the application, server socket is created, and the server thread is waiting for client’s connections. Possible erroneous reactions are: system exception on no available heap space or server thread error message on impossibility of creation of server’s passive socket. Only these two fatal errors can be put into evidence by this scenario. Absence of any error messages does not prove its correct operation. To be sure that all the above operations are correctly executed it is necessary to pass by all eight scenarios. 44.6.3.2 Server Access via Home Page The objective of this test is to prove definitely the correct execution of some initial operations and show that page composer works on plain HTML pages with no data taken from the process interface. In this scenario the server machine should be called by the test browser and the server should answer by sending the Home Page. Received page elements should be examined visually in order to detect a visible defect (incoherent text, jammed background, distorted images, applets that fail to operate). Move the mouse over the buttons of the main menu. The image should change taking the form of a selected button. Click on a button, this should activate a hyperlink to one of the three pages of the site. It is necessary to test all three buttons from the page. 44.6.3.3 Authentication The objective of this test is to prove the correct activation of challenge transaction on protected realms. In this server, the access to realms di80_params and transmitters requires the client to present the authentication credentials (registered user name and correct password). The access to the Home Page of the server is not protected but any of the three possible links leading from this page to three destinations should trigger the authentication request. To start the test sequence, first activate any of three links, for example, the one to the transmitter list. This should force the server to return the response that activates the dialog box on the client’s screen and forces the user to provide its credentials. Submission of credentials: username = “hartfip” and password = “alstom” should open the access to the page with the list. From now on any path within the server’s structure should be opened and no authentication should be requested any longer till the execution of the client browser is stopped. If the protection on all three paths from the Home Page is to be tested, the browser should be restarted before the tests of each path, in order to make it lose the credentials entered during the first authentication. Otherwise, the first authentication will open the access for all subsequent links from the Home Page to the realms of restricted access.

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-37

44.6.3.4 Page of di80_params in Read Only Mode The objective of this test involves testing the part of the application wrapper module that retrieves the parameters. Activation of a link to this part of the server should cause the reception of a three pane frame — type page with the five parameter pages in the lower right pane and five links in the lower left pane. Examine all page panes visually. The displayed image should be regular with no apparent defects, all tables in the parameter pane should be filled in with coherent values. All links in the menu pane should move an appropriate table to the top of parameter pane. The unique upper pane should contain the page banner and three links: to Home Page, to the transmitter list page, and to the page that enables parameter modification. All links in this pane should be active and should lead to the expected functions. This page should be refresh every 60 sec. 44.6.3.5 Page of Gateway Parameters in Modiﬁcation Mode The objective of this test is triple: primarily it tests the script that analyzes the set of data provided by the HTTP POST service, secondarily it tests the application wrapper function that modifies the parameters of the gateway, finally it tests the technique of dynamic generation of some sophisticated parts of HTML pages such as pop-up menus, checkboxes, and groups of radio buttons. The page that corresponds to the function contains a form that is composed of five groups of items, each form item corresponding to a modifiable gateway parameter. In the part of the scenario concerning the script activated by the POST request, the following test cases are incorporated: • Test of functions that control the coherence of formats of parameters entered into the form items. • Test of efficiency of parameter modifications. • Test of control of access protection by password. The external links to other server functions should also be tested. 44.6.3.6 Retrieval of List of Active Channels The objective of this test is to verify the part of application wrapper module that is responsible of providing the global status of eight HART channels connected to the gateway and in testing the technique of dynamic generation of large context dependent sections of HTML page. As a result of the request of the page with the channel list, all three HART transmitters should be identified and described. The channel with the resistance simulating closed-current loop should be declared as Non-HART device. All four empty channels should be labeled as inaccessible. Links giving access to individual transmitter pages should be displayed as active. Disconnection of a channel should be seen in the table after the page update. The test scenario of this page includes the test of effectiveness of all the links in this page: those to channel descriptions and those to other server functions. 44.6.3.7 Access to Channel Parameters The primary objective of this test is to verify the part of application wrapper module that is in charge of controlling the HART channel parameters. It also tests the correct operation of the script that coordinates the process of channel parameter handling as well as the dynamic generation of pop-up menus, checkboxes, and radio buttons. The nature of tests done in this scenario resembles to those described for the gateway parameter modification, since one of the parts of this page has a form incorporated. Test cases for this page also concern the coherence of data entered into the form and function of protection by the password.

© 2006 by Taylor & Francis Group, LLC

44-38

Embedded Systems Handbook

The test for the page takes into account the verification of external links to Home Page, transmitter list, and to parameter setting function. 44.6.3.8 Trend Applet The objective of this test is conceptually more sophisticated than other scenarios. It puts under verification the data exchange between a Java applet and a server script based on the principle of data tunneling via HTTP protocol PDUs. The test scenario includes activation of this page and verification of the following cases: • • • • •

Applet initialization and start-up Trend refreshment Applet stop phase Applet restart Efficiency of the link back to channel description

This test should be done for every active channel. 44.6.3.9 Call of a Nonexistent Server’s Object The objective of this test is to verify the correct reaction of the server to the request concerning a nonexistent object. To initiate this test the browser should request a nonexistent server object. This can be done when the object’s URL is manually entered to the browser’s address box. The correct reaction of the server should be the transmission of the page signaling the absence of the requested object.

44.7 Summing Up This account was conceived as a complement of the reference manual describing the FDWS software modules. It assists the developers using the FDWS software library in designing clean and efficient implementations of embedded servers. The structure of the document is organized in a manner in which it can be used as a self-standing guide to comprehension of the technique deployed by the FDWS software. It contains the presentation of principle of operations of embedded servers, sketches the basis of the technology, and leads the designer through a real-life example toward a solution of a concrete design case. The major mission of the document is to facilitate the FDWS module library, which contains a lot of routines and is not so easily accessible without a guide. The presentation of the technology is voluntarily conceived as being platform independent (no links to a development tool or to a target platform). The reason for this is that the basic idea of the software is its platform independence. With the same facility, embedded servers based on FDWS technology can be incorporated to a PLC, an I/O nest, or an industrial PC. The developed software library constitutes the first step on the way to the complete and universally applicable technology. A supplementary effort is necessary in order to increase the software’s utility. The biggest progress is needed in the domain of configuration tool (or tool suit) that would significantly increase the users comfort in the process of implementation of embedded HTTP servers.

References [1] Jeremy Bentham, TCP/IP Lean: Web Servers for Embedded Systems, C M P Books, May 2002. [2] T. Berners-Lee, R. Fielding, and H. Frystyk, Hypertext Transfer Protocol — HTTP/1.0, Network Working Group, RFC 1945.

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-39

44.A1 Appendix: Conﬁguration of VFS 44.A1.1 Programming of VFS Component One of the most important components of an embedded server structure is the VFS. This component hosts the central data structure which manages embedded objects being the targets of remote client requests. Building such a component consists, in dynamic generation, of adequate data structures that are organized in the form of repository trees. It is to be recalled that the VFS is composed of three basic elements, realized in general by disjoint modules: 1. Tree-like data structure whose role is equivalent to one of the file system management tables. Through this data structure called repository skeleton the user can find, read, and modify the files embedded in the host environment. 2. Collection of routines which process this data structure. 3. Collection of memory regions storing the embedded files. The generation process of repository skeletons is programmed as a sequence of calls of specialized routines which create and link together nodes of repository (repository root, directory nodes, file nodes, and script nodes) and attach memory regions storing embedded data to the file nodes. It is important to state that the routines that process embedded file nodes in order to satisfy remote client requests should be generated coherently with the structure of repository. The method of programming of the VFS generation is straightforward, but for big and somewhat complicated repositories, manual maintenance of the generation code (especially maintenance of coherence of its three above mentioned elements) can become awkward and time consuming. The regularity of operations required to obtain the complete VFS component suggests the possibility of automation of its production from a higher level specification. The idea consists in specifying the repository structure and operations necessary to generate the requested server entities within the same description expressed in a specialized high-level language. The process of transformation of such a specification can be realized by a tool that compiles the specification file into executable modules of C-language code (Figure 44.A1). The specification file is expressed in the language for which syntax is described below.

Specification

VFS compiler

VFS generation module

Page generation module

FIGURE 44.A1 Compilation of VFS description into executable modules.

© 2006 by Taylor & Francis Group, LLC

44-40

Embedded Systems Handbook

Global section C code Repository spec Script repository

File repository

FIGURE 44.A2 VFS specification structure.

44.A1.1.1 Speciﬁcation Structure VFS specification is composed of two sections (Figure 44.A2): • Optional global declaration section • Obligatory repository specification section The global declaration section, which can be omitted in some simple specifications, data structure definitions, and routines, programmed directly in C language. These routines can be invoked in some part of the second section of the specification. The second section of the description of one or two repositories (file repository and/or script repository). Specification of at least one repository is obligatory (Figure 44.A2). Repository specification is composed of three elements obligatory and one optional: • Repository type — this elements allows the compiler to distinguish between file repository and script repository. • Repository name — character string necessary to identify the repository tree. • By-default node description — file node specification which is necessary for the description of file repository. • Repository body — list of nodes composing the repository; this list is composed of a sequence of file, script, and directory node specifications; the list can be empty. Each element of the list is described according to a specific syntax. All the elements have the name and the qualifier that identifies the node type (file, script, or directory). Script specifications contain the pointer to the script routine. File descriptions may contain pointers to memory region containing file data and file qualifier that permits to determine the nature of the file contents (page HTML, embedded image, embedded applet). Embedded directory descriptors contain directory body — of exactly the same nature as repository body and optionally the list of records holding access credential descriptors (pairs username–password). Repository root together with the list of nodes from the repository body spans the first level of the repository tree. File (and script) nodes are the leaves of the tree while the directories span the sub-trees of the repository. To each leaf of the node qualified as an HTML page one can attach the following two optional sections: • List of variables that convey the data from outside of the HTML page structure and that serve to inject the data to the page structure. • Sections of C-code that process data before merging them with the page template.

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems

44-41

This brief textual description of the VFS specification language is followed by the more rigorous syntax specification.

44.A1.2 BNF of Speciﬁcation Language The syntax of the VFS specification presented below uses the widely accepted version of specification meta-language EBNF (Extended Bakhus Naur Form). This version uses the following conventions: • • • • •

Keywords are in bold uppercase and are closed in double quotes. Nonterminal symbols are in lower case. Single character terminal symbols are enclosed in single quotes. Symbols ::= (derivation) and [ ] (optional section) are part of meta-language. IDENT, STRING, NUMBER, and SPECIAL_STRING are meta-keywords (keywords transporting a value).

specification ::=

[“GLOBAL” ‘{ ‘target_code’ }’] file_repository_spec [script_repository_spec ] file_repository_spec ::= “” “MAIN” rep_name def_file_spec cplx_node_body “” script_repository_spec ::= “” “CGI” rep_name cplx_node_body “” def_file_spec ::= file_spec rep_name ::= IDENT cplx_node_body::= [access_list] node_list node_list ::= node_spec | rep_list node_spec node_spec::= file_spec | script_spec| directory_spec directory_spec ::= “” dir_name cplx_node_body “” dir_name ::= IDENT file_spec ::= “” file_name [“CONTENTS” = region_name file_proc] script_spec ::= “<SCRIPT>” script_name “ROUTINE” ‘=’ routine_name file_proc ::= qualifier [‘(‘ param_section’)’][‘{‘ target_code’}’ ] qualifier::= “TEXT” | “SIZE” ‘=’ NUMBER nature nature ::= “GIF” |“JPEG” | “JAVA” | “TEXT” | “PLUGIN” param_section ::= param_spec | param_section param_spec param_spec::= par_qualif par_name ‘:’ type [‘=’ init_value] par_qualif::= “DATA” | “FREE” par_name ::= IDENT int_value ::= NUMBER | STRING taget_code::= SPECIAL_STRING type::= “INTEGER” | “FLOAT” | “STRING”

44.A1.3 Speciﬁcation Example The specification text below gives the complete description of the VFS component presented in Figure 44.A3. The figure shows the repository that contains the following nodes: • One by default page named ROOT. • Directory public that contains five HTML pages: gauge1, gauge2, di80_param_form, dvc5000_1, dvc5000_2. • Directory images that contains six images in gif and jpeg format: alstom, DI80Mimic, ccd, HartMimic1, sensor, and valve. • Director javadir that contains one file: Trend.

© 2006 by Taylor & Francis Group, LLC

44-42

Embedded Systems Handbook

global{ #include "env_var.h" } main page_root <default> ROOT CONTENTS =indexnew_str TEXT public.one gauge1.htm CONTENTS = Hello_page_str TEXT(FREE Ititle:STRING ="Furnace Temperature" FREE legend:STRING ="Furnace Temperature" FREE IButton:STRING =" gauge 1" FREE Imval:INTEGER =550 FREE Iinit:INTEGER = 80 FREE Iinterval:INTEGER = 5 FREE Iscriptname:STRING ="gen.cgi?genPar=22" FREE yaxlegend:STRING ="I/sec") gauge2.htm CONTENTS = Hello_page_str TEXT(FREE Ititle:STRING ="Cooling Fluid Temperature" FREE legend:STRING="Cooling Fluid Temperature" FREE IButton:STRING ="gauge 2" FREE Imval:INTEGER = 150 FREE linit:INTEGER = 55 FREE Iinterval:INTEGER = 2 FREE Iscriptname:STRING ="gen.cgi?genPar=69" FREE yaxlegend:STRING ="m") di80_param_form.htm CONTENTS = di80_param_form_str TEXT(DATA tag_name:STRING ="cooler" DATA time_out:INTEGER = 300 DATA retries:INTEGER = 3 DATA refreshment:INTEGER = 75 DATA promptness:INTEGER = 100 DATA version:INTEGER = 1 DATA revision:INTEGER = 1 DATA filter:FLOAT = 0.44 FREE Icheckstr:STRING ="checked" FREE Icheckstr_bis:STRING =" " FREE Iselstr:STRING = "selected" FREE Iselstr_bis:STRING = "" {char* interm; tdb_result Ires; get_db_data("OperatingMode",&Ires); if(Ires.result==c_string && Ires.value.string !=NULL) if(strcmp(Ires.value.string,"OPERATIONAL")==0 &&strcmp(Icheckstr,"checked")!=0 ll strcmp(Ires.value.string,"INITIALISATION")==0&& strcmp(Icheckstr,"checked")==0){ interm = Icheckstr; Icheckstr = Icheckstr_bis; Icheckstr_bis= interm; }; get_db_data("MesureFormat",&Ires); if(Ires.result==c_string && Ires.value.string !=NULL) if(strcmp(Ires.value.string,"ANALOG")==0 && strcmp(Iselstr,"selected")!=0 || strcmp(Ires.value.string,"DIGITAL")==0 && strcmp(Iselstr,"selected")==0){ interm = Iselstr; Iselstr = Iselstr_bis; Iselstr_bis= interm; }; }) dvc5000_1.htm CONTENTS = dvc5000_str TEXT (FREE Iact:STRING = "dvc5000_1" FREE Ititle:STRING = "Input Valve" FREE Imamps:FLOAT = 0.0 FREE Itrav:FLOAT = 0.0 DATA DriveSign:FLOAT = 0.0 {Imamps = 4.0+16.0*IDriveSign.value.real/100.0; Itrav=IDriveSign.value.real;}) dvc5000_2.htm CONTENTS =dvc5000i_str TEXT (FREE lact:STRING ="dvc5000_2" FREE ltitle:STRING = "Output Valve - inversed drive" FREE Imamps:FLOAT =0.0 FREE Itrav:FLOAT =0.0 DATA DriveSignl:FLOAT =0.0 {Imamps = 4.0+16.0*(1-IDriveSignl.value.real/100.0); Itrav=IDriveSignl.value. real;}) images alstom.gif CONTENTS =alstom_img SIZE = alstom_img_length GIF DI80Mimic.gif CONTENTS = DI80Mimic_img SIZE = DI80Mimic_img_length GIF ccd.gif CONTENTS = ccd_img SIZE = ccd_img_length GIF HartMimic1.jpeg CONTENTS = HartMimic1_img SIZE = HartMimic1_img_length JPEG sensor.gif CONTENTS = sensor_img SIZE = sensor_img_length GIF valve.gif CONTENTS = valve_img SIZE = valve_img_length GIF javadir Trend.class CONTENTS = Trend_bcode SIZE = Trend_bcode_length JAVA cgi script_rep <script> set.cgi ROUTINE = first_script <script> xy_coordinates.cgi ROUTINE = coordinates <script> gen.cgi ROUTINE = generator

FIGURE 44.A3 Configuration script of the embedded server contents.

The configuration file contains also the CGI repository that contains three scripts. The compilation of this specification file produces two files in ANSI C language: one generating the VFS skeleton and the other generating the page composition routine. Below we present the contents of the file generating the VFS skeleton (Figure 44.A4).

© 2006 by Taylor & Francis Group, LLC

Embedded Web Servers in Distributed Control Systems #include "DI80_VFS_0.h" #include "data_base_processing.h" extern extern extern extern extern extern extern extern extern extern extern extern extern extern

tdata_base_struct server_root; extern tdata_base_struct cgi_bin; char* indexnew_str; extern char* starter_str; char* Hello_page_str; extern char* di80_param_form_str; char* dvc5000_str; extern char* dvc5000i_str; int alstom_img_length; int DI80Mimic_img_length; int ccd_img_length; int HartMimic1_img_length; int sensor_img_length; int valve_img_length; int Trend_bcode_length; void first_script(int,...); void coordinates(int,...); void generator(int,...);

static tdata_base_struct db_page_root_gen_0(void) {tdata_base_struct Irepository; tdata_base_struct Iptrstack[10]; Irepository = InitRepository(NULL,"ROOT","page_root"); Iptrstack[0]=BuildFileNode("ROOT",indexnew_str,0,0,1); AppendNode(Irepository,Iptrstack[0]); Iptrstack[0]=BuildFileNode("starter.htm",starter_str,0,1,1); AppendNode(Irepository,Iptrstack[0]); Iptrstack[0]=BuildDirNode("public.one"); Iptrstack[1]=BuildFileNode("gauge1.htm",Hello_page_str,0,2,1); InsertNode(Iptrstack[0],Iptrstack[1]); Iptrstack[1]=BuildFileNode("gauge2.htm",Hello_page_str,0,3,1); InsertNode(Iptrstack[0],Iptrstack[1]); Iptrstack[1]=BuildFileNode("di80_param_form.htm",di80_param_form_str,0,4,1); InsertNode(Iptrstack[0],Iptrstack[1]); Iptrstack[1]=BuildFileNode("dvc5000_1.htm",dvc5000_str,0,5,1); InsertNode(Iptrstack[0],Iptrstack[1]); Iptrstack[1]=BuildFileNode("dvc5000_2.htm",dvc5000i_str,0,6,1); InsertNode(Iptrstack[0],Iptrstack[1]); AppendNode(Irepository,Iptrstack[0]); Iptrstack[0]=BuildDirNode("images"); Iptrstack[1]=BuildFileNode("alstom.unused.gif",alstom_img,alstom_img_length,7,2); InsertNode(Iptrstack[0],Iptrstack[1]); Iptrstack[1]=BuildFileNode("DI80Mimic.old.gif",DI80Mimic_img,DI80Mimic_img_length,8,2); InsertNode(Iptrstack[0],Iptrstack[1]); Iptrstack[1]=BuildFileNode("ccd.unused.gif",ccd_img,ccd_img_length,9,2); InsertNode(Iptrstack[0],Iptrstack[1]); Iptrstack[1]=BuildFileNode("HartMimic1.jpeg",HartMimic1_img,HartMimic1_img_length,10,3); InsertNode(Iptrstack[0],Iptrstack[1]); Iptrstack[1]=BuildFileNode("sensor.gif",sensor_img,sensor_img_length,11,2); InsertNode(Iptrstack[0],Iptrstack[1]); Iptrstack[1]=BuildFileNode("valve.gif",valve_img,valve_img_length,12,2); InsertNode(Iptrstack[0],Iptrstack[1]); AppendNode(Irepository,Iptrstack[0]); Iptrstack[0]=BuildDirNode("javadir"); Iptrstack[1]=BuildFileNode("Trend.class",Trend_bcode,Trend_bcode_length,13,4); InsertNode(Iptrstack[0],Iptrstack[1]); AppendNode(Irepository,Iptrstack[0]); return Irepository; } static tdata_base_struct db_script_rep_gen_1(void) {tdata_base_struct Irepository; tdata_base_struct Iptrstack[10]; Irepository = InitRepository(NULL,NULL,"script_rep"); Iptrstack[0]=BuildScriptNode("set.cgi",16,first_script); AppendNode(Irepository,Iptrstack[0]); Iptrstack[0]=BuildScriptNode("xy_coordinates.cgi",17,coordinates); AppendNode(Irepository,Iptrstack[0]); Iptrstack[0]=BuildScriptNode("gen.cgi",18,generator); AppendNode(Irepository,Iptrstack[0]); return Irepository; }

FIGURE 44.A4 Code generated by the configuration tool set from the script in Figure 44.A3.

© 2006 by Taylor & Francis Group, LLC

44-43

45 HTTP Digest Authentication for Embedded Web Servers 45.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-1 Motivation • Security Objectives • Outline

45.2 Security Extensions in the TCP/IP Stack . . . . . . . . . . . . . . . 45-3 Link Layer Security • IPSec • Secure Sockets Layer/Transport Layer Security • Application Layer Security

45.3 Basic Access Authentication Scheme . . . . . . . . . . . . . . . . . . . 45-4 45.4 DAA Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-5 Cryptographical Prerequisites • Digest Authentication • Digest Authentication with Integrity Protection • Digest Authentication with Mutual Authentication • Summary

45.5 Weaknesses and Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-9 Basic Authentication • Replay Attacks • Man-in-the-Middle Attack • Dictionary Attack/Brute Force Attack • Buffer Overflow • URI Check

45.6 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-11 Servers • Browsers • DAA Compatibility

Mario Crevatin and Thomas P. von Hoff ABB Switzerland Ltd.

45.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix: A Brief Review of the HTTP . . . . . . . . . . . . . . . . . . . . . . . Acknowledgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45-12 45-12 45-14 45-14

45.1 Introduction 45.1.1 Motivation The application area of the Hypertext Transfer Protocol (HTTP) becomes larger and larger. While it was originally intended as the protocol to transfer HTML files, it is increasingly used by other applications. One reason is that the port assigned to HTTP is almost never blocked by a firewall. Thus, running an application on top of HTTP allows to communicate through network security elements such as packet filters. Examples for such applications are web mail and Web-based Distributed Authoring and Versioning (WebDAV) [1,2]. Since these web services contain no security features in their specification, they depend 45-1

© 2006 by Taylor & Francis Group, LLC

45-2

Embedded Systems Handbook

on security provided by HTTP or lower protocol layers. Most implementations of protocols below HTTP do not provide user authentication, hence this service is offered by extensions to HTTP, namely basic and Digest Access Authentication (DAA) [3]. In today’s industrial communication, the trend is to replace proprietary communication protocols by the standardized TCP/IP protocol stack [4]. This is also owing to the increased connectivity of automation networks, thus opening new opportunities to improve the efficiency of operations and maintenance of automation systems. In the course of this development the number of embedded web servers has increased rapidly. These web servers allow web-based configuration, control, and monitoring of devices and industrial processes. Owing to the connectivity of the communication networks of the various hierarchy levels (control network, Local Area Network [LAN], Wide Area Network [WAN]), establishing access to any device from any place in the plant or even globally becomes technically feasible. However, in addition to many opportunities, this technology also leads to many security challenges [13]. Usually embedded web servers are run on processors with limited resources, both in terms of memory and processor power. These restrictions favor the deployment of lightweight security mechanisms. Vendors offer tailored versions of the comprehensive security protocol suites such as Secure Sockets Layer (SSL) and IP Security Protocol (IPSec). However, these versions may still not be suitable for all types of processors and applications, owing to their requirements on memory and computational power. In the case where applications are restricted to HTTP, DAA is an alternative solution [5]. This protocol extension to HTTP is economical, concerning memory and processor power requirements. Although designed for user authentication in particular, many more services have been included in its original definition. In this chapter, we focus on the mechanisms and services as well as on the potential applications of HTTP digest authentication.

45.1.2 Security Objectives We distinguish the following security objectives for communication systems: • Confidentiality: Guarantees that information is shared only among authorized persons or organizations. Encryption of the transmitted data using cryptography prevents unauthorized disclosure. • Integrity: A system protects the integrity of data if it makes any modification detectable. This can be achieved by adding a cryptographic check sum. • Authenticity: Guarantees that a receiver of a message can ascertain its origin and that an intruder cannot masquerade as an authorized person. Authenticity is a prerequisite for access control. • Access control: Guarantees that only authorized people or devices have access to specific information. • Availability: Guarantees that a resource is always available. In business and commercial environments, auditability, nonrepudiability, and third-party protection also belong to the set of security objectives. Note that the relevance of the individual security objectives vary from case to case and depend much on the specific application. A business web application where monetary transactions may be involved has different security requirements than an industrial application. While for the former application confidentiality of the data transfer is a major issue, this is less sensitive in the latter case. In turn, other security objectives such as user authentication and integrity protection are much more critical in industrial communication. These considerations become an issue in particular, when the embedded web server is not within a well-protected network, but is installed at a remote location. Such situations may occur in distributed applications.

45.1.3 Outline First, an overview of the services of the security extensions in the TCP/IP protocol suite is given, with a focus on SSL and IPSec. Starting with a brief review of the HTTP message exchange, the mechanisms of HTTP basic and digest authentication are detailed and all their additional useful options (integrity protection and mutual authentication) are discussed. Furthermore, the current implementation status © 2006 by Taylor & Francis Group, LLC

HTTP Digest Authentication

45-3

of some (embedded) web servers (Apache 2.0.42, Allegro Rom Pager 4.05, GoAhead 2.1.2) and browsers (Mozilla 1.01, Internet Explorer 6.0.26, Opera 6.05) are investigated. The results of functionality and inter-operability tests are presented.

45.2 Security Extensions in the TCP/IP Stack Security services are provided at different layers in the TCP/IP communication protocol suite by appropriate protocol extensions [6]. An overview of those extensions is shown in Figure 45.1. The communication protocol stack concept makes the security services of a given layer transparent to the upper layers. The security extensions on the Internet layer and the transport layer, IPSec and SSL, respectively, provide a large range of security services and have therefore been widely implemented.

45.2.1 Link Layer Security As extensions to the Point-to-Point Protocol (PPP), the (cryptographically weak) Password Authentication Protocol (PAP) and the stronger Challenge Handshake Authentication Protocol (CHAP) provide authentication. To establish secure tunnels with a PPP connection into a LAN or a WAN, the Point-to-Point Tunnel Protocol (PPTP) or the Layer 2 Tunnel Protocol (L2TP) can be used.

45.2.2 IPSec This network layer security protocol is particularly useful if several network applications need to be secured. As protection is applied at the IP layer, IPSec provides a single means of protection for most data exchanges (UDP and TCP applications). It is transparent to all upper layers. The security services provided by IPSec are: • • • •

Access control (IP filtering) Data integrity Encryption (optional) Data origin authentication (optional)

These services are based on cryptographic mechanisms guaranteeing a high security level when used with strong algorithms. However, a drawback of IPSec is that for each host-to-host link, a specific configuration is required. While IPSec provides machine-to-machine security, it cannot perform authentication of the user. Therefore, IPSec is mainly deployed to establish Virtual Private Networks (VPNs). An IPSec implementation on a Coldfire MCF5307 65 MHz processor showed a program memory requirement of 64 KB, without support of the Internet Key Exchange (IKE) Protocol. Experiments

Application layer

Basic/digest authentication

Transport layer

SSL TLS

Internet layer

IPSEC

Link layer

PAP CHAP

FIGURE 45.1 Network layers and associated security protocols.

© 2006 by Taylor & Francis Group, LLC

PGP SSH

PPTP L2TP

45-4

Embedded Systems Handbook

consisting of ping requests between two Coldfire processors were performed. The delay between a ping request and the reception of its reply was observed to become twice or even three times longer compared with the unprotected case, when IPSec was activated using the Authentication Header (AH) or the Encapsulation Security Payload (ESP) configuration, respectively.

45.2.3 Secure Sockets Layer/Transport Layer Security The SSL is a protocol created by Netscape Communications Corporation. The standardized version is also known as Transport Layer Security (TLS). SSL is transparent to the end user and to the upper protocol layers. It protects all applications running on top of TCP, but does not protect UDP applications. The “https” prefix on the URI (Uniform Resource Identifiers) and the lock icon on the browser GUI indicate that the SSL protocol is in use. If the server’s certificate is not signed by a certificate authority trusted by the client (browser), the user is prompted to accept or to refuse the certificate. The security services provided by SSL/TLS are: • • • •

Session key management and negotiation of cryptographic algorithms Confidentiality using encryption Server authentication using certificates Data integrity protection

Secure Sockets Layer includes optional client authentication, which is rarely performed in practice. Under the protection by encryption provided by SSL, user authentication is often implemented at application level. In summary, SSL provides a high level of security, but has high memory and computation requirements, particularly when considering the constraints of embedded web servers.

45.2.4 Application Layer Security The procedures described in the previous sections operate on the lower layers and focus on the authentication of machines. On the application layer, individual applications may provide their own security enhancements. Typical security tools are PGP/GnuPG to secure mail transfer and SSH (secure shell). For HTTP, there exist the protocol extensions HTTP basic and digest authentication, which authenticate users to control their access to protected documents. Authentication of the user is in contrast to machine authentication provided by the protocols described in the sections above. Since the remainder of this chapter focuses on the protocol extensions of HTTP, a brief review of HTTP is given in the Appendix.

45.3 Basic Access Authentication Scheme The HTTP basic authentication scheme [3] is the simplest authentication scheme and provides some weak protection. This is owing to the fact that username and password can be discovered by eavesdropping on the message exchange. The HTTP message exchange for basic authentication is depicted in Figure 45.2. On reception of a “401 unauthorized” message, the client (browser) prompts the user for his or her username and password. These are transmitted in clear over the link in the “authorization” request-header-field, for each accessed document within the same protection space (Figure 45.3). 1. The browser issues a HTTP GET command to the server, with the requested URI. 2. The server answers with a “401 unauthorized” HTTP error code and requests the browser to send a valid username and password (credentials)1 using basic authentication. The realm2 (string) is also included in the challenge sent to the client. These parameters are part of the “WWW-authenticate” request-header-field. 1 Credentials: Information that can be used to establish the identity of an entity. Credentials include things as private keys, tickets, or simply a username and password pair. This is also known as the shared secret. 2 Realm: Name identifying a protection space (zone) on a server. Usually shown to the user at the password prompt.

© 2006 by Taylor & Francis Group, LLC

HTTP Digest Authentication

45-5

Client

Server

1

GET URI HTTP/1.1 2 HTTP/1.1 401 unauthorized WWW-authenticate: Basic realm=“Basic Test Zone”

3

5

GET URI HTTP/1.1 Authorization: Basic dGVzdDp0ZXN0 4 HTTP/1.1 200 OK

FIGURE 45.2

HTTP basic authentication negotiation.

FIGURE 45.3

Internet Explorer’s basic authentication prompt.

3. The browser prompts the user for username and password. The realm (here “Basic Test Zone”) is usually displayed to the user. The credentials are sent with a new GET request, encoded in Base64 format. Decoding is trivial because Base64 is a simply invertible encoding scheme. The credentials are sent in the “authorization” response-header-field. 4. After the server has checked and accepted the password, the requested document is sent to the client with a HTTP 200 response. 5. The client displays the document, and automatically sends the same credentials for any subsequent request made under the same protection space. Hence the password is sent unencrypted with each request. Note that, unless confidentiality is provided by some other security protocol on a lower layer (see Section 45.2) username and password are transmitted in an unprotected way.

45.4 DAA Scheme 45.4.1 Cryptographical Prerequisites Unless public keys are used, authentication is based on a shared secret (credentials) between the authenticating and authenticated entity. Usually, these credentials consist of the relation between a username

© 2006 by Taylor & Francis Group, LLC

45-6

Embedded Systems Handbook

FIGURE 45.4

IE’s digest authentication prompt.

and its password (Figure 45.4). One possibility to authenticate a peer over the network is the submission of username and password as executed in the basic authentication scheme (see Section 45.3). However, since they are transmitted in clear over the network, an attacker having access to the network traffic can eavesdrop the credentials. The challenge/response concept solves this problem by avoiding to send the password in clear. Instead, the authenticating entity A sends a challenge x to the entity B to be authenticated. B calculates the response zB = f (x, y ), where y is the shared secret between A and B. A also calculates zA = f (x, y ) and checks whether zA coincides with zB . If so, the identity of B is proven to A. To make the procedure of challenge and response secure, there are two requirements. First, x needs to be random, so that zB is of no value to any attacker intercepting it. Second, f should be a one-way hash function [7]. The properties of hash functions are: • • • •

A finite-length output message (hash) is calculated from an arbitrary-length input message. It is easy to determine the output message. Given an output message it is hard to find a corresponding input message. It is hard to find another input message with the same output message (hash).

Functions having these properties meet the requirements for cryptographic check sums. It is their quasiuniqueness (hard to find two input messages producing the same output) that allows an owner B of a message z to show only the message’s hash to another user A and A can still trust that z is known by B. The same property is used to protect the integrity of data storage or message transmission. A comparison of the hash of the received message or the stored data with the original hash will detect any change in the message or the data. The most frequently used hash functions are MD5 and SHA-1 [7].

45.4.2 Digest Authentication Although very similar to the basic authentication scheme, DAA [3] is much more secure. Instead of sending the username and the password in an unprotected way, a unique code is calculated from username, password, and a unique number received from the server. Figure 45.5 shows the HTTP DAA transactions between a web server and a browser: 1. The browser requests a document in the usual way, with a HTTP request message. 2. The server sends back a “401 unauthorized” challenge response message. The server generates a nonce (number used once) and sends it to the client. Note that the nonce must be different for every 401 message.

© 2006 by Taylor & Francis Group, LLC

HTTP Digest Authentication

45-7

Client

Server

1 GET/protected/test.html HTTP/1.1 2 HTTP/1.1 401 unauthorized

WWW-authenticate: Digest realm=“DigestZone”, nonce=“3gw6...”, algorithm=MD5, domain=“/protected”, qop=“auth”

3

GET/protected/test.html HTTP/1.1

Authorization: Digest username=“test”, realm=“DigestZone” nonce=“3gw6...”, algorithm=MD5, uri=“/protected/test.html”, qop=“auth”, response=“65bia...”, nc=“0001”, cnonce=“82c...”

4

HTTP/1.1 200 OK 5

Authentication-Info: rspauth=“d9260...”. qop=“auth”, nc=“0001”, cnonce=“82c...”,

FIGURE 45.5 HTTP digest authentication: negotiation.

3. The browser prompts the user for its username and its password, and computes a 128-bit response using the MD5 algorithm [7] as a one-way hash function: response=MD5[MD5(username:realm:password):nonce:nc:cnonce:qop:MD5(method:uri)] This response is sent to the server along with the nonce received, the “uri” requested, its own generated “cnonce” (client nonce), and the username. “qop” stands for Quality of Protection and indicates whether additional integrity protection is provided. In this example, this is not the case, hence qop=“auth”. 4. The server calculates an own response following the same scheme as given in Step 3, using the information sent to the client before and its own version of the username and password (or optionally a hashed form of them). It compares the received version with the computed one and grants access to the resource (HTTP 200 OK response) if the results match. If the authorization fails, a new 401 error message is sent. All 401 error messages include an HTML error page to be displayed by the browser. Browsers usually reprompt the user for a new username and password three times before giving up and displaying the error page. 5. For any subsequent request, the client usually generates a different cnonce. A counter, nc, is incremented. This new cnonce and counter, along with the new uri is used to recompute a new valid response value. Usually the browser stores the username and the password temporarily in the memory in order to allow a user to reaccess a given protection space without retyping the password.

45.4.3 Digest Authentication with Integrity Protection The RFC for digest authentication [3] provides the capability to include the hash of the entity (the payload, usually HTML code) in the computed MD5-hash response: response=MD5[MD5(username:realm:password):nonce:nc:cnonce:qop:MD5 (method:uri:MD5{entity})] In this way, any modification of the transmitted information will result in a different MD5-hash response, which would be easily detectable. While the integrity of the document from the server is insured in a response message, the integrity of POST data is protected in a request message. To indicate that the server supports integrity protection, the argument QoP is set to “auth-int”.

© 2006 by Taylor & Francis Group, LLC

45-8

Embedded Systems Handbook

Client

Server Request document

Challenge: nonce Challenge: cnonce Authorization: response Authorization: rspauth

FIGURE 45.6

Mutual authentication mechanism.

Note that, for GET requests with arguments, the integrity of the payload (the arguments) is already protected without the option qop=“auth-int”, because the URI with its arguments is included in the MD5-hash. For integrity protection of the response from the server the “rspauth” field must be present. See Section 45.4.4 on mutual authentication on rspauth.

45.4.4 Digest Authentication with Mutual Authentication We have already seen that digest authentication identifies the client. However, DAA foresees authentication of the server by the client, providing mutual authentication. The server already knows the client is trustworthy, because the browser has sent the proof that the user knows their shared secret. This occurred when sending the response to the server (see Step 3 in Figure 45.5). Exactly the same mechanism is used to authenticate the server. After receiving correct client credentials along with the GET request, the server sends back a proof that it also knows the shared secret. This is done via the “rspauth” field sent in the “Authentication-Info” header of the HTTP 200 OK message, along with the document previously requested. The challenge initiating the server response is the cnonce from the client. rspauth=MD5[MD5(username:realm:password):nonce:nc:cnonce:qop:MD5(:uri)] The browser uses this information to authenticate the server. If integrity protection is activated, the hash for the entity is included in rspauth. rspauth=MD5[MD5(username:realm:password):nonce:nc:cnonce:qop:MD5(:uri:MD5{entity})]3 The nonce from the server is used to challenge the client with an unpredictable number. In the same way, when server authentication is used for mutual authentication with DAA, the cnonce from the client is used as a challenge that the server cannot predict. Therefore, it is not possible to precompute responses to those challenges. This is known as a mutual challenge/response mechanism. Its concept is depicted in Figure 45.6.

45.4.5 Summary DAA offers a secure authentication scheme with low implementation complexity, well adapted to embedded systems. In this environment, authentication and integrity protection is important, whereas confidentiality is often not required. Unfortunately, integrity protection and mutual authentication are not yet supported by today’s typical DAA implementations on clients and servers. 3 The

only difference to the client response is the missing field “method”.

© 2006 by Taylor & Francis Group, LLC

HTTP Digest Authentication

45-9

TABLE 45.1 Comparison Between DAA and SSL on Functionality and Footprint Size

Mandatory features Client authentication Server authentication Data integrity Data confidentiality Optional features Client authentication Server authentication Data integrity Data confidentiality Memory requirements RAM ROM

DAA

SSL

✓ ✗ ✗ ✗

✗ ✓ ✓ ✓

✓ ✓ ✓ ✗

✓ ✓ ✓ ✓

1648 Bytea 6312 Bytea

100 KB–250 KBb 200 KBc

a According to own measurements on a Coldfire MCF5307. b RAM required depending on the number of simultaneous

secure connections (1 to 24) according to Reference 8. c According to information of Allegro.

Table 45.1 compares mandatory and optional features of DAA and SSL, and also their memory requirements. Note that, at the time our tests were conducted the optional features were supported by some implementations only.

45.5 Weaknesses and Attacks 45.5.1 Basic Authentication If an attacker has access to the network, he can eavesdrop the HTTP transaction very easily and obtain the username and the password if there is no encryption provided, for example, by SSL or IPSec. In those cases, HTTP basic authentication should be replaced by the digest scheme (Table 45.2).

45.5.2 Replay Attacks Vulnerabilities appear when data, for example, sensitive commands, are sent to the server. With POST and some other HTTP methods such data is transmitted in the entity part of the message. An attacker could replay credentials from an intercepted POST request, with tampered form data (commands) thus taking control of the remote server (automation system). In contrast, the arguments sent in a GET request are part of the arguments to compute the digest. Thus, GET requests are safer and should preferably be used to send data to the server. The fact that the credentials are stored in the browser’s cache when using the GET method has no effect to the security, because of the uniqueness of the nonce. Proper nonce generation together with a reliable check for uniqueness provides good protection against replay of previously used valid credentials. Although the definition of a nonce requires its uniqueness, implementers might be tempted to reuse a nonce, this must be avoided. Within a session, a replay attack is prevented by incrementing the counter “nc,” making the previously calculated hash value invalid. Note that, the change of 1 bit in the argument of a one-way hash function changes in the average half of the bits of the hash value [7]. In addition, integrity protection would prevent from tampering information when POST is used. On an embedded device, where usually only one legitimate user at a time is likely to access the system, locking the protection space (realm) to only one client at a time also contributes to protection.

© 2006 by Taylor & Francis Group, LLC

45-10

Embedded Systems Handbook TABLE 45.2 Comparison between DAA Server Implementations Server Features Basic access authentication DAA Mutual authentication Integrity protection Nonce check Uri check

Apache

RomPager

GoAhead

✓ ✓ ✓ ✗ ✓ ✓

✓ ✓ ✗ ✗ ✓ ✓

✓ ✓ ✗ ✗ ✗ ✗

45.5.3 Man-in-the-Middle Attack An attacker might be able to insert himself into the path between a client and a server. Capturing the packet containing the server’s response with the challenge for digest authentication, he can replace the WWW-authenticate field with a field requesting basic authentication from the client. Username and password can thus be gained as the client returns them in an unprotected form, owing to the faked WWW-authenticate field. Disabling of basic authentication in the browser and requiring mutual authentication could prevent such an attack.

45.5.4 Dictionary Attack/Brute Force Attack This attack assumes that the user chooses a simple password. In the client’s request message (Step 3) all information to calculate the response field, apart from the password, is available. Having such a message in his hands an attacker might thus compute thousands of responses generated with a list of possible passwords from a dictionary, and see if it coincides with the response sent by the browser. The success probability of such a dictionary attack can be decreased by proper selection of the password, for example, avoiding common words and including lower and upper case letters as well as special characters. For a brute force attack the list of possible passwords is replaced by all combination of characters. However, the fact that a hash must be calculated for each password guess makes a brute force attack very expensive.

45.5.5 Buffer Overﬂow In a buffer overflow attack the attacker sends very long commands to the server [9]. In embedded web servers, web pages often serve as an entry to CGI programs. Input data for such a program are transmitted in the HTTP message. If the size of such data is too large, the data can be overwritten in the memory over the original executable code of the browser/server, and be executed by the processor, unless a careful error check is performed. Mostly, this will crash the server, resulting in a denial of service. However, it can also allow a skilled attacker to gain access to everything the server has access to, for example, confidential information or the control of the devices in the automation system. Buffer overflow can be avoided when all applications perform careful data range checks. For integrated third-party code it is crucial to update the code to the latest available version and to apply any security patches or service packs as soon as they become available.

45.5.6 URI Check Some web servers do not check whether the “uri” field located in the “authorization” header corresponds to the requested URI in the GET request. An attacker may replay a stored valid request from a client for

© 2006 by Taylor & Francis Group, LLC

HTTP Digest Authentication

45-11

a given “uri”, but modify the GET line to obtain some other protected documents he wants or, possibly even more dangerous, the server may accept modified parameters in a GET request. This might allow an attacker to send arbitrary commands to an embedded automation device. See uri check in Section 45.6.

45.6 Implementations Some available embedded web server implementations and browsers have been tested for their support of DAA. Investigations have shown that the implementations do not match the specification in every aspect. This section briefly outlines the results.

45.6.1 Servers 1. Apache 2.0.42. Among the DAA implementations tested, Apache has the best one. While mutual authentication is in place and working, integrity protection is not yet implemented. The nonce lifetime is adjustable and the uri is checked. Apache is also the most robust server in terms of resistance to exploits, since a large user community uses and tests it continuously. Note that DAA is compatible with the Opera browser only if the AuthDigestDomain directive is configured in the file .htaccess. The Internet Explorer (IE) removes the arguments when copying the GET query in the uri field. Therefore, DAA is not compatible between IE and Apache for GET requests with parameters. The nonce is composed of a timestamp (in the clear), followed by the “=” sign, and the SHA1 hash [7] of the previous timestamp, realm, and a server secret (private key): nonce=timestamp:“=”:SHA1(timestamp+secret) 2. Allegro RomPager 4.05. In the RomPager web server [10] DAA is implemented without mutual authentication and integrity protection. There exists the option “StrictDigestAndIp,” where the validity of the unique nonce is time limited and never more than one IP address is granted access in a given instant. This feature is appropriate in embedded systems because it prevents replay attacks. RomPager makes a full uri check. In addition, it is able to cope with the less secure uri processing of the IE. Furthermore, RomPager assumes that requests received via HTTP 1.0 originate from a browser not supporting digest. As a consequence, DAA does not work with IE and Opera when connected via a proxy. Therefore, Opera needs to be configured without proxies. On IE the “Use HTTP 1.1 through proxy connections” option can be set. The nonce is generated using the time, the server IP address, the previous nonce (if there was one), and the server name. nonce=MD5(Time:Server-IP-Address:[previous nonce:]Server-Name) 3. GoAhead 2.1.2. GoAhead is a free software and open source server, developed for embedded devices on a variety of platforms. No mutual authentication and integrity protection is supported. A given nonce never expires and is never checked. Hence, there is no protection from replay attacks. The server removes the parameters of a GET query request in the digest “uri” field. This prevents Mozilla from being compatible with those types of requests. nonce=MD5(RANDOMKEY:timestamp:myrealm)

45.6.2 Browsers All three browsers tested here have their strengths and weaknesses. IE is the only one using a different prompt for basic and digest authentication (see Figure 45.3 and Figure 45.4). Being made aware of this difference, an attentive user can recognize a Man-in-the-Middle-Attack (see Section 45.5.3). However, the IE removes GET arguments in the uri. Mozilla is an open-source browser, and thus very easy to modify. On the other side, the current implementation is not very user friendly (slow and continually asking for the username and password). Opera is the strongest one in terms of security and the only one supporting

© 2006 by Taylor & Francis Group, LLC

45-12

Embedded Systems Handbook TABLE 45.3 Compatibility of Client and Server Implementations of DAA Clients Servers Apache 2.0.42 (win32) RomPager 4.05 GoAhead 2.1.2

Mozilla 1.01 Netscape 7

IE 6.0.26

Opera 6.05

✓a

✓b ✓ ✗

✓ ✓ ✓b

✓ ✓

a Not working for GET with parameters. b Requires valid domain.

mutual authentication. Concerning DAA, a combination of the three products mentioned above would probably meet the features expected from a perfect DAA implementation. These are: • Option to disable basic authentication. • The user shall be notified by some visual indication that DAA is used, when he is prompted for username/password, and also during browsing. • Support of DAA with mutual authentication, with some displayed indication that the server has been authenticated. Possibility to refuse pages if the server has not been authenticated. • Support of DAA with integrity protection, with a visual indication that it is used. • Verification that the URI requested with DAA is in the protection space.

45.6.3 DAA Compatibility Table 45.3 summarizes the compatibility of different clients versus different servers. Note that, the compatibility tests did not include server authentication and data integrity.

45.7 Conclusions Digest Access Authentication is a light-weight, yet efficient way of providing user authentication. Applications running on top of HTTP can benefit from the services of DAA. Typically these applications are web services like WebDAV and HMI applications in automation systems, migrating from proprietary communication protocols towards TCP/IP technology. Wherever basic authentication is still in use and not protected by a security protocol of a lower layer, it should be replaced by DAA. For embedded web server applications, it is urgent that browser and web server vendors implement mutual authentication and integrity protection, as these services are required to achieve a high security level. Where confidentiality is not required, an implementation of DAA including all features defined in the RFC [3], namely mutual authentication and integrity protection, would provide sufficient, yet light-weight, security for embedded systems.

Appendix: A Brief Review of the HTTP The HTTP is widely used to exchange text data across different platforms over a TCP/IP network. The definition of HTTP 1.1 is given in [11]. HTTP is based on standard request/response messages transmitted between a client (browser) and a web server. An example of a typical HTTP handshake is depicted in Figure 45.A1. The procedure is straightforward: 1. The browser sends a GET request to the server, indicating the requested resource. 2. The server responds with a 200 OK message along with the document requested.

© 2006 by Taylor & Francis Group, LLC

HTTP Digest Authentication

Client 1

45-13

Server

GET/simple.html HTTP/1.1 Host: 192.168.0.3 User-Agent: Mozilla/5.0 (...) Gecko/20020530 Accept: text/html (...) Header Accept-Language: en-us, en;q = 0.50 Accept-Encoding: gzip, deflate, compress;q = 0.9 Accept-Charset: ISO-8859-1, utf-8;q = 0.66, ∗;q = 0.66 Keep-Alive: 300 Connection: keep-alive 2

HTTP/1.1 200 OK date: Sun, 29 Dec 2002 15:21:13 GMT Server: Apache/2.0.39 (Win32) Last-Modified: Sun, 29 Dec 2002 15:05:13 GMT ETag "524d-fa-4921304a" Accept-Ranges: bytes Content-Length: 250 Keep-Alive: timeout = 15, max = 100 Connection: Keep-Alive Content-Type: text/html; Charset = ISO-8859-1

Header

Test page <meta http-equiv = "content-type" content = "text/html; charset = ISO-8859-1"> This is a simple HTML page

Entity

FIGURE 45.A1 Example of a HTTP message exchange.

Using Figure 45.A1, the key aspects of HTTP are briefly explained. A HTTP message consists of a header and in most cases an entity. Note that, while all data in the header is transferred as ASCII text, the entity might contain non-ASCII data, for example, .jpg-files. Header A Header is composed of a start-line and header-fields. Start-line There are two types of start-lines: A request start-line is of the form Method URI HTTP-Version, for example, GET/simple.html HTTP/1.1 A status start-line has the form HTTP-Version Status-Code Phrase, for example, HTTP/1.1 200 OK Header-fields Header-fields give various information, including date, language, and security information. In this document, the security-relevant header fields where discussed in detail. Examples: Host: 192.168.0.3 Date: Sun, 29 Dec 2002 15:21:29 GMT WWW-authenticate: Digest realm=“abc”,...

© 2006 by Taylor & Francis Group, LLC

45-14

Embedded Systems Handbook

Method The most relevant methods are GET and POST. GET: The most common method to request a document. The GET request can also include arguments in the URI. This is widely used to transmit commands from the client to the server. POST: Method used to send information to the server, usually from a web page form. URI Uniform Resource Identifiers [12]. URIs in HTTP can be represented in absolute form or relative to some known base URI. Example: http://www.test.ch/simple.html or /simple.html. Entity The rest of the message, for example, a HTML document.

Acknowledgment The authors would like to thank Emanuel Corthay from the Swiss Federal Institute of Technology Lausanne for his valuable contribution in the examination of the individual implementations.

References [1] J. Slein, F. Vitali, E. Whitehead, and D. Durand, Requirements for a Distributed Authoring and Versioning Protocol for the World Wide Web, RFC 2291, 1998. [2] E. Whitehead, A. Faizi, S. Carter, and D. Jensen, HTTP Extensions for Distributed Authoring — WebDAV, RFC 2518, 1999. [3] J. Franks, P. Hallam-Baker, J. Hostetler, S. Lawrence, P. Leach, A. Luotonen, and L. Stewart, HTTP Authentication: Basic and Digest Access Authentication, RFC 2617, June 1999. [4] M. Naedele, IT Security for Automation Systems: Motivations and Mechanisms, atp Vol. 45, pp. 84–91, 2003. [5] T. von Hoff and M. Crevatin, HTTP digest authentication in embedded automation systems. In Proceedings of the IEEE International Conference on Emerging Technologies for Factory Automation (ETFA03), Vol. 1, pp. 390–397, 2003. [6] W. Stallings, Network Security Essentials: Applications and Standards, Prentice-Hall, New York, 2000. [7] B. Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C, John Wiley & Sons, New York, 1996. [8] Rom Pager Secure Programming Reference Version 4.20, Allegro Software Development Corporation, Boxborough, MA, 2002. [9] E. Cole, Hackers Beware, New Riders, 2002. [10] Rom Pager Web Server Engine Porting & Configuration, Allegro Software Development Corporation, Boxborough, MA, 2000. [11] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee, Hypertext Transfer Protocol — HTTP/1.1, RFC 2626, June 1999. [12] T. Berners-Lee, R. Fielding, and L. Masinter, Uniform Resource Identifier (URI), RFC 2396, August 1998. [13] R.J. Anderson, Security Engineering: A Guide to Building Dependable Distributed Systems, John Wiley & Sons, New York, 2001.

© 2006 by Taylor & Francis Group, LLC

Intelligent Sensors 46 Intelligent Sensors: Analysis and Design Eric Dekneuvel

© 2006 by Taylor & Francis Group, LLC

46 Intelligent Sensors: Analysis and Design 46.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-1 46.2 Designing an Intelligent Sensor . . . . . . . . . . . . . . . . . . . . . . . . . 46-2 Analysis • The External Model • Functional Decomposition of a Service • Sensor Architectural Design

46.3 The CAP Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-10 Description • Illustration • Implementation

Eric Dekneuvel University of Nice at Sophia Antipolis

46.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-17 Acknowledgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-18 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-18

46.1 Introduction Today, thanks to the advances in numerical processing and communications, more and more functionalities are embedded into distributed components with the charge for providing the right access to these services. Complex systems are then seen like a collection of interacting subsystems embedding control and estimation algorithms. The inherent “modularity” concept behind this approach is the key answer to the increasing complexity of the systems and this has led to the definition of new models and languages for the formal specification of the components [1]. In this chapter, we are more particularly interested in “intelligent sensors,” components associating computing and communication devices with sensing functions [2]. In order to reduce the complexity, the design of an intelligent sensor requires the necessity to provide a model of the sensor at a high level of abstraction of the implementation. The disparity of the knowledge encapsulated inside the instrument renders the modeling process very sensitive to the modeling strategy adopted and to the models used. A real-life component like the intelligent instrument usually involves the cooperation of three kinds of programs [3]: • A level of data management to perform transformational tasks. • One or more reactive kernels to compute the outputs from the logical inputs, selecting the suitable reaction (computations and output emissions) to incoming inputs. • Some interfaces with the environment to acquire the inputs and processes the outputs. This level includes interrupt management, input reading from sensors and conversion between logical and physical inputs/outputs. Communication with the other components of the system will also be managed at this level. 46-1

© 2006 by Taylor & Francis Group, LLC

46-2

Embedded Systems Handbook

Data management covers research fields such as the probability theory, the possibility theory, the measurement theory, and uncertainty management. Unlike a numeric sensor that provides an objective quantitative description of objects, a symbolic sensor provides a subjective qualitative description of objects [4]. This qualitative description, adapted to the sensor measurement, can be used in Knowledge Based Systems (KBS), checking the validity of a measurement or improving the relevance of a result [5]. The reactive part is probably the most difficult part of the design of the intelligent sensor. Like all reactive systems, the intelligent sensor must continuously react to its environment at a speed determined by this environment. This often involves the ability of exhibiting a deterministic behavior, of allowing concurrency and of satisfying strict real-time requirements. A generic intelligent sensor model has been developed to help during the specification step of the sensor functionalities [6]. The purpose of the intelligent sensor generic model is to provide a high level of abstraction of the implementation of the sensor, focusing on the fundamental characteristics that the sensor must exhibit. For this, the generic model uses the point of view of the user to describe the services and the operating modes in which the services are available [2]. Then, by using a language to compute the formal description, we are in a position of evaluating the component, from a static and/or dynamic point of view. The availability of a language to compute the formal model allows the evaluation of the component. Once the component is validated, a prototyping step can be launched in order to obtain an operational system prototype. This prototyping step being usually expensive in time and resources, the final implementation should be made as much as possible using automatic synthesis from the high-level description to ensure implementation that are “correct by construction” [7]. In this chapter, after reviewing the main characteristics of the generic intelligent sensor formal model, we talk about an implementation of the model provided by the CAP language, a language specifically developed for the design of intelligent sensors.

46.2 Designing an Intelligent Sensor 46.2.1 Analysis As stated earlier, the diversity of the embedded functions, the flexibility, the reuse, argue for a distribution of the functionalities inside a complex system into areas of responsibility [8]. From an external viewpoint, an intelligent sensor will be considered as a modular unit behaving as a server. As such, it will be designed to offer its customers (the operator, other instruments, or other modules) an access to the various functionalities encapsulated inside the sensor. Let us consider a simple example. A navigation system has to be designed using a closed loop on a surface like a wall to control the locomotion. As can be seen in Figure 46.1, the environment of the system to be designed exhibits various entities or actors, such as the axes, the operator, and the obstacles. Every occurrence of a start_moving request, a closed loop is activated until a new request like stop_moving is emitted by the operator. Once the links between the system and the environment are defined, the functional specifications can be established. For this, a dataflow diagram (see Figure 46.2) can be easily defined by identifying the data necessary for the navigation goals: a position measurement value useful for the closed loop to compute the values of the speed that are to be applied on the various axes. Suppose we now decide to include another activity that will enable the system to follow a predefined trajectory. By this way, the operator (or a high-level decisional system) is provided with a possibility to choose between two methods according to the current context of the execution. This trajectory execution activity can be interested in knowing if unexpected obstacles are met along this trajectory in order to stop the system before striking one of the obstacles. If both functionalities, the obstacle detection and the computing of the position, use the same physical resource (a set of ultrasonic sensors for example), it is better to encapsulate them for homogeneity inside a subsystem in charge of the physical resource. The intelligent sensor module interacts with the environment through several messages. This is the interface of the module. The structure of a message (its signature) is generally limited at this level to a list of parameters such as the sender identity, the communication medium used, and the contents. We

© 2006 by Taylor & Francis Group, LLC

Intelligent Sensors: Analysis and Design

46-3

Operator

Start_moving Stop_moving Speed values

Easy data Navigation system

Obstacles

Speed measurements

Excitation

FIGURE 46.1

Axes

Context diagram of the application.

Unexpected obstacle

Obstacle detection

Trajectory execution

Position Position counting

Wall following

Speed instruction

Control

FIGURE 46.2 A global dataflow diagram.

usually make a difference between messages for data communication and messages for control. Control messages enable the customers to communicate with the sensor in a bidirectional way, using a client–server protocol (see Figure 46.3). The customer requests the launching of an activity through the request link. The customer receives an identification number for its demand and can be informed about the status of the request (activity launched, terminated, etc.), owing to another message of control (reply). To be effective, the intelligent sensor interface description must be complemented by the behavioral description of the module. While the structural viewpoint describes the internal organization of a complex system, the behavioral viewpoint will express all the information that characterizes the module to be designed from an external viewpoint [9]. A generic model of an intelligent instrument has been developed for this purpose, using the concept of external services to qualify the set of operations offered to the outer entities. Reference 10 gives the following definition: Definition 46.1 From an external point of view, a service is the result of the execution of a treatment, or a set of treatments, for which one can provide a functional interpretation. In other terms, the execution of a service typically results in the production of output values according to input values consumed by the execution of a processing. The services are not limited to measurement aspects. The set of the services cover a large spectrum of functionalities that we can expect from intelligent sensors. Intelligent sensors must be configured, calibrated, enabled, so that they can provide their measurements to the rest of the system. Selecting a particular sensor, an alimentation, a reference voltage and a sampling frequency, are common examples of configuration services that can be used to

© 2006 by Taylor & Francis Group, LLC

46-4

Embedded Systems Handbook

Pilot component (customer) Wall following component (customer) Reply

Request

Intelligent sensor component (server) Input data

Output data Position computing

FIGURE 46.3

Obstacle detection

The intelligent sensor interface.

set the value of these parameters. Processing embedded inside services can be as simple as the acquisition of a value but usually, it involves more complex treatments such as signal processing (to improve, for example, the resolution of a given value), the data processing, the validation of the measurement, and so on. In the generic model of an intelligent sensor, a service will consequently be modeled by two sets of parameters: External: That is, how the service communicates with other services. The services are gathered into User Operating Modes. Internal: That is, how the external service is decomposed into internal basic processing units. Let us examine in detail both aspects successively.

46.2.2 The External Model Figure 46.4 depicts the external model of a service in use. A service is mainly described by the input/output data and is triggered by an external event. The used data and events can be organized into classes, with the description of a set of characteristics such as the format, the accuracy, the refresh period, etc., for each data class. The description of the input and output behaviors exhibits the possible interconnection between the service and those that precede or follow this particular service. This is then a data-driven representation of the service relationships equivalent to an explicit representation with the advantage of being more efficient and general (new services can be added without being obliged to physically interconnect the entire system). Control parameters are received from the “parent” activity that requests the service. The control parameters affect the modalities of processing and the modalities of the underlying sensor the service might encapsulate. The control parameters are usually passed in conjunction with the service activation request. For example, one can easily imagine the obstacle detection service running in a cyclic shooting modality or in a single shooting modality, depending on the nature of the activity that requests the service. The launching of a service can be conditioned by the verification of its activation conditions. The distinction between a request and a condition is that the request for a service is emitted by the user, while the condition is processed by the system. Those conditions are often related with the access rights or to the security aspects and induce the verification of the origin of the request, the mode of transmission used, etc.

© 2006 by Taylor & Francis Group, LLC

Intelligent Sensors: Analysis and Design Request

Input data

46-5 Conditions

Service

Control parameters

Output data

Resources

FIGURE 46.4

Graphic model for representing a service.

Resources include both hardware (sensor, CPU, memory, etc.) as well as software (extended Kalman filter, etc.). Input and output data can also be considered as resources with the problem of the data obsolescence. Other properties can be added such as time and complexity measures that can help to select between different methods. The set of services can be easily assimilated to a set of instructions we find in a regular computer. To cope with various states of the sensor that can occur during its life (out of order, in configuration, in manual exploitation, in automatic exploitation, etc.), the different external services are organized into coherent subsets of User Operating Modes (USOMs). In the model, a sensor service can be requested, and thus accepted, only if the current active USOM includes this service. This prevents the request of services when they cannot be available. According to Reference 10: Definition 46.2 An external mode is a subset of the set of external services included in the intelligent instrument. An external mode includes at least one external service and each service is included in an external mode at least. The operating modes can be easily described with respect to a labeled transition system where the label is the external event matching a request of commuting the current mode (see Figure 46.5). Moreover, in each USOM, a notion of context may exist, where the context is the subset of the services that are implicitly requested as long as the system remains in the given USOM. The external services included inside an USOM are supposed to behave independently. We say that they belong to orthogonal regions of a state, sometimes termed as constellations. As an example, the external services inside the intelligent sensor of the navigation system could be structured into a wall following or an execution trajectory mode. The position computation and the obstacle detection services would be implicitly executed when entering the corresponding mode. If the sensor reveals a complex state space, it can be decomposed into nonoverlapping substates to reduce the complexity. For example, the wall following and execution trajectory states can belong to a more general measuring state, often called a macro-state or a super-state, belonging to an active macro-state, etc. Some properties that the design must satisfy and that can be checked against the functional specifications have been elaborated. They complement the formal model. For example, properties may express some axioms such as: 1. An external mode is a nonempty set of external modes. 2. Each external service belongs to one external mode, at least. 3. In an intelligent instrument, the set of disconnected vertexes in the state-transition diagram is empty, that is, there is no external disconnected mode. 4. A transition between two modes is unique in the graph. 5. Each external mode must be reachable and each external mode can be left. 6. etc.

© 2006 by Taylor & Francis Group, LLC

46-6

Embedded Systems Handbook

T12

USOM1 Servicex

Servicey

T21

Servicex USOM2

FIGURE 46.5

The concept of USOMs.

These properties must be verified to guarantee a safe production of the intelligent instrument. For example, the verification of the property 4 can be easily done by checking if the fan-in and fan-out degrees of every vertex of the graph are not equal to zero. An incidence matrix can help in doing this. The external viewpoint of the intelligent instrument will usually be complemented with a second level of description, to capture the algorithmic flow. This level must exhibit the treatments that concur to the global functionality which is usually called internal services.

46.2.3 Functional Decomposition of a Service Complex operations often need to be decomposed into multiple primitive operations in order to produce the overall behavior. For example, an external measurement service can induce a very complex treatment, probably following a step of initialization and, for self-terminated services, followed by a step of termination. So, Definition 46.1 is usually complemented with the following definition [11]: Definition 46.3 An external service is the result of the execution of internal services. From the viewpoint of the designer, an internal service is an elementary operation, possibly extracted from a library of components, for which no further decomposition is needed. Its I/O behavior can be easily described through an algorithm. Depending on the area of applications, such a conceptual unit can be known under various appellations such as a module, a codel [12], etc. The functional decomposition of a complex external service into internal services clearly has the following advantages: • A structured programming helps the designer to describe the different steps of the treatment, without using internal state variables. • Transitions from one step to the other explicitly defines the possible interruption points of the service. Between these points, the operation is considered to be an atomic transaction. This preserves the functionality of coherency problems (loss of data and so on). • Reuse of common units of programming: they are common to different services or are the result of previous developments. For example, an obstacle detection service and a position computing service can share a lot of common portions of code: the signal emission, the signal acquisition, and so on. These units can be part of a library of Intellectual Properties (IP) modules. As the reader can see, there is no mention at this point on the nature of the realization. Design units can be implemented on customized or on software processors. The hardware units can also be freely implemented in the discrete domain using FPGA (Field Programmable Gate Array) or DSP (Digital Signal Processor) components, or by using analog components in the continuous domain. In order to reduce the complexity of the design, the definition of the executive architecture (problem known as the partitioning problem)

© 2006 by Taylor & Francis Group, LLC

Intelligent Sensors: Analysis and Design

46-7

External event Input data acquisition

Internal event

Processing

Output data emission

FIGURE 46.6

Example of a functional decomposition.

must be postponed during a detailed design step, taking into account various design constraints such as economic and real-time constraints. The internal description of a complex service can be expressed using an activity diagram. The activity diagrams are well suited to show algorithmic details or procedural flow inside the service. They are often compared with flowcharts but are more expressive. Such a diagram describes the internal operations to be achieved on the incoming flow and their temporal dependencies. Depending on the complexity of the service, the detailed refinement of an activity can be performed using several hierarchical successive levels. The processing step in the Figure 46.6 can, for example, be refined into a feature extraction step followed by a data classification step, and a final decision step in sequential ordering. Elementary operations, those for which no refinement is needed will be described by their internal behavior, usually through a algorithm. The activation of the internal services is controlled by internal events. Definition 46.4 An internal event in an intelligent instrument is an event that is produced and consumed by the instrument itself. The producer gives birth to the event. Consumers can react to this event in order to start their processing. The activation of an operation will often depend on the completion of the previous operation but, more complex temporal dependencies will also frequently happen like the activation of a signal processing operation conditioned to the end of an external conversion operation. In this way, an external service can itself be in the position of a client of another component, dynamically starting an external activity to request data necessary to the achievement of its mission. While an external event is associated with a unique external service, an internal event can be associated with several internal services, leading to a n producer–m consumer relationship. In this, activity diagrams differ from conventional flowcharts relatively to their capacity to represent concurrency. In Figure 46.7, the execution of the internal service V is followed by the simultaneous execution of the internal services W and Y. Expressing sequential and parallel compositions of treatments is not always sufficient. The execution of a service can be affected by the state of the resources. The concept of version has been created with the aim of providing alternative versions of treatments that will enable the service to operate under nonnominal conditions. This is a means to take the fault tolerance problem into account. All the versions of a given service will share the same request and produce the same output, but the inputs, procedures, and resources will differ from one version to another. For example, a measurement service uses two transducers in a nominal mode of the service, to compute a data value using a sophisticated data analysis method. If a defect is detected on one of the transducers, the measurement service can continue to operate using a subset of the features extracted from the input data. Of course, the quality of the result will decrease. The versions are typically ranked and classified into internal modes such as the nominal mode and the degraded mode. The management of the versions of a service can be straightforward: at time t

© 2006 by Taylor & Francis Group, LLC

46-8

Embedded Systems Handbook

Nominal Degraded

Internal service V

Internal service W

Internal service Y

Internal service U

Internal service Z

FIGURE 46.7 Version and internal mode concept.

when the request for service is emitted, the version to be carried out will be the one with the lowest rank whose resources are all nonfaulty [13]. Like for the USOMs, the description of the internal modes can be done using a state diagram. Having reviewed the generic formal model of the intelligent sensor, let us now turn to some validation aspects of the sensor.

46.2.4 Sensor Architectural Design We have seen the mathematical properties underlying the intelligent sensor generic model of computation. These properties can be efficiently used to answer questions about system behavior without carrying out expensive verification tasks. The formal validation generally uses an automata-theoretic approach, modeling the formal description by Finite-State Machine (FSM) and the language of the automaton [14]. As stated earlier, the final implementation of the intelligent sensor should be made as much as possible, using automatic generation from the generic model, to ensure implementations that are “correct by construction.” For example, the protocol defined at the high level of abstraction uses the concept of message passing, where a message is an abstraction of data and/or control information passed from one component to another. Various mechanisms (the message signature) can be envisioned at a lower level of definition, including a function call, an interruption, an event using a Real-Time Operating System (RTOS), an ADA rendezvous or a Remote Procedure Call (RPC) in a distributed implementation. Consequently, a prototype of the sensor is also a useful mean to validate the specification in the presence of the real-time inputs, with physical characteristics similar to those of the final implementation and which will be produced by the synthesis stage. Rapid prototyping aims at analyzing the performance of an implementation, to validate its capability of satisfying hard real-time constraints, etc. To do so, the key technologies are the use of software synthesis, hardware synthesis, and the synthesis of interfaces between software and hardware using programmable components. The prototype to be generated will be highly dependant on the physical architecture selected and on the physical communication links. The targeted architecture is strongly dependant on the cost of components and production. Consequently, as shown in the Figure 46.8, there are some nontrivial aspects to be analyzed, to be in a position of producing a prototype: • The definition of the hardware/software architecture (partitioning, mapping). • The sequencing of the software on each software processor (scheduling).

© 2006 by Taylor & Francis Group, LLC

Intelligent Sensors: Analysis and Design

46-9

Formal model of the intelligent sensor

Formal validation

Partitioning, mapping, scheduling

Hardware/software synthesis

Prototype of the intelligent sensor

FIGURE 46.8 Design flow of an intelligent sensor.

Input interface

Output interface

I1 I2

T1 MUX

E/B

Amp

ADC

Amp

DAC

MUX

I3

T2 T3

xxxxx Ref.

System bus

EPROM

RAM

Microprocessor

Communication interface Software processor

FIGURE 46.9 Typical model of an architecture for an intelligent sensor system.

Figure 46.9 shows typical hardware/software architecture for an intelligent sensor. We can observe the hybrid character of the intelligent sensors, mixing together analog and numerical components. Each component description can be refined to exhibit the detailed architecture of a component. On the figure, we can see an architecture organized around a microprocessor that processes the functionalities for which it is in charge. This processor can be a DSP, a processor that has a CPU customized for data-intensive operations such as digital filtering. Bidirectional communication is ensured through various means, using a serial link, a CAN (Controller Area Network), an Ethernet link, etc. [15]. Finally, memories (ROM, RAM, etc.) ensure the memorization of the information located inside the sensor. In the future, the hardware architecture will tend to combine more and more customized hardware with embedded software. The definition of a hardware/software architecture involves checking if the sensor can be schedulable, that is, if all the performance requirements can be guaranteed. A deadline (a point in the time or a delta-interval by which a system action must occur) is an example of a requirement that,

© 2006 by Taylor & Francis Group, LLC

46-10

Embedded Systems Handbook

when missed, constitutes an erroneous computation. Consequently, the definition of hardware/software architectures generally requires more complex modeling, by defining the external timing requirements of the messages. The requirement of Quality-of-Service (QoS) of a message can be expressed using different means. For example, the response timing can be defined in terms of timeliness requirements (typically, deadlines) [16]. Assigning an execution order to concurrent modules, and finding a sequence of instructions implementing a functional module are the primary challenges in software organization. These can be nontrivial issues to deal with, particularly when one must consider the performance, as well as the functional requirements, of the system. The software implementation can be facilitated by the use of real-time languages and their underlying executive kernel. Such languages provide a style of programming that enables the manipulation of events and/or state changes with constructions expressing the behavior through the parallelism, the synchronizing, etc. Selecting a language to compute the model is not straightforward with basically several possibilities: developing a dedicated language or the use of an existing language, the use of a graphical or a textual input form, etc. The domain of applicability of a language must also be carefully studied with basically two approaches. The synchronous approach states that time is a sequence of instants between which nothing interesting occurs [17]. In each instant, some events occur in the environment and a reaction is computed instantly by the modeled design. This means that computation and internal communication take no time. This hypothesis is very convenient, allowing modeling the complete system as a single FSM with a completely predicable behavior. ESTEREL [18,19] and its graphical expression form, the SYNCCHARTS [20,21], are representatives of synchronous imperative programming languages. Like all language specialized to control-dominated systems programming, data manipulation cannot be done very naturally. In the asynchronous approach like in ELECTRE [22], events are observed and processed immediately. This approach enhances the expression power of the language. Moreover, the design can be more efficiently implemented on heterogeneous hardware/software architectures. On another side, timing constraints are difficult to check.

46.3 The CAP Language 46.3.1 Description The generic intelligent sensor model gave birth to a new language that can be akin to the category of asynchronous languages, the CAP language [23]. Its ability of providing a rapid prototype of the intelligent sensor model on common microcontrollers available on the market has been one of the main reasons of developing this language. The developers have consequently limited the implementation to monoprocessors and to a sequential operation. The CAP language is an incomplete language in the sense that it specifies only the interaction between computational modules (internal services), and not the computation performed by the modules. An interface with a host language specifies the behavioral contents of such units through C instructions. Like every conventional language, the grammar can be described using a Backus Naur Form (BNF) also known as a metalanguage. Figure 46.10 shows a formal grammar defined by Reference 10. Reference 11 complemented the language with the possibility of declaring internal services. As can be seen, the two parts of the model are normally successively expressed. The interface of the instrument follows the metavariable instrument with: • A number referring to the instrument as a node in the network. • Variables that can be exported or updated on the network. • Communication links imported or exported to implement the command protocol. The expression of the graph of the external modes is achieved through the set of vertices and the set of transitions. Each transition is then described by declaring the input vertex, the output vertex, and the link of communication source of the event of transition. Variables of the imported or exported lists are defined according to a type in C. Finally, the expression of a list of external modes in the definition of a service

© 2006 by Taylor & Francis Group, LLC

Intelligent Sensors: Analysis and Design

46-11

FIGURE 46.10 An overview of the formal grammar. (From J. Ttussor, PhD thesis. With permission.)

can be noted. These modes are the only ones in which the service can be launched. The description of the internal model is close to the external one and can be easily understood. As shown on Figure 46.11, the principle of the operation of the compiler is relatively straightforward: after the lexical and syntactical analyses, a set of lists reflecting the formal model is produced by the code generator. After being generated, the code is compiled and linked with the user-defined libraries and the processor-dependant startup code. The generated data structures are split into two sets of list: a first set contains the names of the objects and is subdivided into eight lists: external modes, internal modes, external services, internal services, variables, export links, import links, and internal events. The second set contains the detailed description of each transition. The conformity of the description to the formal model will be analyzed using these lists. They also contribute to the providing of a level of abstraction between the software synthesizer and the real machine.

© 2006 by Taylor & Francis Group, LLC

46-12

Embedded Systems Handbook

Source of the IS

CAP compiler

Intermediate data structures

Kernel

C compiler

Input/output libraries

Executable

FIGURE 46.11

The CAP design flow. (From J. Ttussor, PhD thesis. With permission.)

Let us illustrate this principles on an intelligent instrument designed to process ultrasonic sounds in order to produce a distance measurement value.

46.3.2 Illustration To illustrate the approach, let us take the following example [11]: a measurement system is composed of one ultrasonic transmitter able to emit a signal toward a target in order to compute the distance between this target and the sensor with the help of two receivers. Figure 46.12 shows the basic configuration of the measurement system. The intelligent instrument delivers the two measurements of distance d1 and d2 , each of them with a validation degree dv1 and dv2 , respectively. As shown in Figure 46.13, the basic principle of the measurement is the following: the ultrasonic emitter sends a sinusoidal waveform linearly modulated in frequency to the F = (fmax −fmin ) interval with a rate of variation of the frequency: α = f /Tr . Then, the instantaneous frequency Freceived undergoes an offset relatively to the frequency Femitted of f = |Femitted − Freceived |. Reference 24 has shown that the distance d can be determined by measuring the offset only inside the interval [t0 , Tr ] where its value is fa . For that, the measurement is inhibited during the time ta where the offset of the frequency has the value f b . The state space of the sensor is decomposed into two USOMs: the configuration mode (which is the default USOM) and the measurement mode (see Figure 46.14). The transition between each mode is triggered on the cantcp_in event. A number of internal services have been specifically developed for this application: • Initialization and configuration services enable the modification of the parameter values used for the computation of d,: duration of the slope Tr , slope p, time tdead , and voltage umin . They take account of particular conditions of measurements (nature of the obstacle, environment, etc.). • Internal services such as those depicted in Figure 46.15 contribute to the measurement elaboration; for example, the internal service for the slope generation has the responsibility of sending the signal in direction of the obstacle; suppression of aberrant measurements filters and keeps the impulsions close to the median famoy (average of the measured frequencies on Tr ), while the others are discarded. The Dynamic Packet Transport (DPT) service computes the pseudo-triangular distribution of the possibilities.

© 2006 by Taylor & Francis Group, LLC

Intelligent Sensors: Analysis and Design

46-13

d0

Receiver

d2

d1

Receiver

Transmitter Sensor

FIGURE 46.12 Configuration of the measurement system. (From J. Ttussor, PhD thesis. With permission.)

f fmax

Emitted signal

Tr

Slope a t

∆F

Received signal

fmin

tdead

Us

t0

∆f = f b

∆f = fa

t

FIGURE 46.13 Principle for the measurement of fa . (From J. Ttussor, PhD thesis. With permission.)

Internal calculations are not very complex. For example, the distance computation service computes the distance on every fa input. By doing this, there are Nexp measurements of d carried out during the interval Tr , using the following formula: D=

fa Tr − tdead ∗V 2 F

(46.1)

The internal service named validity estimation computes a value in the [0, 1] interval according to the following rule: Nexp (46.2) Dv = Nth where Nth is the theoretical number of the periods of fa that are supposed to be observed by the instrument.

© 2006 by Taylor & Francis Group, LLC

46-14

Embedded Systems Handbook

Configuration Import_slope_duration

Import_umin

Import_slope

Law_selection

Import_wave_speed

Cantcp_in

Reset

Cantcp_in

Mode configuration, measure. Transition configuration to measure on cantcp_in. Transition measure to configuration on cantcp_in. Default mode configuration.

Measure

Distance_measurement

Standard deviation

FIGURE 46.14 Graph of the USOMs and the corresponding CAP declarations. (From J. Ttussor, PhD thesis. With permission.)

As shown, the declaration of the distance_measurement external service includes the request (on cantcp_in) and the USOM (in measure) where it is available. The measurement functionality can be carried out according to four internal modes: nominal, degraded1, degraded2, and critical. For example, when the validity of the measurement d1 (see equation [46.2]) goes down below a given threshold in the nominal mode, the instrument can automatically switch in the degraded1 mode. Figure 46.16 part of the synthesized code presents an excerpt of the synthesized code. As stated earlier, these symbolic data structures will be used by the verification tool to check the properties of the formal model. The data structures (array of services, etc.) will be typically stored inside a volatile memory (RAM or SRAM) of the hardware architecture, while the automaton, not reachable to the user will be set in a longterm memory (EPROM, FLASH, etc.). Depending on the operating mode, development or exploitation mode, the user program can be stored inside the volatile memory or set in long-term memory.

46.3.3 Implementation The execution of the various services is handled by an automaton that has the responsibility to interpret the formal model. As shown in Figure 46.17, the execution machine runs a cyclic program that processes the inputs, updating a FIFO (First In First Out) storing of the pending events. Then, depending on the nature of the event, a transition can be fired or a call to a procedure is achieved. Permanent functions are then executed. The loop ends with the emission of the output messages. For each step, the detailed behavior is the following: 1. Reading of a message on each communication link: if the intelligent sensor is concerned by the message, it will be processed according to its type: • In case of a data message, the corresponding variable is updated and the associated event is triggered.

© 2006 by Taylor & Francis Group, LLC

Intelligent Sensors: Analysis and Design

46-15

Slope Generation

Acquisition (voice 2)

Acquisition (voice 1)

Suppression of aberrant measurements 2

Suppression of aberrant measurements 1

Translation in accumalized units 2

Translation in accumalized units 1

Distance computation 2

Distance computation 1

Standard_ deviation 1

Get_Standard_ deviation1

DPT1

Validity estimation 1

Get_DPT1

Get_validity1

Standard_ deviation2 DPT2

Get_Standard_ deviation2

Get_DPT2

Validity estimation 2

Get_validity2

Average computation

Get_measurement 1 Get average

Get_measurement 2

Service distance_measurement on cantcp_in in measure {uses slop_generation, acquisition 1, distance_conputation 1, ... } Iservice acquisition 1 on end (slop_generation) {/* C code of the internal service */}

FIGURE 46.15 Functional decomposition of the measurement functionality. (From J. Ttussor, PhD thesis. With permission.)

• In case of an event message, the corresponding event is inserted inside the FIFO event for an ulterior processing. • In case of a query message, the current mode is exported on output links. • In case of a variable message, the variable is linked to import an external variable. 2. Event queue processing: three situations can arise and be analyzed in the following order: • The event is a request of a change of the external mode. Provided this change is authorized, the internal variable is updated and the new mode is exported on output links. • The same thing for an event corresponding to a request of a change of the internal mode. • The event is an execution request of an external service. Provided the service(s) is (are) enabled in the current mode, these services are executed using the procedural entry point. • The same thing for an event corresponding to a request for executing an internal service. If defined, an event noticing the end of execution is inserted in the FIFO event. 3. Execution of the EVER services: these are services that run permanently. 4. Diffusion of the messages on communication links (exported variables, change of mode, etc.).

© 2006 by Taylor & Francis Group, LLC

46-16

Embedded Systems Handbook

FIGURE 46.16

Part of the synthesized code. (From J. Ttussor, PhD thesis. With permission.)

Event link FIFO Reading and processing of the messages

Read and processing of the events Input link FIFO Triggering of EVER events

Emission of the messages

FIGURE 46.17

Principle of operation of the automaton.

© 2006 by Taylor & Francis Group, LLC

Output link FIFO

Intelligent Sensors: Analysis and Design

46-17

Note the reduced portability of the application, because of the tight link that exists between the application and the hardware. Open software platforms (EmbeddedJava, JavaCard, etc. [25]) can be considered as attractive solutions to this problem, by interpreting the applications via a virtual machine. But, the designer has to be aware of the substantial performance penalty that could be paid by adopting this solution.

46.4 Conclusion In this chapter, a generic model for the design of intelligent instruments has been discussed. In the formal model of the sensor, the specification of the external services is given according to a user point of view of the functionalities available inside the sensor. The external model uses the concept of the USOMs to preserve the activation of services that are not available during the current external mode. Internal services define the basic units that will be assembled to describe the complex behavior of an external service, taking into account various temporal dependencies. Like the external services, the internal services can be gathered inside internal modes. Internal modes define the internal states of a service according to its possibilities of operating under various contextual situations. Some of the basic principles underlying the implementation of the model on hardware architectures have also been exposed. The advantages of the numeric instrumentation for the processing relatively to the conventional instrumentation are well established today. We have finally discussed of an implementation that can be generated automatically through the use of language like CAP. The automatic generation of the implementation is conditioned to the formal verification of the properties underlying the generic model. As stated earlier, the intelligent sensors to come will be composed of more complex and heterogeneous components. This trend will change the industrial landscape, making the trade and assembly of IPs embodied in layouts, RTL (Register Transfer Level) designs and software programs indispensable [17]. This aspect, not specific to the design of the intelligent sensors is a great challenge. For example, Hardware/software cosimulation is often performed with separated simulation models [26]. This makes trade-off evaluation difficult because the models must be recompiled whenever a change in the architecture mapping is made. The confluent system level design tool [27] introduces an intermediate level of abstraction, the functional level, between the specification and the architectural model of the sensor [28]. As shown in Figure 46.18, this level defines the logical architecture of the system in terms of functional components

Customer needs

Requirements document

Requirement defination Specifications docment System specifications

Functional design

Functional model

Rense Architectural model

Architecture design IP Capitalization

Printotyping Product

FIGURE 46.18 The MCSE design methodology for complex sensor hardware architectures.

© 2006 by Taylor & Francis Group, LLC

Prototypes

46-18

Embedded Systems Handbook

(simply called functions) and the relations between them (ports, shared variables, events depending on the kind of relationship). Like in the Vulcan system [29], the use of a control/data flow graph for the behavioral model facilitates the partitioning at the operation level. This environment has been used successfully for the design of an intelligent sensor for pattern recognition [30]. The functional model provides an environment for behavioral and performance analysis in a technology- and language-independent manner that allows implementation of the same functionality on diverse physical architectures [31]. Automatic synthesis can be achieved on hardware (VHDL descriptions) or RTOS primitives (VxWorks). A systemC simulation engine and code generator is also available for system-on-chip (SoC).

Acknowledgment All the figures in relation with the CAP language are reprinted with permission from Dr L. Foulloy, University of Savoy, France.

References [1] N. Medvodovic and R.N. Taylor. A classification and comparison framework for software architecture description language. IEEE Transactions on Software Engineering, 26: 70–93, 2000. [2] M. Staroswiecki and M. Bayart. Models and languages for the interoperability of smart instruments. Automatica, 32: 859–873, 1996. [3] N. Halbwachs. Synchronous Programming of Reactive Systems. Kluwer Academic Publishers, Dordrecht, 1993. [4] E. Benoit, R. Dapoigny, and L. Foulloy. Fuzzy-based intelligent sensors: modelling, design, applications. In Proceedings of the 8th IEEE International Conference on Emerging Technologies (ETFA ’2001), Antibes, France, October 2001. [5] E. Dekneuvel, M. Ghallab, and J.P. Thibault. Hypotheses management in a multi-sensory perception machine. In Proceedings of the 10th European Conference on Artificial Intelligence (ECAI), Vienna, Austria, August 1992. [6] J.M. Riviere, M. Bayart, J.M. Thiriet, A. Bouras, and M. Robert. Intelligent instruments: some modeling approaches. Measurement and Control, 29: 179–186, 1996. [7] S. Edwards, L. Lavagno, E.A. Lee, and A. Sangiovanni-Vincentelli. Design of embedded systems: formal models, validation and synthesis. Proceedings of the IEE, 85: 366–390, 1997. [8] D. Harel, H. Lachover, A. Naamad, A. Pnueli et al. STATEMATE: a working environment for the development of complex reactive systems. IEEE Transactions on Software Engineering, 4(16): 403–414, 1990. [9] J.P. Calvez. Embedded Real-Time Systems. A Specification and Design Methodology. John Wiley & Sons, New York, 1993. [10] A. Bouras and M. Staroswiecki. Building distributed architectures by the interconnection of intelligent instruments. In IFAC INCOM ’98, Nancy, June 1998. [11] J. Tailland, L. Foulloy, and E. Benoit. Automatic generation of intelligent instruments from internal model. In Proceedings of the International Conference SICICA ’2000, Argentina, September 2000. [12] S. Fleury, M. Herrb, and R. Chatila. Design of a modular architecture for autonomous robot. In Proceedings of the IEEE International Conference on Robotics and Automation, San Diego, CA, 1994. [13] M. Staroswiecki, G. Hoblos, and A. Aitouche. Fault tolerance analysis of sensor systems. In Proceedings of the 38th IEEE Conference on Decision and Control Phoenix, USA, 1999. [14] R.P. Kurshan. Automatic-Theoretic Verification of Coordinating Processes. Princeton University Press, 1994. [15] J. Warrior. Smart sensor networks of the future. Sensor Magazine Mars, 40–45, 1997. [16] B.P. Douglass. Doing Hard Time. Developing Real-Time Systems with UML, Objects, Frameworks, and Patterns. Addison Wesley, Reading, MA, 1999.

© 2006 by Taylor & Francis Group, LLC

Intelligent Sensors: Analysis and Design

46-19

[17] F. Balarin, M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, C. Passerone, A. SangiovanniVincentelli, E. Sentovich, K. Suzuki, and B. Tabbara. Hardware–Software Co-Design of Embedded Systems. Kluwer Academic Publishers, Dordrecht, 1997. [18] G. Berry and G. Gonthier. The ESTEREL synchronous programming language: design, semantics, implementation. Science of Computer Programming, 19: 87–152, 1992. [19] Esterel Studio. http://www.esterel-technologies.com [20] C. André. Representation and analysis of reactive behaviors: a synchronous approach. In Proceedings of the CESA ’96, Lille, France, 1996, pp. 19–29. [21] C. André, F. Boulanger, and A. Girault. A software implementation of synchronous programs. In Proceedings of the 2nd International Conference on Application of Concurrency to System Design (ICACSD 2001), Newcastle upon Tyne, UK, June 25–29, 2001, pp. 133–142. [22] F. Cassez and O. Roux. Compilation of the ELECTRE reactive language into finite transition systems. Theoretical Computer Science, 146: 109–143, 1995. [23] E. Benoit, J. Tailland, L. Foullooy, and G. Mauris. A software tool for designing intelligent sensors. In Proceedings of the IEEE Instrumentation and Measurement Technology IMTC/2000, Baltimore, MD, May 2000. [24] G. Mauris, E. Benoit, and L. Foulloy. Ultrasonic smart sensors: the importance of the measurement principle. In Proceedings of the IEEE/SMC International Conference on Systems Engineering in the Service of Humans, Le touquet, France, October 1993. [25] EmbeddedJava and Javacard. http://Java.sun.com [26] J. Rowson. Hardware/software co-simulation. In Proceedings of the Design Automation Conference, 1994, pp. 439–440. [27] CoFluent Studio. http://www.cofluentdesign.com [28] J.P. Calvez. A co-design case study with the MCSE methodology. Design Automation of Embedded Systems, Special Issue on Embedded Systems Case Studies, 1: 183–211, 1996. [29] R.K. Gupta, C.N. Coelho, and G. De Micheli. Program implementation schemes for hardware/software systems. IEEE Computer, 27: 48–55, 1994. [30] E. Dekneuvel, F. Muller, and T. Pitarque. Ultrasonic smart sensor design for a distributed perception system. In Proceedings of the 8th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Antibes Juan le spins, France, 15–18 October 2001. [31] J.P. Calvez and O. Pasquier. Performance assessment of embedded hw/sw systems. In Proceedings of the International Conference on Computer Design, Austin, TX, October 1995.

© 2006 by Taylor & Francis Group, LLC