Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2981
3
Berlin Heidelberg New York Hong Kong London Milan Paris Tokyo
Christian Müller-Schloer Theo Ungerer Bernhard Bauer (Eds.)
Organic and Pervasive Computing – ARCS 2004 International Conference on Architecture of Computing Systems Augsburg, Germany, March 23-26, 2004 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Christian M¨uller-Schloer University of Hannover Institute of Systems Engineering, System and Computer Architecture - SRA Appelstr. 4, 30167 Hannover, Germany E-mail:
[email protected] Theo Ungerer University of Augsburg Institute of Informatics, 86159 Augsburg, Germany E-mail:
[email protected] Bernhard Bauer University of Augsburg Department of Software Engineering and Programming Languages 86159 Augsburg, Germany E-mail:
[email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at
.
CR Subject Classification (1998): C.2, C.5.3, D.4, D.2.11, H.3.5, H.4, H.5.2 ISSN 0302-9743 ISBN 3-540-21238-8 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag is a part of Springer Science+Business Media springeronline.com c Springer-Verlag Berlin Heidelberg 2004 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin, Protago-TeX-Production GmbH Printed on acid-free paper SPIN: 10990674 06/3142 543210
Preface
Where is system architecture heading? The special interest group on Computer and Systems Architecture (Fachausschuss Rechner- und Systemarchitektur) of the German computer and information technology associations GI and ITG asked this question and discussed it during two Future Workshops in 2002. The result in a nutshell: Everything will change but everything else will remain. Future systems technologies will build on a mature basis of silicon and IC technology, on well-understood programming languages and software engineering techniques, and on well-established operating systems and middleware concepts. Newer and still exotic but exciting technologies like quantum computing and DNA processing are to be watched closely but they will not be mainstream in the next decade. Although there will be considerable progress in these basic technologies, is there any major trend which unifies these diverse developments? There is a common denominator – according to the result of the two Future Workshops – which marks a new quality. The challenge for future systems technologies lies in the mastering of complexity. Rigid and inflexible systems, built under a strict top-down regime, have reached the limits of manageable complexity, as has become obvious by the recent failure of several large-scale projects. Nature is the most complex system we know, and she has solved the problem somehow. We just haven’t understood exactly how nature does it. But it is clear that systems designed by nature, like an anthill or a beehive or a swarm of birds or a city, are different from today’s technical systems that have been designed by engineers and computer scientists. Natural systems are flexible, adaptive, and robust. They are in permanent exchange with their environment, respond to changes adequately, and are very successful in staying alive. It seems that also the traditional basic technologies have realized this trend. Hardware is becoming reconfigurable, software now updates itself to fulfill new requirements or replace buggy components, and small portable systems form ad hoc communities. Technical systems of this kind are called Organic Computer systems. The key challenge here will be to understand and harness self-organization and emergence. Organic Computing investigates the design and implementation of self-managing systems that are self-configuring, self-optimizing, self-healing, selfprotecting, context aware, and anticipatory. ARCS 2004 continued the biennial series of German Conferences on Architecture of Computing Systems. This seventeenth conference in the series served as a forum to present current work on all aspects of computer and systems architecture. The program committee of ARCS 2004 decided to devote this year’s conference to the trends in organic and pervasive computing. ARCS 2004 emphasized the design, realization, and analysis of the emerging organic and pervasive systems and their scientific, engineering, and commercial applications. The conference focused on system aspects of organic and pervasive computing in software and hardware. In particular, the system integration and
VI
Preface
self-management of hardware, software, and networking aspects of up-to-now unconnected devices is a challenging research topic. Besides its main focus, the conference was open to more general and interdisciplinary themes in operating systems, networking, and computer architecture. The program reflected the main topics of the conference. The invited talk of Andreas Maier (IBM) presented the Autonomic Computing Initiative sparked by IBM which has objectives similar to but not identical with Organic Computing. Erik Norden’s (Infineon) presentation discussed multithreading techniques in modern microprocessors. The program committee selected 22 out of 50 submitted papers. We were especially pleased by the wide range of countries represented at the conference. The submitted paper sessions covered the areas Organic Computing, peer-topeer computing, reconfigurable hardware, hardware, wireless architectures and networking, and applications. The conference would not have been possible without the support of a large number of people involved in the local conference organization in Augsburg, and the program preparation in Hannover. We want to extend our special thanks to the local organization at the University of Augsburg, Faruk Bagci, Jan Petzold, Mattias Pfeffer, Wolfgang Trumler, Sascha Uhrig, Brigitte Waimer-Eichenauer, and Petra Zettl, and in particular to Fabian Rochner of the University of Hannover, who managed and coordinated the work of the program committee with admirable endurance and great patience.
February 2004
Christian M¨ uller-Schloer Theo Ungerer Bernhard Bauer
VII
Organization
Executive Committee General Chair: General Co-chair: Program Chair: Workshop and Tutorial Chair:
Theo Ungerer University of Augsburg Bernhard Bauer University of Augsburg Christian M¨ uller-Schloer University of Hannover Uwe Brinkschulte
University of Karlsruhe (TH)
Program Committee Dimiter Avresky Nader Bagherzadeh Bernhard Bauer J¨ urgen Becker Michael Beigl Frank Bellosa Arndt Bode Gaetano Borriello Uwe Brinkschulte Francois Dolivo Kemal Ebcioglu Reinhold Eberhart Werner Erhard Hans Eveking Hans-W. Gellersen Werner Grass Wolfgang Karl J¨ urgen Klein¨ oder Rudolf Kober Erik Maehle Christian M¨ uller-Schloer J¨ org Nolte Wolfgang Rosenstiel Burghardt Schallenberger Alexander Schill Hartmut Schmeck Albrecht Schmidt Karsten Schwan Rainer G. Spallek Peter Steenkiste
Northeastern University, Boston, USA University of California Irvine, USA University of Augsburg, Germany University of Karlsruhe, Germany TecO, Karlsruhe, Germany University of Erlangen, Germany Technical University of M¨ unchen, Germany University of Washington, USA University of Karlsruhe, Germany IBM, Switzerland IBM T.J. Watson, Yorktown Heights, USA DaimlerChrysler Research, Ulm, Germany Friedrich Schiller University of Jena, Germany TU Darmstadt, Germany University of Lancaster, UK University of Passau, Germany University of Karlsruhe, Germany University of Erlangen-N¨ urnberg, Germany Siemens AG, M¨ unchen, Germany University of L¨ ubeck, Germany University of Hannover, Germany TU Cottbus, Germany University of T¨ ubingen, Germany Siemens AG, M¨ unchen, Germany Technical University of Dresden, Germany University of Karlsruhe, Germany LMU, M¨ unchen, Germany Georgia Tech, USA TU Dresden, Germany Carnegie-Mellon University, USA
VIII
Organization
Djamshid Tavangarian Rich Uhlig Theo Ungerer Klaus Waldschmidt Lars Wolf Hans Christoph Zeidler Martina Zitterbart
University of Rostock, Germany Intel Microprocessor Research Lab, USA University of Augsburg, Germany University of Frankfurt, Germany University of Braunschweig, Germany University Fed. Armed Forces, Germany University of Karlsruhe, Germany
Additional Reviewers Klaus Robert M¨ uller Christian Grimm
University of Potsdam University of Hannover
Local Organization Bernhard Bauer Faruk Bagci Jan Petzold Matthias Pfeffer Wolfgang Trumler Sascha Uhrig Theo Ungerer Brigitte Waimer-Eichenauer Petra Zettl
University University University University University University University University University
of of of of of of of of of
Augsburg Augsburg Augsburg Augsburg Augsburg Augsburg Augsburg Augsburg Augsburg
Program Organization Fabian Rochner
University of Hannover
Supporting/Sponsoring Societies The conference was organized by the special interest group on Computer and Systems Architecture of the GI (Gesellschaft f¨ ur Informatik – German Informatics Society) and the ITG (Informationstechnische Gesellschaft – Information Technology Society), supported by CEPIS and EUREL, and held in cooperation with IFIP, ACM, and IEEE (German section).
Sponsoring Company
Table of Contents
Invited Program Keynote: Autonomic Computing Initiative . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Maier
3
Keynote: Multithreading for Low-Cost, Low-Power Applications . . . . . . . . Erik Norden
4
I Organic Computing The SDVM: A Self Distributing Virtual Machine for Computer Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Haase, Frank Eschmann, Bernd Klauer, Klaus Waldschmidt Heterogenous Data Fusion via a Probabilistic Latent-Variable Model . . . . Kai Yu, Volker Tresp
9
20
Self-Stabilizing Microprocessor (Analyzing and Overcoming Soft-Errors) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shlomi Dolev, Yinnon A. Haviv
31
Enforcement of Architectural Safety Guards to Deter Malicious Code Attacks through Buffer Overflow Vulnerabilities . . . . . . . . . . . . . . . . . Lynn Choi, Yong Shin
47
II Peer-to-Peer Latent Semantic Indexing in Peer-to-Peer Networks . . . . . . . . . . . . . . . . . . . Xuezheng Liu, Ming Chen, Guangwen Yang
63
A Taxonomy for Resource Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Koen Vanthournout, Geert Deconinck, Ronnie Belmans
78
Oasis: An Architecture for Simplified Data Management and Disconnected Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anthony LaMarca, Maya Rodrig
92
Towards a General Approach to Mobile Profile Based Distributed Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Christian Seitz, Michael Berger
X
Table of Contents
III Reconfigurable Hardware A Dynamic Scheduling and Placement Algorithm for Reconfigurable Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Ali Ahmadinia, Christophe Bobda, J¨ urgen Teich Definition of a Configurable Architecture for Implementation of Global Cellular Automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Christian Wiegand, Christian Siemers, Harald Richter RECAST: An Evaluation Framework for Coarse-Grain Reconfigurable Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Jens Braunes, Steffen K¨ ohler, Rainer G. Spallek
IV Hardware Component-Based Hardware-Software Co-design . . . . . . . . . . . . . . . . . . . . . . 169 ´ am Mann, Andr´ P´eter Arat´ o, Zolt´ an Ad´ as Orb´ an Cryptonite – A Programmable Crypto Processor Architecture for High-Bandwidth Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Rainer Buchty, Nevin Heintze, Dino Oliva STAFF: State Transition Applied Fast Flash Translation Layer . . . . . . . . . 199 Tae-Sun Chung, Stein Park, Myung-Jin Jung, Bumsoo Kim Simultaneously Exploiting Dynamic Voltage Scaling, Execution Time Variations, and Multiple Methods in Energy-Aware Hard Real-Time Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Markus Ramsauer V Wireless Architectures and Networking Application Characterization for Wireless Network Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Andreas Weissel, Matthias Faerber, Frank Bellosa Frame of Interest Approach on Quality of Prediction for Agent-Based Network Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 Stefan Schulz, Michael Schulz, Andreas Tanner Bluetooth Scatternet Formation – State of the Art and a New Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 Markus Augel, Rudi Knorr
Table of Contents
XI
A Note on Certificate Path Verification in Next Generation Mobile Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Matthias Enzmann, Elli Giessler, Michael Haisch, Brian Hunter, Mohammad Ilyas, Markus Schneider
VI Applications The Value of Handhelds in Smart Environments . . . . . . . . . . . . . . . . . . . . . . 291 Frank Siegemund, Christian Floerkemeier, Harald Vogt Extending the MVC Design Pattern towards a Task-Oriented Development Approach for Pervasive Computing Applications . . . . . . . . . . 309 Patrick Sauter, Gabriel V¨ ogler, G¨ unther Specht, Thomas Flor Adaptive Workload Balancing for Storage Management Applications in Multi Node Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 Jens-Peter Akelbein, Ute Schr¨ ofel
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Keynote Autonomic Computing Initiative Andreas Maier IBM Lab B¨ oblingen [email protected]
Abstract. Autonomic computing systems have the ability to manage themselves and dynamically adapt to change in accordance with business policies and objectives. Self-managing environments can perform such activities based on situations they observe or sense in the IT environment, rather than requiring IT professionals to initiate the tasks. Autonomic computing is important today because the cost of technology continues to decrease yet overall IT costs do not. With the expense challenges that many companies face, IT managers are looking for ways to improve the return on investment of IT by reducing total cost of ownership, improving quality of service, accelerating time to value and managing IT complexity. The presentation will outline where IBM comes from with its autonomic computing initiative and what has been achieved to date.
C. M¨ uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, p. 3, 2004. c Springer-Verlag Berlin Heidelberg 2004
Keynote Multithreading for Low-Cost, Low-Power Applications Erik Norden Senior Architect Infineon Technologies, Balanstrasse 73, 81541 Munich, Germany [email protected]
Abstract. Innovative architectural design may be the best route to create an economical and efficient balance between memory and logic elements in cost and power sensitive embedded solutions. When system prices are measured in a few euros instead of a few hundred, the large, power intensive and costly memory hierarchy solutions typically used in computer and communications applications are impractical. A multithreading extension to the microprocessor core is especially for deeply embedded systems a more effective approach. Infineon has developed a new processor solution: TriCore 2. It is the second generation of the TriCore Unified Processor architecture. TriCore 2 contains, among others, a block multithreading solution, which responds to the blocking code memory latency in one thread by executing the instructions of a second thread. In this way, the execution pipelines of the processor can be almost fully utilized. From the user programming model, each thread can be seen as one virtual processor. A typical scenario is a cell phone. Here, generally external 16-bit flash memories with a speed of 40 MHz are used while today’s performance requirements expect processors with a clock speed of 300-400 MHz. Because of this discrepancy, up to 80% of the performance can be lost, despite caches. Larger cache sizes and multi-level memory solutions are not applicable for cost reasons. Block multithreading allows system designers to use comparatively smaller instruction caches and slow external memory while still getting the same overall performance. The performance degradation in the cell phone example can be almost eliminated. Even the clock frequency can be reduced. Block multithreading is very efficient for a general CPU based application residing in cache memory and an algorithmic application in the local on-chip memory. This is a characteristic which many deeply embedded processor applications have. Effectively, a separate DSP and CPU can be replaced by a multithreaded hybrid to reduce chip area, tool costs etc. The block multithreading solution also supports a fast interrupt response, required for most deeply embedded applications. The additional costs for this multithreading solution are small. The implementation in TriCore 2 requires a chip area of only 0.3mm2 in 0.13 micron technology. The most obvious costs are caused by the duplicated register files to eliminate the penalty for task switching. Instruction cache C. M¨ uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 4–5, 2004. c Springer-Verlag Berlin Heidelberg 2004
Keynote Multithreading for Low-Cost, Low-Power Applications and fetch unit need to support multithreading but the overhead is low. Same to other areas that are affected which are traps and interrupt handling, virtual memory, and debug/trace. Apart from multithreading, TriCore 2 has also other highlights. Advanced pipeline technology allows high instruction per cycle (IPC) performance while reaching higher frequencies (400-600 MHz typical) and complying with demanding automotive requirements. The center of the processor’s hierarchical memory subsystem is an open, scalable crossbar architecture, which provides a method for efficient parallel communication to code and data memory including multiprocessor capability. This presentation will describe the root problems in low-cost, low-power embedded systems that require a multithreaded processor solution. Since the core architecture is well-deployed in the demanding automotive market, the first implementation is specified for these requirements like quality and determinism. Working silicon with this implementation is expected for the first quarter of 2004 and will be used as a demonstrator.
5
The SDVM: A Self Distributing Virtual Machine for Computer Clusters Jan Haase, Frank Eschmann, Bernd Klauer, and Klaus Waldschmidt J.W.Goethe-University, Technical Computer Sc. Dep., Box 11 19 32, D-60054 Frankfurt, Germany {haase|eschmann|klauer|waldsch}@ti.informatik.uni-frankfurt.de
Abstract. Computer systems of the future will consist more and more of autonomous and cooperative system parts and behave self-organizing. Self-organizing is mainly characterized by adaptive and context-oriented behaviour. The Self Distributing Virtual Machine (SDVM) is an adaptive, self configuring and self distributing virtual machine for clusters of heterogeneous, dynamic computer resources. In this paper the concept and features of the implemented prototype is presented.
1
Introduction
State of the art systems of computers are mostly used to run sequential programs, though client-server-systems or parallelizing compilers are increasingly popular. So far, many parallel programs run on dedicated parallel computing machines. Unfortunately, these multi-processor systems cannot be adapted to all needs of problems or algorithms to turn the inherent or explicit parallelism into efficiency. Therefore large clusters of PCs become important for parallel applications. These clusters show a huge dynamic and heterogenity with a structure that is usually unknown at the compile time of the application. Furthermore, the CPUs of the cluster have immense changing loads, nodes are more or less specific and the network is spontaneous with vanishing and appearing resources. Besides, the reuse of existing hardware would be more economic and cost-efficient. Thus, mechanisms have to be developed to perform the (efficient) distribution of code and data. They should be executable on arbitrary machines - not considering different processors and/or operating systems. Computers spending their machine time partly to the parallel calculations need applications running silently in the background without user interactions – like a UNIX-daemon. These computers should not be burdened too much by the background process to keep the foreground processes running smoothly. Full machine time must be available on demand for foreground processes, so the background process must support a shutdown at any time – or at least a nice feature to calm down its machine time consumption. Similarly the background
Parts of this work have been supported by the Deutsche Forschungsgemeinschaft (DFG).
C. M¨ uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 9–19, 2004. c Springer-Verlag Berlin Heidelberg 2004
10
J. Haase et al.
process should be started by the user expecting idle times for a while. So it has to reintegrate into the cluster without the need for a restart of the parallel application. The cluster would then be self configuring regarding its composition. Some batch processing systems, as e. g. Condor [1], feature cycle harvesting – a method to identify idle times of the participating sites. The systems base on a client-server structure, in which the server decides about the distribution and execution of the scheduled jobs centrally. As all sites have to report to the server, the communication channel tends to be the bottleneck of the system – especially in a loosely coupled cluster. The SDVM works without a client-server structure on a peer-to-peer basis, so neighborhood relations are automatically strengthened. Due to the size of large computer clusters the probability of hardware failures needs to be considered. Large clusters are hardly usable unless a concept for self healing of the cluster is available, making the parallel execution immune to hardware failures. To address these problems, the Self Distributing Virtual Machine (SDVM) has been developed and prototypically implemented in C++ by the authors. 1.1
Goals
The SDVM has been developed to introduce the following features into standard computer clusters. – self configuration: The system should adapt itself to different environments. Signing in and signing off at runtime should be available for all machines in the cluster. – adaptivity: The cluster should cope with heterogeneous computer systems, so computers with different speeds and even different operating systems can participate in an SDVM controlled cluster. – self distribution: Intelligent distributed scheduling and migration schemata should provide that data and instruction code is distributed among the cluster engines automatically at a minimum cost of machine time. – self healing: Crashes should be detected, handled and corrected by the SDVM automatically. – automatic parallelization: Hidden parallelism in programs should be detected and exploited. – self protection and protection of others: For a cluster working within an open network it is important to consider safety aspects. The authenticity of users and data has to be ensured, and sabotage, spying and corrumption has to be prevented. – accounting: Mechanisms and cost functions for accounting of provided and used computing time should be offered. The first three of these goals are addressed in this first paper about the SDVM.
The SDVM: A Self Distributing Virtual Machine for Computer Clusters
1.2
11
Possible Fields of Application
The SDVM can in the first instance be seen as a cheap opportunity to shorten the runtime of a program by parallelization. Due to self configuration (here: configuration of the computing cluster) it can be decided at run time to enlarge the cluster to get more performance or to downsize the cluster if participants are needed elsewhere. For example the use of a company’s workstation cluster at lunch time or at night is imaginable. Applied to the internet, the SDVM can be used to solve complex problems by collaboration of many computers, such as Seti@Home [2]. In this way computers which are currently at the night hemisphere of the earth can join into an SDVM cluster and sign off in the morning. In contrast to the internet solution a monolithic version can be implemented. In this case the SDVM can be seen as a concept for a self distributing scheduling of calculations on a given multiprocessor system. 1.3
VMs
Generally a virtual machine is a software layer, which emulates the functionality of a certain machine or processor on another machine. Usually the user does not recognize the emulation. Virtual machines are used for example to capsulate program execution because of security and supervision reasons, so that the application gets only access to virtual resources, and the access is not forwarded to the real resources, unless the access is checked and granted. The meanwhile widely used Parallel Virtual Machine (PVM) [3] is essentially a collection of procedures and functions to simplify the usage of heterogeneous computer clusters. On each participating computer a daemon (pvmd) has to be started, which performs communication with other pvmd’s on different cluster computers via special functions. To run an PVM a “host pool” has to be configured before run time, with a constant total of machines and a fixed communication infrastructure. The SDVM whereas allows signing in and out of computers at run time, without the need to know before run time which computers will participate.
2 2.1
Concepts Microframes and Microthreads
The SDVM is based on the concept of the Cache Only Memory Architectures (COMAs) [4]. In a COMA architecture a processor which needs specific data, looks for it in its local memory. The local memory is active and checks whether the data is locally available. In case of a cache hit it returns the data to the processor. In case of a cache miss it autonomously connects to another computer in the COMA cluster, and asks for the data. The answer of this query will be written into the local memory and propagated to the processor. Thus the data access on COMA clusters is done transparently for the application.
12
J. Haase et al.
input parameters ...
ID
ID target addresses ...
MicroFrame
[...] double romberg(double a, double b, int N){ double sum; int i, j; double res; double T[25][25]; if (N>25)return 0; T[0][0] = (b-a) * (f1(a) + f1(b)) /2; if (N > 25)return 0; for(i=1;i<=N;i++){ sum=0; [...]
MicroThread Fig. 1. Microframe and Microthread
The SDVM extends this concept such, that beside the transparent data migration also program instructions can transparently migrate. For the development of the SDVM the SDAARC architecture [5,6,7] is the origin. Program instructions are represented by microthreads. A microthread contains a (for each computer architecture compiled) code fragment, but it lacks its start arguments for execution. The SDVM generally distinguishes two types of data: global data and microframes. Global data in terms of COMA are spread over the sites participating in the SDVM cluster. Microframes as a special case of global data are containers for the arguments which a microthread needs to execute. They contain the code fragment, a pointer to the owning microthread, and addresses to microframes, which need the results of the microthread for their own execution. As soon as a microframe has all its arguments, it becomes executable and the corresponding microthread can be executed with the arguments of the microframe as parameters – thus the execution is triggered using dataflow synchronisation. Several microframes can point to the same microthread (n-to-1-relation). Both microframe and microthread have an unambiguous identifier each. Figure 1 shows a microframe and its corresponding microthread. As we don’t have a compiler to split up an application to be run on the SDVM automatically into microthreads yet, currently the user has to do this by hand. However, this is not like typical parallel programming, since the user doesn’t have to worry about the binding of the microthreads to specific sites – this is done automatically by the SDVM at runtime. 2.2
Execution Layer
The SDVM is structured in two layers: an execution and a communication layer. Both layers consist of several modules (“managers”), which communicate over method calls. The main part of the execution layer (fig. 2) is the scheduling manager. It manages the executable microframes and tells the code manager to provide the corresponding microthreads. The code manager loads the microthread instructions from hard disk or obtains it from code managers of other sites. Then it calculates the correct start
The SDVM: A Self Distributing Virtual Machine for Computer Clusters results
ProcessingManager MicroThread
MicroFrame
SchedulingManager
Code-Manager
13
Attraction Memory
Fig. 2. The execution layer of the SDVM
address of the microthread code and returns it to the scheduling manager. If the microthread doesn’t exist in a compatible binary format, it is created by retrieving the C sourcecode of the microthread and compiling it with the locally installed C compiler. The processing manager supervises the actual execution by getting the microframe and corresponding microthread from the scheduling manager, if it has free resources. It reads the start parameters for the microthread out of the microframe and starts the execution of the microthread at the start address of the microthread provided by the scheduling manager. During the execution of the microthread the processing manager can access the global data memory by reading and writing. The results of the execution generate the arguments for further not yet executable microframes. The processing manager passes the results to the attraction memory (which behaves like the attraction memory in COMA [4]) together with the addresses of the microframes which are waiting for these results. The attraction memory writes the results to the destination addresses and checks if further microframes got all their arguments now and thus have become executable. In this case the executable microframes are sent to the scheduling manager.
2.3
Communication Layer
The message manager is the main part of the communication layer (see fig. 3). It is the contact point for all managers which need to communicate with other SDVM sites. A manager creates a message in a standardized format (SDMessage) and hands it over to the message manager. An SDMessage includes beside the destination site address and the destination manager id also the site address and the manager id of the originator. Also it contains an unambiguous identifier and the referenced data itself. The SDVM provides two communication modes: synchronous and asynchronous. If a message is sent synchronously, the originator waits and freezes until it receives an answer. The message manager sends the message and scans
14
J. Haase et al.
ProcessingManager
Code-Manager
SchedulingManager
Attraction Memory
Cluster-Manager
MessageManager
I/O-Manager
NetworkManager Network
Fig. 3. Within a site all managers communicate via the message manager with other managers on other sites.
all incoming messages if there is the answer message for the synchronous communication. This answer is returned to the waiting manager directly as return value. Synchronous messages are especially intended for fast requests, so they have a preferred delivery, to freeze the originator as shortly as possible. An asynchronous message does not need an answer message. Asynchronous messages are just delivered to the message-in interface of a manager without freezing the originator. The message manager provides message bundeling for messages to the same target site to save communication overhead. For this feature the message manager collects messages within a certain time window and then communicates the compressed message bundle. Synchronous messages are considered to be urgent, so they will be sent immediately. To send a message, the message manager needs to know the physical (IP) address of the site. As all sites have a uniform logical address (site id) obtained from the sign-in procedure, lookups are inevitable. This is done by the cluster manager which provides a list with logical and physical addresses of known sites. The message dispatching is performed by the network manager, which sends the message over the network to the destination site. After reception the message is processed by the remote network manager and forwarded to its local message manager for further distribution to the target manager. The structure of a site is shown in figure 3.
The SDVM: A Self Distributing Virtual Machine for Computer Clusters Attraction Memory
Cluster-Manager
15
Attraction Memory
MessageManager
MessageManager
NetworkManager
NetworkManager
Site 1
Site 2
Network
Fig. 4. Communication between sites
homesite directory
homesite directory
global data microframes
copies
attraction memory
attraction memory
site a
site b
Fig. 5. Homesite directories contain pointers to locally created data objects (microframes or global data) and all copies, if there are some. If an object migrates, its homesite updates the corresponding entry.
2.4
Communication between Two Sites
Components of distributed systems have to decide if data is stored locally or in a remote site. The SDVM is provided with homesite-directories to solve this problem. A homesite directory (see fig. 5) contains information regarding all locally created objects as there are microframes and global data. The global address of any of these objects includes the logical site address of the site it was created on. This site is called the homesite of the data object. When one of these objects is copied or migrated, the new location of the object or copies of it is written to the homesite directory. Therefore the object can be located at any time by querying its homesite. The sequence of an access to any object o is as follows, regarding site s:
16
J. Haase et al.
1. On site s an access to object o is necessary. 2. The attraction memory on site s starts a search in the local memory. If o is found, the access can be completed locally. 3. If o is not found on site s, the attraction memory reads the homesite address from o (which is coded into the global address of o), say h. 4. Site s sends a message to site h containing a request for object o. 5. Now site h searches in its local memory. If o is found, it is sent to site s and the address of s written to the homesite directory. 6. If o was not found on site h either, the attraction memory scans its homesite directory to determine on which site o currently resides, say r. 7. Site h informs site s that the object o resides on site r. 8. Site s sends a message to site r containing a request for object o. 9. Site r scans its local memory, locates o and sends it to site s. 10. Finally site s informs the homesite h that object o now resides on Site s, and h updates the corresponding entry in its homesite directory. It may happen that object o is migrated from site r to another site while site s gets the information that o resides on site r. In this case site r would return an error, too, and site s would have to query the homesite h again, which in the meantime has updated its homesite directory. If object o has copies, the homesite has to be informed whenever there is a write access to o. The homesite then forwards the new value to all other sites storing a copy of o to keep the copies consistent. To save communications cost, the SDVM offers not to pull the object o to the writing site but to send the data to be written to the site where o resides. In this case o stays just there. This may make sense in cases when one site writes many times to object o and another site writes only a few times. 2.5
Add Sites to or Remove Sites from the Cluster at Run Time
To join a running SDVM cluster, a SDVM daemon has to be started on a machine and the physical (IP-) address of an active member of the cluster has to be given by the user (or a config file etc.). The SDVM daemon builds a new site which signs on to the given cluster-member. From there it gets its own logical site address. Then the new site is ready and asks for work. When the first executable microframe with the corresponding microthread is transferred to the new site, the new site is a member of the cluster. In this way the SDVM-cluster can be expanded arbitrarily at run time. If a machine is needed for another task, the local site s can vanish from the SDVM-cluster at any time. To do that, all locally stored microframes and data must be transferred to another member of the cluster. Moreover, another site t has to be declared the new homesite of this data. Site t receives the homesite directory of site s, includes it into its own homesite directory and from now on site t has more than one logical site address - its own and the newly received. Finally, all other sites are informed that the physical site address of site s has changed: The cluster managers receive the new physical site address of the new
The SDVM: A Self Distributing Virtual Machine for Computer Clusters
17
homesite, so all queries and communications to site s are redirected to site t automatically. If a machine suddenly fails, more precisely if appears unreachable without having performed a regular sign off, a crash situation has occured. For this case, the SDVM prototype contains checkpointing, which saves all microframes and global data to a site in regular intervals. If a crash occurs, first of all the program has to be stopped by informing all involved sites. These delete all corresponding microframes and global data. Then the site with the most recent checkpoint is determined, and this site copies all saved data concerning this checkpoint to its global memory. Then it restarts execution of the program normally, and all other sites which are idle ask for work automatically, so that after a short period of time the program is distributed again.
3
First Results
The SDVM needs a lot of calculations and communications to distribute code and data. Therefore a question is whether the additional overhead is small enough to maintain the concept. To answer this question, an example SDVM-program was developed that is easily parallelizable. The example program does an integration using the Romberg numerical algorithm [8]. This algorithm partitions the area to be measured into several portions of equal width. Those can be measured independently and the results added eventually. First of all, it shall be demonstrated how much overhead is generated by using the SDVM. To show this, run times on a stand-alone SDVM site are compared with the run times of a corresponding sequential program. This overhead appears to be about 3%. width accuracy SDVM (1 site) sequential program 100 20 129s 125s 20 200s 196s 150 In the next step, it has to be shown that the speedup is in expected regions. On a cluster of identical machines (Pentium IV, 1700 MHz), a value for the speedup is: width accuracy site A site A+B (Speedup) site A+B+C (Speedup) 100 20 119s 63s (1,89) 43s (2,77) 20 179s 96s (1,86) 69s (2,59) 150 Finally, the program was executed on three different machines respectively, and the execution times listed. width accuracy site D stand-alone site E stand-alone site F stand-alone 100 20 107s 129s 281s 150 20 163s 200s 426s
18
J. Haase et al.
The execution times of different numbers and combinations of machines used to calculate are opposed to the previous results. The machine on which the program was started first is listed first, the others were added to the cluster at run time and so helped the first machine. width accuracy site D+E site D+E+F site F +D+E 100 20 56s 62s 20 101s 80s 150 Obviously, the execution of the program started on the slowest machine (F ) can be accelerated very much, because the helpers are faster than F .
4
Conclusion and Future Prospects
The SDVM is a platform to pool any computers in a dynamic cluster and run different microthreads, which are fragmented programs. The distribution of the microthreads is done automatically and adapts to given constraints. For working in the background, the SDVM is developed as a daemon. In a first implementation, the goals of self-distribution (microframes migrate to idle computers on their own), self-configuration (computers can sign on and sign off at the cluster at run time) and adaptivity (different hardware for the participating machines is supported) are reached. Furthermore, the concurrent execution of several SDVM-programs is supported. Presently, the integration of a security concept (self-protection) and crashmanagement (self-healing) is in preparation. Future versions could provide self-distributing in-/output, e.g. for users, who access the same cluster from different terminals, or for a chat program. Also, it is possible to integrate other transfer protocols - or a protocol specifically designed for the SDVM - beside or instead of the currently used TCP/IP, to address routing characteristics of the cluster or even optimize the routing dynamically. Furthermore, a compiler may be created which partitions any sequential program automatically and thus completes the path from a sequential program to a self-distributing parallely executed program.
References 1. Litzkow, M., Livny, M., Mutka, M.: Condor – a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems. (1988) 2. Sullivan III, W.T., Werthimer, D., Bowyer, S., Cobb, J., Gedye, D., Anderson, D.: A new major SETI project based on Project Serendip data and 100,000 personal computers. In Cosmovici, C., Bowyer, S., Werthimer, D., eds.: Astronomical and Biochemical Origins and the Search for Life in the Universe, Proceedings of the Fifth International Conference on Bioastronomy. Editrice Compositori, Bologna, Italy (1997) http://setiathome.ssl.berkeley.edu/woody_paper.html.
The SDVM: A Self Distributing Virtual Machine for Computer Clusters
19
3. Sunderam, V.S.: PVM: A framework for parallel distributed computing. In: Concurrency: Practice and Experience. (1990) 315–339 4. Hagersten, E., Landin, A., Haridi, S.: DDM — A Cache-Only Memory Architecture. IEEE Computer 25 (1992) 44–54 5. Moore, R., Klauer, B., Waldschmidt, K.: The SDAARC architecture. In: Proceedings of the Ninth Euromicro Workshop on Parallel and Distributed Processing (PDP 2001), Mantova, Italy, IEEE Computer Society Press (2001) 429–435 http://www.ti.informatik.uni-frankfurt.de/Papers/Adarc/mantova.pdf. 6. Moore, R., Klauer, B., Waldschmidt, K.: Tailoring a self-distributing architecture to a cluster computer environment. In: 8th Euromicro Workshop on Parallel and Distributed Processing (EURO-PDP 2000), Rhodes, Greece, IEEE Computer Society Press (2000) http://www.ti.informatik.uni-frankfurt.de/Papers/Adarc/rhodos.pdf. 7. Eschmann, F., Klauer, B., Moore, R., Waldschmidt, K.: SDAARC: An Extended Cache-Only Memory Architecture. IEEE micro 22 (2002) 62–70 8. Dahlquist, G., Bjorck, A.: Numerical Methods. Prentice Hall, Englewood Cliffs, NJ (1974)
Heterogenous Data Fusion via a Probabilistic Latent-Variable Model Kai Yu and Volker Tresp Information and Communications, Siemens Corporate Technology, Otto-Hahn-Ring, 6 81730 Munich, Germany
Abstract. In a pervasive computing environment, one is facing the problem of handling heterogeneous data from different sources, transmitted over heterogeneous channels and presented on heterogeneous user interfaces. This calls for adaptive data representations keeping as much relevant information as possible while keeping the representation as small as possible. Typically, the gathered data can be high-dimensional vectors with different types of attributes, e.g. continuous, binary and categorical data. In this paper we present - as a first step - a probabilistic latent-variable model, which is capable of fusing high-dimensional heterogenous data into a unified low-dimensional continuous space, and thus brings great benefits for multivariate data analysis, visualization and dimensionality reduction. We adopt a variational approximation to the likelihood of observed data and describe an EM algorithm to fit the model. The advantages of the proposed model are illustrated on toy data and used on real-world painting image data for both visualization and recommendation.
1
Introduction
Among others, pervasive computing will be characterized by the processing of heterogeneous and high-dimensional data. For example, results provided by Internet search engines may contain text, pictures, hyperlinks, categorial and binary data. The demand for clearly structured information presented to end users, but also the limitations of telecommunication networks as well as user interfaces calls for a lower-dimensional representation providing the most relevant information. Promising candidates for this task of dimensionality reduction are latent variable models. Latent variable analysis is a family of data modelling approaches that factorizes high-dimensional observations with a reduced set of latent variables. The latent variables offer explanations of the dependencies between observed variables. An example is the probabilistic variant of the widely used principal component analysis, PPCA, where observations are explained by a linear projection of a set of Gaussian hidden variables, plus additive observation noise [10]. Standard PCA is widely used for data reduction, pattern recognition and exploratory C. M¨ uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 20–30, 2004. c Springer-Verlag Berlin Heidelberg 2004
Heterogenous Data Fusion via a Probabilistic Latent-Variable Model
21
data analysis. Recent studies on PCA reveals its connections to statistical factor analysis (FA) [7]. While existing PCA or FA approaches rely on continuous-valued observations, data analysis on mixed types of data (discrete and continous observations) is often desirable: – In solving typical data mining problems, one is always faced with mixed data. For example, a hospital patient’s record typically includes fields like age (discrete real-valued), gender (binary), various examination results (real-valued or categorical), binary indicator variables for the presence of symptoms or even textual descriptions. A unified means to explore the dependencies of these data are needed. – If applied to dimensionality reduction for pattern recognition, PCA is purely unsupervised. Thus, the resulting projection can be not indicative of the targeted pattern distribution. A generalized PCA which allows class membership as additional attributes (binary or categorical) may obviously provide a better solution. – For heterogeneous data, it is often difficult to derive a small set of common features describing the total data. For example, in a web-based image retrieval system, each image can be characterized by its visual features, accompanying words, categories, and user visit records. For these reasons, we will present a probabilistic latent variable model to fit observations with both continuous and binary attributes in this paper. Since categorical attributes can always be encoded by sets of binary attributes1 (e.g. 1of-c coding scheme), this model can be applied to a wide range of situations. We call this model generalized probabilistic PCA, GPPCA. In the next section we describe the latent variable model and derive an efficient variational expectation-maximization (EM) formalism to learn the model from data. In Sec. 3 we discuss properties of the model and connections to previous work. In Sec. 5 we present empirical results based on toy data and image data, with focus on both data visualization and information filtering.
2
A Generalized Probabilistic PCA Model
The goal of a latent variable model is to find a representation for the distribution p(t) of observed data in an M -dimensional space t = (t1 , . . . , tM ) in terms of a number of L latent variables x = (x1 , . . . , xL ). In our setting of interest, we consider a total of M continuous and binary attributes. We use m ∈ R to indicate that the variable tm is continuous-valued, and m ∈ B for binary variables (i.e. {0, 1}). The generative model is: 1
To be precise, an additional constraint is required here, which we drop for simplicity.
22
K. Yu and V. Tresp
x ∼ N (0, I) y|x = W T x + b tm |ym ∼ N (ym , σ 2 ) tm |ym ∼ Be g(ym )
(1) (2) m∈R m∈B
(3) (4)
By Be(p) we denote a Bernoulli distribution with parameter p (the probability of giving a 1). W is an L × M matrix with column vectors (w1 , . . . , wM ), b an M -dimensional column vector, and g(a) the sigmoid function g(a) = 1/(1 + exp(−a)). We assume that observed vectors t are generated from a prior Gaussian distribution with zero mean2 unit covariance. Note that we assume a common noise variance σ 2 for all continuous variables. To match this assumption, we sometimes need to use scaling or whitening as a pre-processing step for the continuous data in our experiments. The likelihood3 of an observation vector t given the latent variables x and model parameters θ is p(t|x, θ) = p(tR |x, θ)p(tB |x, θ) 1 (y − t )2 1 m m √ = exp − g (2tm − 1)ym 2 2 2 σ 2πσ m∈R m∈B
(5)
where ym = wTm x + bm . The distribution in t-space, for a given value of θ is then obtained by integration over the latent variables x p(t|θ) = p(t|x, θ)p(x)dx (6) For a given set of N observation vectors, the log likelihood of data D is L(θ) = log p(D|θ) =
N
log p(tn |θ)
(7)
n=1
We estimate the model parameters θ = {W , b, σ 2 } using a maximum likelihood approach, which can be achieved by the expectation-maximization (EM) algorithm. However, given parameters θ estimated from the previous M-step, the integral Eq. (6) in the E-step can not be solved analytically. We thus have to resort to an approximated solution. Previous work on mixed latent variable models has concentrated, for example, on approximating the (equivalent of the) integral Eq. (6) by Monte Carlo sampling [6] or by Gauss-Hermite numerical integration [8]. These approaches demonstrate good performance in many cases, but introduce a rather high computational cost. In the next section, we will present a variational approximation to solve this problem. 2
3
A non-zero mean and non-identity covariance matrix can be moved to parameters W and b without loss of generality. A full Bayesian treatment would require prior distributions for the parameters θ. We do not go for a full Bayesian solution here, thus implicitly assuming a non-informative prior.
Heterogenous Data Fusion via a Probabilistic Latent-Variable Model
2.1
23
A Variational EM Algorithm for Model Fitting
In order to select the parameters θ that maximize Eq. (7), we employ a variational EM algorithm. A variational EM algorithm constructs a lower bound (the variational approximation) for the likelihood of observations, Eq. (7), by first introducing additional variational parameters ψ. Then, it iteratively maximizes the lower bound with respective to the variational parameters (at the E-step) and the parameters θ of interest (at the M-step). This idea has been applied by Tipping [9] to a hidden-variable model for binary data only. A variational approximation for the likelihood contributions of binary variables, tm ∈ B in Eq. (4) is given by p(tm |x, θ) ≥ p(tm |x, θ, ψm )
2 = g(ψm )exp (Am − ψm )/2 + λ(ψm )(A2m − ψm )
(8)
where Am = (2tm − 1)(wTm x + bm ) and λ(ψm ) = [0.5 − g(ψm )]/2ψm . For a fixed value of x, we get the perfect approximation where the lower bound is maximized to be p(tm |x, θ) by setting ψm = Am .4 The variational approximation for the log likelihood Eq. (7) of data D becomes L(θ) ≥ F(θ, Ψ ) = log
N
p(tn |x, θ, ψ n )p(x)dx
(9)
n=1
where p(tn |x, θ, ψ n ) =
m∈R
p(tmn |x, wm , σ)
p(tmn |x, wm , ψmn )
(10)
m∈B
We denote the total set of N × |B| variational parameters by Ψ . Since the variational approximation depends on x only quadratically in the exponent and the prior p(x) is Gaussian, the integrals to obtain the approximation F(θ, Ψ ) can be solved in closed form. The variational EM algorithm starts with an initial guess of θ and then iteratively maximizes F(θ, Ψ ) with respect to Ψ (E-step) and θ (M-step), respectively, holding the other fixed. Each iteration increases the lower bound, but will not necessary maximize the true log likelihood L(θ). However, since the E-step results a very close approximation of L(θ), we expect that, at M-step, the true log likelihood is increased. Details are given in the following: (i) E-step: Ψ k+1 ← arg maxΨ F(θk , Ψ ). The optimization can be achieved updated from the previous step, the by a normal EM approach. Given ψ old n algorithm iteratively estimates the sufficient statistics for the posterior approxi5 mation p˜(xn |tn , θk , ψ old n ) , which is again a Gaussian with covariance and mean given by 4
5
However, in the case of x distributed over a Gaussian prior N (0, I), maximization of the corresponding lower bound with respect to ψm is not straightforward. Based on Bayes’ rule, the posterior approximation is derived by normalizing p(tn |xn , θk , ψ old n )p(xn ) and thus is a proper density, no longer a lower bound.
24
K. Yu and V. Tresp
Cn =
−1 1 old wm wTm + I − 2 λ(ψmn )wm wTm 2 σ m∈R
µn = Cn
1 2tmn − 1 old + 2b (t − b )w + λ(ψ ) w mn m m m m mn σ2 2 m∈R
(11)
m∈B
(12)
m∈B
and then updates ψ n by maximizing En log p(tn , xn |θk , ψ n ) where the expectation is with respect to p˜(xn |tn , θk , ψ old n ). Taking the derivative of
k p(tn , xn |θ , ψ n ) with respect to ψ n and setting it to zero leads to the En log updates
2 ψmn = En (wTm xn + bm )2 = wTm En (xn xTn )wm + 2bm wTm En (xn ) + b2m (13) where En (xn xTn ) = C n + µn µTn and En (xn ) = µn . The two-stage optimization updates ψ and monotonously increases F(θk , Ψ ). The experiments showed that this procedure converges rapidly, most often in only two steps. (ii) M-step: θk+1 ← arg maxθ F(θ, Ψ k+1 ). Similar to the former E-step, this can also be achieved by iteratively first estimating the sufficient statistics Eq. (11) and Eq. (12), and then maximizing of p˜(xn |tn , θold , ψ k+1 n ) through
N k+1 p(tn , xn |θ, ψ n ) with respect to θ, where En (·) denotes the exn=1 En log pectation over p˜(xn |tn , θold , ψ k+1 n ). For m ∈ R, we derive the following updates wTm =
N
(tmn − bm )En (xn )T
n=1
σ2 =
N
−1 En (xn xTn )
(14)
n=1
N 1 T wm En (xn xTn )wm + 2(bm − tmn )wTm En (xn ) + (bm − tmn )2 N |R| n=1 m∈R
(15)
where bm , m ∈ R, is directly estimated by the mean of tmn . For m ∈ B, we have the following updates N N −1 Tn ) (wTm , bm )T = − 2λ(ψmn )En ( xn x (tmn − 0.5)En ( xn ) n=1
(16)
n=1
n = (xT , 1)T . where x 2.2
Inference
Finally, given the trained generative model, we can infer the a posteriori distribution of hidden variables for a complete observation vector t by using Bayes’ rule p(x|t, θ) =
p(t|x, θ)p(x) p(t|x, θ)p(x)dx
(17)
Heterogenous Data Fusion via a Probabilistic Latent-Variable Model
25
However, since the integral is again infeasible, we need to derive a variational approximation by normalizing p(t|x, θ, ψ)p(x), where ψ is obtained by maximizing the lower bound p˜(t|θ, ψ). For a vector ˆt of partial observations, we can still infer the posterior distribution in a similar way. If only continuous variables are observed, a normal posterior calculation can be employed, without the need for a variational approximation. This solution is the same as calculating the posterior based on the standard probabilistic PCA model [10].
3
Properties of Generalized Probabilistic PCA
The rows of the ML estimator W that relates latent to observed variables span the principal subspace of the data. The GPPCA model allows a unified probabilistic modelling of continuous, binary and categorical observations, which can bring great benefits in real-world data analysis. Also, it can serve as a visualization tool for high-dimensional mixed data in a two-dimensional latent variable space. Existing models currently only visualize either continuous [1] or binary data [9]. Also, like PPCA [10], GPPCA specifies a full generative model, it can also handle missing observations in a principled way. For pattern recognition tasks, GPPCA can provide a principled data transformation for general learning algorithms (which most often rely on continuous inputs) to handle data with mixed types of attributes. One such example, in the context of painting image recommender system incorporating visual features, artists, user ratings, will be shown in Sec. 5. Also, GPPCA can provide a principled approach to supervised dimensionality reduction, by allowing the target values as additional observation variables. GPPCA explores the dependence between inputs and targets via the hidden variables and maximizes the joint likelihood of both. It actually discovers a subspace of the joint space in which the projections of inputs have small projection loss and also have clear class distributions. A large number of methods have been developed to handle issue of supervised data reduction (see [4]), like partial least squares, discriminant analysis. However most of them, in general, can not handle missing data.
4
Relation to Previous Work
Jaakkola & Jordan [5] proposed a variational likelihood approximation for Bayesian logistic regression, and briefly pointed out that the same approximation can be applied to learn the “dual problem”, i.e. a hidden-variable model for binary observations. Tipping [9] derived the detailed variational EM formalism to learn the model and used it to visualize high-dimensional binary data. Collins et al. [3] generalized PCA to various loss functions from the exponential family, in which the case of Bernoulli variables is similar to Tipping’s model. Latent variable models for mixed observation variables were also studied by [6] and [8]. In contrast to our variational approach, [6] and [8] used numerical integration
26
K. Yu and V. Tresp
methods to handle the otherwise intractable integral in the EM algorithm. Latent variable models for mixed data were already mentioned by Bishop [1] and Tipping [9], yet never explicitly implemented. Recently, Cohn [2] proposed informed projections, a version of supervised PCA, that minimizes both projection loss and inner-class dissimilarities. However, this requires tuning a parameter β to weight the two parts of the loss function,
Fig. 1. A toy problem: PCA, GPPCA and GPPCA-W solutions
5 5.1
Empirical Study A Toy Problem
We first illustrate GPPCA on a simple problem, where 100 two-dimensional samples are generated from two Gaussian distributions with mean [−1, 1] and [1, −1] respectively and equal covariance matrices. A third binary variable was added that indicates which Gaussian the sample belongs to. We perform GPPCA, as described in Sec. 2, and standard PCA on the data to identify the principal subspace. The results are illustrated in Fig. 1. As expected, the PCA solution is along the direction of largest variance. The GPPCA solution, on the other hand, also takes the class labels into account, and finds a solution that conveys more information about the observations. In an additional experiment, we pre-process the continuous variables with whitening and then perform GPPCA. We will refer to this as GPPCA-W in the following. With GPPCA-W, the solution even more clearly indicates the class distribution. Clearly, a change of the subspace in W corresponding to the whitened continuous variables will no longer change the likelihood contribution. Thus, the GPPCA EM algorithm will focus on the likelihood of binary observations only and thus lead to a result with clear class distribution.
Heterogenous Data Fusion via a Probabilistic Latent-Variable Model
(a) PCA solution
(b) GPPCA solution
27
(c) GPPCA-W solution
Fig. 2. Visualization of painting images
5.2
Visualization of Painting Images
Next, we show an application of GPPCA to visualizing image data. We consider a data set of 642 painting images from 47 artists. An often encountered problem in the research on image retrieval is that low-level visual features (like color, texture, and edges) can hardly capture high-level information of images, like concept, style, etc. GPPCA allows to characterize images by more information than just those low-level features. In the experiment, we examine if it is possible to visualize different styles of painting images in a 2-dimensional space by incorporating the information about artists. As the continuous data describing the images, we extract 275 low-level features (correlagram, wavelet texture, and color moment) for each image. We encode the artists in 47 binary attributes via a 1-of-c scheme, and obtain a 322dimensional vector with mixed data for each image. The result of projecting this data to a 2-dimensional latent space is shown in Fig. 2, where we limit the shown data to the images of 3 particular artists. The solution given by normal PCA does not allow a clear separation of artists. In contrast, the GPPCA solution, in particular when performing an additional whitening pre-processing for the continuous features, shows a very clear separation of artists. Note furthermore, that the distinction between Van Gogh and Monet is a bit fuzzy here—these artists do indeed share similarities in their style, in particular brush stroke, which is reflected by texture features. 5.3
Recommendation of Painting Images
Due to the deficiency of low-level visual features, building recommender systems for painting image is a challenging task. Here we will demonstrate that GPPCA allows a principled way of deriving compact and highly informative features. Thus the accuracy of recommender systems based on the new image features can be significantly improved.
28
K. Yu and V. Tresp
(a)
(b)
Fig. 3. Precision on painting image recommendation, based on different features
We use the same set of 642 painting images as in the previous section. 190 users’ ratings (like, dislike, or no rated) were collected through an online survey 6 . For each image, we combine visual features (275-dim.), artist (47-dim.), and a set of M advisory users’ ratings on it (M-dim.) to form an (322+M )-dimensional feature vector. This feature vector contains continuous, binary and missing data (because on average each user only rated 89 images). We apply GPPCA to map the features to a reduced 50-dimensional feature space. The rest of 190−M users are then treated as test users. For each test user, we hide some of his/her ratings and assume that only 5, 10, 20, or 50 ratings are observed. We skip one particular case if a user has not given that many ratings. Then we use the rated examples, in form of input (image features) – output(ratings) pairs, to train an RBF-SVM model to predict the user’s ratings on unseen images and make a ranking. The performance of recommendation is evaluated by the top-20 precision, which is the fraction of actually liked images among the top-20 recommendations. We equally divide the 190 users into 5 groups, pick one group as the group of test users and treat the other 152 users as advisory users. For each tested case, we randomize 10 times and calculate the mean and error bars. The results are shown in Fig. 3. Fig. 3(a) shows that GPPCA improves the precision in all the cases by effectively incorporating richer information. This is not surprising since the information about artists is a good indicator of painting styles. Advisory users’ opinions on a painting actually reflect some high-level properties of the painting from a different individual’s perspective. GPPCA here provides a princpled way to represent different information sources into a unified form of continuous data, and allows accurate recommendations based on the reduced data. Interestingly, as shown in Fig. 3(a), a recommender system working with direct combination of 6
http://honolulu.dbs.informatik.uni-muenchen.de:8080/paintings/index.jsp
Heterogenous Data Fusion via a Probabilistic Latent-Variable Model
29
the three aspects of information shows a much lower precision than the compact form of features. This indicates that GPPCA effectively detects the ‘signal subspace’ of high dimensional mixed data, while eliminating irrelevant information. Note that there are over 80 percent of missing data in the user ratings. GPPCA also provides an effective means to handle this problem. Fig. 3(b) shows that GPPCA incorporating visual features and artist information significantly outperforms a recommender sytem that only works on artist information. This indicates that GPPCA working on the pre-whitened continuous data does not remove the influence of visual features.
6
Conclusion
This paper describes generalized probabilistic PCA (GPPCA), a latent-variable model for mixed types of data, with continuous and binary observations. By adopting a variational approximation, an EM algorithm can be formulated that allows an efficient learning of the model parameters from data. The model generalizes probabilistic PCA and opens new perspectives for multivariate data analysis and machine learning tasks. We demonstrated the advantages of the proposed GPPCA model on toy data and data from painting images. GPPCA allows an effective visualization of data in two-dimensional hidden space that takes into account both information from low-level image features and artist information. Our experiments on an image retrieval task show that the model provides a principled solution to incorporating different information sources, thus significantly improving the achievable precision. Currently the described model reveals the linear principal subspace for mixed high dimensional data. It might be interesting to pursue non-linear hidden variable model to handle mixed types of data. This approach and its possibly extensions may provide the basis for - even adaptively - compactifying data representations in future pervasive computing environments thus increasing their performance and acceptance. Acknowledgments. We would thank Dr. Rudolf Sollacher and Anton Schwaighofer for their very constructive comments to this work.
References [1] Bishop, C. M., Svensen, M., and Williams, C. K. GTM: The generative topographic mapping. Neural Compuation, 10(1):215–234, 1998. [2] Cohn, D. Informed projections. In S. Becker, S. Thrun, and K. Obermayer, eds., Advances in Neural Information Processing Systems, 15. MIT Press, 2003. [3] Collins, M., Dasgupta, S., and Schapire, R. A generalization of principal component analysis to the exponential family. In T. K. Leen, T. G. Dietterich, and V. Tresp, eds., Advances in Neural Information Processing Systems, 13. MIT Press, 2001.
30
K. Yu and V. Tresp
[4] Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Satatistical Learning. Springer Verlag, 2001. [5] Jaakkola, T. and Jordan, M. Bayesian parameter estimation via variational methods. Statistics and Computing, pp. 25–37, 2000. [6] Moustaki, I. A latent trait and a latent class model for mixed observed variables. British Journal of Mathematical and Statistical Psychology, 49:313–334, 1996. [7] Roweis, S. and Ghahramani, Z. A unifying review of linear gaussian models. Neural Computaion, 11:305–345, 1999. [8] Sammel, M. D., Ryan, L. M., and Legler, J. M. Latent variable models for mixed discrete and continuous outcomes. Journal of the Royal Statistical Society, Series B 59:667–678, 1997. [9] Tipping, M. E. Probabilistic visualization of high-dimensional binary data. In M. S. Kearns, S. A. Solla, and D. A. Cohn, eds., Advances in Neural Information Processing Systems, 11, pp. 592—598. MIT Press, 1999. [10] Tipping, M. E. and Bishop, C. M. Probabilistic principal component analysis. Journal of the Royal Statisitical Scoiety, B(61):611–622, 1999.
Self-Stabilizing Microprocessor Analyzing and Overcoming Soft-Errors (Extended Abstract) Shlomi Dolev and Yinnon A. Haviv Department of Computer Science, Ben-Gurion University of the Negev, Israel. {dolev, haviv}@cs.bgu.ac.il
Abstract. Soft-errors are changes in memory value caused by cosmic rays. Decrease in computing features size, decrease in power usage and shorting the micro-cycle period, enhances the influence of softerrors. Self-stabilizing systems is designed to be started in an arbitrary, possibly corrupted state, due to, say, soft errors, and to converge to a desired behavior. Self-stabilization is defined by the state space of the components, and essentially is a well founded, clearly defined, form of the terms: self-healing, automatic-recovery, automatic-repair, and autonomic-computing. To implement a self-stabilizing system one needs to ensure that the micro-processor that executes the program is self-stabilizing. The self-stabilizing microprocessor copes with any combination of soft errors, converging to perform fetch-decode-execute in fault free periods. Still, it is important that the micro-processor will avoid convergence periods as possible, by masking the effect of soft errors immediately. In this work we present design schemes for self-stabilizing microprocessor, and a new technique for analyzing the effect of soft errors. Previous schemes for analyzing the effect of soft errors were based on simulations. In contrast, our scheme computes lower bound on the micro-processor reliability and enables the micro-processor designer to evaluate the reliability of the design, and to identify reliability bottlenecks.
1
Introduction
The interest in robust systems increased dramatically during the last years. New terms and research directions such as automatic-recovery, self-healing, selfrepairing and self-stabilization [1,13,3] are extensively explored by academia and industry. This is no coincidence, the design of critical systems such as computer aircraft control should be verified to be self-stabilizing. Otherwise, once the assumptions concerning the type and amount of failures are violated, the system may enter a state from which it will never recover. Time and space redundancy are usually used to cope with fault prone environments. Error detection and correction (codes) are used to mask the faults in a
Partially supported by NSF Award CCR-0098305, IBM faculty award, STRIMM consortium, and Israel ministry of defense.
C. M¨ uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 31–46, 2004. c Springer-Verlag Berlin Heidelberg 2004
32
S. Dolev and Y.A. Haviv
way that ensures input output relation requirements with high probability. Still soft-errors and other unpredictable faults may transfer the system to undesired states in which faults are not masked. Self-stabilization ensures that the system will eventually recover from such faults. Following a finite stabilization period the system will exhibit the desired relation of the inputs and the corresponding outputs streams. Fault tolerance using space and/or time redundancy have been extensively studied. These approaches differ from the one we propose since they do not cope with the recovery from an arbitrary state. Our approach compliments the previous approaches: design your system to be started in an arbitrary state, so that even if the assumptions concerning its operation do not hold for a while the system will recover. The basic assumption that the self-stabilizing algorithm designer uses is that the program that implements the algorithm is executed by the processor(s). This assumption should be examined. In fact, it is clear that current designs of microprocessors do not have the automatic recovery property, and hence the self-stabilizing program, that the microprocessor should execute, will not be executed, and obviously the system will not stabilize. Next we elaborate on the different aspects that have to be addressed when designing a microprocessor that copes with soft error or transient faults, by (1) using a self-stabilizing design that ensures recovery following the occurrence of any combination of soft errors, namely recovery from arbitrary state, (2) reducing the probability of soft-errors to influence the computation, analyzing the fault masking capabilities, in order to achieve fault masking with high probability, and by doing so, (3) eliminate as much as possible (the fault non-masking) convergence periods (of the self-stabilizing microprocessor) that may cause higher level (self-stabilizing) algorithms to start (fault non-masking) convergence periods. Soft Errors: Soft errors, also called single event upset (seu), are voltage changes caused by cosmic rays (or other disturbance); they can change the output value of a logical gate in the digital circuit. For our discussion we can model the behavior of a soft error as a single bit value change, from zero to one or vise versa, the assumption of any value change is different from the one way change assumed in [8]. The probability of a soft error increases when the feature size decreases, the voltage decreases and the micro-cycle time is shorten. Current technology considers soft errors in memory circuits by adding redundancy in the form of error correcting codes. The effect of soft errors in the logic circuit (alu, cpu, etc.) is not addressed, claiming that currently the probability for such a scenario is low (sizewise, memory devices contain much more targets for soft errors than logic circuits). Recent studies have shown that the probability of soft errors to influence the logic circuit will increase in less than a decade to the current probability for influencing memory. In addition, we note that the current typical number of internal registers of a cpu is large enough to require soft-error considerations. Some tasks cannot allow even a small error probability. There is an extensive efforts for achieving robust design that can cope with soft errors. In particu-
Self-Stabilizing Microprocessor
33
lar, in on-going systems that never stop operating, such as systems of satellites, the eventual occurrence of such a fault is almost certain. Two bold examples for a robust system design are the IBM S-390 and the Compaq non-stop Himalaya server. Both architectures use a combination of space and information redundancy to try to mask the effect of soft-errors [17]. One approach is to use simultaneous multithreading processor that runs two copies of the same program, in order to increase the resiliency to soft-errors. This technique is called simultaneous and redundant thread [17,16]. The space and time redundancy solutions, using error correcting or recomputing, reduce the probability that the resulting computation will be corrupted but do not, and in fact may not, give a guarantee for masking the effect of soft-errors. Self-Stabilization: Self-stabilization is an elegant approach for designing fault tolerant computing devices [1]. The idea is to explore the state space of the device, simply by considering any possible value for the bits in the memory and proving that from every such state the system eventually converges to the desired behavior. In other words, given a specification, we will say that an algorithm is a self-stabilizing algorithm for the given specification if: There exists a set of configurations, called safe configurations, that being in one of them ensures executions according to the specification, and this set of configurations is closed, being in a safe configuration ensures transition to a safe configuration. Starting from any configuration, a self-stabilizing algorithm must reach in finite time a safe configuration. Self-stabilizing algorithms can be combined in a way that the output of the first stabilizing algorithm serves the second stabilizing algorithm as an input. For example, self-stabilizing algorithm may assume correct behavior of the microprocessor and will stabilize after the (self-stabilizing) microprocessor will converge to its desired behavior. Originally self-stabilizing algorithm were design to cope with transient faults, such as soft errors. The assumptions made by the designers of self-stabilizing algorithms are that the algorithm itself is not corrupted (one can assume that the algorithm is written in read only memory) and that it is executed. We examine, the assumption that the processor continues to execute the algorithm at any given time. When considering a micro-code controlled processor, we can imagine the case of this processor, getting into infinite loop in a subset of its micro-code, due to soft errors, and never reaching the micro-code part that evolves fetching a new command and executing it. Analyzing Soft-Errors Influence: Von Neumann [11], and later Pippenger [14] suggested a model for a noisy computation of circuits. In this model a gate may fail with a specific probability. This model assumes that a gate computes it’s output only once during the function computation, even if there are two paths of different length from the gate to one of the outputs of the circuit. This model does not fit the properties of single event upsets since seus are temporary faults which cause the gate to produce incorrect output for a short period of time. The probability of a seu hitting a gate to effect the computation is closely related to it’s computation crucial time as we will define and compute.
34
S. Dolev and Y.A. Haviv
We proposed an algorithm for analyzing the masking probability of a circuit that implements a boolean function. The algorithm results in a lower bound for the probability that the circuit computes correctly in the presence of soft errors. To the best of our knowledge, up to now, only simulations were used for analyzing the soft error resiliency of circuits [6,12,10,9]. Our approach, as opposed to simulations, gives the designer knowledge on which gates in the circuit are the problematic ones. The rest of the paper is organized as follows. Two methodologies for designing self-stabilizing microprocessors appear in Section 2, the first requires detailed examination of the microprocessor circuit, while the second proposes an additional device to enforce stabilization. In Section 3 we present technique for analyzing the masking probability of a given circuit, as a function of the soft-errors probability. Concluding remarks appear in Section 4. Most of the proofs are omitted from this extended abstract.
2
Methodologies for Designing Self-Stabilizing Microprocessors
The configurations (or states) space of a microprocessor includes the value of every register of the microprocessor, including the micro program counter, or any other internal control variable of the microprocessor (e.g., pipeline control). An input output configuration of a microprocessor at a particular (clock) pulse is the binary value of the microprocessors pins when the pulse takes place. The input and output stream of a microprocessor is defined by a sequence of input and output configurations, such that every two successive configurations are related to two successive pulses, in an obvious way. We define a legal behavior of a microprocessor by every input output stream that correspond to an input output stream that starts in the (manufacturer) predefined initial state and handles a sequence of machine-code commands (in the set of the commands that are allowed by the manufacturer). We also include any suffix of input output stream to be in the set of the legal behaviors. We next observe that the contents of the non-control microprocessors registers (including the program counter of the machine code) are in fact part of the state of the (machine code) program and therefore should be handled by designing the (machine-code) program to cope with every possible state, namely to be self-stabilizing. Thus, we only concern ourselves with the control variables of the microprocessor and requires that they will lead to eventual execution of the commands of the stabilizing (machine-code) program. The eventual execution of the machine-code commands is in fact an execution of fetch-decode-execute cycle. Thus, our legal behavior set includes execution that starts in every possible values of the non-control variables of the microprocessor, and proceeds according to the (manufacturer) definition of the machine-code commands execution. A microprocessor is self-stabilizing iff every behavior that starts in any configuration reaches in a finite number of pulses a safe configuration after which its behavior is in the set of legal-behaviors.
Self-Stabilizing Microprocessor
35
In other words we require that a safe configuration is reached, and then the microprocessor will repeatedly execute fetch-decode-execute, where each machine code command is executed according to the manufacturer specifications. The manufacturer manual defines the specification of each machine code command. It is possible, as we will show in the sequel, that the microprocessor will be proven to stabilize for a certain specifications of a machine code and not stabilize for other specifications of the machine code. For example, whether the specification exposes the user to the internal registers used to implement the stack operations. In the rest of this section we consider the simplified case in which the criteria for a microprocessor to be self-stabilizing is a repeated execution of fetch-decode-execute sequence, where the execution is according to the microprocessor arbitrary internal state. Similar technique can be used for specific designs, monitoring and verifying that the execution of the commands is according to the manufacturer specifications.
2.1
Involved, State Diagram Method
Our first suggestion is to examine the designed microprocessor and verify whether it is self-stabilizing or not. To demonstrate the importance of the (internal) microprocessor control variables, we examine a micro-code controlled processor. We note that the same technique can be applied to any other microprocessor control method, considering the specific (internal) control variables used by the microprocessor. Our goal is to prove that there is no cycle of microinstruction executions that does not include a fetch-decode, and properly execution of the machine command pointed by the program counter. Usually this problem requires explicitly generating transition graph of the micro processor (which is too large to compute). Our scheme avoids this by using an abstraction of the transition graph. Given a micro-code of a processor, we convert it to a finite state representation. Where a node in the representation is defined by the (control variables, e.g., the) micro-code program counter value. In fact, every node in our finite state machine, represents a set of all possible microprocessor refined states that have the same micro-code program counter value. Where a refined state includes specific values for all the registers, in particular the non-control registers. The edges between the nodes of the finite state representation, represents transitions in the granularity of microinstruction execution. There is an edge between every two nodes, i and j, such that there is a transition (due to the occurrence of clock pulse) from a refined microprocessor state represented by the node i to a refined microprocessor state represented by j. Once the finite state representation is constructed we can validate its stabilization property. We search for a state transition cycle that does not contain a fetch-decode-execute sequence. First we note that every state in the finite state representation of a self-stabilizing microprocessor must have an out-degree (of non self-loop edges) of at least one, otherwise the microprocessor can be driven (by soft-errors) into a configuration, where no change to the micro program counter is possible and therefore no fetch-decode-execute sequence is performed.
36
S. Dolev and Y.A. Haviv
Then we can check whether, all cycles in the finite state representation include a fetch-decode-execute sequence. In such a case, we can conclude that the microprocessor is self-stabilizing. This test can be performed by executing a depth first search over the finite state representation. Otherwise, we will consider every cycle that does not include the fetch-decode-execute sequence in details. Namely, we will look whether the edges of the cycle are due to a possible transition of the microprocessor, considering refined states. We use the above method to verify that the micro-code controlled processor Mic-1 presented at [18] (chapter 4) is a self-stabilizing microprocessor for certain machine code specifications and is not for others. We have succeeded proving stabilization of the Mic-1 microprocessor in the case in which the definitions of the machine code commands, that refer the top of stack register TOS, are allowed to return any value as long as it is not a value pushed to the stack following stabilization. On the other hand we proved that Mic-1 is not stabilizing (it is only pseudo-stabilizing) when we require that the machine code commands will be executed as if the TOS is the value of the top of stack address in the memory. 2.2
Blackbox, Watchdog Method
One can ensure that a fetch-decode-execute sequence is eventually executed by using an upper bound on the number of clock pulse that may pass in between every two successive executions of the fetch-decode-execute sequence. We assume that every processor repeatedly executes the fetch-decode-execute sequence when it is started in a predefined state (e.g., the initial state defined by the manufacturer). Thus, one can use a watchdog circuit that will detect the situation in which the processor has not executed the fetch-decode-execute sequence for a period longer than the given upper bound. In such a case, the watchdog reset the microprocessor to the predefined (initial) state. Note, that the (re)activation of the reset will occur only due to (additional) soft errors. The watchdog circuit itself may experience soft-errors. Fortunately, it is possible to ensure that the watchdog circuit is self-stabilizing. One can implement the watchdog as a counter that is decremented in every clock pulse, using the exact number of bits needed to count the upper bound on the number of pulses in between two successive fetches. We assume that the watchdog counter can be initialized in any possible state (due to a soft error) causing in the worst case, an immature reset of the microprocessor. A self-stabilizing microprocessor recovers following the occurrence of faults that derive its state to an arbitrary state. During its automatic recovery the processor converges to a legal behavior (for example, a behavior that can be achieved from its predefined initial configuration). Naturally, the influence of soft-errors are not masked (immediately after an arbitrary state is reached, and) during the convergence period. Thus, (self-stabilizing) programs that the microprocessor executes may loss their consistency and will have to start a convergence period themselves. In [2] it was suggested to use Markov chains, to compute the probability to be in a safe state, when exits from safe states are possible. We enhance this approach by suggesting to measure the expected execution period length in which, a system that is in a
Self-Stabilizing Microprocessor
37
safe configuration, does not leave the set of safe configurations (in the presence of faults such as soft-errors). We use the term legal execution period for such execution period. In particular, we would like a self-stabilizing algorithm that is executed by our self-stabilizing microprocessor to have the longest expected legal execution period, which in turn implies high fault masking capabilities. Next we analyze the masking probability of a circuit in order to enable the circuit designer to evaluate its soft-errors masking probability, identify resiliency bottlenecks, and modify the design if needed.
3
Analyzing Soft-Error Masking Probability
The self-stabilizing property of the microprocessor is essential as a fall-back mechanism that ensures automatic recovery, still it could (and should) be combined with techniques that mask faults with high-probability. The combination of self-stabilization with fault masking capabilities ensures high probability for tolerating faults with no output influence (masking faults) and automatic recovery following (a bursty) occurrence of faults that the processor cannot mask. Note that it is impossible to mask all combinations of transient errors, in particular the combination that changes the outputs of all the gates. Let f be a function with nin input bits and nout output bits. We assume that the circuit implementing f has nin input (one bit) latches and nout output (one bit) latches. We present a method for estimating the probability that the circuit that implements f causes the outputs latches to store (the vector) f (x) when (the vector) x is the current (fixed) values of the input latches. Given a circuit that implements f , we consider the computation dag of the circuit, Gf = (V, E) to be the directed acyclic graph defined as follows: V = Vinput ∪ Vlogic ∪ Voutput , where v ∈ Vinput represents an input latch, v ∈ Vlogic represents a gate in the circuit, and v ∈ Voutput represents an output latch. The edges of the computation dag are defined by E = {(u, v)| the output of u is wired into the input of v}. The input degree of every node v, InDegree(v) is bounded by the maximal fan-in of a gate, M axDegree. Note that ∀v ∈ Vinput InDegree(v) = 0, because v represents an input latch. Similarly, ∀v ∈ Voutput OutDegree(v) = 0, since v represents an output latch. In addition, the number of gates in the circuit is n = |Vlogic |. We use delay for denoting the propagation time of signals in gates, that is, the time it takes for a particular gate to compute and output the result that reflects the current inputs of the gate. Our time axis origin (time 0) will be the first time the output latches are opened for writing, and we mark by the duration that the output latches should receive a stable (and correct) input, in order to store a correct result. In such a case, we say that the output latches are enabled for writing during the time interval [0, ]. Definition 1. We define the event the circuit computes correctly to be the event in which the output latches are disabled for writing, when their contents are the values defined by applying f to the contents of the input latches. We use pf to denote the probability for such an event.
38
S. Dolev and Y.A. Haviv
Our analysis is based on understanding the locations and times in which the circuit is vulnerable to seu. Given a gate or a latch u in the implementation of f , the t time point will be considered an input crucial time for u iff at least one of the followings hold: (1) u is an output latch and t ∈ [0..] (2) There is a gate or a latch v s.t. t + delay is an input crucial time in v and one of it’s inputs is wired to the output of u. Intuitively, the time in which a gate or a latch is input crucial, is the time it is important that this gate receives a correct input.
DQ Q’
(d)
(c)
(b)
DQ Q’
DQ Q’
DQ Q’
DQ Q’
DQ Q’
DQ Q’
DQ Q’
DQ Q’
DQ Q’
(a)
0
(a) (b) (c) (d)
Fig. 1. The input crucial time of circuit elements is depicted on the right. A gap between two dotted lines represent delay time (e.g., = 3 · delay)
Definition 2. Given a gate g in the implementation of f , the t point of time is considered a computation crucial time for g iff the interval [t − delay, t] overlaps input crucial time of g. Intuitively, if it is crucial that a gate g receives correct input at time t then it is crucial that the computation of the gate will be correct at time t but also delay time later, when the (“last”) signal propagates through the gate. A seu hitting the circuit influences the correct evaluation of a certain gate for certain period. If this period does not overlap with the computation crucial time of the gate then the circuit will compute correctly. We will use this observation to compute a (lower) bound on the probability that the circuit computes correctly. Next we present an algorithm to compute, ICTi , the input crucial time, for each gate and latch represented by a node vi . Later we will use the output of this algorithm to compute the computation crucial time of vi for vi ∈ Vlogic and then to compute a lower bound for pf . 3.1
Calculating Input Crucial Time
An algorithm for computing ICTi for all vi ∈ V is presented at Figure 2. Step (3) of the algorithm scans all the nodes (nodes represent gates) according to a topological sort of the computation dag, computing at each node, the input crucial time of the gate or latch that it represents. Figure 3 and Figure 4 describe an efficient implementation of step 3 of the algorithm. Step 3 of the algorithm computes ICTi also for vi ∈ Vinput , that is, nodes that represent input latches. In the case where the input to the circuit does not come from a latch, ICTi of vi ∈ Vinput represents the time that it is crucial that the input to the circuit will be correct (offseted by delay).
Self-Stabilizing Microprocessor
39
Input: Gf = (Vinput ∪ Vlogic ∪ Voutput , E), delay, Output: {ICTi |vi ∈ Vlogic ∪ Vinput } /* the input crucial times of each logic gate and input latch*/ (1) Compute a topological sort of Gf where Vinput (Voutput ) are the first (last, respectively). (2) For each vi ∈ Voutput do: (2.a) ICTi ← {[0, ]} (3) For each vi ∈(Vlogic ∪ Vinput ) in decreasing order of topological sort do: (3.a) ICTi ← (i,j)∈E {t − delay|t ∈ ICTj } Fig. 2. Abstract presentation of the algorithm for computing input crucial time
Lemma 1. The set ICTi computed in Figure 2 is the set of input crucial time of the gate or input latch that is represented by the node vi in the computation-dag. We use the fan-out limitations of a circuit to conclude that the number of edges of a computation-dag |E| is in O(n). The depth of the circuit, depth is, max{|P (vi , vj )| | vi ∈ Vinput , vj ∈ Voutput }. Where P (vi , vj ) is a path in Gf connecting vi to vj , and |P (vi , vj )| is the number of edges in that path. Note that the depth of every circuit cannot exceed the number of gates, (every path is a simple path), and hence depth ∈ O(n), but is usually much less. It turned out that the implementation of step (3) of the algorithm in Figure 2 influences the complexity of the crucial times computation. We propose representing the time periods as time segments as detailed in Figures 3 and 4. Data structure: ∀vi ∈ V , we hold ICTi as a list of disjoint closed intervals ordered by the starting time of the interval, that is Ci = {startji , endji }, where ∀i, j : startji ≤ endji and ∀i, j1 , j2 , j1 < j2 : endji 1 < startji 2 . Comment: ICTj − delay in line (3.a) should be applied to both the start and end times, namely replacing every start, end ∈ ICTj by start − delay, end − delay. (3) For each vi ∈ (Vlogic ∪ Vinput ) in decreasing order of topological sort do: (3.a) ICTs ← {ICTj − delay|vi , vj ∈ E} (3.b) ICTi ← ICTs[1] (3.c) For 2 ≤ i ≤ |ICTs| do (3.c.1) ICTi ← MergeCrucialTimes(ICTi , ICTs[i]) Fig. 3. Detailed description of step (3) of Figure 2
Lemma 2. The overall duration of input crucial times that can exist in a set ICTi of node vi cannot exceed (depth · delay) + .
40
S. Dolev and Y.A. Haviv
Input: Two lists, ICT1 , ICT2 , of disjoint closed intervals ordered by the starting time of each interval. Output: ICT , a list of disjoint closed intervals ordered by the starting time of each interval, holding the union of time in ICT1 and ICT2 . (1) ICT ← ∅ (2) While ((ICT1 = ∅) ∨ (ICT2 = ∅)) do (2.a) Pull start, end, interval with the first starting time from ICT1 or ICT2 . (2.b) Let start , end be the last interval in ICT . if start ≤ end then (2.b.1) end ← max(end , end) (2.c) else (2.c.1) Add start, end to ICT Fig. 4. MergeCrucialTimes routine
Lemma 3. If ICTi holds the input crucial times as a set of maximal continues time intervals, then for every interval start, end ∈ ICTi it holds: (a) end − start ≥ , (b) start ∈ {−(i · delay)|i ∈ IN} Lemma 4. If the crucial time units ICTi of a node vi ∈ V are represented as a set of maximal continues time intervals, then the number of intervals in ICTi , denoted |ICTi |, is in O(depth). Lemma 5. The time complexity of the MergeCrucialTimes routine described at Figure 4 is O(|ICT1 | + |ICT2 |). Lemma 6. The time complexity of an iteration in step (3) for node vi is O(OutDegree(vi ) · depth). Lemma 7. The time complexity of step (3) of the algorithm is O(n · depth). Theorem 1. The time complexity of the algorithm is O(n · depth). 3.2
Computing a Lower Bound for Correct Computation
pf denotes the probability that a circuit computes correctly as defined in Definition 1. In this section we use the output of the algorithm presented in Section 3.1 to compute (a lower bound on) pf . Definition 3. A computation of a function is single event upset free (seu free), if there exist no gate g and time t for which: (a) g is computation crucial at time t, and (b) a seu causes g to compute incorrectly during time t.
Self-Stabilizing Microprocessor
41
Let ef be the event in which a computation is seu free. For every event e, we denote the probability of e to occur by P r(e). It holds that pf ≥ P r(ef ) because in the case in which all the gates computes correctly during their computation crucial times, the output of the circuit is correct. This is an inequality due to the possibility that a logical masking will occur (that is, incorrect result of a gate my not necessarily imply incorrect result of the output, e.g., [14]). Definition 4. We say that a computation of a function is a single event upset free (seu free) with respect to a gate g, if there exist no time t for which g is computation crucial at time t, and a seu caused g to compute incorrect result at time t. Given a function implementation with n gates g1 , . . . , gn . We define ei to be the event in which the computation of the function is seu free with respect to gi . We assume that any two events ei and ej , i = j are independent. Therefore the n (P r(ei )). Replacing P r(ei ) by pi we have pf ≥ following holds P r(ef ) = Πi=1 n Πi=1 pi . We show how to compute pi for every gate gi using ICTi , the set of input crucial time units of gi . In doing so we consider the physical characteristics of the influence of seus on the particular circuit technology. Recall that transient-faults (in the form of single event upsets) are electrical pulses caused by an ionized alpha particle hitting a gate. The result of such a hit is a pulse of electrical current in a typical shape (see Figure 5 which depicts the data given in [7]). The shape of the pulse depends not only on the technology used (e.g., packaging material) but also on the energy of the particle hitting the gate. Physical experiments relate a particle of a certain energy hitting a gate to specific pulse characteristics [12,10,9, 6]. Physical experiments also supply us with the probability that a particle with an energy level Q (causing a pulse with a specific shape, for a given technology) hits a gate during a specified time period. The electrical pulse created by the seu damages the computation made by the gate it hits only when it exceeds a certain current threshold. This threshold can be determined given the technology of the circuit (e.g., process technology, voltage levels). Therefore, it is common to model the electrical pulse of a seu as a logical pulse. That is, a rectangle (or a step function) pulse with only two possible values, 0 and 1. The logical pulse is 0 whenever the electrical pulse does not exceed the gate’s threshold and 1 when it exceeds it. The simulations of the effect of soft errors proposed by iRoC technologies [6], uses a similar model when injecting single event upsets into the simulated circuit. Given ICTi the input crucial time of a gate gi , it is easy to compute CCTi , its computation crucial time according to Definition 2. First we present a method to determine pi from CCTi , the crucial time units of the gate, assuming that all particles contain the same energy (thus causing an identical electrical pulse). Next, we will improve the method by considering particles of different energies. Fixed duration of pulses: In this method we consider all particles to contain the same energy. One way of doing so is by averaging the energy of particles with energy that causes a pulse that exceeds the gate’s threshold (according to their occurrence probability). Thus the pulse can be modeled as logical pulses with constant duration (defined per a given technology). Denote by dseu , the
42
S. Dolev and Y.A. Haviv 0.45
0.4
0.35
0.3
mA
0.25
0.2
0.15
0.1
0.05
0
0
0.2
0.4
0.6
0.8
1 ns
1.2
1.4
1.6
1.8
2
Fig. 5. The current over time of a particle hit (particles with different energies)
duration of the logical pulse. If t is a computation crucial time of a gate gi , then any particle with an average energy, hitting gi , causing a logical pulse which starts anywhere in the range [t − dseu , t] will cause a disruption in a the i to be: computation made by gi at time t. For each 1 ≤ i ≤ n, we denote CCT t∈CCTi {[t − dseu , t]}. CCTi is the time in which, if particle pulse begins in, then the gate gi may have an incorrect output that will reach the output latches of the circuit at the time the latches store the circuit result. Figure 6 depicts i given the set of computation critical time of an example of computing CCT gi , CCTi . The set CCTi can be easily computed by extending each maximal continues duration in CCTi to the left, by dseu , and merging when needed. Denote by qseu (d) the probability that a particle (with an energy level sufficient to create a pulse that exceeds the threshold of agate) hits a certain gate during d time. Our bound for pi in this case will be: start,end∈CCT i 1 − qseu (end − start). Note that when no particle hits the gate during a duration i , non of the logical pulses overlap with the computation equal by its size to CCT crucial times of the gate (implying that gi does not violate the computation seu free property). dseu (a)
(b)
Fig. 6. Example of calculating time in which logic pulses may influence computation result (a) The computation crucial time periods, CCTi (b) The periods in which logic i. pulses must not start in, CCT
The above may not be accurate when particles contain different energy levels. To gain further accuracy, we propose considering the probabilistic distribution on the energy of the particles.
Self-Stabilizing Microprocessor
43
Different pulse durations: Given a gate threshold, we divide the particles hitting the gate into classes according to the duration of the logical pulse they cause. Particles with energy causing a logical pulse with a duration of ((D − 1)δ, Dδ] will be in the D’th class. The smaller value we choose for δ the more accurate our analysis will be. Since the duration of the logical pulse is a monotonic function of the energy of particle creating it, the classes are simply a division into D (d) the probability that during time continues energy levels. We denote by qseu period of length d, our gate will be hit by a particle of the D’th class which will cause a logical pulse with a duration in the range ((D − 1)δ, Dδ] time. Let ∆ be the maximal duration of the logical pulse caused by a particle. Then we have ∆ δ functions, one for each class of particles. Our analysis is by a simple reduction to the case of fixed duration of logical pulses. For each 1 ≤ k ≤ ∆ δ we will compute the probability that our gate crucial computation does not overlap with a logical pulse caused by particles of the k class, in doing so we will assign D qseu := qseu and dseu := k. We bound pf by the product of the results. 3.3
Logical Masking Analysis
First we will prove that that analysis of logical masking is an NP-complete problem then we will present techniques for analyzing limited (yet, important) cases for majority circuits. Complexity of Logical Masking Problems Definition 5. Given a circuit C : {0, 1}n → {0, 1},containing gates g1 , . . . , gk , a function q : {g1 , . . . , gk } → (0.5, 1] and an input x ∈ {0, 1}n , we define F TC,q (x) to be the probability that C computes correctly on input x when the gate gi (1 ≤ i ≤ k) of C computes correctly with probability q(gi ). Definition 6. Given a circuit C : {0, 1}n → {0, 1},containing gates g1 , . . . , gk , and a function q : {g1 , . . . , gk } → (0.5, 1], we define F TC,q = min{F TC,q (x)|x ∈ {0, 1}n }. That is, the minimal resiliency of C with respect to q. We show that unless P = N P , there is no polynomial algorithm for computing F TC,q even for the case in which C is a formula [19]. Specifically, we show that if such an algorithm exists, we can determine in polynomial time whether a formula ψ is satisfiable. Given a formula ψ, we create a formula Φ as depicted in Figure 7. The formula Φ consists of three copies of the formula ψ and two and gates wired as depicted. We choose q(a) = q(b) = 34 (where a and b are the two and gates depicted in figure 7), and q(g) = 1 for all other gates g. We will determine whether ψ is satisfiable using the observation that the probability that Φ computes correctly on input x (F TΦ,q (x)) is closely related to whether or not x satisfies ψ. When Φ computes on input x ∈ {0, 1}n such that ψ(x) = 0, one of the inputs to the and gate b is always 0 (with probability 1). Thus, the probability that the output of b will be correct is 34 , and we have F TΦ,q (x) = 34 .
44
S. Dolev and Y.A. Haviv
ψ1
b
ψ2
a ψ3
Fig. 7. The formula Φ as a function of ψ. ψi (i=1,2,3) are three independent copies of ψ.
When Φ computes on input x ∈ {0, 1}n such that ψ(x) = 1, the input to the and gate a is always 1, 1 (with probability 1) and the probability that the output of gate a is 1 is 34 . The probability that the and gate b will produce 1 (the correct result) is composed of two situations, the first is when the input to gate b is 1, 1 and the gate b computes correctly, and the second is when the input of b is not 1, 1 and b does not compute correctly (error cancelation). Thus, in the case of x that satisfies ψ, we get that F TΦ,q (x) = ( 34 )2 + (1 − 34 )2 = 58 . In the case ψ is satisfiable, there exist x ∈ {0, 1}n such that ψ(x) = 1 and F TΦ,q , the minimal resiliency of Φ with respect to q will be min( 34 , 58 ) = 58 . In the case ψ is not satisfiable, for all x ∈ {0, 1}n , ψ(x) = 0 and F TΦ,q = 34 . Definition 7. Given a formula ψ : {0, 1}n → {0, 1}, containing gates g1 , . . . , gk , a function q : {g1 , . . . , gk } → (0.5, 1], and a constant Q ∈ (0, 1) we define the Formula-Resiliency problem to be the problem of determining whether there exists x ∈ {0, 1}n such that F Tψ,q (x) < Q. Theorem 2. The Formula-Resiliency problem is NP-Complete. Proof. The Formula-Resiliency problem is NP-Hard: the proof is given by the reduction from SAT given above. The Formula-Resiliency problem is in NP: The nondeterministic algorithm will guess x ∈ {0, 1}n , compute F Tψ,q (x) (described below), if the result is less than Q, return true, otherwise return false. Computing F Tψ,q (x) is done by recursively computing the probability that the output of a gate is 1. Given an input x ∈ {0, 1}n we mark by Pψ,q,x : {x1 , . . . , xn , g1 , . . . gk } → [0, 1] a function mapping each gate and input of the formula to the probability that this gate/input will produce the value 1. Therefor, for inputs xi s.t. xi = 1 we get Pψ,q,x (xi ) = 1 and for inputs xi s.t. xi = 0 we get Pψ,q,x (xi ) = 0. For gates gi , the function Pψ,q,x is computed recursively as follows: if gi is an and gate and it’s inputs are A and B, with Pψ,q,x (A) = qA and Pψ,q,x (B) = qB , then Pψ,q,x (gi ) = [(qA · qB ) · q(gi )] + [(1 − (qA · qB )) · (1 − q(gi ))]. The above is due to the fact that an and gate will produce 1 in two cases, the first is when it receives the input 1, 1 and computes correctly, and the second is when receiving another input and failing to compute correctly. Similarly, in the case the gate gi is an or gate, we get Pψ,q,x (gi ) = ((1−qA )·(1−qB )·(1−q(gi )))+((1−((1−qA )·(1−qB )))·q(gi )).
Self-Stabilizing Microprocessor
45
Denote go the output gate of the circuit, then F Tψ,q (x) will be either Pψ,q,x (go ), when the correct output is 1, or (1 − Pψ,q,x (go ), when the correct output is 0. Majority circuits: Most error resilient circuitry employ some sort of error correction code (information redundancy) for masking errors in memory [4]. The common scheme used to mask errors in logic circuitry is using three or more parallel copies of the circuit and using a voter (majority circuit) to determine the correct result [4]. The voter in this scheme can be considered a decoder from an encoding in which the redundant word is a simple copy of the nonredundant word. The correctness of the computation made by the decoder is vital for the correctness of the scheme. We introduce a model for the definition of logical masking of functions that may receive incorrect input due to soft errors. Specifically we demonstrate the model on the majority function on three inputs I1 , I2 , I3 . We use a discrete model for time, where each time unit represent a duration of delay, the propagation time in a gate. That is, time unit t represent the duration [t·delay, (t+1)·delay]. It turned out that adding timing information to the description allows better analysis and lower bounds for the probability that the majority circuit will be able to mask transient faults, in particular it is possible that there is no (fixed) majority during the input crucial times of the majority circuit (detailed are omitted from this extended abstract).
4
Concluding Remarks
In this paper we build a solid ground for executing self-stabilizing programs (in fact any program) by presenting a design for a self-stabilizing microprocessor. We have presented an accurate and efficient method for computing the probability of correct computation in the presence of soft errors. The analysis is algorithmic and is not based on simulations. Our method can serve the circuit designer to modify the design in order to avoid (weak) points of failures. In addition, we suggest a way to analyze logical masking using the (natural) example of a majority circuit. The goal is a processor that masks soft-errors with high probability and, has a fall back mechanism, in the form of its self-stabilization property, to automatically recover in severe cases in which soft-errors are not masked. Acknowledgment. It is a pleasure to thank Amos Beimel and Enav Weinreb for helpful remarks and pointers to relevant literature.
References 1. S. Dolev, Self-Stabilization, MIT Press, 2000. 2. S. Dolev, Herman, T., “Dijkstra’s Self-Stabilizing Algorithms in Unsuportive Environments” WSS01, LNCS:2194, pp. 67-81, 2001. 3. A. Fox and D. Patterson. “Self-Repairing Computers”, Scientific American, June, 2003
46
S. Dolev and Y.A. Haviv
4. C. N. Hadjicostis, Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems, Kluwer, 2002. 5. J. L. Hennessey and D. A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, 2002. 6. iRoC Technologies RobanT M , white papers. http://www.iroctech.com. 7. M. Kistler, P. Shivakumar, L. Alvisi, D. Burger, and S. Keckler. “Modeling the effect of technology trends on the soft error rate of combinational logic”. In ICDSN, volume 72 of LNCS, pages 216–226, 2002. 8. K. L. Parag, Self-Checking and Fault-Tolerant Digital Design, Morgan Kaufmann, 2001. 9. F. Lima, S. Rezgui, L. Carro, R. Velazco, R. Reis “On the use of VHDL Simulation and Emulation to Derive Error Rates”. Radiation Effects on Components and Systems Conference (RADECS), Grenoble, FRANCE, 2001. 10. P. C. Murley and G. R. Srinivasan. “Soft-error Monte Carlo modeling program, SEMM”. IBM Journal of Research and Development, vol. 40, Number 1 pp. 109– 118, 1996. 11. J. von Neumann, “Probabilistic logics and the synthesis of reliable organisms from unreliable components”, Automata Studies, C. E. Shannon and J. McCarthy Eds., Princeton University Press, pp. 329-378, 1956. 12. E. Normand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, vol. 43, pp. 2742–2751, 1996 13. D. Patterson. Recovery Oriented Computing. http://www.cs.berkeley.edu/. 14. N. Pippenger, “On networks of noisy gates”, Proc. of the 26th IEEE Symposium on Foundations of Computer Science, pp. 30-36, 1985. 15. N. Pippenger, “Analysis of error correction by majority voting”. Advances in Computing Research, Volume 5, JAI Press, pages 171–198, 1989. 16. S. K. Reinhardt and S. S. Mukherjee. “Transient fault detection via simultaneous multithreading”. ISCA, pages 25–36, 2000. 17. E. Rotenberg. AR-SMT: “A microarchitectural approach to fault tolerance in microprocessors”. Symposium on Fault-Tolerant Computing, pp. 84–91, 1999. 18. A. Tanenbaum. Structured computer organization, Prentice-Hall, 1984. 19. H. Vollmer. Introduction to Circuit Complexity.. Springer-Verlag, Inc., 1999.
Enforcement of Architectural Safety Guards to Deter Malicious Code Attacks through Buffer Overflow Vulnerabilities Lynn Choi and Yong Shin The Department of Electronics and Computer Engineering Korea University Anam-Dong, Sungbuk-Ku, Seoul, Korea {lchoi, sy3620}@korea.ac.kr http://it.korea.ac.kr
Abstract. The buffer overflow attack is the single most dominant and lethal form of security exploits as evidenced by recent worm outbreaks such as Code Red and SQL Slammer. In this paper, we propose a new architectural solution to detect and deter the buffer overflow exploit. The idea is that the buffer overflow attacks usually exhibit abnormal symptoms in the system. This kind of unusual behavior can be simply detected by checking the integrity of instruction and data references at runtime, avoiding the potential data or control corruptions made by such attacks. Both the hardware cost and the performance penalty of enforcing the integrity rules are negligible. By performing detailed execution-driven simulations on the programs selected from SPEC CPU2000 benchmark, we evaluate the effectiveness of the proposed safety guards. By randomly corrupting stack and other data sections of a process’s address space during simulation, we create various buffer overflow scenarios, including both stack and heap smashing. Experimental data shows that enforcing two safety guards not only reduces the number of system failures substantially but it also circumvents virtually all forms of malicious code execution made by stack smashing or function pointer corruptions.
1 Introduction Since the advent of Morris Worm in 1988, the explosive growth of the Internet has increased both the volume and the types of security exploits on the Internet. Figure 1 shows the number of security incidents reported each year by CERT since 1988 [6]. As shown in the figure, y-axis is drawn in log scale, which implies that the number has been increasing exponentially most of the time during 19882003 periods. These security exploits are made through flaws in programs called vulnerabilities, through which remote attackers gain unauthorized access, launch an arbitrary code, or cause an unstable behavior such as denial-of-service conditions. As Figure 1 demonstrates, the number of these vulnerabilities has also been increasing rapidly.
C. Müller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 47–60, 2004. © Springer-Verlag Berlin Heidelberg 2004
48
L. Choi and Y. Shin
Among those vulnerabilities and attacks, the buffer overflow vulnerability and the attacks exploiting them are considered as the most serious security problem. This can be attributed to several factors. Firstly, overflowing the buffer not only corrupts the data nearby the buffer but also can usurp the control of program and execute any arbitrary code of malicious intent. Secondly, since the malicious code can replicate and propagate itself without any manual activation, it has the fastest propagation speed among all forms of malicious code attacks. For example, SQL Slammer worm could infect more than hundred thousand un-patched SQL servers in the world in about 8 minutes according to CAIDA [5]. Thirdly, the buffer overflow attack is the most prevalent form of security exploits. It accounts for approximately a quarter of all security vulnerability types during last three years [8] and accounts for fourteen out of the top twenty most critical Internet security vulnerabilities selected by the SANS Institute and the FBI [15]. Finally, it is the most persistent form of security exploits. Although the buffer overflow problem has been known for a long time since the Internet worm written by Robert Morris in 1988, it continues to present a serious security threat as evidenced by recent worm outbreaks such as Code Red I/II, SQL Slammer and W32/Blaster. Furthermore, it is predicted that buffer overflow attacks would still be a problem in the next twenty years [3].
Fig. 1. CERT statistics on security incidents and vulnerabilities. The numbers in year 2003 reflect only the numbers reported until the second quarter of the year
Stack smashing through return address modification and injecting the malicious code into the stack is the most common form of buffer overflow attacks. Most of existing works have focused on this type of attacks. Although various software solutions [2], [7], [9], [12], [13], [14], [17] have been proposed in the form of operating system fixes, compiler patches, debugging tools, and runtime libraries, these techniques have not been widely adopted. In most operating systems, the stack region is marked as executable. Thus, making the stack non-executable can effectively defeat the stack smashing [13]. However, nested function calls and trampoline functions, which are used by functional language implementation, such as LISP interpreters, and C compilers, such as gcc, also need an executable stack to work properly. In addition, the most common implementations of signal handler returns
Enforcement of Architectural Safety Guards to Deter Malicious Code Attacks
49
on Linux rely on an executable stack. Still, the most prevalent forms of countermeasures are manual downloads of individual patches and fixes that contain the binaries of modified and re-compiled source codes. However, this only addresses the particular vulnerability of a particular product after the vulnerability source is publicly known. In this paper, we propose a novel hardware solution to the buffer overflow problem. Thus, it does require neither recompilation nor the execution of extra codes to perform the boundary checking involved with software solutions. Furthermore, the hardware-level protection can secure all system and application programs running on the processor rather than a particular product with the particular software patch. The idea is that a buffer overflow attack usually exhibits unusual symptoms or traces, which can be simply detected by the hardware. For example, the most common form of buffer overflow attacks called stack smashing modifies a return address in the process’s stack frame, which is not possible in normal program execution. Likewise, the execution of malicious codes often requires instruction fetches from the stack or data region of a program’s virtual address space, which is not common in a normal program execution. To detect such abnormal symptoms, the processor verifies the safety of instruction and data references by checking the address range of those references. Both the hardware cost and the performance penalty of the safety checking are negligible. To evaluate the effectiveness of our proposed safety guards and also to investigate all the possible scenarios and consequences of buffer overflow attacks, we simulated five applications from the SPEC2000 CPU benchmark using the SimpleScalar simulation tool set. By randomly injecting malicious codes to stack, heap, or data regions of program’s memory space, we were able to analyze different scenarios of buffer overflow attacks. From the experimental data, we observe that a simple injection of any arbitrary data into the local stack frame can lead to a serious damage to the attacked process such as denial of service conditions. On the contrary, attacking global data or heap regions is much harder. In addition, most of the system failures are due to control corruptions rather than pure data corruptions. Lastly, we find that enforcing two simple architectural safety guards can substantially decrease the number of system failures for both stack and heap smashing attacks. In addition, the safety guards can virtually eliminate all forms of malicious code execution by avoiding the control corruption caused by return address or function pointer corruptions. Section 2 presents an in-depth discussion of various buffer overflow attack scenarios. Section 3 discusses how an input buffer overflow can change the data and eventually corrupt the control flow of a program execution. Then, we introduce data and instruction reference safety guards and how these guards can be employed at runtime to detect and deter various scenarios of data and control corruptions made by an attack. Section 4 shows our experimentation environment and evaluates the effectiveness of those architectural safety guards in reducing various control or data corruptions made by such attacks. Finally, Section 5 concludes the paper and discusses the future work.
50
L. Choi and Y. Shin
2 Buffer Overflow Attack Scenarios Except a few safety-oriented languages such as Ada and Java1, most languages are vulnerable to buffer overflow attacks, which means that they do not require bounds checking of arrays. Programs written in C or C++ are particularly vulnerable since many implementations of C standard library routines, such as strcpy and gets, do not check the bounds of arguments. Therefore, it is up to the programmers to check explicitly that the use of these functions should not overflow buffers. However, programmers often omit these checks. In addition, it is not always possible to perform bounds checking since arrays are often passed without any hint of their sizes. Thus, writing past the end of array is possible during the execution of those programs. This is called buffer overflow vulnerability and used by a vast number of security exploits to attack vulnerable systems. Various types of buffer overflow attacks have been discovered. Among them, the simplest and the most popular form of attacks is called stack smashing [1], which overwrites a buffer on the stack to replace the return address. When the function returns, instead of jumping to the return address, control will jump to the code that was placed on the stack by the attacker. This gives the attacker the ability to execute arbitrary code. To exploit such vulnerabilities, an attacker merely has to enter an input larger than the size of the buffer and encode an attack program binary in that input. The Morris Worm of 1988 exploited this type of buffer overflow vulnerability in fingerd on UNIX systems.
Fig. 2. Return address corruption in x86 processors
Figure 2 shows how the return address can be modified by a buffer overflow attack. When a vulnerable function like strcpy reads an input to a local variable, the external input may overflow nearby locations in the function’s stack frame and can eventually corrupt the return address. There are three scenarios that the return address can change the control flow of the attacked process. First, the return address can point to the beginning of the malicious code that is brought in by the input data. Often the code sends duplicate copies of itself to other vulnerable hosts as in the SQL Slammer incident. This scenario is the most common. The second scenario is to modify the return address to point to another function or instruction in the process’s text area. This simply modifies the control flow of the program to execute an internal function without permission. The third scenario is 1
Unfortunately, applications written in safety-oriented languages such as Java can also be attacked [11] due to errors in the Java type checking implementation.
Enforcement of Architectural Safety Guards to Deter Malicious Code Attacks
51
modifying the return address to point to any arbitrary memory location such as data area or an invalid memory location. Essentially, writing any arbitrary value to the return address can lead to denial of service conditions. For example, the modified return address can point to an address in the global data region. On a return from the function, the process may incur the undefined instruction fault. All of these three cases assume that the return address is modified directly by the data brought into the overflowed variable. This is called linear attack. Figure 3 shows two examples where a return address can be modified indirectly either by changing another local pointer variable or by changing the previous frame pointer saved in the stack. These two cases are called non-linear attacks, which are much less common than linear attacks, but reported to occur in a few buffer overflow attacks [4], [16].
Fig. 3. Linear versus non-linear attack scenarios
1. Figure 3(a): linear attack – The return address is modified directly by the data brought into the overflowed variable. 2. Figure 3(b): non-linear attack – The data copied into the variable overflowed and modified a pointer variable that points to the return address. Thus, when the function later copies data into the location pointed by the pointer variable, the return address can be modified [4]. 3. Figure 3(c): non-linear attack – The data copied into the variable only modifies the previous frame pointer stored. When the function returns, the caller can have a wrong frame pointer, thus the location of return address can be changed. Thus, the location of return address can be changed rather than the return address itself. This can also be used to change the flow of control to execute the malicious code [16]. All of the above cases are called stack smashing and target the stack region of a program’s address space. More sophisticated buffer overflow attacks may exploit unsafe buffer usage on the heap or the global data regions of a program’s memory space. In this case, any instruction pointers such as function pointers used by a call instruction or branch target addresses used for an indirect branch may be used to change the control flow of the program instead of a return address. However, this is harder since it is hard to locate the function pointers in these regions compared to the return address in the local stack frame.
52
L. Choi and Y. Shin
3 Corruptions and Defenses against Buffer Overflow Attacks 3.1 Data and Control Corruptions Made by Buffer Overflow Attacks The buffer overflow attacks start by corrupting data nearby the overflowed variable. Since the program text region of a process’s memory space is writeprotected, only data regions such as stack, heap, or global static data can be corrupted. Figure 4 classifies different types of control or data corruptions made by buffer overflow attacks. Depending on when and where the data is corrupted, the data corruption may or may not lead to an erroneous program execution. If the corrupted data item is never referenced in the future, the data corruption should not affect the program execution. We call this dead data corruption. If the same attack occurs to the same location earlier, the data may be referenced and can lead to a serious problem. Thus, the impact of data corruption not only depends on its location, but also depends on the timing of the attack. When the corrupted data item is referenced in the future, we call this case live data corruption.
Fig. 4. Classifications of data and control corruptions made by buffer overflow attacks
Enforcement of Architectural Safety Guards to Deter Malicious Code Attacks
53
Live data corruption can be further classified into three categories. First, if the data item is used as a target of a control transfer instruction such as return or an indirect branch, the data corruption is called control data corruption. This can immediately change the control flow of the process’s execution. On a fetch from the modified branch target, the process’s execution path is altered. Most of buffer overflow attacks try to use this kind of control data corruption to launch a malicious code, targeting the return address in the stack frame or function pointers in the stack or other data regions such as heap. The second category of data corruption occurs when the attack modifies the branch condition of a control transfer instruction. This may also lead to the modification of a program control flow although it does neither alter the return address nor a function pointer. We call this control dependent data corruption. For example, the sample code shown below demonstrates how a corrupted data item can alter the control flow without changing an instruction pointer. Depending on the value of the variable a, different functions can be invoked. This type of attack has not been reported previously [6]. if (a > 0) then call func_a(a) else func_b(a); We call all other forms of live data corruption as data dependent data corruption. This may not change the flow of control directly but modifies the live data, which will be referenced by the program in the future. This may appear as a normal program execution but with an erroneous result or state. In addition, the data corruption may propagate to other data locations, which can eventually lead to control dependent data corruption or control data corruption. Note that all the above three cases of live data corruption may lead to an abnormal termination. The reference of the corrupted data item is susceptible to data reference exceptions or execution related errors caused by invalid data operands. For example, the load from a corrupted data address may result in a TLB miss, a page miss, or an access violation. Although the corrupted data item is successfully referenced, it may cause an execution related exceptions such as overflow or floating-point exceptions when it is used by a later arithmetic operation. When the corrupted data item is used by a conditional branch as a branch condition or by an indirect branch as a branch target, it modifies the execution control flow. Specifically, an instruction is fetched from a wrong target address. We call this control corruption, which implies that the illegal control flow is made by the buffer overflow attack. This may lead to a malicious code execution if the branch target is modified to point to the worm code brought in by the external input. We call this kind of control corruption external code control corruption. This is the most serious consequence of buffer overflow attacks since the malicious code can replicate and propagate to other vulnerable hosts. Also, this is the most common form of buffer overflow attacks. The other form of control corruption is when the modified branch target points to the legitimate code in the text region. We call this internal code control corruption, which has been reported in a few cases of buffer overflow incidents [6].
54
L. Choi and Y. Shin
3.2 Safety Guards: Detection of Buffer Overflow Exploits Fortunately, the program under attack exhibits abnormal symptoms during its execution, specifically during its data and instruction references. For example, a stack smashing attack modifies return address and often copies its accompanied malicious code into the stack area outside the current stack frame, neither of which is possible during a normal program execution. Furthermore, when it launches the malicious code, the attacked program fetches instructions from the stack region. Except for a few rare cases such as the implementation of Linux “signal” or gcc “trampolines” functions that require fetching instructions from the current stack frame, it is not usual to fetch instructions from the stack area. Moreover, it is impossible to fetch instructions from the non-local stack area during a normal program execution. Both the abnormal instruction reference and the abnormal data reference can be easily detected by the hardware at runtime with a simple safety checking of its address referenced. Figure 5 shows how the processor can protect the system against possible data or control corruptions made by the buffer overflow attacks in the processor pipeline. First, during the instruction fetch stage the value of the program counter can be inspected as shown in Figure 5(a). If the program counter points to a location either in the program text region or in the current stack frame, it is assumed to be safe. For other instruction references, we can enforce the following integrity checking to block unsafe instruction references in x86 processors as shown in (1). We generally call this kind of safety checking during the execution of an instruction as a safety guard for the instruction and call this specific case of instruction reference safety checking as instruction reference safety guard. If NOT ((PC ∈ text region) OR (EBP <= PC <= ESP)), then instruction fetch is blocked for “possible control corruption”.
(1)
This is a simple range checking and can be incorporated into the hardware with virtually neither the hardware nor the performance cost. This instruction reference safety guard can effectively eliminate most of control corruptions made by a buffer overflow attack since the external malicious code can only reside in data regions of the program’s address space. However, this may not protect against the internal code control corruption, which fetches instructions from the text region. In addition, this cannot protect against the control corruption caused by control dependent data since it only checks the branch target address rather than the branch condition. Note that it is always better to detect and deter the attack at the earlier stage of an attack. Thus, protecting the system during data corruption stage can reduce the amount of injuries made to the system than protecting the system during control corruption. Similar to the instruction reference safety guard, the same level of protection can be provided during the execution stage of a branch instruction. Instead of checking the PC, the validity of branch target address can be inspected during the execution stage of a branch instruction. Figure 5(b) shows other safety guards that can be employed at the execution stage. By checking the addresses of load and store instructions we can protect the system against possible data corruptions made by stack smashing attacks. Since
Enforcement of Architectural Safety Guards to Deter Malicious Code Attacks
55
Fig. 5. Architectural safety guards enforced at instruction fetch and execution stages and their corresponding protections against data and control corruptions
both the previous frame pointer and the return address are stored outside the local stack frame in x86 processors, i.e. above the location pointed by the frame pointer, by blocking the data access to non-local stack frame neither the return address nor the frame pointer can be modified. Note that this can effectively protect all sorts of stack smashing including both linear and non-linear attacks [16] by enforcing a single data reference safety guard. The guard for the x86 processors is listed as below: If (data address ∈ stack region) AND NOT (EBP <= data address <= ESP), then data reference (both loads and stores) is blocked for “possible data corruption”.
(2)
Alternatively, instead of blocking all data references to stack outside the local stack frame, only the locations storing return address or previous frame pointer can be write-protected to deter a possible control data corruption depending on the implementation. On a detection of insecure memory references with the proposed safety guards, the processor has several options. First, it can terminate the process under attack. However, this leads to the denial of service condition since the attacker at least succeeds in terminating the process. The second option is to terminate the current attacked function invocation and change its control flow to return to the caller by force. We call this compulsory return. This is possible since the hardware can keep track of the location of the return address and can change the PC by force. The third scenario is to recover the original state before the attack and return to the caller. However, this would require a hardware buffer to hold the previous or the new architectural state. We evaluate the effectiveness of the compulsory return in Section 4 and plan to investigate further in the last option to come up with a practical hardware recovery scheme in our future work.
4 Experimentation and Preliminary Results In our experimentation we set the following two goals. First, we want to investigate different scenarios of buffer overflow attack. Specifically, we want to analyze how an attack can lead to different data and/or control corruptions, and
56
L. Choi and Y. Shin
eventually to the erroneous program behavior. The second goal is to evaluate how effectively our proposed safety guards can deter or reduce the control or data corruptions made by such attacks. To achieve these goals, we decide to use the SPEC2000 CPU benchmarks commonly used for processor performance evaluation. Instead of writing a sophisticated worm code for an attack we decide to randomly inject a garbage data into the stack or other data regions of a running process and see how such a random attack can cause different types of data or control corruptions. We chose five applications from CPU2000 benchmarks. These applications are selected simply because their simulation times are endurable given our extensive simulations. We plan to extend our experimentation to all other applications in CPU2000 benchmarks. For the detailed architectural simulations, we use SimpleScalar 2.0 toolset. Four-issue out-of-order pipeline with default cache/TLB configuration is assumed for a processor model. All the simulations are run on multiple Dell Servers running Linux Red Hat 8.0 operating system. Each Dell sever is equipped with dual 2.2 GHz Intel Zeon processors and 1 gigabytes of RAM. For a clear presentation of our results, we define the following terminology: 1. Failure: If an attack is successful, (ironically) we call this failure since a successful attack can cause the system under attack to fail. All the data or control corruptions except the dead data corruption are considered as failures since the attacked process must have made an illegal control move or must have used a corrupted data item. 2. Failure rate: The percentage of failures out of the total number of attacks 3. Failure types: The type of control or data corruptions made by an attack. 4. Stack smashing: If an attack targets the stack region of a process’s address space. 5. Data smashing: If an attack targets other data regions of a process’s address space such as heap or static data region. 6. Input: The garbage data used for an attack. The input is carefully organized so that 90% of stack smashing attacks will modify the return address to point to the location in the stack, simulating the external code control corruption while only 10% of stack smashing attacks would change the return address to point to the text region, simulating internal code control corruption. Data smashing attacks use random garbage input data, which may point to any location in the valid address space. 4.1 System Failure Analysis: Classification of Data/Control Corruptions Figure 6 shows both the failure rate and the distribution of failure types made by buffer overflow attacks for each of the applications for both stack and data smashing attacks. We ran 50 independent simulations for each application. Each simulation run accompanies a single attack that injects a garbage input data of size 256 bytes into the current stack frame or into other data sections of a running process in the middle of its execution. As shown in the figure, the average failure rate, 76.0%, of stack smashing is much higher than the average failure rate, 4.8%, of data smashing, which confirms that stack smashing is more effective in attacking the remote system. In extreme cases, all the stack smashing attacks of vortex and equake result in 100% failures while all the data smashing attacks of gzip, mcf, and bzip2 were fruitless, resulting in dead data
Enforcement of Architectural Safety Guards to Deter Malicious Code Attacks
57
corruptions. Among the failures, 86.6% of failures can be attributed to control corruptions. Except vortex, most of these control corruptions in stack smashing attacks result from the modification of a return address, i.e. control data corruptions. However, for data smashing attacks all the failures are due to control corruptions due to control dependent data, which suggests that it is much more difficult to corrupt a function pointer in data smashing attacks compared to stack smashing attacks. We further divide each type of control or data corruption into normal and abnormal termination. Abnormal termination implies that the attacked program is abnormally terminated due to an illegal data/instruction reference or an invalid operand. In a few simulations, the attacked program runs forever, i.e. infinite loop, which is considered as an abnormal termination in our data. On the contrary, normal termination means the program completed without reporting an abortion but with an erroneous result. 87.6% of failures are abnormal terminations, which implies that most of failures are abruptly terminated due to the attack and only one out of eight completed but with erroneous results.
Fig. 6. Failure type distribution for stack and data smashing attacks
4.2 Impact of Enforcing Safety Guards To evaluate the effectiveness of our proposed microarchitecture safety features, we modified the processor model to simulate the enforcement of the safety guards. Enforcing the instruction reference guard can avoid the control corruptions due to an external malicious code but it is still vulnerable to all sorts of data corruptions. In addition, the instruction reference guard cannot avoid internal code control corruption, which is expected to rarely happen both in our experimentation environment and in real situations. Enforcing the data reference safety guard can avoid data corruptions made to stack region outside the local stack frame. This can effectively deter the stack smashing attacks by protecting the return address, which is the most vulnerable control data we want to protect. However, it is useless for data smashing attacks.
58
L. Choi and Y. Shin
Fig. 7. Failure type distribution for stack smashing attacks with 1) no security mechanisms, 2) instruction reference safety guard, 3) data reference safety guard, and 4) instruction and data reference safety guards
Figure 7 shows both the failure rate and the distribution of failure types for stack smashing attacks with instruction reference safeguard alone, data reference safeguard alone, and finally both safety guards on. As expected, instruction reference safety guard can eliminate all control corruptions due to a branch target corruption since the instruction fetch from non-local stack frame cannot proceed. However, it does not increase the normal execution rates since all those control corruptions turn into data corruptions instead. Furthermore, it cannot avoid control corruptions caused by control dependent data. On the contrary, the data reference safety guard with compulsory return is very effective in eliminating failures caused by stack smashing attacks. With the data reference safety guard, failure rates are decreased from 76% to 14% on average. In gzip, the data reference safety guard alone can eliminate all sorts of system failures caused by stack smashing attacks. Enforcing both safety guards together can eliminate all sorts of control corruptions, which means that a remote attacker can neither execute a malicious code nor he or she can change the control flow of the attacked process even with a control dependent data corruption. However, the system is still vulnerable to the denial of service attack caused by data corruptions. Figure 8 shows the similar data for data smashing attacks. Although the failure rates are lower, the safety guards are useless in defending against data smashing attacks. Even though the instruction reference safety guard can eliminate all control corruptions due to control data corruption, this kind of control corruption does not occur in our simulations. By using a larger size input, we could generate control data corruptions, i.e. function pointer corruptions in data smashing attacks. And, we confirm that the instruction reference safety guard can successfully eliminate the control corruptions caused by the control data corruption. Due to space limitation, we do not present these data in this paper. However, as expected, the data reference safety guard is useless for data smashing attacks.
Enforcement of Architectural Safety Guards to Deter Malicious Code Attacks
59
Due to the difficulty of locating the function pointer in the heap or static region of an address space, only a few cases of data smashing attacks have been reported. Likewise, it is also more difficult to defend against data smashing attacks. We plan to investigate further in defending against these more intricate ways of buffer overflow scenarios including those stack smashing cases that cannot be resolved by our proposed safety guards in our future work.
Fig. 8. Failure type distribution for data smashing attacks with 1) no security mechanisms, 2) instruction reference safety guard, 3) data reference safety guard, and 4) instruction and data reference safety guards
5 Conclusions The issue of security in computing systems no longer remains outside the processor hardware. In addition to existing research and development in the areas of computer and Internet security as illustrated by firewalls, filtering routers, intrusion detection systems, anti-virus software, VPN, and etc., the issue of security brings forth new directions in all computer science and engineering fields including secure operating system, compiler technology to generate safe codes, safe application development methodology, and safety-oriented languages. In this paper, we propose a revolutionary architectural solution to build a more secure processor. The contribution of this paper is the following. First, we analyze the data or control corruptions made by buffer overflow attacks and classify them into different types of data and control corruptions. Second, we propose how a processor can detect and deter the control or data corruptions by checking the safety of data and instruction references. These safety guards can be incorporated onto any modern processor with virtually no hardware cost and no performance penalty. Third, we evaluate our proposed scheme and investigate the different attack scenarios made by buffer overflow attacks by performing detailed architectural simulations. Our experimentation environment is quite powerful in that we were able to create all kinds of attack scenarios by randomly
60
L. Choi and Y. Shin
injecting a garbage data into the data region of a program’s address space, which was never tried in the previous research. Fourth, our experimental data assures that enforcing only two safety guards can virtually eliminate all the failures made by data smashing attacks and can eliminate all sorts of control corruptions caused by control data corruption. With these promising but preliminary results, we will investigate more effective ways of detecting and avoiding buffer overflow attacks in the future. Moreover, we plan to uncover different ways of recovering from a severe data corruption, which cannot be avoided under the protection of safety guards.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
Aleph One, “Smashing the Stack for Fun and Profit”, BugTraq Archives. http://immunix.org/StackGuard/profit.html. Arash Baratloo, Navjot Singh, and Timothy Tsai, “Transparent Run-Time Defense Against Stack Smashing Attacks”, In Proceedings of the USENIX Annual Technical Conference, June 2000. Brian Snow, “Future of Security”, Panel Presentation at IEEE Security and Privacy, May 1999. Bulba and Kil3r, “Bypassing Stackguard and Stackshield”, Phrack, 10(56), May 2000. Cooperative Association for Internet Data Analysis (CAIDA), “Analysis of the Sapphire Worm”, http://www.caida.org/analysis/security/sapphier/, Jan. 2003. CERT/CC Statistics 1988-2003, http://www.cert.org/stats/cert_stats.html. Changwoo Pyo and Gyungho Lee, “Encoding Function Pointers and Memory Arrangement Checking against Buffer Overflow Attack”, In Proceedings of the Fifth International Conference on Information and Communications Security, October 2003. Common Vulnerabilities and Exposures (CVE), “[TECH] Vulnerability Types Seen in CVE”, http://cve.mitre.org/board/archives/2002-10/msg00005.html. Crispin Cowon, Steve Beattie, Ryan Finnin Day, Calton Pu, Perry Wagle, and Erik Walthinsen, “Protecting Systems from Stack Smashing Attacks with StackGuard”, In the Linux Expo, 1999. Crispin Cowan, Perry Wagle, Calton Pu, Steve Beattie and Jonathan Walpole, “Buffer Overflows: Attacks and Defenses for the Vulnerability of the Decade”, In Proceedings of the DARPA Information Survivability Conference and Exposition. January 2000. Drew Dean, Edward W. Felten, and Dan S. Wallach, “Java Security: From HotJava to NetScape and Beyond”, In Proceedings of the IEEE Symposium on Security and Privacy, Oakland, CA, 1996. Gyungho Lee and Akhilesh Tyagi, “Encoded Program Counter: Self-Protection from Buffer Overflow Attack”, In Proceedings of the International Conference on Internet Computing, June 2000. Openwall Project, “Linux kernel patch from the openwall project” http://openwall.com/linux Richard Jones and Paul Kelly, “Bounds Checking for C”, http://www-ala.doc.ic.ac.uk/~phjk/BoundsChecking.html, July 1995. SANS Institute and FBI, “The Twenty Most Critical Internet Security Vulnerabilities ~ The Expert’s Consensus”, http://www.sans.org/top20, 2003. Symantec, “Blended Attack Exploits, Vulnerabilities and Buffer-Overflow Techniques in Computer Viruses”, In Proceedings of the Virus Bulletin Conference, Sept. 2002. Tzi-Cker Chiueh and Fu-Hau Hsu, “RAD: A Compile-Time Solution to Buffer Overst flow Attacks”, In Proceedings of the 21 International Conference on Distributed Computing Systems, 2001.
Latent Semantic Indexing in Peer-to-Peer Networks Xuezheng Liu, Ming Chen, and Guangwen Yang Department of Computer Science and Technology, Tsinghua University, Beijing, P.R. China {liuxuezheng00,cm01}@mails.tsinghua.edu.cn, [email protected]
Abstract. Searching in decentralized peer-to-peer networks is a challenging problem. In common applications such as Gnutella, searching is performed by randomly forwarding queries to all peers, which is very inefficient. Recent researches utilize metadata or correlations of data and peers to steer search process, in order to make searching more purposeful and efficient. These efforts can be regarded as primitively taking advantage of Latent Semantics inhering in association of peers and data. In this paper, we introduce latent semantics analysis to peer-to-peer networks and demonstrate how it can improve searching efficiency. We characterize peers and data with latent semantic indexing (LSI) defined as K-dimensional vectors, which indicates the similarities and latent correlations in peers and data. We propose an efficient decentralized algorithm derived from maximizing-likelihood to automatically learn LSI from existing associations of peers and data (i.e. from (peer, data) pairs). In our simulations, searching efficiency can be greatly improved based on LSI, even with the simplest greedy search preference. Our approach is a framework to exploit inherent associations and semantics in peer-to-peer networks, which can be combined fundamentally with existing searching strategies and be utilized in most peer-to-peer applications.
1
Introduction
Peer-to-peer (P2P) networks have become a rapid growing and one of the most popular Internet applications. Due to their abilities in exchanging data among a large number of users, P2P networks are widely used for file sharing (Kazaa, [5], Napster [6], Gnutella [7], Freenet [10], FastTrack [11]). People use P2P networks to publish and share files with their own computers (participating peers), search for files and download their preferences from other peers. Currently, the effectiveness of P2P networks depends on efficiency of searching, which is a crucial factor in exploiting the shared resources from all peers. Many different searching strategies have been proposed in P2P networks. Centralized indexing strategies use special servers to hold centralized content indexing and answer for queries (Napster, [6]). Despite of performance, centralized approaches have many inherent defects: hot-spot, server failure, vulnerability and censorship-suffering, etc. Therefore, Internet users and research communities turn to decentralized searching strategies where content indices are dispersed among all peers and searching are performed in cooperative manner. It is still a difficult problem to find items efficiently without global knowledge. Based on local index (which is only
C. Müller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 63–77, 2004. © Springer-Verlag Berlin Heidelberg 2004
64
X. Liu, M. Chen, and G. Yang
a tiny fraction of the whole), users’ queries must be forwarded to a large number of peers, which generates terrible message flooding (Gnutella, [7]). This paper focuses on the problem of searching efficiency in decentralized P2P networks. We employ latent semantics of peers and items (files or contents shared in peers) to guide searching process, in order to make searching more purposeful and efficient by leading queries to their potential targets. The main concept of our design is that: there are strong underlying correlations and semantics in peers and their items, thus utilizing these underlying semantics can greatly and essentially improve decentralized searching. In practical P2P networks, peers (i.e. related users) have their respective interests or preference of items, and items have their own distinct meanings and belong to different subjects. Because of peers’ preferences and items’ semantics, their correlations are not random. This phenomenon of underlying correlations is usually referred to latent semantics and is widely used in data mining and information retrieval [1]. We believe that latent semantics also has great significance for searching in P2P networks, as what has happened in common information retrieval realm. In this paper, we are trying to primarily answer the following two questions: 1. How can we define and obtain latent semantics in P2P networks? 2. How can latent semantics help improve searching efficiency in P2P networks? To answer the first question, we introduce latent semantics analysis to P2P networks and define latent semantic indexing (LSI) based on our dual probability distribution model. Common latent semantics analysis use singular vector decomposition (SVD) [1, 3, 16] on the correlation matrix, which can hardly be adapted to P2P environments due to the strong communication in SVD computation. Based on the conclusion of SVD-based analysis, we propose our statistical model (dual probability distribution model) and derive our LSI from the model. We also propose an efficient decentralized algorithm to estimate values of LSI vectors in P2P networks, following Maximum a Posteriori (MAP) method and Bayesian paradigm. To answer the second question, we utilize LSI-based peer ranking to guide searching process. In the simulations, we find even simplest ranking approach can significantly improve searching efficiency. We perform two steps of simulations, one for evaluating LSI estimation algorithm, and the other for evaluating searching efficiency improvement based on LSI. The simulation results greatly support our conclusions. For the rest of the paper, Section 2 surveys related works and gives our motivations. Section 3 proposes our model for LSI in P2P networks. Section 4 presents our LSI estimation algorithm. Section 5 presents our methodology for simulation and gives simulation results for the decentralized algorithm. Section 6 discusses how LSI can be used to improve P2P searching. Section 7 shows the simulation results for searching efficiency. Section 8 is discussions and conclusions.
2
Related Works and Motivations
Resent researches have paid much attention to searching in decentralized P2P networks. Gnutella [7] uses a random (or blind) searching strategy, where user queries are forwarded from each peer to its neighbor peers and so on, as a breath-first traversal in the overlay. This simple searching strategy can be seen as a prototype for most unstructured P2P networks (Freenet, [10], [12]). The main problem of random searching is its bad efficiency, i.e. it can hardly find less-popular items unless to traverse
Latent Semantic Indexing in Peer-to-Peer Networks
65
numerous peers with flooding of messages. To improve random searching, result caching and index replication are used. Freenet [10] caches locations for recent searching results. FastTrack [11] uses high-bandwidth peers as searching-hubs to replicate indices. Cohen et al [12] suggest replicating contents according to their query rates so that expectation of searching size is optimized. The common property of these approaches is blind manner in probing peers which has inherently very low probabilities of finding results per step, unless using extremely large caching or replications. There are a group of different P2P research systems (Chord [8], CAN [9]) that employ distributed hash tables (DHT) for locating content. In order to locate every item within a few networks hops, they mandate a sophisticated network structure and strict item placement. Despite novel designs, DHT confines searching to only accurate queries, which is a limitation for file sharing applications. So, in this paper we only discuss searching on top of common network structures and natural item placements. To essentially improve searching efficiency, a key point is to replace random probing with more intelligent probing, which focuses on potential targets and can avoid most waste search steps. We need to exploit the latent semantics in data. 2.1 Motivation for Latent Semantic Analysis Latent semantics is a very usual phenomenon observed in most areas of human activities. When involves a lot of items and their consumers, items and consumers are usually not independent, by reason of different needs or “interests” inhering in consumers and intrinsic meanings, uses or semantics of items. The resulting underlying semantic structures in pairs (and inner-correlations between items and between consumers) are called “Latent Semantics” [1, 2, 3, 15]. Many research works have been done on latent semantics analysis by data mining and information retrieval communities [1, 2, 3, 16]. These works prove that exploiting latent semantics can greatly improve efficiency and performance of applications, esp. for information indexing and searching. These results and P2P use pattern motivate us to introduce latent semantics analysis to P2P networks for improving P2P searching. Another motivation arises in recent P2P researches, in which people begin to differentiate peers and items and use their characteristics to guide the search. Crespo et al [13] use explicit item semantics to build routing indices. They assign documents with some “terms” indicating related realms, and maintain in each peer a statistic table containing as term-based routing indices, which indicates how many documents would be found, if probes the query of that term to a neighbor peer. Despite searching improvement, the approach has some drawbacks. First, as most latent semantics analyses proved, only terms-based statistics cannot fully capture item characteristics, for terms also have underlying correlations and semantics [1, 3, 16]. Second, considering the large number of terms in common P2P systems, it is infeasible to build and maintain term-based routing indices tables in practice. To solve these problems, it is nature that we use latent-semantics-based indices instead of the termbased indices. Other approaches utilize interest-based [14] or possession-based guidance [15] for purposeful searching, which can be regarded as primitively exploit latent semantics between peers, and by doing so the search has been greatly improved.
66
X. Liu, M. Chen, and G. Yang
All above designs motivate us to build a framework that can extract and fully exploit latent semantics in P2P networks, including both user’s preferences and items’ semantics. We propose this framework in the following section.
3
Latent Semantic Indexing in P2P Networks
In this section, we present our dual probability distribution model and the framework of latent semantic indexing. We start by discussing state vector of peers/items and conventional SVD-based latent semantics expression, then we describe our model. 3.1 State Vector and SVD-Based Semantics Expression Let N be the number of peers in P2P networks, and M be the number of different N ×M items. We represent data in the network by the peer-item matrix D ∈ {0,1} where Dij = 1 if and only if peer i contains item j. Matrix D denotes all pairs in the network, which is all of our knowledge about correlations between peers and items. Thus, we have: D = (n1 ' , n2 ' ,
nN ' )' = (t1 , t2 ,
t M ), ni ∈ (0,1) M , t j ∈ (0,1) N
(*)
where D’s rows and columns are peers’ and items’ state vectors, respectively. ni th represents correlations between the i peer and all items, and tj represents correlations th between the j item and all peers. Our goal is to characterize peers and items, telling whether two peers have same interests, whether two items are essentially similar, or whether a peer might “need” an item. A simple way is to use state vectors in D, in which row vectors {ni} characterize each peer and column vectors {tj} characterize each item. Peers containing more same items have more common interests, and items belonging to more same peers are more similar in meaning. Thus, we can use distance of state vectors (Euclidian or cosine distance) to define distance of peers or items. Possession-based searching guidance in [15] is an example of utilizing peer’s state vector as latent semantics characterization. However, this characterization is very limited since vectors in sparse matrix D do not properly characterize “latent” structures and actual correlations. For a peer, there are lots of items suitable to its interests (having latent correlations) but are actually not contained by the peer. Sparsity of D and incompletion of existing pairs limit the performance of state vectors. In addition, based on state vectors we cannot directly measure the potential association between a peer and an item, which is quite important in searching. Conventional latent semantics analysis performs singular vector decomposition (SVD) [1, 3, 16] on matrix D in order to obtain real correlations. SVD, like eigenvector decomposition, can find K “singular vectors” corresponding to K largest singular values which dominate most power of original matrix. Peers and items can be characterized by linear combination of singular vectors, i.e. a K-dimensional point in the feature space spanned by the K singular vectors. Previous SVD-based researches show that small dimensions are enough to express latent semantics (i.e. K<
Latent Semantic Indexing in Peer-to-Peer Networks
67
Unfortunately, performing SVD involves very high computational costs. In a largescale P2P network, centralized SVD needs very strong computation power that single peer is not competent, while parallel SVD needs much communication, so it is yet impracticable to directly use SVD-based latent semantics analysis in P2P networks. 3.2 Dual Probability Distribution Model and LSI Our latent semantics definition in P2P networks derives from common latent semantics analysis. We suppose that there be inherently K internal orthogonal classes or properties of data. Each peer enjoys one or more classes of data as its “interests” [14]. Also, each item has one or more classes or properties, meaning that it can satisfy some kinds of peers’ interests. Thus, we use K-dimensional vectors in “semantics space” to characterize both peers and items. The K axes of semantics space correspond to the above K classes respectively, indicating the most important K essential genres or characteristics. These K-dimensional vectors are used as peers’ and items’ Latent Semantic indexing (LSI). To define our LSI, we firstly introduce a probabilistic model. Suppose we have K essential classes or interests denoted by C1, C2, …CK, and each query belongs to exact one class. Each peer P is associated with a K-dimensional LSI vector n P = (n1P , n2P , nKP ) , indicating the probabilistic distribution of P’s interests and th
preferences. The i value of P’s LSI vector is the probability that P’s queries belong th to the i class Ci (querying data of Ci). So, we have the following equation for each P:
n1P + n2P +
nKP = 1
(1)
The definition of items’ LSI is a bit more complicated. First, we define an item T’s querying frequency vector F = ( F1 , F2 , FK ) , indicating frequency that T is queried T as each essential class. For all queries in the system that belongs to Ci, there are Fi of them is satisfied by item T. Thus, frequency vectors indicate both item’s importance and semantics. The more important (or popular) item T is, the bigger the correlated frequency vector should be. Also, the classes that essentially represent item T’s semantics should also have bigger correlated frequency values. Now we define item’s T
T T T LSI vector f = ( f1 , f 2 ,
T
T
T
f KT ) as the normalization of frequency vector, i.e. K
K
i =1
i =1
( f1T , f 2T ,... f KT ) = ( F1T , F2T ,...FKT ) g T , g T = ∑ Fi T and ∑ f i T = 1
(2)
where g is the sum of item’s frequency values among all classes, indicating item’s importance. Here we suppose our orthogonal classes are balanced and have uniform importance, so that the queries are uniformly distributed on K classes. Thus, all classes have the same average number of queries in certain durations, and in Eq. (2) the value g is the frequency for item being queried (i.e. satisfying a certain query), over all classes. So, item T’s LSI vector denotes the probability distribution for its correlated queries over all classes, where fi represents the probability that a query for T belongs to class Ci. What we want is to find out and predicate how much a peer “needs” an item (in other words how much an item satisfies a peer’s query). Based on our LSI, these P correlations can be easily calculated. Consider peer P with LSI vector n and item T
68
X. Liu, M. Chen, and G. Yang T
T
T
with querying frequency vector F , LSI vector f and importance value g . P sends out m queries in some duration. From our definitions, we know that averagely there are P m·ni queries among them that belong to Ci class. Among these queries of Ci class, T P T there are Fi of them will be satisfied by item T, i.e. m· ni ·Fi queries averagely. So, by summing up queries of all classes, we know that total times that P queries for T is: K
∑ m ⋅ niP ⋅ FiT i =1
(
)
(
= m ⋅ nP , F T = m ⋅ g T ⋅ nP , f T T
P
)
(**)
T
and so probability for T satisfying P’s query is g (n , f ), where parenthesis denotes inner-product of vectors. We can also calculate similarity between peers and between items. Interests of peers are characterized by their LSI vectors, so we define difference of interests between two peers as the distance of their LSI vectors in the feature space spanned by K orthogonal classes. Also, difference of semantics between two items is defined as distance of items’ LSI vectors, because each factor of item’s normalized LSI vector is the degree of association between the item and the correlated class. Because in our hypothesis all K classes are orthogonal (i.e. having no semantic correlations in between), we can rationally use Euclidian distance as distance of vectors in the Kdimensional feature space. Thus, similarity of peers’ interests and items’ semantics is also well defined as a function inversely proportional to the correlated difference.
4
Estimating LSI Vectors
In this section we propose the algorithm for estimating all these LSI vectors. Since the abstracted classes in LSI definition are implicit, we don’t know exactly what class each query belongs to and cannot directly calculate LSI vectors from the definition. All we have is knowledge of existing associations between peers and items, i.e. all the existing pairs in which the peer contains the item. So, the only way is to learn from the knowledge of existing associations. 4.1 Models for Parametric Estimation P
Consider a peer P with unknown LSI vector n and contains l items {T1, T2, …Tl}. If we have known LSI vectors for all of these items, then P’s interests can be deduced. i Suppose P’s item Ti has a LSI vector f (here we simplify our denotations for clarity), i meaning that Ti is found by queries of K classes that follows a distribution equal to f . So the queries that P sent out for Ti can be regarded as a random sample from all i queries that Ti satisfies, and are also follows f distribution over all classes. For a certain peer with relatively determinate neighbors, network topology and fixed user, it is natural that we suppose P put same effort and sent nearly same number of queries for searching each Ti. Now we draw the conclusion that all queries P sent out for T1, T2, …Tl have the following distribution with respect to class of queries: Pr{P' s query ∈ Ci } =
f i1 + f i 2 + + f i l 1 l j = ∑ fi f 1 + f 2 + f l l j =1
(3)
Latent Semantic Indexing in Peer-to-Peer Networks
69
where |•|represents the norm of vector defined by summation of all its components. We leave out of account the P’s queries for unfound items, by supposing P’s items are sufficient to characterize P’s interests for data. Thus, we have the following LSI vector of P from our definition:
n P = (niP )i =1
K
=(
1 l j ∑ fi )i =1 l j =1
= ( f 1, f 2 ,
K
(4)
f l)
where the last term of Eq. (4) represents the mean of the l vectors. Equally, we can also estimate item’s LSI vector from known corresponding peers’ LSI vector. For item T, suppose T is contained by peer {P1, P2,…Pm} and we know i T each Pi’s LSI vector n . We estimate T’s LSI vector f by following maximize a posteriori (MAP) algorithm and using Bayesian paradigm. From MAP algorithm, the T value of f should be the one that maximize the likelihood for the occurring event that {P1, P2,…Pm} contain item T. That is, f T = arg max Pr(T ∈ P1 ∩ P2 ∩ T
∩ Pm n1 , n 2 ,
f
(5)
nm ) T
T
T
i
T
From section 3, the probability that Pi contains T under known f and g is g (n , f ). So, from Bayesian paradigm we have: Pr(T ∈ P1 ∩ P2 ∩
∩ Pm n1 , n 2 ,
m
= Pr( f T ) ⋅ ∏ Pr(T ∈ Pj n1 , n 2 , j =1
n m ) = Pr( f T ) ⋅ Pr(T ∈ P1 ∩ P2 ∩
∩ Pm n1 , n 2 ,
nm , f T )
(6)
m
n m , f T ) = Pr( f T ) ⋅ ∏ g T ( f T , n j ) j =1
T
where Pr(f ) is the a priori of T’s LSI vector value. From Eq. (6) and (5), we have: m m Pr( f T ) ⋅ ∏ g T ( f T , n j ) = arg max ∑ log( f T , n j ) + log Pr( f T ) + const f T = arg max T T f f = 1 j = 1 j
(7)
T
where in the last equation of (7) we drop term of g because it is independent with the T T desired variable f . The Pr(f ) can be determined in accordance with practical cases, indicating our viewpoint of item’s semantics and their rationality. For example, we can constrain the number of item’s correlated semantics, by choosing low a priori T probability values for f s that cover many classes. In our experiments, we set all Pr(f T ) as the same value for simplicity, meaning that we have no bias towards special distribution of items’ semantics. Now we have Eq. (4) and (7), from peer’s LSI vector can be estimated from known LSI values of correlated items, and vice versa. This leads us to an iterative algorithm for gradually approaching peers’ and items’ LSI values, by alternately estimate peers’ LSI and items’ LSI from their counterpart in (peer, item) pairs. We thus develop such an algorithm and adapt it to decentralized P2P networks. 4.2 Iterative Algorithm Since both peers’ and items’ value are unknown, we use an iterative algorithm to alternately estimate peers’ values and items’ value while fixing the other, respectively, as the following:
70
•
X. Liu, M. Chen, and G. Yang
•
Initialization: Setting all LSI vectors with random values, while holding the normalization condition in (1) and (2). nth iteration: estimating peers’ LSI vectors under fixed items’ LSI values:
•
nP = ( f 1, f 2 , f l ) (a.1) for each peer P with correlated items (n+1)th iteration: estimating items’ LSI vectors under fixed peers’ LSI values: m
f T = arg max [∑ log( f T , n j ) + log Pr( f T )] + const T f
j =1
(a.2) for each item T with correlated peers • if convergence is reached, then stop the iteration By using iterative algorithm, we expect that rationalities of estimated values can be accumulated during iterations, while the estimated values will gradually approach to appropriate values of LSI vectors. Our iterative algorithm can also be regarded as an application of Expectation-Maximization (EM) method, which is very efficient and widely used for many parametric estimation problems. 4.3 Adapting Iterative Algorithm to P2P Networks The remaining problem is to perform above algorithm in decentralized P2P networks. Note that we needn’t to update peers’ and items’ LSI estimation in a synchronized step, but can perform the algorithm in asynchronous manner. LSI of Peers and items can be re-estimated and updated in an arbitrary sequence, and in most cases arbitrary updating sequence will not harm the property of algorithm convergence. So, we needn’t to carry out any special synchronization for controlling iteration steps. In P2P networks, each peer stores and maintains its own LSI vector and LSI vectors of all its items. Since we use a small dimension of vectors, storing these vectors in peers will only take negligible space of storage. Thus step (a.1) in above algorithm can be simply carried out in each peer. Note that step (a.2) need to know LSI vectors of all peers that contain a certain item, and that we store a copy for item in each of its correlated peers. Therefore, we need to communicate between these peers to perform step (a.2). So, we employ additional links namely “co-sharing links” in peers. 4.3.1 Co-sharing Links Co-sharing links connect the peers that shares same item. In our approach, each peer maintains one or several co-sharing links for each of its item, which point to other peers that also contain the related item. Co-sharing links can be easily built at the time when related items are downloaded to the peer. Maintaining co-sharing links has many advantages for P2P networks. For example, with co-sharing links we can easily implement parallel downloading. Besides, it is efficient to search items via peer’s co-sharing links due to data correlations [14, 15]. Here we use co-sharing links mainly for performing step (a.2). In Section 6 we also utilize these links for searching items, in order to make a comparison with other approaches without LSI (e.g. approach in [15]).
Latent Semantic Indexing in Peer-to-Peer Networks
71
4.3.2 Estimating and Updating Item’s LSI Vectors via Co-sharing Links In each peer, each item periodically estimates and updates its LSI vector. We employ a “collect-and-spread” method for this estimation and updating. When it is in item T’s turn to perform step (a.2) in peer P, P sends a specific “re-estimation” message to other peers that also contain T, via T’s co-sharing links in P. The re-estimation messages will be forwarded by receiving peers to new peers within the overlay formed by co-sharing links of item T (i.e. within all peers that contains T), until some fixed number of hops is reached. When forwarding this message, peers will add their current LSI values to the message, so that peers’ LSI values can be collected. When each message reaches its hop boundary and get to the last peer of its trip, it will be returned to the original peer P with the collected LSI values for encountered peers during its trip. Thus, by carrying out a small message spreading, P obtains current LSI values of some of T’s containers around P. After that, new value for T’s LSI vector is calculated in P by perform (a.2) on the collected vectors. Then, P updates T’s LSI value in it with the new value, and spreads this value to its nearby peers within T’s overlay, via exactly the same message forwarding procedure as it spreads re-estimation messages a moment ago. Thus, T’s new LSI value is updated in P and its nearby overlay peers. For those “single” items with no co-sharing links, they just ignore the estimation step until some one downloads them and their first co-sharing link is created. Before the first estimation, the “raw” value of item’s LSI vector does not take part in the estimation of peer’s LSI vector. 4.3.3 Our Decentralized Iterated Algorithm In a peer, when any of the contained items has changed its LSI value, the peer’s LSI value will immediately updated locally following (a.1) of the algorithm for better convergence. By periodically estimating each item’s LSI vector in each peer and updating item’s and peer’s LSI value, we have our decentralized iterative algorithm for LSI calculation, which is adaptive to P2P environment. In practice, we limit message spreading in Section 4.3.2 to a very small scope within no more than 3 hops or no more than 10 peers totally. Thus a single estimation step of item will generate very slight communication costs. Also, total computation is finely distributed to all peers, so that each peer has a very low computational cost. 4.3.4 Insensitive to Failure of Co-sharing Links Co-sharing links can be created when items are downloaded similar with [14]. Although co-sharing links are important, our approach remains highly insensitive to neighbors going offline. When a peer has no online neighbors for an item, it can just leave it alone with the last estimated value of LSI vector. Due to our algorithm, an item without living co-sharing links is still possible to change its LSI value, because other peers might have co-sharing links pointing to it. And if we use more than one co-sharing links (if possible) for each replica of item, the abundance of interconnections in items’ overlay and our small-scope flooding-based communication will guarantee the convergence for items’ LSI values. We can also employ additional low-cost mechanisms to maintain co-sharing links, e.g, occasionally exchanging co-sharing links between two copies of an item in two different peers, during the time they are in communication with re-estimation messages. Using “co-sharing links shuffling” mechanism, we can both recover
72
X. Liu, M. Chen, and G. Yang
unreachable links with living links, and change combinations and orders for LSI estimation so that we may achieve better convergence.
5
Simulation Results for LSI Estimation
For the above algorithm to LSI estimation, there are two aspects to be evaluated. First, we need to evaluate efficiency of algorithm, i.e. how long it takes to estimate all LSI vectors? How many costs it takes to perform algorithm? Second, we must evaluate the estimated LSI values. We need to know how these vectors are reasonable. In this section, we describe our experiment for our algorithm and present simulation results aimed at the first aspect, i.e. the algorithm efficiency. The second aspect for quality of estimated values will be evaluated in Section 7. 5.1 Simulation Methodology We use simulations to evaluate our approach in the paper. Before simulation, we must firstly prepare appropriate simulated data, including item indices that are shared by peers, and the structure of connectivity graph on top of peers. 5.1.1 Data of Peers and Items Experimental evaluation requires indices of a large number of peers, but unfortunately these data are not publicly available. Current decentralized P2P systems can hardly collect indices of their peers due to limitation for locating contents, while data of centralized systems (such as Napster) are not accessible for public. Therefore, we use trace-based data for our simulation as most other researches do [14, 15], which have similar structures to peer-item indices. We choose four traces of proxies obtained from Boeing [4] to generate our simulation data. These proxy traces capture requests for web site from end users, which we consider to be similar to requests in Internetbased file-sharing system, e.g. P2P systems. From each trace, we extract the users and requested hostnames as the peers and contained items of simulation data, respectively. The Boeing trace [4] is composed of one-day traces from five of Boeing’s firewall proxies in Mar. 1999. We use four one-day traces of them captured from Mar. 1 to 4, and extract our data of nearly 50K peers and 45K items. These traces are also used by [15], so we can make a comparison with some of their results. 5.1.2 Connective Networks We also need to construct connective networks on top of these peers as the simulation of underlying overlay and our co-sharing links. To simulate a connective peers’ networks (such as Gnutella), we use randomly chosen peers as a peer’s neighbors. Each peer has averagely 5 neighbors, and the number of peers’ neighbors has a normal distribution. This overlay connectivity is used for searching items and evaluating performance of our approach in Section 7. Furthermore, we need to construct co-sharing links for each item in each peer. As the design of approach, peers build co-sharing links of items when the items are being downloaded. So, we construct simulated co-sharing links based on the downloading of items. Considering the Web-traces, we treat the requests in the Web-trace as queries of items in P2P
Latent Semantic Indexing in Peer-to-Peer Networks
73
system. For every item T, the first peer P0 in the trace that queries T is regarded as T’s originator, and so P0 shares T. Any sequent peer Pi that also queries T in the traces must “downloads” T from the peers that have already had T. So Pi randomly chooses one of these peers to download, and then builds a co-sharing link for T pointing to the chosen peer. After that, Pi also shares T for other peers’ downloading. In this way we construct one co-sharing link for each item in each peer. For more co-sharing links, we simply exchange co-sharing links between the linked peers. This construction of co-sharing links is exactly what happens in the real world. 5.2 Simulation Results for LSI Estimation Now we perform simulation of our decentralized algorithm presented in Section 4. Peers employ two or three co-sharing links for each contained item. We use 15dimensional vectors of LSI for all web-traces, and all vectors are initiated as random values. When simulation begins, each peer periodically estimates LSI value for its items, one at a time. The spreading of “re-estimation” message is within 3 hops, and thus averagely involves 10 other peers. In order to simulate the asynchronous P2P environment, the interval for peers estimating their items is not fixed, but follows exponential distribution. To evaluate the efficiency of algorithm, we consider the number of iterations for the algorithm to reach convergence. Since the algorithm is performed in asynchronous manner, we turn to use average iteration number of all peers as our evaluating metrics. If a peer has finished one cycle for estimating all of its items’ LSI values, we say that the peer has finished one of its iterations. Thus we have the number of iterations in each peer, and the average iteration number is defined as the mean value for all peers’ iteration numbers. The simulation results show that our algorithm for calculating LSI vectors is very efficient, and the algorithm reaches its convergence very quickly. It takes averagely no more then 30 iterations to reach complete convergence and obtain all LSI values, in all traces. Indeed, after 15 iterations most LSI vectors have reached their final values. Fig.1.(a) to Fig.1.(d) shows the estimated LSI values of peers in 1, 5, 15 and st 30 iterations of Boeing trace in Mar.1 , where values is represented by a 2D point with the coordinates equal to the value’s first two dimensions. It is obviously that the algorithm successfully makes LSI vectors to cluster in the feature space. Due to quick convergence, it takes only a small system cost to perform the algorithm. Considering that there are averagely 10 items in each peer (as in Boeing traces), each peer should perform 10 LSI estimations in one average iteration. If each peer estimates one of its items every 6 minutes, it takes 1 hour to complete once iteration. So, after 15 hours the most LSI values are obtained. During this estimating time, each peer originates a re-estimation message in every 6 minutes. Assuming re-estimation will spread averagely within 10 peers, during the estimation time each peer receives about 3.3 messages in every minute (there are two times of message spreading in each LSI estimation). It is a very little communication cost for most decentralized P2P applications.
74
X. Liu, M. Chen, and G. Yang
Fig. 1. The convergence of LSI estimation
6
Using LSI to Improve Searching Efficiency
Our approach is a framework for utilizing latent semantics in P2P and can be fundamentally combined with most searching strategies. We use LSI vectors to guide searching and perform selective forwarding, in which neighbor peers are ranked and searching results are prejudged before actually forwarding query to neighbors. One simple way to rank neighbor peers is using peer similarity. The similarity of two peers is defined as cosine function of their LSI vectors, and the degree of similarity is used to rank peers. When forwards a query, a peer can always choose one or some of its neighbor peers with the largest similarity degrees with the original peer (the peer originating the query). Different searching strategies may use variety of traverse approaches, while our LSI approach can serve all of them basically with our guidance. We can also use more complicated and accurate LSI-based searching guidance, e.g. use LSI-based routing indices like in [13]. For paper space limitation, we omit these discussions.
7
Evaluation of Searching Improvement
In this section we answer the second question in Section 5, telling whether our LSI vectors are valuable. Here we evaluate the reasonability of LSI vectors by investigating their performance in guiding search, and seeing improvement of searching efficiency. To evaluate searching efficiency, we use Expected Search Size (ESS) as our metric [15], which indicates average number of peers that a query traverses for finding desired item. We compare ESS between usual searching strategies and LSI-based guided searching strategies. We employ two kinds of usual searching algorithms as the following:
Latent Semantic Indexing in Peer-to-Peer Networks
• •
75
Random Searching: each peer randomly chooses one neighbor peer to forward query. This strategy represents most random searching strategies Co-sharing Searching: each peer randomly chooses one co-sharing link in it to forward query. This strategy represents base strategies in [14] and [15], which primitively utilizes semantics similarity of peers linked by co-sharing links.
For comparison, we introduce LSI-based neighbor peer choosing to both above strategies as searching guidance. We only use the simplest neighbor peer ranking in Section 6 (i.e. the LSI-based peer similarity) and a greedy neighbor peer choosing: • LSI Searching: each peer chooses its “best” neighbor peer to forward query, which has the largest similarity with the originator of the query • LSI Co-sharing Searching: each peer chooses its “best” co-sharing link in it to forward query, which links to a peer with largest similarity with the originator For queries, we firstly take a random sample of 5K items in each Boeing traces. For each picked item we randomly choose 100 peers from the request list for the item in the trace. If there are not enough peers in the request list for the item, we randomly sample other peers for complement. We set the max searching number as 104, and if a query has no result after traverse 104 peers we then set its searching size as 104. Fig.2 to Fig.4 show the performance evaluation of Random searching and LSI searching. Fig.5 to Fig.7 show the performance evaluation of Co-sharing Searching and LSI Co-sharing searching. These results illustrate the efficiency of using our LSI vectors to guide searching. The figures show a cumulative fraction of queries that have the simulated ESS below a certain threshold. They also show searching performance for items that belong to different level of popularity. From results of both random searching and searching on top of co-sharing links, we see that employing LSI-based ranking and guidance can greatly improve the performance. In all simulations (random and co-sharing-based), introducing LSI will improve searching efficiency by 2~3 times. Among these figures, the left three results (random) are simulated on Boeing-1 while the right three are on Boeing-3. Other results are omitted here due to similarity. In addition, comparing Co-sharing Searching and Random Searching, we see co-sharing links can improve searching efficiency, as presented in [14] and [15]. Co-sharing searching can be seen as random searching on top of co-sharing links. Our LSI-based searching still outperform cosharing searching, indicating that our LSI vectors capture real correlations between peers and items, and the latent semantics can greatly improve searching efficiency.
8
Discussion and Conclusions
This similarity-based searching guidance can be seen as advancement of recent approaches in [14, 15]. In their solutions, co-sharing links are used to indicate peer’s interests and guide the search. Our LSI-based similarity gives a quantified measure of peers’ interests, which can be applied with peers linked by both co-sharing links and arbitrary links. So, similarity-based searching is a universal approach for any network
76
X. Liu, M. Chen, and G. Yang
Fig. 2. Search for item in less than 0.1% peers
Fig. 3. Search for item in less than 0.01%
Fig. 4. Search for item in less than 0.001% peers
Fig. 5. Search for item in less than
Fig. 6. Search for item in less than 0.01%
Fig. 7. Search for item in less than 0.001% peers
Latent Semantic Indexing in Peer-to-Peer Networks
77
topology. Furthermore, since queries are characterized by the LSI of their original peers, they can use this information to keep their forwarding in the right group of peers, even if after many hops. Without LSI, either shortcuts or co-links cannot guarantee this property. For conclusion, in this paper we introduced latent semantics analysis to P2P networks, and proposed our latent semantic indexing (LSI) method. We proposed a dual probability distribution model to define LSI vectors, and designed an efficient decentralized algorithm to estimate LSI values, based on Maximize a posteriori (MAP) method and Bayesian paradigm. We also demonstrated how LSI-based peer ranking could improve searching efficiency in decentralized P2P networks. The main contribution of this paper is our framework: based on the framework, one can easily express and utilize latent semantics in P2P networks, and combine most of the searching strategies with the framework, so that searching efficiency can be greatly improved with LSI-based searching guidance.
References [1] S. Deerwester, S. T. Dumais, G. W. Furnas. Indexing by latent semantic indexing. Journal of the American Society for Information Science. [2] S. Wermter, Neural Network Agent for Learning Semantic Text Classification. Journal of Information Retrieval. Vol. 3, No. 2, 2000 [3] Chris, H. Q. Ding. A Similarity-based Probability Model for Latent Semantic Indexing. In Proceeding of ACM SIGIR, 1999 [4] Web traces and logs, http://www.web-caching.com/traces-logs.html [5] Kazaa, http://www.kazaa.com [6] Napster, http://www.napster.com [7] Gnutella, http://www.gnutella.com [8] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for Internet applications. In ACM SIGCOMM, Aug. 2001. [9] S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker. A scalable contentaddressable network. In ACM SIGCOMM, 2001 [10] Freenet, http://freenet.sourceforge.com [11] FastTrack, http://www.fasttrack.nu [12] E. Cohen and S. Shenker. Replication strategies in unstructured Peer-to-Peer networks. In Proceedings of the ACM SIGCOMM, 2002 [13] A. Crespo et al, Routing Indices for Peer-to-peer Systems, In Proceeding of ICDCS, 2002 [14] K. Sripanidkulchai, B. Maggs and H. Zhang. Efficient Content Location Using InterestBased Locality in Peer-to-Peer Systems. In Proceedings of the IEEE INFOCOM, 2003 [15] E. Cohen, A. Fiat, and H. Kaplan. Associative Search in Peer to Peer Networks: Harnessing Latent Semantics. In Proceedings of the IEEE INFOCOM. 2003 [16] S. T. Dumais. Improving the retrieval of information from external sources. Behavior research Methods, Instruments and Computers, 1991.
A Taxonomy for Resource Discovery Koen Vanthournout, Geert Deconinck, and Ronnie Belmans Katholieke Universiteit Leuven, Belgium, Department Electrical Engineering (ESAT), [email protected], http://www.esat.kuleuven.ac.be
Abstract. Resource discovery systems become more and more important as distributed systems grow and as their pool of resources becomes more variable. As such, an increasing amount of networked systems provide a discovery service. This paper provides a taxonomy for resource discovery systems by defining their design aspects. This allows comparison of the designs of the deployed discovery services and is intended as an aid to system designers when selecting an appropriate mechanism. The surveyed systems are divided into four classes that are separately described. Finally, we identify a hiatus in the design space and point out genuinely distributed resource discovery systems that support dynamic and mobile resources and use attribute-based naming as a main direction for future research in this area.
1
Introduction
A large and growing number of modern computing systems are composed of a diverse multitude of networked components, that cooperate to accomplish the application’s targets. Examples are: peer-to-peer file sharing networks, GRID computing, the ambient environment [12], LAN plug-and-play, etc., or: Gnutella [13], CAN [24], Globus [8], CORBA [14], UPnP [7], Jini[3], etc1 . The enabling technology for all these systems is resource sharing. But before resources can be shared, potential users must have the ability to locate them. This can be accomplished by either manual configuration or discovery at-runtime of the resources (from here on named resource discovery (RD)). The larger the system grows, the more cumbersome manual configurations are. And if the resource pool is variable and not static, manual configuration is ruled out completely. Since the current trend is towards such larger, complexer and more variable systems, resource discovery becomes an ever more important service, worth special attention. Because of it indispensable nature, all current networked systems that contain some level of self-configuration, already provide a RD service. For instance, a GRID-client’s computational task is automatically distributed to a varying pool 1
A complete list of all systems that were surveyed for this paper can be found in Table 2 (legend in Table 3). Mind that this list is not exhaustive and only includes IP-based resource discovery systems.
C. M¨ uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 78–91, 2004. c Springer-Verlag Berlin Heidelberg 2004
A Taxonomy for Resource Discovery
79
of available processing units, in a plug-and-play environment an arriving laptop automatically finds the local printer, a file is located in a peer-to-peer network where nodes continuously enter and leave, etc. Although the resource discovery systems used in these examples pursue comparable tasks, the diversity of the applied strategies is as large as the different types of applications that require it. The aim of this paper is to provide a taxonomy for the different strategies that are used to tackle IP-based resource discovery. This is accomplished by defining the design space of RD systems. After clarifying the terminology, used in this paper, the relevant design aspects and the values they take are discussed (Sect. 3). This is followed in Sect. 4 with a overview of the large clusters of existing RD systems and the indication of the hiatuses in the design space.
2
Terminology
To understand resource discovery, one must first clearly understand what is meant by ’resources’. The dictionary gives us following definition: A source of supply, support, or aid, esp. one that can be readily drawn upon when needed. We discuss resources in the context of resource discovery as an enabling step for networked resource sharing. Thus, we can transform above definition to: Definition 1. A resource is any source of supply, support, or aid a component in a networked environment can readily draw upon when needed. Examples are: files, measurements, CPU-cycles, memory, printing, controldevices, forums, online shops, etc. This definition requires further specification, since different systems support different types of resources. A classification of resources is presented in Sect. 3.7. Definition 2. Resource discovery is the ability to locate resources that comply to a set of requirements given in a query. Sect. 3 elaborates on the available mechanisms for resource discovery. The process of resource discovery involves three main actors: resource providers, resource users and the RD service itself. – Definition 3. A resource provider is any networked entity that allows sharing of its resources. – Definition 4. A resource user is any networked entity that uses shared resources. – Definition 5. A resource discovery service is the service that returns the location of matching resources in response to a query with requirements.
80
K. Vanthournout, G. Deconinck, and R. Belmans
Multiple types of actors can be exerted by one single entity in a system, e.g, a resource provider can at the same time be resource user or a discovery service can be implemented by the resource users and providers themselves, without third party interference (see Sect. 3.1). It always remains possible, however, to logically divide a single entity into the actors it’s composed of. The term node is used throughout this paper to indicate a network-enabled device. Such a device can contain any of the above actors.
3
Relevant Design Aspects
Resource discovery systems can be categorized by different design aspects. Each of these aspects represents a design choice for which several solutions are possible. None of these solutions can be entitled as ’the best’ solution. Depending on the envisioned target application and environment, different choices may prove optimal. Therefore, a classification into a design space will not allow an absolute ranking, but rather permits the identification and comparison of clusters of RD systems with equal targets. It is also an aid to the designers of new systems and helps identifying hiatuses and potential new approaches. A list of the design aspects can be found in Table 1. Table 1. Resource discovery design aspects design aspect service provider construction foreknowledge architecture registration discovery supported resources naming and queries
3.1
The Service Provider
A designer is presented with two options when selecting how to provide a RD service. First choice is to implement it as a third party service, i.e., as an entity, distinctly separate from the resource providers and users. This usually takes the form of a server or collection of servers that gather information on the available resources and use this information to respond to queries of users. The third party type of RD system is the most common one. It is the traditional well-established way to provide services and allows deterministic and centrally controllable systems. Examples of systems like these are: DNS [18], CORBA [14], Napster [21], etc. (see Table 2 for a full list and Table 3 for the legend of Table 2). It does occur
A Taxonomy for Resource Discovery
81
however, that there is no organization willing or allowed to take the responsibility of providing a centralized RD service. An infamous example is mp3 file sharing, but also GRIDs with resources owned by various administrative organizations [17] and ambient intelligence applications [12] fall into this category. The alternative to a third party service is a genuinely distributed system, i.e., the RD service is distributed across all involved resource providers and users, without any central or coordinating components. Examples of genuinely distributed systems are: Gnutella [13], Freenet [5], Tapestry [30], CAN [24], etc. Mind though, that also third party systems can be distributed, although these systems act as a single entity to the outside world. DNS, LDAP [16] and CORBA are only a few examples. A final note must be made on systems like Salutation [6] and Jini [3], since these are systems that support both models: in the presence of a RD server, this server provides a third party service, but in absence of such a server, the system automatically switches to genuinely distributed operation (more on this subject in Sect. 3.2). 3.2
Construction
A networked distributed system constructs an overlay network on top of the actual communication layer. This overlay network is a directed graph: vertices represent network nodes and an edge represents the knowledge of a node about another node. Two alternatives exists for the construction of such an overlay network: they can be manually configured or they can grow by means of selforganization. – Manually Configured Networks: Third party systems that extend beyond the LAN-level are usually manually configured, i.e., a human administrator is responsible for providing each server with the location and function of the other components and users and providers must know the address of a server. Or, put differently, a human administrator must design the overlay network graph: the different components must be informed of the edges that leave them. While manual configuration allows more control and deterministic behavior, it is cumbersome to scale. The large organizational and human resources required to maintain the DNS system can serve as an example. – Self-organizing Networks: Self-organizing systems become interesting when the size of the network is large, if human configurations are too expensive or if no central authority is available to coordinate configuration. Drawback is the increased network traffic, required for the self-organization and maintenance of the overlay net and the complexity of the algorithms. Examples are Gnutella, Freenet, CAN, etc. – Hybrids: Also hybrid solutions are possible. This can be a means to add redundancy to a system and to increase its robustness. Examples are Salutation and Jini, where the absence of a (manually configured) RD server doesn’t imply system failure, since the resource providers and users will automatically self-organize into a complete graph (see Sect. 3.4), though this
82
K. Vanthournout, G. Deconinck, and R. Belmans
increases the network load significantly. Another class of hybrid systems are those that use multicast or broadcast to self-organize on LAN-level and manual configuration to interconnect these self-organized subgraphs at WAN level (e.g., Ninja SSDS [9] and Uniframe [26]). 3.3
Foreknowledge
Most systems, even self-organizing, require some configuration per node, prior to its entrance in the system. The information contained in this node specific configuration is referred to as foreknowledge and is closely related to the method of overlay network construction: – Nodes in manually configured systems require a list of all components they need to interact with. The foreknowledge composes of a list of well-known addresses. – Self-organizing systems that operate at WAN level, usually require at least the address of one random active node. Using the knowledge of the overlay network edges leaving this node, the new node can find its location. Many require on top of this a unique identifier, though this can often be avoided by using a hash function on the node’s network-address. – Self-organizing systems that employ multicast or broadcast usually need no foreknowledge, but are limited to operate at LAN level. 3.4
Architecture
The nodes of a distributed RD system organize into an overlay network, as mentioned before. The architecture of these overlay structures can be visualized by graphs and thus can be categorized by using graph theory [11,1] (see Fig. 1 for graphical representations):
Fig. 1. Examples of the different overlay network architectures (from left to right): A trivial graph, a tree graph, a ring-shaped regular graph, a 2-dimensional Cartesian regular graph, a random graph and a complete graph.
– Trivial Graph: A graph of order 0 or 1 (0 or 1 vertices), is a trivial graph. Order 0 graphs are logically of no interest to us. Trivial graphs of order 1, on the contrary, are the purely centralized systems. The RD system consists of a single server, registering all resources and answering all queries. These
A Taxonomy for Resource Discovery
–
–
–
–
3.5
83
systems may have scaling difficulties and are prone to single points of failures. Mind that trivial graph servers may still be internally duplicated. Trivial means here that the server has a single point of access. Tree Graph: Systems organized as trees scale well and allow for backtracking as a search algorithm, but high level node failures affect large portions of the system. They are especially suited for directory systems, e.g., DNS, LDAP, etc. Regular Graph: If nodes are structured in an ordered lattice and if adjacent nodes are interconnected, the graph is regular. (See [1] for how to measure regularity by using clustering coefficients). Self-organizing networks that grow into a regular structure allow for optimized search algorithms2 (e.g., Plaxton routing [22], used by Tapestry). Random Graph: Systems exhibiting random graph behavior include the manually configured systems, where administrators can interconnect nodes as they please, e.g., CORBA, and self-organizing networks where the number and nature of links is not predefined, e.g., Gnutella and Freenet. It should be noted, though, that many networks that seem completely random at first sight, do have an emergent structure: they are scale free networks [1]. The distribution of the number of links per node of scale free networks follows a power law. Both Gnutella and Freenet exhibit this behavior. Main consequences are an increased resilience to random failures and shorter path lengths, but also a larger susceptibility to directed attacks. Complete Graph: In this case all nodes know each other. Multicast and broadcast systems without central server (e.g., Jini) and replicated server RD systems (e.g., NetSolve [4]) are examples. Resource Registration
Before requests for resources can be made, a reference to them must be stored on a place where these references can be predictably accessed. The different techniques used are: – Local registration only: Only the resource providers are aware of the resources they share. This implies that a resource can only be located by immediate interrogation of the provider. Clearly, this is an inefficient method. If only replicable resources are offered (see Sect. 3.7), improvements can be made by replication and caching (e.g., Freenet). – References: References to the location of a resource are stored on a predictable place in a regular graph structure. Typically this is realized by using hash identifiers for both nodes and resources. References are then stored at the node whose identifier matches the resource’s identifier closest. 2
Notable also is ”small world” behavior [29,1]: if nodes in a highly regular structure maintain few links to far parts of the lattice, the average path lengths drop strongly, while the regularity is still preserved. Chord, for instance, uses this technique to optimize search.
84
K. Vanthournout, G. Deconinck, and R. Belmans
– Registration at the local server: Resource providers announce their resources to the local server (in case of trivial graphs to the only server). This information may then be stored only locally or locally and at replicated servers. – Manual registration: Resources are registered at RD servers by means of configuration files. This is only useful for long-term stable resources, e.g., URL names stored at DNS servers.
3.6
Query Routing
If a resource user wants to access a resource and thus first needs to locate it, a query with the user’s requirements is composed and send to that user’s local contact point to the RD system. The latter is the local server or, in case of a genuinely distributed system, the RD module on the node itself. This query must then be routed in the graph structure towards the node that knows the location of a matching resource. The strategies are: – Central server or local replicated server: In case of a single local server or a replicated server infrastructure, routing is limited to the single query send from resource users to the local server, where the information is readily available. – Query forwarding: if a node is located in a graph and it fails to match a query, it must forward the query to one or more nodes, based on some metric or rule. For instance, Freenet uses hash identifiers to select the next node a query will be forwarded to (caches enlarge the probability of this being a successful strategy). The most popular query techniques are: • Flooding: Used by some systems that deploy a random graph structure, flooding burdens the communication network heavily and requires limited hop-distance settings and special precautions for loops. Yet flooding guarantees that the node with the requested knowledge will be reached by the shortest path. The best-known system with flooding is probably Gnutella, but also the CORBA traders use it. Multicast or broadcast LAN-discovery systems that operate without server, can also be included in this category. • Back-tracking: The single biggest advantage of tree graph structures is that they allow back-tracking search algorithms. • Regular structure routing: The effort of implementing the complex algorithms that allow nodes to self-organize into a regular lattice has but one goal: predictable locations of resource references and a method to route towards those locations. A multitude of techniques are used for this purpose (see the footnotes to Table 2). Almost all used query forwarding strategies fall into one of the above mentioned sub-categories. An overview of alternative query forwarding algorithms for RD systems can be found in [17].
Re pl ic ab le
M
ob ile
A Taxonomy for Resource Discovery
Fixed
85
Dynamic
Fig. 2. The classes of resources
3.7
Supported Resources
As defined in Sect. 2, resources include everything that could be useful for an entity within a networked system that requires resource sharing and thus resource discovery. This includes both services and information. But services and information are not clearly delineated categories of resources, since they often overlap. E.g., a measurement device provides a service that delivers information. Better is to sort them in classes by rate of change (see Figure 2): – Fixed resources: These are resources that have both a fixed location and fixed properties. Examples are network printers with fixed properties (color, resolution, etc.), weather stations with a static set of measurement devices, etc. – Replicable resources: Basically, replicable resources are fixed resources combined with replicated files. In the case of replicated files, each file will have a single set of identifiers and the resource discovery service will return the location of one of the copies. – Mobile resources: The class of mobile resources contains resources with a fixed location, replicable resources and resources with a variable location. Examples of the latter are laptops, wireless devices, etc. – Dynamic resources: Dynamic resources are resources that have a fixed location, but whose identifiers can vary. Servers or PC’s that offer their (idle) computational resources are an example: the current load and the available amount of CPU time, memory and disk space vary. – Mobile & dynamic resources: This class embraces all resources, delineated above. Some of the surveyed systems target a specific set of resources, but could include other types with few or even no adaptations. Nevertheless, the values in Table 2 refer to the resources the system was specifically designed for and not to what could be included.
86
3.8
K. Vanthournout, G. Deconinck, and R. Belmans
Resource Naming and Queries
References to resources are composed of two parts: the location and the name of the resource. Mind that ’name’ could be as large as an XML file. Queries for resources are matched to these names. Consequently, the used naming mechanism defines largely the types of resources that can be served by the RD system and the ease by which those resources can be found. Compare the cumbersome task of locating resources (e.g., web pages) by means of clear text boolean queries to the deterministic alternative of assigning values to a selection of predefined attributes. – Unique identifiers and hashes: If a resource has a unique name, it is easier to locate. And, as shown before, such an identifier can be used to define and retrieve a fixed node where a reference to the resource’s location can be stored. Identifiers like these are usually obtained by using a hash function on the clear text name of the resource. Its main problem is that this system does not allow for changing properties to be reflected in the naming system and thus excludes dynamic resources. Or at least, dynamic properties of resources are not reflected in the RD system. – String naming: While hashes allow no search on specific terms in the resource name/description, clear text naming does allow more complex queries: boolean expressions, ’sound as’ queries, etc. – Directories: Directory naming systems build on scalable hierarchical name spaces, with as most famous examples DNS and LDAP. Most RD systems with directory naming employ the DNS or LDAP syntax or a variant on one of these two. – Attributes: The most powerful naming system is the use of attributes: resources are described by means of a number of predefined attributes that take a value. Attributes allow for extensive queries with predictable results, as opposed to ad-hoc queries used with clear text naming. Attributes can be used for the full range of resources, even for descriptions of dynamic resources. Note that the most commonly used attribute system is XML (used by UPnP, Ninja SSDS, UniFrame, UDDI, etc.).
4
The Main Classes of RD Systems
Table 2 gives an overview of the RD systems that were surveyed for this paper and lists the values they take for the several design aspects. The legend to Table 2 can be found in Table 3. Table 3 also serves as an overview of the design aspects and their values. This list of RD systems in Table 2 is not exhaustive, but contains the four classes presented here.
A Taxonomy for Resource Discovery
87
Table 2. Resource discovery systems and their properties (legend in Table 3), sorted by class (from top to bottom): P2P RD systems, multicast discovery RD systems, distributed third party RD systems and centralized RD systems Name
Prov
Constr
Forekn
Arch
Res Reg
Routing
Sup Res
Naming
Chord [10] Tapestry [30] Pastry [25] CAN [24] Freenet [5] Gnutella [13]
GD GD GD GD GD GD
SO SO SO SO SO SO
ID+RN ID+RN ID+RN RN RN RN
Reg3 Reg4 Reg5 Reg6 Rand Rand
Ref@N Ref@N Ref@N Ref@N Res@N None
Route Route Route Route QF Flooding
Repl Repl Repl Repl Repl Repl
Hash Hash Hash Hash Hash String
Salutation [6] Jini [3] UPnP [7] SLP [15] Ninja SSDS [9] UniFrame [26]
Both Both 3P 3P 3P 3P
SO SO SO SO SO/Man SO/Man
None None None None No/WKAddr No/WKAddr
Compl Compl/Triv Triv Triv Tree Rand
Node/Serv Node/Serv Serv Serv Loc Serv Loc Serv
Flood/Serv Flood/Serv Serv Serv BT Flooding
Mob Mob M/D Mob Mob Fix
Attr Attr Attr Dir Attr Attr
NetSolve [4] UDDI [27] RCDS [19] CORBA7 [14] Globe [28] DNS [18] LDAP [16]
3P 3P 3P 3P 3P 3P 3P
Man Man Man Man Man Man Man
WKAddr WKAddr WKAddr WKAddr WKAddr WKAddr WKAddr
Compl Compl Compl Rand Tree Tree Tree
Loc Serv Loc Serv Loc Serv Loc Serv Loc Serv Man Man
Repl Serv Repl Serv Repl Serv Flooding BT BT BT
Dyn Fix Repl Dyn Mob Fix Fix
Attr Attr Dir Attr Hash Dir Dir
Matchmaking [23] Napster [21] SuperWeb [2] Ninf [20] Globus8 [8]
3P 3P 3P 3P 3P
Man Man Man Man Man
WKAddr WKAddr WKAddr WKAddr WKAddr
Triv Triv Triv Triv Triv
Serv Serv Serv Man Man
Serv Serv Serv Serv Serv
Dyn Repl Dyn Dyn Fix
Attr String Attr Attr Attr
4.1
Centralized and Distributed Third Party RD Systems
Centralized RD systems are the systems that have a unique contact point for their users. They are manually configured systems with a trivial graph. The distinct advantage of centralized RD systems is that resource registrations, updates and queries require no (time consuming) routing, which is reflected in the resources they support: this class represent the largest portion of the attributes using systems that support dynamic resources. Immediate drawback is scalability and the single point of failure created by the trivial graph. The single point of failure can be overcome by replication and multiple contact points to the system. Scalability can be improved by spreading the resource references on several servers. Together, these systems make the class of dis3 4 5 6 7 8
The regular graph for Chord is build from a circular identifier space. The regular graph for Tapestry is build from a Plaxton Mesh [22]. The regular graph for Pastry is build from a circular identifier space. The regular graph for CAN is build from a d-dimensional Cartesian identifier space. The CORBA object implementing its resource discovery is the Trading Object Service. The component of GLOBUS that provides the RD service is the MDS (Metacomputing Directory Service). The MDS builds on LDAP and enhances it with attributebased search.
88
K. Vanthournout, G. Deconinck, and R. Belmans Table 3. Legend to Table 2 Prov
Service Provider
Constr
Construction
Forekn
Foreknowledge
Arch
Architecture
Res Reg
Resource Registration
Routing
Query Routing
Sup Res
Naming
GD 3P Both SO Man SO/Man RN ID+RN WKAddr No/WKAddr Triv Rand Tree Reg Compl None Ref@N Res@N Serv Loc Serv Man
Genuinely Distributed Third Party 3P, if absent GD Self-Organizing Manual configuration @LAN level: SO, @WAN level: Man Random Node Unique Identifier and RN Well-known Addresses @LAN level: None, @WAN level: WKAddr Trivial graph Random graph Tree graph Regular graph Complete graph Local registration only Reference at node with closest ID Resource cached at nodes with a closer ID Registration at unique server Registration at local server Manual configuration
Flooding BT Route QF Serv Repl Serv Flood/Serv
Flood query Back-Tracking Route to node with closest ID Query forwarding Query central server Query local replicated server if GD: Flooding, if 3P: Serv
Supported Resources
Fix Repl Mob Dyn M/D
Fixed resources Replicated resources Mobile resources Dynamic resources Mobile & Dynamic resources
Resource Naming and Queries
Hash String Dir Attr
Hash ID String naming and queries Directories Attributes
tributed third party systems (identified by requiring manual configuration and by a graph of larger complexity than trivial). Long established and offering the most straight forward billing, these systems are well-developed. This is reflected in their diversity. Indeed, many combinations of techniques are available and together they cover all types of resources. 4.2
Multicast RD Systems
The use of multicast (or broadcast) offers a clean and straightforward method to implement resource discovery with minimal administration effort (no foreknowledge is needed) and is especially useful if mobile resources are to be handled. Its use, however, is limited by its very enabling technology. The system‘s size is restricted to multicast supporting LANs as internet-wide multicast is no option yet. Ninja and UniFrame overcome this problem by combining it with a
A Taxonomy for Resource Discovery
89
distributed third party approach. However, the drawback of this solution is the manual configurations that are associated with third party RD systems. Combination of a genuinely distributed approach (see Sect. 4.3) with multicast would be an interesting approach to overcome this, but, as far as the authors know, no such system exists. 4.3
P2P Systems
Genuinely distributed systems are commonly referred to as peer-to-peer systems (P2P)9 , though this might be a confusing term. Nevertheless, P2P is widely accepted and recognized and thus will be used in the remaining of this paper. Two classes of P2P are available: – Random P2P RD Systems: The first class of P2P systems contains systems like Gnutella and Freenet. Using no fixed rules and variating metrics, nodes build a list of other nodes, starting from the location of a random active node. Result is a random overlay network. Efficient and/or deterministic search strategies are not feasible on this architecture and they are limited to replicable resources only. – Regular P2P RD Systems: By using a regular structure to store (hashbased) references to resources at predictable spots in this structure, regular P2P RD Systems allow for more efficient and deterministic search algorithms, hereby overcoming the main problem of random P2P RD systems. They too, are limited to replicable resources, though. Hiatuses. The immediate advantage of P2P systems is the absence of a need for a coordinating authority, a minimal required configuration effort, improved scalability and no dependency on third parties (and their servers). But unfortunately, P2P systems are limited to replicable resources; they focus on file sharing. The lack of genuinely distributed systems that support mobile and dynamic resources and use attribute naming is a true hiatus in the design space. The algorithms for self-organization and query forwarding, required for their construction, should become subject of intensive research. Notable is the initial work of Iamnitchi et al. [17] in this area.
5
Conclusions
The presented taxonomy for resource discovery systems is based on their design space that consists of eight design aspects, which allows categorization and comparison of RD systems. It is intended as an aid to the designers of new RD systems and as a means to facilitate the selection of a suited RD system for 9
Although Napster is commonly accepted as a P2P system, it is not included here. Napster does use peer-to-peer file transfer, but the RD mechanism is a centralized one.
90
K. Vanthournout, G. Deconinck, and R. Belmans
new distributed applications: all too often, RD systems with almost identical properties have been re-designed and -implemented. Four main classes of RD systems have been identified: centralized, distributed third party, multicast discovery and P2P RD systems. While centralized and distributed third party systems are well established and cover all types of resources and naming, multicast discovery systems require additional development to operate at WAN level without loss of their strongest advantage: the lack of any required foreknowledge. Potentially, the combination with P2P technology could offer a solution. The largest gap in the design space, however, is the limited types of resources that are supported by P2P systems. Genuinely distributed RD systems that support dynamic and mobile resources and use attribute-based naming require increased attention from the research community. Acknowledgements. This work is partially supported by the K.U.Leuven Research Council (GOA/2001/04) and the Fund for Scientific Research - Flanders through FWO Krediet aan Navorsers 1.5.148.02.
References [1] R´eka Albert and Albert-L´ asl´ o Barab´ asi. Statistical mechanics of complex networks. Reviews of Modern Physics, 74:47–97, Jan 2002. [2] A. Alexandrov, M. Ibel, K. Schauser, and C. Scheiman. SuperWeb: Towards a global web-based parallel computing infrastructure. In Proceedings of the 11th International Parallel Processing Symposium (IPPS’97), pages 100–106, apr 1997. [3] K. Arnold, B. O’Sullivan, et al. The jini specification, 1999. See also www.sun.com/jini. [4] Henri Casanova and Jack Dongarra. NetSolve: A network-enabled server for solving computational science problems. The International Journal of Supercomputer Applications and High Performance Computing, 11(3):212–223, Fall 1997. [5] Ian Clarke, Oskar Sandberg, Brandon Wiley, and Theodore W. Hong. Freenet: A distributed anonymous information storage and retrieval system. Lecture Notes in Computer Science, 2009:46, 2001. [6] Salutation Consortium. Salutation architecture specification. Technical report, salutation.org, 1999. [7] Microsoft Corporation. Universal plug and play device architecture. http://www.upnp.org/download/UPnPDA10 20000613.htm, 2000. [8] K. Czajkowski, I. Foster, N. Karonis, et al. A resource management architecture for metacomputing systems. In Proc. of the IPPS/SPDP ’98 Workshop on Job Scheduling Strategies for Parallel Processing, pages 62–82, 1998. [9] Steven E. Czerwinski, Ben Y. Zhao, Todd D. Hodes, et al. An architecture for a secure service discovery service. In Mobile Computing and Networking, pages 24–35, 1999. [10] Frank Dabek, Emma Brunskill, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, and Hari Balakrishnan. Building peer-to-peer systems with Chord, a distributed lookup service. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 81–86, May 2001.
A Taxonomy for Resource Discovery
91
[11] Reinhard Diestel. Graph Theory. Graduate Texts in Mathematics. Springer-Verlag New York, second edition, 2000. [12] K. Ducatel, M. Bogdanowicz, et al. Scenarios for ambient intelligence in 2010. ftp://ftp.cordis.lu/pub/ist/docs/istagscenarios2010.pdf, Feb 2001. [13] Gnutella. The gnutella protocol specification. http://rfc-gnutella.sourceforge.net. [14] Object Management Group. Corbaservices: Common object services specification. ftp://ftp.omg.org/pub/.docs/formal/98-07-05.pdf, 1999. [15] Erik Guttman. Service location protocol: Automatic discovery of IP network services. IEEE Internet Computing, 3(4):71–80, 1999. [16] T. Howes and M. Smith. Rfc 1823: The ldap application program interface. http://www.faqs.org/rfcs/rfc1823.html, Aug 1995. [17] A. Iamnitchi, I. Foster, and D. Nurmi. A peer-to-peer approach to resource location in grid environments. In 11th Symposium on High Performance Distributed Computing, Edinburgh, UK, Aug 2002. [18] Paul V. Mockapetris and Kevin J. Dunlap. Development of the domain name system. In SIGCOMM, pages 123–133, 1988. [19] Keith Moore, Shirley Browne, Jason Cox, and Jonathan Gettler. Resource cataloging and distribution system. Technical Report UT-CS-97-346, University of Tennessee, Jan 1997. [20] Hidemoto Nakada, Mitsuhisa Sato, and Satoshi Sekiguchi. Design and implementations of ninf: towards a global computing infrastructure. Future Generation Computing Systems, 15:649–658, 1999. [21] Napster. www.napster.com. [22] C. Greg Plaxton, Rajmohan Rajaraman, and Andrea W. Richa. Accessing nearby copies of replicated objects in a distributed environment. In ACM Symposium on Parallel Algorithms and Architectures, pages 311–320, 1997. [23] Rajesh Raman, Miron Livny, and Marvin Solomon. Matchmaking: Distributed resource management for high throughput computing. In Seventh IEEE International Symposium on High-Performance Distributed Computing, 1998. [24] Sylvia Ratnasamy, Paul Francis, Mark Handley, et al. A scalable content addressable network. In Proceedings of ACM SIGCOMM 2001, pages 161–172, 2001 2001. [25] Antony Rowstron and Peter Druschel. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), pages 329–350, 2001. [26] N.N. Siram, R.R. Raje, A. M. Olson, et al. An architecture for the uniframe resource discovery service. In Proceedings of the 3rd International Workshop of Software Engineering and Middleware, 2002. [27] uddi.org. Uddi technical white paper. Technical report, uddi.org, sep 2000. [28] Maarten van Steen, Franz J. Hauck, Philip Homburg, and Andrew S. Tanenbaum. Locating objects in wide-area systems. IEEE Communications Magazine, 36(1):104–109, January 1998. [29] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393:440–442, 1998. [30] B. Zhao, J. Kubiatowicz, and A. Joseph. Tapestry: An infrastructure for faulttolerant wide-area location and routing. Technical Report UCB/CSD-01-1141, University of California, Berkeley, Apr 2001.
Oasis: An Architecture for Simplified Data Management and Disconnected Operation 1
Anthony LaMarca and Maya Rodrig
2
1
Intel Research Seattle [email protected] 2 Department of Computer Science & Engineering, University of Washington [email protected]
Abstract. Oasis is an asymmetric peer-to-peer data management system tailored to the requirements of pervasive computing. Drawing upon applications from the literature, we motivate three high-level requirements: availability, manageability and programmability. Oasis addresses these requirements by employing a peer-to-peer network of weighted replicas and performing background self-tuning. In this paper we describe our architecture and an initial implementation. Our performance evaluation and implementation of three applications suggest that Oasis offers good availability and performance while providing a simple API and a familiar consistency model.
1 Introduction The vision of pervasive computing is an environment in which users, computing and the physical environment are artfully blended to provide insitu interactions that increase productivity and quality of life. While many of the hardware components required to realize this vision are available today, there is a dearth of robust applications to run on these new platforms. We contend that there are so few pervasive computing applications because they are too hard to develop, deploy and manage. A number of factors that are particular to pervasive computing scenarios make application development challenging: devices are resource-challenged and faulty, and devices may be continually arriving and departing. While prototypes of compelling applications can be deployed in the lab, it is very difficult to build an implementation that is robust and responsive in a realistic pervasive environment. We argue that the best way to foster pervasive computing development is to provide developers with a comprehensive set of software services, in effect an “OS for pervasive computing”. While there has been work in the area of system software for pervasive computing, a number of significant challenges remain [17]. In this paper, we address the challenge of providing pervasive computing support for one of the more traditional services, namely the storage and management of persistent data. We examined fifteen pervasive computing applications described in the literature and we distilled a common set of requirements that fall in the areas of availability, manageability, and programmability. Based upon these requirements, we designed and implemented a data management system called Oasis. We tested the performance C. Müller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 92–106, 2004. © Springer-Verlag Berlin Heidelberg 2004
Oasis: An Architecture for Simplified Data Management
93
of Oasis and used it to implement three pervasive computing applications in order to understand how well it satisfies these requirements. The contributions of this work are twofold. First, we offer an investigation of the data management requirements of pervasive computing applications. Second, we propose a new architecture to address these requirements that combines existing systems and database techniques and algorithms in a new way. The rest of the paper is organized as follows. In section 2 we identify the data management requirements of pervasive computing applications and draw specific examples from the literature. Section 3 presents the Oasis architecture and our initial implementation. In section 4 we describe our experience constructing three applications on top of Oasis. We discuss the performance of the system as measured with the workload of one of our applications in section 5. In sections 6, 7 and 8 we compare Oasis to related work, describe future work, and conclude.
2 Data Management Requirements of Pervasive Computing Through a survey of fifteen pervasive computing applications published in the literature [1,3,5,6,13,14,16,20,21,22,26,28,31,32], we have identified what we believe are the important data management requirements of these applications. The breadth of applications covered in the survey includes smart home applications, applications for enhancing productivity in the workplace, and Personal Area Network (PAN) applications. The specific requirements can be grouped into three areas: availability, programmability, and manageability. 2.1 Availability Pervasive computing applications are being developed for environments in which people expect devices to function 24 hours a day, 7 days a week. The AwareHome [16] and EasyLiving [6], for example, augment household appliances such as refrigerators, microwaves, and televisions that typically operate with extremely high reliability. For many of these augmented devices to function, uninterrupted access to a data repository is needed, thus a storage solution for pervasive computing must ensure data is available in the following conditions: Data access must be uninterrupted, even in the face of device disconnections and failures. Proposed pervasive computing scenarios utilize existing devices in the home as part of the computing infrastructure [6,20]. A data management solution should be robust to the failure of some number of these devices; turning off a PC in a smart home should not cause the entire suite of pervasive computing applications to cease functioning. The data management system must handle both graceful disconnections and unexpected failures, and data must remain available as devices come and go. Data may need to be accessed simultaneously in multiple locations even in the presence of network partitions. The majority of applications we examined include scenarios that require support for multiple devices accessing the same data in multiple locations. Commonly, these applications call for users to carry small, mobile devices that replicate a portion of the user’s home or work data [22,31]. Labscape [3] cites
94
A. LaMarca and M. Rodrig
disconnection as uncommon, but would like to support biologists who choose to carry a laptop out of the lab. Finally, some applications involve non-mobile devices sharing data over unreliable channels. The picture frame in the Digital Family Portrait [26], for example, communicates with sensors in the home of a geographically remote family member. In all of these cases, the application scenarios assume the existence of a coherent data management system that supports disconnected operation. Data can be accessed from and stored on impoverished devices. Pervasive computing applications commonly involve inexpensive, resource-constrained devices used for both accessing and storing data. In PAN applications, for example, impoverished mobile devices frequently play a central role in caching data and moving data between I/O and other computational devices [21,22,31]. Ideally a data management system for pervasive computing would accommodate the limitations of resource challenged devices; challenged devices would be able to act as clients, while data could be stored on fairly modest devices. 2.2 Manageability Perhaps the single largest factor keeping pervasive computing from becoming a mainstream reality is the complexity of managing the system. We have identified a number of features that are essential to making a data management system for pervasive computing practical for deployment with real users. Technical expertise should be required only in extreme cases. By many accounts the “living room of the future” will have the computational complexity of today’s server room; however there will rarely be an expert to manage it. In many cases only nontechnical users are present [26] while in extreme applications like PlantCare [20], there are no users at all. In the spirit of IBM’s autonomic computing initiative [34], data management for pervasive computing environments should be self-managing to the largest extent possible. Adjustments to storage capacity should be easy and incremental. Many of the pervasive computing systems we examined could most appropriately be labeled as platforms on which many small applications and behaviors are installed [5,6,13]. In such an environment, the data management system should be able to grow incrementally to support changing workloads and capacity needs. The system should adapt to changes within and across applications. The wide variety of devices and applications suggests that the data management system should monitor and adapt to changes in configuration and usage. Consider the location tracking system that is common to many pervasive computing scenarios [3,6,32]. Its job is to track people and objects and produce a mapping for other applications to use. In some scenarios this location data is used infrequently while other scenarios may require hundreds of queries against this data per second. A static configuration runs the risk of either providing poor performance or over-allocating resources. An adaptive solution, on the other hand, could detect the activation of a demanding application and adjust priorities accordingly, ensuring good performance and overall system efficiency.
Oasis: An Architecture for Simplified Data Management
95
2.3 Programmability A distributed, dynamic and fault-ridden pervasive computing environment is a far cry from the computing platforms on which most software engineers are trained. With this in mind, we have identified a number of requirements intended to lower the barrier to entry and make reliable, responsive pervasive computing applications easier to develop. The system should offer rich query facilities. Pervasive computing applications often involve large amounts of structured sensor data and frequent searches through this data. A common pattern is that an application takes an action if a threshold value is crossed (e.g., going outdoors triggers a display change [26], low humidity triggers plant watering [20], proximity to a table triggers the migration of a user interface [3]). Data management systems that provide indexing and query facilities vastly reduce the overhead in creating efficient implementations of such behaviors. The system should offer a familiar consistency model. Some distributed storage systems provide “update-anywhere” semantics [30] in which clients are permitted to read and write any replica at any time, even when disconnected. These systems provide weak consistency guarantees and applications may see writes to the data occur in a different order than they were written in. These weak guarantees can result in a wide range of unpredictable behaviors that make it difficult to write reliable applications. Our experience suggests that a familiar, conservative consistency model is more appropriate for most pervasive computing applications, even if it results in a decrease in availability.
Fig. 1. A sample configuration of devices, databases and Oasis components
The system should provide a single global view of the data. Some application scenarios dictate a specific replica configuration in order to achieve particular semantics. However, many applications merely want to reliably store and retrieve
96
A. LaMarca and M. Rodrig
data. Accordingly, a data management system for pervasive computing should include a facility for automatic data placement and present the view of a single global storage space to application developers. While these decisions can be guided by hints given by the application, the developer should not be directly exposed to a disparate collection of storage devices.
3 The Oasis Architecture Oasis is a data management system tailored to the requirements of pervasive computing. In Oasis, clients access data via a mediator service that in turn communicates with a collection of Oasis servers. The mediator service stores no persistent data; its only purpose is to run the Oasis consistency protocol. (The mediator’s function has been separated to allow the participation of impoverished clients like sensor beacons.) Figure 1 shows an example of an Oasis configuration in an instrumented home. Data is replicated across Oasis servers to provide high availability in the event of a device disconnection or failure. Oasis does not depend on any single server; data remains accessible to clients provided a single replica is available. In the remainder of this section we describe the Oasis architecture and describe how these enable Oasis to meet the requirements described in Section 2. 3.1 Data Model From the client’s perspective, Oasis is a database that supports the execution of SQL queries on relational data. We chose SQL because it is a widespread standard for accessing structured data. An Oasis installation stores a number of databases. Each database holds a number of tables that in turn hold a set of records. We envision that different databases would be created for different types of data such as sensor readings, configuration data, sound samples, etc. In Section 4, we describe three applications we have built using Oasis and their data representation. It should be noted that nothing in the rest of the architecture is specific to the relational data model, and Oasis could manage file- or tuple-oriented data with a few small changes. 3.2 P2P Architecture with Replication As devices may arrive and depart in pervasive computing scenarios, an architecture that supports dynamic membership is needed. A pure peer-to-peer (P2P) architecture provides the desired decentralization, adaptability, and fault-tolerance by assigning all peers equal roles and responsibilities. However, the emphasis on equal resource contribution by all peers ignores differences in device capabilities. To support a wide variety of devices, Oasis is an asymmetric-P2P system, or “super-peer” system [34], in which devices’ responsibilities are based on their capabilities. Devices with greater capabilities contribute more resources and can perform computation on behalf of others, while impoverished devices may have no resources to contribute and participate only as clients.
Oasis: An Architecture for Simplified Data Management
97
Data is replicated across multiple Oasis Servers to provide high availability. In our initial implementation, replication is done at the database level. (Replicating entire databases simplifies implementation but potentially overburdens small devices. In Section 7 we discuss the potential for partial replication.) An initial replica placement is determined at creation time and is then tuned as devices come and go, and data usage changes. The self tuning process is described in Section 3.4. 3.3 Weighted-Voting and Disconnection Operation All distributed data-stores employ an access-coordination algorithm that offers a consistency guarantee to clients accessing the data. To provide developers with a familiar consistency model, we chose an algorithm for Oasis that offers clients sequential consistency [23] when local replicas are available. Sequential consistency guarantees that the operations of all clients execute in some sequential order, and the operations of each client appear in this total ordering in the same order specified by its program. Basically, sequential consistency provides a set of distributed clients with the illusion that they are all running on a single device. The traditional way to provide sequential consistency and allow disconnected operation is with a quorum-based scheme in which a majority of the replicas must be present to update the data. We have adapted Gifford’s “weighted voting” [10] variant of the basic quorum scheme. As in a quorum-based scheme, data replicas are versioned to allow clients to determine which replica is the most recent. In addition, weighted voting assigns every replica of a data object a number of votes. The total number of votes assigned to all replicas of the object is N. A write request must lock a set of replicas whose votes sum to at least W, while read operations must contact a set of replicas whose votes sum to at least R votes. On a read, the client fetches the value from the replica with the highest version number. On a write, the client must update all of the replicas it has locked and then apply the new update. Weighted voting ensures sequential consistency by requiring that R+W>N. This constraint guarantees that no read can complete without seeing at least one replica updated by the last write (since R>N-W). Weighted voting is more flexible than traditional quorum-based approaches because the vote allocation as well as R and W can be tailored to the expected workload. Making R small, for example, boosts performance by increasing read parallelism. Making R and W close to N/2 allows up to half the servers to fail, increasing fault-tolerance. Gifford’s original weighted-voting algorithm was written for a single, centralized client accessing data on a collection of servers. With a single client, the metadata for a replicated data object (R, W, N and the list of replica locations and votes) can be maintained in a centralized fashion at the client’s discretion. To allow weighted voting to operate in a P2P system with multiple clients and servers, we developed a decentralized version of Gifford’s algorithm. In our scheme, versioned copies of the metadata are distributed along with the data in each replica. Updating the metadata requires acquiring a quorum of W votes, akin to a data update. This allows data and metadata operations to be safely interleaved, enabling the system to perform self tuning. To guarantee sequential consistency, we add the additional constraint that W≥R. This ensures that both reads and writes of the data see the latest version of the metadata (since W≥R>N-W). For more detail about our decentralized weightedvoting algorithm see [29].
98
A. LaMarca and M. Rodrig
In order to ensure consistency, writes cannot proceed when the required number of votes is not available. (In Oasis, a database appears to be “read only” when insufficient votes are available.) Read queries, on the other hand, are permitted to proceed even if a quorum cannot be attained when a local replica is available. This is the equivalent of allowing a disconnected client to read from a stale cache of the data. While this may seem to violate sequential consistency, it does not. Since the client cannot acquire a read quorum, they also cannot write the data (W≥R), ensuring they see a consistent, if out-of-date, snapshot of the data. When Oasis performs a query on a potentially stale replica, the query results are marked as stale to alert the client. Update requests can potentially fail if either the mediator or one of the replicas crashes or departs during the operation. To ensure the consistency of the database, mediators use a two-phase commit protocol when acquiring votes and executing updates on a replica. If a request fails to complete on a replica, the replica will be marked as invalid. Invalid replicas cannot participate in client operations until a distributed recovery algorithm [11] has been successfully executed. 3.4 Online Self-Tuning and Adaptability Oasis was designed to support self-tuning. The SQL data-model provides the opportunity to add and delete indices. Our weighted voting scheme permits flexibility regarding the number of placement of replicas, the vote allocation, and the values of R and W. Finally, our consistency scheme allows these parameters to be adjusted during a stream of client requests. This allows Oasis to be tuned in an online fashion without denying applications access to the data or requiring user intervention. Applications have the choice of managing their own replica placement and vote assignment, or allowing Oasis to manage the data on their behalf. For applications that do not want to manage their own replica placement, Oasis includes a self-tuning service that automatically handles replica configuration. When databases are created in Oasis, performance and availability expectations can be provided by applications that want auto-tuning. Oasis servers advertise their performance and availability characteristics and the self-tuner uses these along with the application expectations to make its configuration decisions. The self-tuner periodically examines each database's expectations and checks if they are best served by their current replica placement and vote assignment, making adjustments if appropriate. As we discuss in Section 7, we see the development of more sophisticated self-tuning behaviors based on machine learning techniques as a promising direction to pursue for future research. 3.5 Implementation Details Our initial implementation of Oasis was written in Java and our servers and mediators communicate using XML over HTTP. Oasis was implemented as a meta-database that delegates storage and indexing to an underlying database. The Oasis server has been written to run on top of any JDBC-compliant database that supports transactions. Our initial deployments have used a variety of products: PostgreSQL and MySql have been used on PC-class devices, and PointBase, an embedded, small-footprint database, has been used with IPAQs and other ARM-based devices.
Oasis: An Architecture for Simplified Data Management
99
4 Applications To investigate usability, we implemented three applications on top of Oasis. Two of these are variants of existing applications from the literature while Guide is a new application that has been developed in our laboratory. While we did not undertake a rigorous evaluation of our implementations, our experience suggests that Oasis is well suited for the pervasive computing domain. More interestingly, for all three applications, we encountered ways in which the capabilities of Oasis transparently augmented or extended some basic function of the application. 4.1 Portrait Display The Portrait Display is an ongoing project in our laboratory motivated by Mynatt et al.’s Digital Family Portrait [26]. The Digital Family Portrait tries to increase the awareness of extended family members about an elderly relative living alone. Information about an elderly person (health, activity level, social interaction) is collected by sensors in his instrumented home, and unobtrusively displayed at the remote home of an extended family member on a digital picture frame surrounding his portrait. Researchers in our laboratory have been using the digital family portrait scenario to explore various approaches for displaying ambient data about elders that require home care. In conjunction with their investigation, we have implemented a digital portrait inspired by the original that runs on top of Oasis. The four categories of information displayed in our digital portrait are medication intake, meals eaten, outings, and visits. The data used to generate the display comes from a variety of sources. In our prototype, medication and meal information are gathered using Motebased sensors [12] and cameras, while information about visits and outings is currently entered by hand using a web interface. The relational data model provided by Oasis is well suited for describing the regular, structured data used by the portrait display application. Similarly, the types of queries needed to extract results to display are easily expressed in SQL. Oasis effectively supports the availability requirements of the portrait display. The portrait display uses a separate Oasis database for each category of information collected (meals, visits, etc.). Each database is explicitly configured with a replica with 4 votes that resides on the device where the data is gathered and a 1-vote replica on the portrait display device (N=5, R=3, W=3). This configuration allows the data to be read and updated at its source, and when a connection exists, allows the portrait display to obtain recent changes. Note that this remains true even if the data source itself is disconnected. For example, after visiting with the elder, a care provider can enter notes about the visit while sitting in his car or back at his office, a practice mentioned in our fellow researcher’s interviews with care providers. While the ability to record information when disconnected was not part of the original scenario, the capability is provided by Oasis transparently by placing the 4-vote replica on the care providers laptop or PDA. This configuration also supports unplanned disconnections by the portrait display itself. The original digital family portrait used a simple client-server model in which the display was rendered as a web page fetched from a server running in the elder’s home. While suitable for a prototype, it would not work well in a real deployment in
100
A. LaMarca and M. Rodrig
which DSL lines and modem connections do in fact go down at times. Implementations that rely on a web client-server model must either display an error page or leave the display unchanged in the case of a disconnection. With Oasis, disconnections are exposed in the form of stale query results giving the application the opportunity to display the uncertainty in an appropriate way. 4.2 Dynamo: Smart Room File System Stanford’s iRoom [13] and MIT’s intelligent room [5] are examples of “productivity enhanced workspaces” in which pervasive computing helps a group of people work more efficiently. In their scenarios, people gather and exchange ideas while sharing and authoring files using a variety of viewing and authoring tools. Generally, in these scenarios either: 1. the files are stored on a machine in the workspace and users lose access when they leave the space, or 2. files reside on a user’s personal device (like a laptop) and everyone else in the workspace loses access when the user departs. For our second application we developed a system called Dynamo that allows fileoriented data to be consistently replicated across personal devices. In Dynamo, each user or group owns a hierarchical tree of directories and files, much like a home directory. Users can choose contexts in which to share various portions of their file system with other users (example contexts are ‘code review’ or ‘hiring meeting’). The collective sum of files shared by the users that are present make up the files available in the workspace. In this manner, everyone present at a hiring meeting can share their proxies and interview notes with the other participants without exposing other parts of their file space. Dynamo was written as an extension to Apache’s webDav server that stores a user’s files in an Oasis database. Microsoft’s Web Folders are used to mount the webDav server as a file system, allowing Dynamo’s file hierarchy to be accessed using standard desktop applications. Implementing Dynamo on top of Oasis required a small number of changes to the original webDav server (less than 400 lines). Despite this, the relational data model was not a good fit for the file oriented data stored in Dynamo. Mapping the hierarchy of the file system into relations required a translation step not needed in our other two applications. The flexibility of Oasis enabled a variety of semantically interesting configurations. If desired, Dynamo can create a 1-vote replica of a user’s files on a device that resides in the workspace. This permits the user to disconnect and leave while enabling the remaining participants to view (but not write) the files that have been shared. These stale read-only files remain in that context until the user returns at which time a more up to date, writeable version would be seen. For files owned by a group, interesting ownership policies can be arranged by assigning votes based on the user’s roles and responsibilities. This can be used to enforce policies ranging from basic majority schemes in which all replicas are equal, to more complex specifications such as: ‘the budget cannot be edited unless the boss plus any other two employees are present’. While this flexibility raises a number of privacy and interface challenges, it shows how Oasis can add rich semantics to a simple application.
Oasis: An Architecture for Simplified Data Management
101
4.3 Guide The Guide project [9] aims to use passive Radio Frequency Identification (RFID) tags, for the purpose of context inference. The project involves tagging thousands of objects in a work space with RFID tags and tracking their position using RF antennas mounted on a robot. Tagged objects include books, personal electronics and office/lab equipment. As the robot moves around the environment the antennas pick up the ID of nearby tags. For each tag ID i discovered at time t and location l, the platform writes the tuple (i, t, l) to a database. The database thus accumulates the location of objects over time. The goal of Guide is to determine high-level relationships between objects based on the accumulated data. To help in our evaluation, the Guide team in our lab implemented their system on top of Oasis. The relational data model was an ideal match for Guide’s structured RFID readings and the Guide workload could be easily expressed as SQL statements. The indexing provided by the underlying database was essential in reducing the time to process Guide queries. Guide has demanding performance, reliability and availability requirements. First it is expected to generate large quantities of data; the database is expected to grow to contain millions of readings in three months. Given that the Guide database is intended to be used as a common utility, it is quite possible that tens or hundreds of clients will query the guide database. The database must therefore scale to support large numbers of potentially complex queries in parallel. Second, this large quantity of data must be stored reliably. Since the data may represent months of activity (and it is impossible to regenerate the data), and the entire period may well be relevant, losing the data will be detrimental. Third, since the queries may be part of standing tasks (such as context-triggered notification) it is important that the database be highly available. Based on Guide’s goals of high availability and performance, the Oasis self-tuner configured the Guide database with three 1-vote replicas (N=3, R=2, W=2). This configuration provides high reliability, good performance and continuous access to the data provided that any two of the three servers are available.
Fig. 2. This graph compares the throughput of two Oasis configurations and a single PostgreSQL server. In the experiment ten clients are running the Guide workload concurrently
102
A. LaMarca and M. Rodrig
5 Performance To measure the performance of Oasis using a realistic workload, we constructed an experiment based on the Guide application described in Section 4.3. The Guide database is comprised of 3 tables: a reading table tracking when and where an RFIDtag was seen, an object table that relates RFID-tags to object names (like ‘stapler’), and a place table that records the geometric bounds of rooms. Our experimental data set was seeded with 1 million records in the reading table, 1000 records in the object table and 25 records in the place table. This approximates the number of tagged objects in our laboratory and the number of readings we expect to record in a month. In our benchmark a set of clients alternate between performing queries and updates on the database. The queries are all of the form “Where was object X last seen”. These are fairly compute intensive queries that join across the reading and place tables. The updates are inserts of a new record into the reading table. The ratio of queries to updates performed by the clients is 50:1, again approximating the expected workload in a Guide deployment. To show the tradeoffs offered by Oasis, we measure two Oasis configurations: one which offers the highest query performance (R=1, W=N) and another which offers the highest tolerance to server failure (R=N/2, W=N/2+1). To show the overhead that Oasis introduces, we compare its performance to direct accesses to the underlying PostgreSQL database. In our experiments, the number of clients is fixed at 10 and the number of replicas is varied from 1 to 6. Each client and server in the test ran on its own Pentium 4 PC running Windows 2000 connected via 100MB/s Ethernet. The Oasis servers and clients ran on Sun’s 1.3.1 JVM and the underlying data was stored in PostgreSQL 7.3. Figure 2 shows the total throughput achieved by the set of clients. The graph shows that for a singly-replicated database, Oasis achieves lower throughput than PostgreSQL. This is expected since Oasis incurs additional overhead running our locking protocol. The graph shows that as replicas are added, read queries are able to take advantage of the increased parallelism each new server offers. This parallelism is greater in the high-performance configuration in which a read query can be fully serviced by any one replica. For all multiple-replica configurations, Oasis achieves higher throughput than direct access to a single PostgreSQL server. Figure 3 shows a latency breakdown for the Guide queries executed against a 2way replicated Oasis database. The breakdown shows that read queries spend more time executing in the database than the writes. It also shows that the Oasis overhead is higher for the writes than the reads. (With two replicas, read operations can piggyback the query on the lock request requiring fewer messages.) This figure also suggests that optimizing our XML/HTTP messaging layer could offer substantial performance gains.
6 Related Work There are many existing storage management systems available to pervasive computing developers, including distributed file systems, databases, and tuple-stores. These distributed systems exhibit a variety of behaviors when clients disconnect from
Oasis: An Architecture for Simplified Data Management
103
Fig. 3. The latency breakdown of read and write queries in the Guide workload for Oasis configured with two replicas.
the network. In most systems, disconnected clients are unable to read or write data, others offer limited disconnected operation [27], while some systems give clients full read and write capabilites while disconnected [14,18,30]. We now review the storage management systems that are most relevant to Oasis and discuss how they compare. A number of data management systems permit clients to perform updates to a local replica at any time, even when disconnected from all other replicas. These so called “update anywhere” systems are attractive because they never deny the client application the ability to write data and guarantee that the update will eventually be propagated to the other replicas. There are update-anywhere file-systems such as Coda [18] as well as update-anywhere databases such as Bayou [30] and Deno [14]. As data can be updated in multiple locations at the same time, these systems offer weaker consistency guarantees than Oasis. To achieve eventual consistency, update anywhere systems employ varying mechanism to resolve conflicts that arise between divergent replicas. Coda [18] relies on the user to merge write conflicts that cannot be trivially resolved by the system. This technique is a poor fit for pervasive computing environments where the user may not be near a computer to provide input or may not have the necessary level of expertise. In Bayou [30], writes are accompanied by fragments of code that travel with the write request and are consulted to resolve conflicts. While these migrating, application-specific conflict resolvers are a potentially powerful model, we believe that writing them is beyond the technical abilities of an average software engineer. Finally, Deno [14] uses rollback to resolve conflicts between replicas. Rollback is difficult to cope with in pervasive computing environment in which physical actuations take place than cannot be undone. While peer-to-peer file sharing systems like Gnutella satisfy a number of our requirements, they do not provide a single consistent view of the data as servers connect and disconnect. Systems like Farsite [4], OceanStore [19], and CFS [8] improve on the basic P2P architecture by incorporating replication to probabilistically ensure a single consistent view. While these systems share our goals of availability and manageability, there are significant differences that make them less than ideal for pervasive computing environments. Farsite was designed for a network of PCs running desktop applications. OceanStore is geared for global-scale deployment and depends on a set of trusted servers. Finally, CFS provides read-only access to clients and is not intended as a general-purpose file system. A few storage systems have been designed specifically for pervasive computing environments. The TinyDB system [25] allows queries to be routed and distributed
104
A. LaMarca and M. Rodrig
within a network of impoverished sensor nodes. Systems like PalmOS allow PDA users to manually synchronize their data with a desktop computer. TSpaces [24] is one of a number of centralized tuple-based system that was written for environments with a changing set of heterogeneous devices. Self tuning has been incorporated into several storage systems outside the domain of pervasive computing. HP AutoRaid [32] automatically manages the migration of data between two different levels of Raid arrays as access patterns change. Similarly, Hippodrome [2] employs an iterative approach to automating storage system configuration.
7 Future Work The largest limitation in Oasis is the need to replicate databases in their entirety; currently an application that wants to replicate a small part of a large database needs to create an alternate database and keep it consistent. One solution is to change the level at which replication occurs to the table or possibly the record level. Another alternative would be to support partial database replicas similar to ‘views’ in SQL. We plan to investigate which alternative best fits pervasive computing as well as understand how our protocols would have to change to remain correct. We also plan to explore how machine learning techniques can be used to guide the placement of replicas, the creation of indexes and the adjustment of the weightedvoting parameters. We believe that large gains in both availability and query performance can be attained by taking greater advantage of device characteristics and data access patterns. While work has been done in off-line self tuning, little on-line tuning research has been done for dynamic storage systems such as Oasis.
8 Conclusions It is difficult to write responsive, robust pervasive computing applications using traditional data management systems. To help address this issue, we have built Oasis, a data management system tailored to the requirements of pervasive computing. Oasis presents an SQL interface and a relational data model, both of which are well suited to the data usage of typical pervasive computing applications. A peer-to-peer architecture coupled with a weighted-voting scheme provides sequentially consistent access to data while tolerating device disconnections. We have validated our initial implementation by showing it exhibits good performance as well as using Oasis to implement three typical pervasive computing applications.
References 1.
Abowd, G. D., Atkeson, C. G., Feinstein, A., Hmelo, C., Kooper, R., Long, S., Sawhney, N., and Tani, M. Teaching and learning as multimedia authoring: the classroom 2000 project. Proceedings of ACM Multimedia '96, 187-198, 1996.
Oasis: An Architecture for Simplified Data Management 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
17. 18. 19. 20. 21.
105
Anderson, E., Hobbs, M., Keeton, K., Spence, S., Uysal, M., and Veitch, A. Hippodrome: running circles around storage administration. In Conference on File and Storage Technology. USENIX, 2002. Arnstein, L., Sigurdsson, S. and Franza, R., Ubiquitous computing in the biology laboratory. Journal of Laboratory Automation, March 2001. Bolosky, W., Douceur J., Ely, D. and Theimer M., Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs", In Proceedings of ACM Sigmetrics, 2000. Brooks, R., The Intelligent Room Project. Proceedings of the Second International Cognitive Technology Conference, 1997. Brumitt, B., Meyers, B., Krumm, J., Kern, A., and Shafer, S. EasyLiving: Technologies for intelligent environments. In Proc. of 2nd International Symposium on Handheld and Ubiquitous Computing (2000), 12-29. Card, S. K., Robertson, G. G., and Mackinlay, J. D.. The information visualizer: An information workspace. Proc. ACM CHI'91 Conf. (1991), 181-188. Dabek, F., Kaashoek, M. F., Karger, D., Morris, R., and Stoica, I. Wide-area cooperative storage with CFS. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, 2001. Fishkin, K.P., Fox, D., Kautz, H., Patterson, D., Perkowitz, M., Philipose, M., Guide: Towards Understanding Daily Life via Auto-Identification and Statistical Analysis. Ubihealth 2003, Sept 2003. Gifford, D. K., Weighted Voting for Replicated Data, Proceedings of the Seventh Symposium on Operating Systems Principles, 1979, pp. 150-162. Goodman, N., Skeen, D., Chan, A., Dayal, U., Fox, S, and Ries, D., A recovery algorithm for a distributed database system, in Proceedings 2nd ACM Symposium on Principles of Database Systems, March, 1983. Hill, J., Szewcyk, R., Woo, A., Culler, D., Hollar, S. and Pister, K.. 2000. System Architecture Directions for Networked Sensors. ASPLOS 2000. Johanson B., Fox A. and Winograd T. The Interactive Workspaces Project: Experiences with Ubiquitous Computing Rooms. IEEE Pervasive Computing Magazine 1(2), AprilJune 2002. Johanson, B. and Fox, A., The Event Heap: An Coordination Infrastructure for Interactive Workspaces, Proc. WMCSA 2002. Keleher, P., Decentralized Replicated-Object Protocols. In Proc. 18th ACM Symp. on Principles of Distributed Computing, (1999), 143-151. Kidd, C., Orr, R., Abowd, G.D., Atkeson, C.G., Essa, I.A., MacIntyre, B., Mynatt, E., Starner, T.E., and Newstetter, W.: The Aware Home: A Living Laboratory for Ubiquitous Computing Research. Proceedings of the Second International Workshop on Cooperative Buildings, 1999. Kindberg T. and Fox A., System Software for Ubiquitous Computing. IEEE Pervasive Computing, 1(1), Jan 2002, pp. 70-81. Kistler, J., Satyanarayanan, M. Disconnected Operation in the Coda File System. ACM Transactions on Computer Systems, Feb. 1992. Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B., OceanStore: An Architecture for Global-Scale Persistent Storage, ASPLOS, 2000. LaMarca, A., Brunette, W., Koizumi D., Lease M., Sigurdsson S., Sikorski K., Fox D., Borriello G., PlantCare: An Investigation in Practical Ubiquitous Systems. Ubicomp 2002: 316-332 Lamming, M. and Flynn, M., Forget-me-not: Intimate Computing in Support of Human Memory, in Proceedings of International Symposium on Next Generation Human Interface, (1994).
106
A. LaMarca and M. Rodrig
22. Lamming, M., Eldridge, M., Flynn M., Jones C., and Pendlebury, D., Satchel: providing access to any document, any time, anywhere, ACM Transactions on Computer-Human Interaction, (7)3:322-352, 2000. 23. Lamport, L. How to make a multiprocessor computer that correctly executes multiprocessor programs. IEEE Trans. on Computers, 28(9):690-691, Sept. 1979. 24. Lehman, T. J, McLaughry, S. W., Wyckoff, P., Tspaces: The next wave. Hawaii Intl. Conf. on System Sciences (HICSS-32), January 1999. 25. Madden S., Franklin M., Hellerstein J., and Hong W., The Design of an Acquisitional Query Processor for Sensor Networks. To Appear, SIGMOD, June 2003. 26. Mynatt, E., Rowan, J., Craighill, S. and Jacobs, A. Digital family portraits: Providing peace of mind for extended family members. Proc of the ACM Conference on Human Factors in Computing Systems, 2001, 333-340. 27. Oracle9i Lite Developers Guide for Windows CE, Release 5.0.1, Jan 2002. 28. Sumi, Y. and Mase, K, Digital System for Supporting Conference Participants: An Attempt to Combine Mobile, Ubiquitous and Web Computing. Ubicomp 2001. 29. Rodrig M., LaMarca, A. Decentralized Weighted Voting for P2P Data Management, Third International Workshop on Data Engineering for Wireless and Mobile Access (MobiDE 2003), Sept 2003. 30. Terry, D., Theimer, M., Petersen, K., Demers, A., Spreitzer, M. and Hauser, C. "Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System", Proc. 15th ACM Symp on Operating Systems Principles, (1995), 172-183. 31. Want, R., Pering, T., Danneels G., Kumar M., Sundar, M., Light, J.: The Personal Server: Changing the Way We Think about Ubiquitous Computing. Ubicomp 2002. 32. Weiser, M., The computer for the twenty-first century. Scientific American, pages 94-100, September 1991. 33. Wilkes, J., Golding, R., Staelin, C., and Sullivan, T. The HP AutoRAID Hierarchical Storage System. ACM Transactions on Computer Systems, 14(1), Feb 1996. 34. Yang, B., and Garcia-Molina, H. Designing a super-peer network. Technical Report, Stanford University, February 2002. 35. Autonomic Computing Manifesto, http://www.research.ibm.com/autonomic/ manifesto/autonomic_computing.pdf, visited Mar ‘03.
Towards a General Approach to Mobile Profile Based Distributed Grouping Christian Seitz and Michael Berger Siemens AG, Corporate Technology, Information and Communications, 81730 Munich, Germany [email protected] [email protected]
Abstract. We present a new kind of mobile ad hoc application, which we call Mobile Profile based Distributed Grouping (MPDG), which is a combination of mobile clustering and data clustering. In MPDG each mobile host is endowed with a user profile and while the users move around, hosts with similar profiles are to be found and a robust mobile group is formed. The members of a group are able to cooperate or attain a goal together. In this paper MPDG is defined and it is compared with related approaches. Furthermore, a modular architecture and algorithms are presented to build arbitrary MPDG applications.
1
Introduction
Tomorrow’s world will be intrinsically ubiquitous and mobile. Ubiquitous computing is a new trend in computation and communication. It is an intersection of several technologies, including embedded devices, service discovery, wireless networking and personal computing technologies. In an ad hoc network, mobile devices can detach completely from the fixed infrastructure and establish transient and opportunistic connections with other devices that are in communication range. The structure of an ad hoc mobile network could be highly dynamic. The absence of a fixed network infrastructure, frequent and unpredictable disconnections, and power considerations render the development of ad hoc mobile applications a very challenging undertaking. We present a new mobile ad hoc network application area, which we call Mobile Profile based Distributed Grouping (MPDG). In MPDG each mobile host is endowed with a user profile. A user profile (short: profile) is a comprehensive data collection belonging to a specific object (e. g. a person). A profile consists of a set of parameters defining the configuration of a user specific application. While the users move around, mobile hosts with similar profiles are to be found and a mobile group is formed. The participants of a group are able to cooperate or attain a goal together. The rest of the paper is organized as follows. Section 2 gives an overview of related work of other clustering or grouping problems. Section 3 formally defines a MPDG problem. The next section presents the architecture of a MPDG C. M¨ uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 107–121, 2004. c Springer-Verlag Berlin Heidelberg 2004
108
C. Seitz and M. Berger
application. Section 5 describes the used algorithms in our approach to MPDG and presents simulation results. Finally, section 6 concludes the paper with a summary.
2
Problem Classification and Related Work
In this section we classify with which problems a Mobile Profile based Distributed Grouping application is confronted. Furthermore, we show which other research areas are related and how this new problem has already been discussed in literature. 2.1
Problem Classification
In Mobile Profile based Distributed Grouping mobile hosts are equipped with wireless transmitters, receivers, and a user profile. They are moving in a geographical area and are forming an ad hoc network. In this environment hosts with similar profiles have to be found. Mobile Profile based Distributed Grouping in ad hoc environments comprises three main problems which have to be solved to accomplish a MPDG application. The first problem is the dynamic behavior of an ad hoc network, where the number of mobile hosts and communication links permanently changes. Secondly, a data structure for the user profile has to be defined and a mechanism must be created to compare profile instances. Finally, similar profiles have to be found in the ad hoc network and the corresponding host form a group, in spite of the dynamic behavior of the ad hoc network. 2.2
Related Research Areas
Grouping algorithms and their applications appear very often in literature. There are mainly two different research areas associated with it, namely mobile networks, databases and data mining. Grouping in mobile networks describes the partitioning of a mobile network in several, mostly disjoint, clusters [2,3]. The clustering process comprises the determination of a cluster head in a set of hosts. A cluster is a group of hosts, able to communicate with the cluster-head. This clustering takes place at the network layer and is used for routing purposes. In the following we will have a closer look at these two research areas. Clustering is also known in the database or data mining area. A huge amount of data is scanned with the goal to find similar data sets. This research is also known as unsupervised learning. In the surveys of Fasulo [6] or Fraley and Raftery [7] an overview of many algorithms for that domain can be found. Maitra [11] and Kolatch [10] examine data-clusters in distributed databases. MPDG combines the two aforementioned clustering approaches and in order to accomplish its objective the mentioned problems are to be solved. The problems, arising by means of the motion of the hosts could be solved by methods used in the mobile network area. Searching for similar profiles is based on algorithms of data clustering. Both methods must be adapted to MPDG, e. g. while in the
Towards a General Approach to Mobile Profile Based Distributed Grouping
109
database area millions of data sets must be scanned, in the MPDG application at the utmost one hundred other hosts are present. In contrast to data sets in databases ad hoc hosts move around and are active, i. e. they can publish their profile by their own. 2.3
Related Work
There is other work that analyzes ad hoc clustering or grouping algorithms. Roman et al. [13] deals with consistent group membership. They assume that the position of each host is known by other hosts and two hosts do only communicate with each other, if it guaranteed that during the message exchange the transmission range will not exceed. In our environment obtaining position information is not possible, because such data is not always available, e. g. inside of buildings. Hatzis et al. [9] describe algorithms for mobile ad hoc networks but they assume, that the number of hosts does not change during protocol execution. This assumption appears as too strict, because vanishing and appearing hosts are an ad hoc networks nature. Another related field are agreement or consensus algorithms, where all hosts must agree on a binary value, based on the votes of each host. They must all agree on the same value, and that value must be the vote of at least one process. This is almost what we want to achieve, but in MPDG we have to agree on complex profiles. Furthermore, not all hosts must agree in our case. It is enough, when a subset agrees. Badache et al. [1] adapt agreement algorithms to a mobile environment, but they use fixed base stations and no real ad hoc network without infrastructure.
3
MPDG Definitions and Assumptions
This section gives a formal introduction to Mobile Profile based Distributed Grouping and defines the used ad hoc network model. Finally, some assumptions to the application are made. 3.1
The Ad Hoc Network Model
An ad hoc network is generally modelled as an undirected graph G = (V, E). Definition 1. Graph G(V, E) Let G = (V, E) a graph with the vertex set V = {v1 , v2 , . . . , vn } and the edges E ⊆ V × V (vi , vj ) ∈ E if the euclidian distance d(vi , vj ) rt 1 . A graph is depicted in figure 1a. The vertices vi of G0 represent mobile hosts. Due to the motion of the vertices, a graph G0 as shown in figure 1a is only a snapshot of an ad hoc network, because in consequence of the mobility of the hosts G will change. Assumptions on the the mobile nodes and network are: 1
rt = transmission radius
110
C. Seitz and M. Berger
a )
b )
Fig. 1. Graph and a possible Spanning tree
– – – –
We do not rely on any central component. There is no location information available, e. g. GPS-data or cell-info. Each mobile device has a permanent, constant unique ID. The transmission range of all hosts is rt . This also guarantees a symmetrical communication behavior, i. e. if host A can send messages to host B, host B can send messages to host A as well, even though physically it is easy to envision circumstances in which some hosts may be able to reach much further than some other. – Each host knows all its neighbors2 and its associated ID. This service is accomplished by the ad hoc middleware. – Each host has the same profile pattern, i. e. the profile consists of attributes (e. g. ) and values (e. g. Smith). The attributes must be for all hosts the same in order to compare these profiles. 3.2
Formal Definition of MPDG
In a MPDG application each host has a user profile. A profile is a point in the profile space Π. Definition 2. Profile-space: Let Π = Π1 × . . . × Πm a profile space with finite dimension m. A point P ∈ Π with P = (p1 , . . . , pm ) corresponds to a profile. The pi are referred as profile entries. Furthermore let φ: V −→ P a function, that maps to each node vi ∈ V a profile Pj , i. e. φ(vi ) = Pj . The algorithms in section 5 distinguish between a local group and decentralized group, which are in the following defined. Definition 3. Communication Relation a Cn b The relation a Cn b indicates that a node a can communicate with a node b over a maximum of n vertices, i. e. a path {e1 , . . . , em } (m ≤ n) from node a to node b with |{e1 , . . . , em }| ≤ n. 2
The Neighbors of a host are all other hosts which reside within transmission range rt .
Towards a General Approach to Mobile Profile Based Distributed Grouping
111
Definition 4. Neighbor set Nvk : Let Nvk = {x | v Ck x}. Thus, Nvk contains the set of nodes, that can communicate over k edges with the node v. Definition 5. Similarity operator σ : Let K a r-tupel of nodes V of the graph G and let v be an arbitrary node of G. Thus, σ is defined as, σ : V r × V → {true, false}. Definition 6. Local Group G: Let Nv1 be the set of communication partners of an arbitrary node vi over 1 edge and let P(Nv1 ) the power set of Nv1 and let K = M ∈ P(Nv1 ) | σ(vi , M ) = true . Then a local group G ⊆ Nv1i is defined G = max∀x∈K |X|. The set of all n local groups G, is denoted with G. The local group G of a vertex v consists of all direct neighbor hosts, whose profile are similar to the profile of v. In order to achieve a decentralized group, local groups must be combined to get a new group. This is done by a combination operator. Definition 7. Combination operator γ : Let Gi and Gj two local groups and v an arbitrary node in G. For the combination γ(Gi , Gj ) = Gi ∪ Gj \ M ⊂ Gi ∪ Gj the following conditions must hold: γ(Gi , Gj ) > |Gi | ∧ γ(Gi , Gj ) > |Gj | σ v, γ(Gi , Gj ) = true Definition 8. Decentralized Group: Let Nvk all the neighbors of a node v. A Decentralized Group is obtained, if the local group of the node v is combined with the local group of each element in Nvk .
4
MPDG Application Architecture
In this section the architecture of a MPDG application is presented, which is depicted in figure 2. A MPDG application consists of three essential parts: the middleware, the MPDG unit and the Domain Description unit. The middleware establishes the basis for a MPDG application. It is in charge of detecting other hosts in the mobile environment and provides a mechanism for sending and receiving messages to other hosts, which are within transmission range. The central element of a MPDG application is the MPDG unit. It is made up of a MPDG algorithm entity and a MPDG Description entity. The MPDG algorithms entity distinguishes between Local Grouping and Decentralized Grouping. In order to obtain a decentralized group a two-tiered process is started. At first, each host selects from its neighbor hosts, the subset of hosts which ought to be in the group. This set of hosts is called a Local Group. After each host has determined its local group, these groups are exchanged in a second
112
C. Seitz and M. Berger D o m a in D e s c rip tio n U n it
D o m a in D e fin itio n
G ro u p D e fin itio n
P ro file D e fin itio n
M P D G U n it M P D G M e ta D e s c rip tio n s
M P D G A lg o rith m s
L o c a l G ro u p in g
D e c e n tra liz e d G ro u p in g
M e ta G ro u p D e s c rip tio n
M e ta P ro file D e s c rip tio n
M id d le w a re (A g e n t P la tfo rm , J X T A , L E A P , e tc . )
Fig. 2. Components of a MPDG application
step and a unique decentralized group is acquired. As the MPDG unit is totally domain independent, the structure of a profile or a group definition must be defined. This is done by the Meta Profile Description and the Meta Group Description which are elements of the MPDG Description entity. The Meta Group and Meta Profile Description define what a group or profile definition consist of and specify optional and mandatory elements of a group or profile definition. In both meta descriptions, there is a mandatory general part, defining the name of the profile or group. The meta profile description encompasses a set of tags for profile entry definitions, and specifies the structure of rules, that can be declared in the profile definition in the Domain Description Unit. The meta group description comprises abstract tag definitions for the size of a group, how a group is defined, and the properties to become a group member. On top of the MPDG unit, the Domain Description unit is located. This unit adjusts the MPDG unit to a specific application domain. In the Domain Definition entity, domain dependent knowledge is described and in the profile and group definition entities the domain specific profiles and group properties are defined. The Profile Definition specifies the structure of the profile for the domain, in accordance with the Meta Profile description. The Group term varies from application to application. A group can be a few people with similar properties (the same hobby, profession, age etc.) but also a set of machines with totally different capabilities. For that reason the Group Definition comprises the characteristics of the group ought to be found. In this paper, we concentrate on describing the MPDG algorithm entity and we only refer to the description entities in inevitable situations.
5
MPDG Algorithms
In this section the architecture of the algorithm entity is presented, the used network model is defined and the algorithms for each layer are shown. The algorithm entity has a layered architecture and encompasses algorithms for initiator determination, virtual topology creation, local grouping and decentralized grouping.
Towards a General Approach to Mobile Profile Based Distributed Grouping
113
Finally, we show some simulation results, in order to indicate how stable the generated groups are.
G ro u p in g L a y e r L o c a l G ro u p in g
D e c e n tra liz e d G ro u p in g
V irtu a l T o p o lo g y L a y e r In itia to r D e te rm in a tio n L a y e r
C o m m u n ic a tio n M o n ito rin g A g e n t
M P D G A lg o rith m s
A d h o c M id d le w a re
Fig. 3. Architecture of the MPDG Algorithm Entity
5.1
MPDG Algorithm Entity
The most important part of a MPDG application is the Algorithm Entity (AE). The design of this essential entity is shown in figure 3. The basis for the algorithm entity is an Ad hoc Middleware. The middleware is needed by the AE in order to send and receive messages in the dynamic environment. Furthermore, the middleware has to provide a lookup services to find new communication partners. The lowest layer of the AE is the Initiator Detection Layer, which assigns the initiator role to some hosts. An initiator is needed in order to guarantee, that the algorithm of the next layer is not started by each host of the network. This layer does not determine one single initiator for the whole ad hoc network. It is sufficient, if the number of initiator nodes is only reduced. The Virtual Topology Layer is responsible for covering the graph G with another topology, e. g. a tree or a logical ring. This virtual topology is necessary to reduce the number of messages, that are sent by the mobile hosts. First experiences show, that a tree is the most suitable virtual topology and therefore we will only address the tree approach in this paper. The next layer is the most important one, the Grouping Layer, which accomplishes both, the local grouping and the decentralized grouping. Local grouping comprises the selection of hosts which are taken into account for global grouping. Decentralized grouping encompasses the exchange of the local groups with the goal to achieve a well defined global group (see definitions in section 4). Furthermore, a Communication Monitoring Agent (CMA) exists. This agent registers for each message which is send or received, the sender, the receiver and the associated time. With these communication lists, it becomes possible to make assumptions, whether a mobile host is still in transmission range or not. Note, the mobility of the nodes is no problem, when the absolute value of the velocity and the direction is rather the same. Thus, if host A is able to communicate five
114
C. Seitz and M. Berger
minutes with host B, these two hosts must have a very low differential velocity. The results which are provided by the CMA are not totally correct, because if a mobile device is shut down by its user it cannot be recognized and not predicted by the CMA. If we proceed on the assumption that users shut down their devices only scarcely, the CMAs results are solid. 5.2
Initiator Determination
Before the spanning tree is created, the initiators must be determined who are allowed to send the first creation-messages. Without initiators all hosts start randomly sending messages with the result that a tree will never be created. We are not in search of one single initiator, we only want to guarantee, that not all hosts start the initiation. There are two ways to determine the initiator, an active and a passive one. The active approach starts an election algorithm (see Malpani et al. [12]). These algorithms are rather complex, i. e. a lot of messages are sent which is very time consuming. They guarantee that only one initiator is elected and in case of link failures that another host takes the initiator role. Such a procedure is not appropriate and not necessary for MPDG, because the initiator is only needed once and it matters little if more than one initiator is present. Therefore, we decided for the passive determination method, which is similar to Gafni and Bertsekas [8]. By applying the passive method no message is sent in the beginning to determine an initiator. Since each host has an ID and knows all neighbor IDs, we only allow a host being an initiator, if its ID is larger than all IDs of its neighbors. The initiator is in charge of starting the virtual topology algorithm, described in the next section. Unfortunately, fairness of the initiator determination process is not guaranteed. Thus, it could happen a mobile device never takes the initiator position. At this time it can not be said if it is an advantage being the initiator or not. 5.3
Virtual Topology Creation
Having confined the number of initiators, graph G0 can be covered with a virtual topology (VT). Simulations showed that a spanning tree is a promising approach for a VT and therefore we will only describe the spanning tree VT in this paper. A spanning tree spT(G) is a connected, acyclic subgraph containing all the vertices of the graph G. Graph theory (e. g. Diestel [5]) guarantees, that for every G a spT(G) exists. Figure 1b shows a graph with one possible spanning tree. The Algorithm Each host keeps a spanning tree sender list (STSL). The STSL contains the subset of a host’s neighbors belonging to the spanning tree. The initiator, determined in the previous section, sends a create-message furnished with its ID to all its neighbors. If a neighbor receives a create-message for the first time, this message is forwarded to all neighbors except for the sender of the createmessage. The host adds each receiver to the STSL. If a host receives a message
Towards a General Approach to Mobile Profile Based Distributed Grouping
115
Receipt of a CREATEMESSAGE from host p: if( not sent ) root = p; STSL += p; if( ++visitedHops < HOPS ) send CREATEMESSAGE to NEIGHBORS; sent = true; fi; Start (initiator node): fi; initiator := true; else STSL -= p; send CREATEMESSAGE to NEIGHBORS; Initialization for a node: STSL = null; initiator = false; sent = false; root = null; visitedHops = 0;
Fig. 4. PseudoCode of the spanning tree creation algorithm
from a host which is already in the STSL, it is removed from the list. The pseudocode notation of this algorithm is shown in figure 4. To identify a tree, the ID of the initiator is always added to each message. It may occur that a host already belongs to another tree. Under these circumstances the message is not forwarded any more and the corresponding host belongs to two (more are also possible) trees. In order to limit the tree size a hop-counter ch is enclosed to each message and is each time decremented, the message is forwarded. If the counter is equal to zero, the forwarding process stops. Note, with an increasing ch the time for building a group also increases, because ch is equivalent to the half diameter3 dG of the graph G. By using a hop-counter it may occur that a single host does not belong to any spanning tree, because all tree around are large enough, i. e. ch is reached. The affiliation of that host is not possible, because tree nodes do not send messages in case the hop-counter’s value is zero. When time elapses and a node does notice it does still not belong to a tree, an initiator determination is started by this host. Two cases must be distinguished. In the first one the host is surrounded only by tree nodes, in the other case a group of isolated hosts are existing. In both cases, the isolated host contacts all its neighbors by sending an init-message, and if a neighbor node already belongs to a tree it answers with a join-message. If no non-tree node is around, the single node chooses arbitrarily one of the neighbors and joins the tree by sending an join-agree-message, to the other hosts a join-refuse-message is sent. If another isolated host gets the init-message, a init-agree-message is returned and the host sending the init-message becomes the initiator starts creating a new tree. Evaluation The main reason for creating a virtual spanning tree upon the given topology is the reduction of messages needed to reach an agreement. Let n be the number of vertices and e be the number of edges in a graph G. Then, there are 3
The diameter d is the maximum of the distances between all possible pairs of vertices of a graph G.
116
C. Seitz and M. Berger
2e − n + 1 messages necessary to create the spanning tree. If no tree would be build and a host receives a message, this messages must be forwarded to all its neighbors, which again results in 2e−n+1 messages for distributing the message through the graph. Overlaying a graph with a virtual spanning tree, the number of forwarded messages is reduced to n − 1. Determining the factor, when a tree 2e−n+1 . If on the average e = 2n, becomes more profitable leads us to A = 2(e−n+1) 3n+1 the amortization A results in 2n+2 , which converges to 1.5 for the amortization factor A with increasing n. In the equation above, the tree maintenance costs are not taken into account. If a new host comes into transmission range or an other host goes away, this is recognized by the ad hoc agent platform and the host is added to or removed from the neighbor list. If the neighbor list has changed, the tree has to be updated. A vanishing host is worse than an appearing one and could have more negative effects to the tree. If a host is added to the tree, attention should be paid to cycles, which could appear. Therefore, the host is added only once into the STSL, to guarantee that no cycle is formed. If a host leaves the ad hoc network, all its neighbors are affected and the vanished host must be deleted from each neighbor host’s STSL. 5.4
Local Grouping – Optimizing the Local View
In this section algorithms are presented that determine the subset of neighbor hosts, which initially belong to a host’s group, called a local group. In order to guarantee, that groups are not formed arbitrarily, but bring a benefit to its members a Group Profit Function is defined. Definition 9. Group Profit Function fGP : Let P(V ) be the power set of the nodes of a graph G(V, E) and G a local group. The Group Profit function fGP (G) : P(V ) → R assigns a value to a group G ∈ P(V ). This value reflects the benefit, which emerges from group formation. If a new node vi is added to the group G, fGP must increase: fGP (G ∪ vi ) > fGP (G), in order to justify the addition of vi . The algorithm adds in each step exactly one new local group member. Initially, a host scans all known profiles and adds that one, with the smallest distance to him. If the group with two points will bring a greater benefit, the points is added to the point. The group now has two members. It may only one point be added in one step, because else the shape of a group gets beyond control A host A can add another host B in the exactly opposite direction than a host D is added by host C. If more than one points should be added, coordination is needed. The points already belong to the group form a fractional line, because at all times only one point is added. In order to sustain this kind of line, we only allow the two endpoints to add new points. To coordinate these two points, the endpoints of the line may add new points, alternately. If the right end has added a new point in step n, in step (n+1) the left side is on turn to add a
Towards a General Approach to Mobile Profile Based Distributed Grouping
117
point. The alternating procedure stops, when one side is not able to find a new point. In such a case, only the other side continues to add points, until no new point is found. If a host is allowed to add a point and there is also one to add, it is not added automatically. The new point must bring a benefit, according to definition 9. y
y y
x
x x
a )
c )
b )
y y
y
x x
x d )
e )
f)
Fig. 5. The local grouping algorithm
Figure 5 illustrates the process of finding a local group and figure 6 contains the pseudocode of the algorithm. The points in the coordinate system in figure 5 represent the destination of a user, and the dark black point is the point the local view is to be obtained for. In image a) the grouping starts. In b) and d) on the right side and in c) and d) on the left side of the line new points are added. In e) the new point is also added at the left side, although the right side would have been on turn. But, there are no points within rage, so that the left side has to find one. The last picture f) represent the complete group. 5.5
Decentralized Grouping – Achieving the Global View
In the previous section each host has identified its neighbor hosts that belong to its local group gi . These local groups must be exchanged in order to achieve a global group. The algorithm presupposes no special initiator role. Each host may start the algorithm and it can even be initiated by more than one host contemporaneously. The core of the used algorithm is an echo-algorithm, see [4]. Initially, a arbitrary host sends a EXPLORER-message with its local-group information enclosed to its neighbors which are element of the spanning tree (the STSL, see section 5.3). If a message arrives, the enclosed local-group is taken and it is merged with its current local view of the host to get a new local view.
118
C. Seitz and M. Berger
firstReferenceNode := currentPoint; secondReferenceNode := currentPoint; nextPoint := null; localGroup := currentPoint; currentProfit = profit_function( localGroup ); while((nextPoint:=getNearestProfilePoint(firstReferenceNode))!=null) futureProfit := profit_function( localGroup + nextPoint ); if( futureProfit > currentProfit ) then localGroup += nextPoint; neighbors -= nextPoint; currentProfit = futureProfit; firstReferenceNode := secondReferenceNode; secondReferenceNode := nextPoint; fi; elihw; Fig. 6. PseudoCode of the Local Grouping Algorithm
The merging function tries to maximize the group profit function, i. e. if two groups are merged, from each group these members become a member of the new group which together draw more profit than each single group. The new local view is forwarded to all neighbors except for the sender of the received message. If a node has no other outgoing edges and the algorithm has not terminated, the message is sent back to the sender.If more than one hosts initiate the algorithm and a host receive several EXPLORER-messages, then only the EXPLORER-message from that host are forwarded, which has the higher ID (message extinction). The pseudocode of the group distribution is shown in figure 7. But if the algorithm in figure 7 has terminated, it is still not yet guaranteed, that each node has the same global view. In the worst case only the initiator node has a global view. For that reason, the echo algorithm has to be executed once more. In order to save messages, in the second run, the echo messages need not be sent, because no further information gain is achieved. A critical point is to determine the termination of the grouping process. The algorithm terminates in at most 2 · dG = 4 · ch steps and because the echo messages in the second run are not sent this is reduced to 3 · ch . If a host receives this amount of messages, the grouping is finished. But, due to the mobility, nodes come and go. Currently, the algorithms stops, if a node gets from all its neighbors the same local group information. This local group information is supposed to be the global group information. To make sure, that all group member have the same global view, the corresponding hosts check this with additional confirmation-messages. But currently this part is considered to be optional.
Towards a General Approach to Mobile Profile Based Distributed Grouping Start (only if not ENGAGED): initiator := true; ENGAGED := true; N := 0; localGroup = getLocalGroup(); EXPLORER.add( localGroup ); send EXPLORER to STSL;
Receipt of an ECHO message: N := N+1; localGroup := merge( localGroup, ECHO.getLocalGroup() ); if N = |STSL| then ENGAGED := false; if( initiator ) then finish; else ECHO.setLocalGroup( localGroup ); send ECHO to PRED; fi; fi;
119
An EXPLORER message from host p is received: if ( not ENGAGED ) then ENGAGED := true; N := 0; PRED := p; localGroup := getLocalGroup() localGroup := merge(localGroup, EXPLORER.getLocalGroup()); EXPLORER.setLocalGroup(localGroup); send EXPLORER to STSL-PRED; fi; N := N+1; if ( N = |STSL| ) then ENGAGED := false; if( initiator ) then finish; else localGroup := merge(localGroup, EXPLORER.getLocalGroup()); ECHO.setLocalGroup(localGroup); send ECHO to PRED; fi; fi;
Fig. 7. PseudoCode of the Echo Algorithm including the group merging process
5.6
Group Stability
In this subsection the stability of the groups is evaluated. With stability we mean the time a group does not change, i. e. no other host is added and no group member leaves the group. This stability time is very important for our algorithms, because in this time the group formation must be finished. We developed a simulation tool to test, how long the groups are stable. The velocity of the mobile hosts is uniformly distributed in the interval [0; 5.2], the average velocity of the mobile hosts is 2.6 km h . This speed seems to be the prevailing speed in pedestrian areas. Some people do not walk at all (they look into shop windows etc.), other people hurry from one shop to the other and therefore walk faster. Moreover we assume a radio transmission radius of 50 meters. The left picture in figure 8 shows this dependency. The picture shows, that the time a group is stable, decreases rapidely. A group with 2 people exists on the average 30 seconds, whereas a group with 5 people is only stable for 9 seconds. Nevertheless, a group that is stable for 9 seconds is still sufficient for our algorithms. The stated times for groups are the times for the worst case, i. e. no group member leaves the group an no other joins the group. But, for the algorithms it does not matter if a group member leaves during the execution of the algorithm. The only problem is, this person cannot be informed about its potential group membership. In case a person joins the group, the profile information of this point must reach every other point in the group, which of course must also occur in the time the group is stable. Unfortunately, we do not have simulation results for groups with 10 to 20 members.
120
C. Seitz and M. Berger
tim e in s
tim e in s
9 0 8 0 7 0 6 0 5 0 4 0 3 0 2 0 1 0 0 0
1
2
3
4
5
6
g ro u p s iz e
s p e e d in k m /h
Fig. 8. The left picture depicts the time, a group is stable in dependency of the group size. The right picture shows the dependency of the group stability and the speed of the mobile hosts.
The group stability is furthermore affected by the speed of the mobile hosts. The faster the mobile hosts are, the more rapid they cross the communication range. In our simulation environment we analyzed the stability of the groups in dependency of the speed, which is shown in the right picture of figure 8. In this simulation, we investigated how long a group of three people is stable, when different velocities are prevailing. For the simulation again the chain algorithm is used and the transmission radius is 50 meters. The right picture of figure 8 shows that the group is stable for more than 40 seconds of the members have a speed of 1 km/h. The situation changes when the speed increases. If all members walk fast (speed 5 km/h) the group is only stable for approximately 10 seconds. Up to now we do not know why at first it decreases rapidly and with a speed of 2.8 km/h the decrease is stemmed.
6
Conclusion
In this paper we presented a kind of ad hoc applications called Mobile Profile based Distributed Grouping (MPDG). Each mobile host is endowed with its user’s profile and while the user walks around clusters are to be found, which are composed of hosts with a similar profile. The architecture of a MPDG application is shown, which basically is made up of a MPDG description entity, that makes the MPDG unit domain independent and a algorithm entity, which is responsible for local grouping and distributed grouping. At first, each host has to find its local group, which consists of all neighbor hosts with similar profiles. Finally, the local groups are exchanged and a global group is achieved. Simulation results show that the groups are stable long enough to run the algorithms. We simulated a first MPDG applications, which is a taxi-sharing scenario, where potential passenger with similar destinations form a group [14]. For the future we will apply the MPDG idea to different domains, e. g. the manufacturing or lifestyle area.
Towards a General Approach to Mobile Profile Based Distributed Grouping
121
References 1. N. Badache, M. Hurfun, and R. Macedo. Solving the consensus problem in a mobile environment. Technical report, IRISA, Rennes, 1997. 2. S. Banerjee and S. Khuller. A clustering scheme for hierarchical control in multihop wireless networks. Technical report, University of Maryland, 2000. 3. S. Basagni. Distributed clustering for ad hoc networks. In Proceedings of the IEEE International Symposium on Parallel Architectures, Algorithms, and Networks (ISPAN), Perth., pages 310–315, 1999. 4. E. J. H. Chang. Echo algorithms: Depth parallel operations on general graphs. IEEE Transactions on Software Engineering, SE-8(4):391–401, July 1982. 5. R. Diestel. Graph Theory, volume 173 of Graduate Texts in Mathematics. SpringerVerlag, New York, 2nd edition, February 2000. 6. D. Fasulo. An analysis of recent work on clustering algorithms. Technical report, University of Washington, 1999. 7. C. Fraley and A. E. Raftery. How many clusters ? Which clustering method ? Answers via model-based cluster analysis. The Computer Journal, 41(8), 1998. 8. E. M. Gafni and D. P. Bertsekas. Distributed algorithms for generating loopfree routes in networks with frequently changing topology. IEEE Transactions on Communications, COM-29(1):11–18, January 1981. 9. K. P. Hatzis, G. P. Pentaris, P. G. Spirakis, V. T. Tampakas, and R. B. Tan. Fundamental control algorithms in mobile networks. In ACM Symposium on Parallel Algorithms and Architectures, pages 251–260, 1999. 10. E. Kolatch. Clustering algorithms for spatial databases: A survey. Technical report, Department of Computer Science, University of Maryland, College Park, 2001. 11. R. Maitra. Clustering massive datasets. In statistical computing at the 1998 joint statistical meetings., 1998. 12. N. Malpani, J. Welch, and N. Vaidya. Leader election algorithms for mobile ad hoc networks. In Proceedings of the Fourth International Workshop on Discrete Algorithms and Methods for Mobile Computing and Communications, 2000. 13. G.-C. Roman, Q. Huang, and A. Hazemi. Consistent group membership in ad hoc networks. In International Conference on Software Engineering, 2001. 14. C. Seitz and M. Berger. Towards an approach for mobile profile based distributed clustering. In Proceedings of the International Conference on Parallel and Distributed Computing (Euro-Par), Klagenfurt, Austria, August 2003.
A Dynamic Scheduling and Placement Algorithm for Reconfigurable Hardware Ali Ahmadinia, Christophe Bobda, and Jürgen Teich Department of Computer Science 12, Hardware-Software-Co-Design, University of Erlangen-Nuremberg, Am Weichselgarten 3, 91058 Erlangen, Germany {ahmadinia, bobda, teich}@cs.fau.de http://www12.informatik.uni-erlangen.de
Abstract. Recent generations of FPGAs allow run-time partial reconfiguration. To increase the efficacy of reconfigurable computing, multitasking on FPGAs is proposed. One of the challenging problems in multitasking systems is online template placement. In this paper, we describe how existing algorithms work, and propose a new free space manager which is one main part of the placement algorithm. The decision where to place a new module depends on its finishing time mobility. Therefore the proposed algorithm is a combination of scheduling and placement. The simulation results show a better performance against existing methods.
1 Introduction A reconfigurable computing system is usually composed of a host processor and a reconfigurable device such as an SRAM-based Field-Programmable Gate Array (FPGA)[4]. The host processor can map a code as an executable circuit on the FPGA, which is denoted as a hardware task. With the ability of partial reconfiguration for the new generation of FPGAs, multiple tasks can be configured separately and executed simultaneously. This multitasking and partial reconfiguration of FPGAs increases the device utilization but it also necessitates well thought dynamic task placement and scheduling algorithms [5]. Such algorithms strive to use the device area as efficiently as possible as well as reduce total task configuration and running time. But these existing algorithms have not a high performance [1]. Such efficient methods have been developed and perfected in a way such that the hardware tasks are placed on the reconfigurable hardware in a fast manner and that they are furthermore tightly packed to use the available area efficiently. However, most such algorithms are static in nature in the sense that the same placement and scheduling rules apply to every single arriving task and that the entire reconfigurable area is available for the placement of every task. The scope of the present paper hence consists of developing a dynamic task scheduling and placement method on a device divided into slots. More precisely, the FPGA is divided into separate slots, then each C. Müller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 125–139, 2004. © Springer-Verlag Berlin Heidelberg 2004
126
A. Ahmadinia, C. Bobda, and J. Teich
of these slots will accommodate only those tasks that end their execution at “nearly the same time”. This 1-D FPGA partitioning as well as the similarity of end times are two parameters that are dynamically varied during runtime. These parameters must then be controlled by an appropriate function in order to reduce the total execution time and the number of rejected tasks. Finally, relevant statistics are collected and the performance of this newly developed algorithm is then compared experimentally to that of existing ones. In the subsequent sections previously existing methods and algorithms will be briefly described, the motivation behind the proposed scheduling and 1-D partitioning approach will be explained and the developed algorithm will be described in detail. Finally, comparative results will be presented and analyzed.
2 Online Placement The problem of packing modules on a chip is similar to the well-studied problem of two-dimensional bin-packing, which is an extension of classical one-dimensional binpacking [7][8]. The one-dimensional bin-packing problem is similar to placing modules in rows of configurable logic, as done in the standard cell architecture. The twodimensional bin-packing problem can be used when the operations to be loaded on the modules are rectangles which can be placed anywhere on the chip [1]. In the context of online task placement on a reconfigurable device, the nature of the operations and hence the flow of the program are not known in advance. The configuration of hardware tasks on the FPGA must be done on the fly. To describe the placement problem clearly, we should define our task model: Definition 1 (Task Characteristics). Given a set of tasks T= { t1 ,t2 , …, tr } such that,
∀t k ∈ T , t k = (ak , ek , d k , wk , hk ) ak = arrival time of task tk ek = execution time of task tk dk = deadline time of task tk wk = width of task tk hk = height of task tk This set of tasks must be mapped to a fixed size of FPGA, according to the time and area constraints of tasks. In fact, each task will be mapped to a module which is a partial bitstream. This partial bitstream occupies a determined amount of logic blocks on the device and it has a rectangular shape. Placement algorithms are therefore developed that must determine the manner in which each arriving task is configured. These algorithms must be perfected to, on the one hand, use the available free placement areas efficiently and, on the other hand, execute in a fast manner. However, there most often exists a trade off between these two requirements as fast placement
A Dynamic Scheduling and Placement Algorithm for Reconfigurable Hardware
127
algorithms are usually low-quality ones and those that use the chip area very efficiently compute slowly. In an online scenario, hardware tasks arrive, are placed on the hardware and end execution at any possible time. This situation leads to a complex space allocation on the FPGA. In order to determine where the new tasks can be placed, the state of the FPGA or the free area must be managed. This free space management aims to reduce the number of possible locations for the newly arriving tasks and to increase placement efficiency as well. Two such free space management algorithms have been developed in [1] and will be compared to our approach here. This free space management is the first main part in online placement algorithms. The second part involves fitting the new tasks inside the empty rectangles. Once the free area is managed and the possible locations for the placement of the new task are determined, a choice has to be made at which one of these locations the task will be configured. Multiple such fitting heuristics have been developed in [1]: First Fit, Best Fit and Bottom Left. 2.1 Free Space Management
The KAMER (Keeping All Maximum Empty Rectangles) method has the highest quality of placement as compared to other ones [1]. It is therefore used as the baseline for comparison against other algorithms in terms of the quality of placement that is lost to the benefit of the amount of speed-up that is gained. The KAMER algorithm should hence be described in order to understand why it has such a high placement quality and also why it requires high computation times. In order to decide where the new arriving task should be placed, the KAMER algorithm partitions the empty area on the reconfigurable hardware by keeping a list of empty rectangles. Moreover, these are Maximal Empty Rectangles (MERs), meaning that they are not contained within any other empty rectangle. The arriving task is then placed at the bottom left corner of one of the existing MERs; the choice of the MER depends on the fitting heuristic that is being used. Figure 1 illustrates the case where the empty free space is partitioned into four MERs; their bottom left corners are denoted by an X. An alternative to the KAMER free space manager is the method that keeps nonoverlapping free rectangles. These empty rectangles are not necessarily maximal and hence some quality in placement is lost. The advantage is though that this algorithm executes faster and is more easily implemented. An example of non-overlapping partitioning of the empty region is shown in Figure 2. It should be self evident that in this case of free space management, the empty area can be partitioned in more than one way. Different heuristics can be used on how to choose between different possible non-overlapping rectangles.
128
A. Ahmadinia, C. Bobda, and J. Teich
Fig. 1. A free space partition into maximal empty rectangles.
Fig. 2. A free space partition into non-overlapping empty rectangles.
2.2 Quality
The KAMER placement algorithm is indeed the highest quality method to partition the free space since the rectangles kept in the list and checked for placing the arriving task are maximal and therefore, offer the largest possible area where the new tasks can be accommodated [2]. Keeping all maximum empty rectangles clearly avoids a
A Dynamic Scheduling and Placement Algorithm for Reconfigurable Hardware
129
high fragmentation of the empty space that can lead to the situation where a new task cannot be placed even though there is sufficient free area available. The reason for quality loss in the keeping non-overlapping rectangles method is that each empty rectangle is contained within one MER. Accordingly, if a task can be placed inside one of these empty rectangles it can also be placed inside the MER that contains it. The reverse is obviously not true. Therefore, this second method for free space management results in a higher fragmentation of the free space and some placement quality is lost. 2.3 Complexity
The KAMER algorithm has to be executed every time a new task is placed on the FPGA as well as every time a task ends its execution and is removed. More precisely, at the moment of the new task’s placement, all those MERs that overlap with it must be divided into smaller MERs, and at the moment of a task’s removal, the overlapping MERs must be merged into larger ones. As an example, Figure 3 illustrates the partitioning of the free space into five distinct MERs whose bottom left corners are identified by A, B, C, D and E. As the newly arriving task, shown in shaded color, is placed inside MER D, it overlaps with 4 of the 5 existing MERs; B, C, D and E. Each of the latter must then be split into smaller ones. Figure 4 illustrates how MER B is divided into 4 smaller maximal empty rectangles. In the same manner, MER D is split into 2, and MERs C and E are both split into 3 smaller maximum empty rectangles. In this case, the total number of MERs after insertion of the new task increased from 5 to 13. This hence indicates that, in the KAMER algorithm, many MERs must be verified whether they overlap with the new task and furthermore many of them must be divided into smaller MERs. In a similar fashion, after the deletion of a task, a considerable number of MERs must be merged into a few larger ones. Thus, in addition to the increased running time, there is a quadratic space requirement in keeping the number 2 of empty rectangles in a list; this method has to manage O(n ) rectangles for n placed tasks. It is obvious that the KAMER algorithm, although offering high quality placement, necessitates an important amount of computation and memory, and hence slows down the overall program operation. Consequently, one of the aims of our integrated scheduling and placement algorithm is to execute faster than the KAMER, but by maintaining a certain quality of placement as well. In the second free space management method, since the empty rectangles are nonoverlapping, only the rectangle where the new task is placed should split into two smaller ones. Therefore, we have a O(n) complexity; the number of empty rectangles considered for placing each hardware task is linear in terms of the number of running tasks on the FPGA.
130
A. Ahmadinia, C. Bobda, and J. Teich
Fig. 3. Placement of an arriving task at the bottom left corner of one of the MERs.
Fig. 4. Changes which are needed in MER B after placing the new module in the bottom left corner of MER D.
A Dynamic Scheduling and Placement Algorithm for Reconfigurable Hardware
131
3 An Integrated Scheduling and Placement Algorithm The aim of this work is to develop an integrated task scheduling and placing algorithm including a 1-D partitioning of the reconfigurable array. In fact a new data structure for management of free space for online placement is developed. Accordingly, the FPGA is divided into slots and the arriving tasks are placed inside one of the slots depending on their execution end time value. Moreover, the width of the slots is to be varied during runtime in order to improve the overall quality of placement. There are two main parameters in this algorithm: The first one determines the closeness of end times for tasks to put in one slot, and the other one defines the width of the area partitioning. A proper function has to be implemented to govern each of these parameters in order to maximize the quality of task placement. The implemented algorithm will be described in detail and shown to require less memory and computation time than its KAMER counterpart. 3.1 A New Free Space Manager
Unlike in the KAMER algorithm where we have a quadratic memory requirement, our placement algorithm requires linear memory. Instead of maintaining a list of empty rectangles where the arriving task can be placed, we maintain exactly two horizontal lines, i.e. one above and one under the placed running tasks as depicted in Figure 5. For storing the information of each horizontal line, we use a separate linked-list. In online placement, all of the so far proposed fitting strategies [1][2] place a new arriving task adjacent to the already placed modules, so to minimize the fragmentation. Therefore these two horizontal lines can be determined. As we place new tasks above the horizontal line_1 or below the horizontal line_2, there shouldn’t be any considerable free space between these two lines to use the area as efficiently as possible. For example as shown in figure 5, if module 6 is removed earlier than modules 2,3,10 and 11 the area occupied by module 6 will be wasted. To avoid these cases, we suggest placing those tasks beside each other, that they will finish their tasks nearly simultaneously. This task clustering and scheduling will be detailed in the next section. The placement algorithm is implemented in such a way that, arriving tasks are placed above the currently running tasks as long as there is free space. Once there are no empty spaces found above the running tasks, the new ones start to be placed below them and so on. As already mentioned, this implementation requires linear memory. Furthermore, the addition and deletion of tasks involves updating and searching through lists, which is a much faster operation than looking for, merging and dividing maximum empty rectangles. Also, placing the arriving tasks alternatively above and then below the running tasks ensures an efficient use of available area.
132
A. Ahmadinia, C. Bobda, and J. Teich Reconfigurable Hardware
Horizontal_line_1
9 5
11
10
7
8
6 1 Horizontal_line_2
2
3
4
Fig. 5. Using horizontal lines to mange free space.
3.2 Task Scheduling
As we mentioned before, we need a task clustering to have less fragmentation between the two horizontal lines. For explaining this clustering, first we should define the required specifications of real-time tasks. Each arriving task t k ∈ T is, amongst other parameters, defined by its arrival time ak and its execution time ek (definition 1). Hence, if a particular task can be placed on the chip at the time of its arrival, it will end its execution at time ak + ek. Each task has also a deadline time dk assigned to it, which is greater than ak + ek and sets a limit on how long the task can reside inside the running process. Next we define a mobility interval for each task according to end times. The mobility interval is defined as mobility=[ASAP_end; ALAP_end], where ASAP_end= ak + ek is the as soon as possible task end time; and ALAP_end= dk is the as late as possible task end time. Therefore, each task, once placed on the array, will finish its execution at a time belonging to its mobility interval. For clustering tasks, first we should define clusters: Definition 2 (Clusters or Slots). An FPGA consists of a two-dimensional CLB(Configurable Logic Array) with m rows and n columns. The columns are partitioned into contiguous regions, which each region is called a cluster or slot.
The number of slots can be chosen to be different, but in our case, according to the FPGA size and the size of tasks, we have divided the area into three slots. Each task’s mobility interval is then used to determine in which cluster the task should be placed. Accordingly, we compute successive end time intervals denoted end_time1,
A Dynamic Scheduling and Placement Algorithm for Reconfigurable Hardware
133
end_time2, end_time3... . The details of their computation will be explained in the pseudo code of scheduling. If a task’s mobility interval overlaps with the end_time1 interval as shown in Figure 6, then this task will be placed inside the first slot, if it overlaps with end_time2 it will be placed inside the second slot, with end_time3 inside the third slot, with end_time4 inside the first slot, and so on. This situation is illustrated in Figure 7. The motivation behind this clustering method, as can be observed in Figure 7, is to have all tasks with similar end times placed next to each other (the number in each module shows that the module belongs to which interval end_time). In this way, as tasks ”belonging” to the same end time interval end their execution, a large empty space will be created at a precise location. This newly created empty space will then be able to accommodate future, perhaps larger tasks.
Fig. 6. The tasks with mobility intervals overlapping with the same end time interval.
3.3 Optimizing Scheduling
In order to optimize the quality of task placement, or in other words, to reduce the number of rejected tasks, benchmarking had to be performed on how big the successive end time intervals should be. As shown in the pseudo code of computing end_time intervals, we divide the Total_Interval into three equal ranges. Moreover, an eventual function had to be implemented to vary the ratio of the end_time intervals to the Total_Interval during runtime. The idea there is that, when an excessive number of tasks are being placed inside a single cluster, the length of the end time intervals should be reduced so that tasks can continue being placed inside the remaining two clusters. Consequently, we define an input rate for each of the clusters as follows:
134
A. Ahmadinia, C. Bobda, and J. Teich
Fig. 7. Placement of tasks inside clusters according to their mobility intervals.
Input rate =
# of Tasks T
(1)
Where # of Tasks is the number of tasks placed inside the corresponding cluster during the period T. Now, as the input rate of one of the clusters becomes higher than some predetermined threshold value, the length of the corresponding end time intervals should be reduced and kept at that value during some time t. This process was simulated and repeated for a vast range of values for the period T, the threshold and the hold time t. The steps of this scheduling and its optimizing is presented in the following way. Here N is number of slots on the device: i=0; //number of arrived tasks k=1; Maxk=0; Task_Arriving: i=i+1; Min= ASAP_endi; Max= ALAP_endi; for j=1 to i {
A Dynamic Scheduling and Placement Algorithm for Reconfigurable Hardware
135
Min= min( Min , ASAP_endi ); Max= max( Max , ALAP_endi ); } Min= max( Min , Maxk ); Total_Interval=Max - Min; for s=0 to N-1 Total _ Interval Total _ Interval end _ time(k + s) = (Min + × s, × (s + 1)) ; N N if( t mod T=0) // T: period of Time // t: Current time { nov=0; // Number of overloaded slots for s= 1 to N { Tasks = {ti ti ∈ end _ time(M ) and M mod N = s}
n(Tasks ) ; T if(Input rate(s) > Threshold) { as=1; nov=nov+1; } else as=2; } for s=0 to N-1 Input _ rate( s ) =
end _ time(k + s) = ( Min +
Total _ Interval × 2 N − nov
s
∑ i =1
ai ,
Total _ Interval × 2 N − nov
s +1
∑a ) ; i
i =1
} if ( Min > t ) // Current time { k=k+3; Maxk=Max; } Go to Task_Arriving
3.4 Optimizing Partitioning
As a new task arrives, its mobility interval is computed, the overlapping end time interval is determined and the task is assigned to the corresponding cluster. The situation might and will arise where that cluster is full and the task will have to be queued until some tasks within that same cluster end their execution so that the queued task can be placed. However, there might be enough free space in the remaining two clus-
136
A. Ahmadinia, C. Bobda, and J. Teich
ters to accommodate that queued task. Hence, to improve the quality of the overall algorithm the cluster widths are made to be dynamic and can increase and decrease during runtime when needed. This situation is illustrated in Figure 8. A proper function had to be found to govern this 1-D partitioning of the reconfigurable hardware. For this purpose, for all the queued tasks waiting to be configured on the device it was counted how many of them are assigned to each cluster. Hence, three variables (queue1, queue2, queue3) kept track of the number of tasks in the queue for each of the three clusters. The width of the clusters was then set proportionally to the values of these three variables. For example, if during runtime we have the situation where queue1=4, queue2=2, queue3=2, the width of the first cluster should be set to half and the widths of the second and third clusters should both be set to one quarter of the entire array width. The widths aren’t however changed instantly but rather gradually by some predetermined value at each time unit. This method for the 1-D space partitioning indeed proved to be the best one in reducing the overall number of rejected tasks.
Fig. 8. Dynamic 1-D array space partitioning.
4 Experimental Results Our cluster based algorithm was compared to the KAMER algorithm in terms of how fast they execute on the one hand, and of how many tasks get rejected on the other.
A Dynamic Scheduling and Placement Algorithm for Reconfigurable Hardware
137
Since the main idea was to compare these two performance parameters, the generation of tasks was kept as simple as possible. Late arriving tasks were not taken into account, only one new task arrives at each clock cycle and once placed, a task’s execution cannot be aborted. Also, at every time unit or clock cycle, the algorithm tries to place the new arriving task on the device, checks for tasks that ended their execu-tion so they can be removed and finally all queued tasks are checked for placement. All simulations were performed for a chip size 80x120, a 2-dimensional CLB array, corresponding to the Xilinx XCV2000E device. In order to evaluate the improvement in the overall computation time of our algorithm as compared to the KAMER, we simulated the placement of 1000 tasks and measured the time in milliseconds both programs took to execute. This was done for different task sizes and shapes, precisely for tasks with width and height uniformly distributed in the intervals [10, 25], [15, 20] and [5, 35]. For each task size range, the simulation was repeated 50 times and the overall average of execution times was computed. The obtained results are summarized by the graph in Figure 9. For the different task sizes we observe an improvement of 15 to 20 percent as compared to the execution time of the KAMER algorithm. This can be observed in Figure 10, where our algorithm’s execution time is presented as a fraction of the time the KAMER algorithm takes to execute.
Fig. 9. Algorithm execution times as an average of 50 measurements.
138
A. Ahmadinia, C. Bobda, and J. Teich
For optimizing scheduling, the conclusion was that, by varying the width of end time intervals, slightly fewer tasks were being rejected than in the case where the width was just held constant at a single value. In fact, the best performance was observed when the length of the end time intervals was set to be distributed uniformly in the mobility interval range. The percent of rejected tasks in KAMER was 15.5% and in our cluster-based method was 16.2%. Because, in the KAMER algorithm where the entire chip area is available for all tasks to be placed, tasks with similar end times are most often separated from each other. Once these tasks end their execution, small empty spaces that are distant from each other are created and although there might be enough total free space to accommodate a new task, it might get rejected since not being able to be placed in either of those free locations.
Fig. 10. Fraction of execution time of the KAMER algorithm.
5 Conclusion In this paper we have discussed existing online placement techniques for reconfigurable FPGA. We suggested a new dynamic task scheduling and placement method. We have conducted experiments to evaluate our algorithm and previous one. We reported on simulations that show an improvement of up to 20% on the placement performance compared to [1]. Also, the quality of placement in this method is comparable to KAMER method and it has nearly the same percent of rejected tasks as KAMER method. Concerning further work, we plan to develop an online scheduling algorithm to minimize task rejections, and take into consider the dependencies between tasks. Also
A Dynamic Scheduling and Placement Algorithm for Reconfigurable Hardware
139
we intend to investigate more on online placement scenario and make a competitive analysis with optimal offline version of placement [6].
References 1. Kiarash Bazargan, Ryan Kastner, and Majid Sarrafzadeh. Fast Template Placement for Reconfigurable Computing Systems. In IEEE Design and Test of Computers, volume17, pages 68-83, 2000. 2. Ali Ahmadinia, and Jürgen Teich. Speeding up Online Placement for XILINX FPGAs by Reducing Configuration Overhead. To appear in Proceedings of 12th IFIP VLSI-SOC, December 2003. 3. Herbert Walder, Christoph Steiger, and Marco. Platzner. Fast Online Task Placement on FPGAs: Free Space Partitioning and 2-D Hashing. In Proceedings of the 17th Inter-national Parallel and Distributed Processing Symposium (IPDPS) / Reconfigurable Architectures Workshop (RAW). IEEE-CS Press, April 2003. 4. Grant Wigley, and David Kearney, Research Issues in Operating Systems for Reconfigurable Computing, In Proceedings of the 2nd International Conference on Engineering of Reconfigurable Systems and Architectures (ERSA). CSREA Press, Las Vegas USA, June 2002. 5. Oliver Diessel and Hossam ElGindy, On scheduling dynamic FPGA reconfigurations, In Kenneth A Hawick and Heath A James, eds, Proceedings of the Fifth Australasian Conference on Parallel and Real-Time Systems (PART'98) , pp. 191–200, Singapore, 1998. Springer-Verlag. 6. Sándor Fekete, Ekkehard Köhler, and Jürgen Teich, Optimal FPGA Module Placement with Temporal Precedence Constraints, In Proc. of Design Automation and Test in Europe , IEEE-CS Press, Munich Germany, 2001, pp. 658-665. 7. E.G Coffman, M.R. Garey, and D.S. Johnson, Approximation algorithms for bin packing: a survey. In D. Hochbaum, editor, Approximation algorithms for NP-hard problems, pages 46-93. PWS Publishing, Boston, 1996. 8. E. G. Coffman Jr., and P. W. Shor, Packings in Two Dimensions: Asymptotic AverageCase Analysis of Algorithms, Algorithmica, 9(3):253–277, March 1993.
Definition of a Configurable Architecture for Implementation of Global Cellular Automaton Christian Wiegand, Christian Siemers, and Harald Richter Technical University of Clausthal, Institute for Computer Science, Julius-Albert-Str. 4, 38678 Clausthal-Zellerfeld, Germany {wiegand|siemers|richter}@informatik.tu-clausthal.de
Abstract. The realisation of Global Cellular Automaton (GCA) using a comparatively high number of communicating finite state machines (FSM) leads to high communication effort. Inside configurable architectures, fixed numbers of FSM and fixed bus widths result in a granularity that makes mapping of larger GCA to these architectures even more difficult. This approach presents a configurable architecture to support mapping of GCA into a single Boolean network to omit increasing communication effort and to receive scalability as well as high efficiency.
1
Introduction
Cellular Automaton (CA) are defined as a finite set of cells with additional characteristics. The finite set is structured as n-dimensional array with well-defined coordinates of each cell and with a neighbourhood relation. Each cell is capable to read and utilise the state of its neighbouring cells. As the cells implement a (synchronised) finite state machine, all cells will change their states with each clock cycle, and all computations are performed in parallel. Data and/or states from nonneighbouring cells are transported stepwise from cell to cell when needed. Useful applications to be implemented within CA consist of problems with high degrees of data locality. Mapping a CA to real hardware – whether configurable or fixed – shows linear growth of communication lines with the number of cells. These links are fixed and of short length, resulting in limited communication efforts to implement a CA. If complexity of the functionality is defined by the RAM capacity to realise this function inside memory – this is normally the case inside configurable, look-up table based structures – the upper bound of the complexity will grow exponentially with the number of inputs and linear with the outputs. The number of input variables is the sum of all bits to code the states of all neighbouring cells as well as the own state. Cell complexity is therefore dominated by the number of communication lines. The concept of Global Cellular Automaton (GCA) [1] overcomes the limitations of CA by providing connections of a cell not only to neighbouring but to any cell in the array. The topology of a GCA is therefore no longer fixed, GCA enable applicationspecific communication topologies, even with runtime reconfiguration. The number of communication lines per cell might be fixed by an upper limit.
C. Müller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 140–155, 2004. © Springer-Verlag Berlin Heidelberg 2004
Definition of a Configurable Architecture for Implementation
141
As the number of possible links between k cells will grow with k², the number of realised communication lines per cell will also grow with order 2. The complexity of a single cell and the Boolean function inside will depend on the number of communication inputs, as discussed in the case of cellular automaton. If any GCA is mapped to a reconfigurable architecture like an FPGA, each cell must be capable of realising any Boolean function with maximum complexity. If the cells are mapped to a reconfigurable array of ALUs, each with local memory, each cell may integrate any complex functionality. The communication effort grows with the square number of cells, and the granularity of the circuit is defined by the number of cells and the bit width of the communication links between them. This architecture is well suited to realise a GCA, when the number of cells and the bit width fits well, because even complex computations might be performed inside one ALU. The disadvantage of this approach is that the cycle time is deterministic but not bounded, because any algorithm could be realised within one ALU but might use several cycles to perform. Even worse, mapping of GCA with non-fitting characteristics will be difficult if not impossible. To map the GCA to another type of reconfigurable cell array, each with programmable Boolean functionality, results in cells capable of computing data from each other cell including the own state. This means that all binary coded states from all cells might form the input vector of this function, while the output vector must be capable of coding the state of the cell. Consequently the complexity of the single cell will grow exponentially with the input vector size, while communication will grow in polynomial order. The approach in this paper presents an architecture capable of realising a GCA into a single Boolean network, where the output vector at the time tn form part of the input vector for the next state tn+1. This omits the resulting complexity by the communication lines, which is important for any reconfigurable architecture. Even more important, this architecture makes no assumption about the granularity, only the resulting complexity of the GCA is limited. The remainder of the work contains the definition of the architecture in chapter 2. The introduced structure is capable of containing Boolean functions with large number of input and output lines. Chapter 3 discusses the mapping of GCA to this architecture and presents an example for realising an algorithm in a GCA and mapping this on the introduced architecture. Chapter 4 finally gives an outlook to future work.
2
A Reconfigurable Boolean Network for Large Functions
To design a reconfigurable Boolean network, one of two basic architectures are normally used and are discussed below: The function to be implemented may be defined completely by storing all values inside a RAM memory. The input vector forms the address bit vector and addresses a well-defined memory cell for any combination. The contents of this memory cell defines the function at this point, and the data bus lines form the output vector. This is known as the look-up table (LUT) architecture. The most important advantages of this architecture are its simplicity, the simple reconfigurability, the high density of memory circuits and the fixed (and fast) timing.
142
C. Wiegand, C. Siemers, and H. Richter
The with the input vector size exponentially growing number of memory cells is of course disadvantageous, limiting the practical use of LUT structures to small functions. The second possibility to implement any Boolean function inside a reconfigurable architecture is to use a configurable network consisting of 2 or k stages. This mostly minimises the use of gates to implement the functionality, and theory has developed representations (e.g. Reed-Muller Logic), algorithms for minimising logic and partitioning it for several stages. The advantage of this approach is that minimised number of gates are used to implement the function. Especially fixed implementations are well supported, but for reconfigurable architectures, the effort again grows exponentially with the input vector sizes (but at different rates, compared to the LUT-based architecture). 2.1
Introducing the New Architecture
To combine the advantages of the first architecture – high degree of integration, simplicity of the circuit – with the reduced number of gates of the second approach, the following approach is considered inside this paper. The basic idea consists of the balanced combination of storing functionality inside RAM-based memory and introducing 3 stages inside the architecture to reduce memory size and complexity. 2.1.1
Three Stage Approach
First Stage The input vector of the Boolean function is represented by the input lines to the complete network. The first stage consists of several memory arrays (or ICs) in parallel, addressed by the input lines. The input vector is partitioned and the parts are each mapped to one corresponding memory array. The so-called minterms of the application, derived from logic minimisation e.g. using Quine-McCluskey or Espresso [2], are stored inside these memory arrays of the first stage. Each part of the array stores a well-defined representation of the (partial) input vector with the actual values (true, false, don’t care) and defines a representing code for this minterm, the so-called minterm-code. Each memory array of the first stage is addressed by a subset of the input lines and compares this address to each of its 3-valued partial minterms. If a match is found, the minterm-code is sent via the data bus lines to the second stage. If an address doesn't match any partial minterm of one memory array, no data is returned. After processing a complete input vector, the first stage returns a bit pattern that represents all minterms of the Boolean function, which correspond to the input vector. Second Stage The minterm-code addresses the memory of the second stage. The memory cells hold the corresponding bit pattern of the minterm-codes and the output vectors. If any input vector of this stage matches one of the stored codes, the stored output information is read out via the data lines and given to stage three for further computation.
Definition of a Configurable Architecture for Implementation
143
The addressing scheme uses again three-valued information, but this time the output consists only of two-valued information. The address information is compared to all stored information in parallel, and a matching hit results in presenting all stored data on the data bus of this second stage memory array. If no matching hit occurs, the corresponding memory array returns ‘0’. Third Stage The third stage combines all output values from stage 2 via the OR-function. 2.1.2 Detailed Implementation Figure 1 shows the complete implementation of a 3-stage network, using an example with 12 input lines, 10 minterms and 8 output lines.
Minterms:
a)
1. 2. 3. 4. 5.
0-11 1101 -0-1 ---0100
---11-0001 ---0001
1100 0000 -----10 ----
6. 7. 8. 9. 10.
0100 ----------0-1
11-0001 0001 ---0001
0000 0000 ---0000 1100
0-11
00
0001
00
0000
1101
01
11--
01
-0-1
10
10
--10
10
0100
11
11
1100
11
00 01
b) 00--11
...
010100
...
1000-----01 1100--
...
110100
...
--0000
...
--00--
...
----00
...
100111
... ... ...
≥1 12
8
Fig. 1. 3-stage reconfigurable Boolean network: a) Representing minterm table b) Circuit structure
The complete input vector is partitioned into 3 parts each containing 4 3-valued data. The minterm table in fig. 1 a) shows similarities to the open-PLA representation [3]. The stored value for the corresponding combination is the minterm code, and the resulting minterm code vector by simple concatenation of the responds is the complete code for the actual input vector.
144
C. Wiegand, C. Siemers, and H. Richter
There might be several matching minterm codes for one input. The input vector contains binary information, but 3-valued information is stored, and the ‘-’-value for don’t care matches to ‘0’ and ‘1’ by definition. Figure 1 shows as an example that the minterms 3, 4 and 8 matches e.g. to “0011 0001 0010”, and for the minterms 4, 5 and 8 matching input vectors are possible too. This means that for correct computation, the system must be capable of computing more than one minterm code vector. The resulting minterm code vectors are used by the second stage of the circuit, where the corresponding output vectors are read. The responds of the second stage have to be coupled using the OR-gate of stage three. This is the final result of the operation. One choice for storing the minterms could be an architecture similar to fully associative cache memory, as shown in figure 2. The minterms are stored as TAG information, the data field will hold the corresponding minterm code. A positive comparison, called compare hit, is marked in the Hit-field.
No.
TAG (3-valued address)
Data (3-valued code)
Hit
No.
TAG (3-valued address)
Data (3-valued code)
Hit
Data (3-valued code)
Hit
.....
No.
TAG (3-valued address)
Fig. 2. Structure of fully associative memory cells, used as stage-1-memory
A difficulty arises, if the minterms contain unbounded variables, coded as ‘don’t care’ (DC). The comparison must be performed for all stored minterms, and all compare hits must be marked. In summary, there might be several hits per partial minterm memory (equivalent to a row in figure 1a), and all partial hits must be compared again to extract all total hits. It will be advantageous to use normal structured RAM arrays as minterm memory. Each of these RAMs will be addressed by a partial input vector, and for all DC-parts of the minterm, the minterm code is stored on both addresses. If the data bus of the RAM array exceeds the bit width necessary for storing the code, more bits to code the context or other information might be stored in addition to show invalid conditions etc. Any normal RAM architecture will not be capable to communicate a data miss, therefore a specialised minterm code must be used to provide stage 2 with this information that the minterm is not stored inside. This is necessary to receive completeness inside minterm coding. Figure 3 shows partial minterms, first mapped to a Tag-RAM (b) with Don't-Carecomparison, then mapped to a conventional RAM (c). Please note that there is no need to store partial minterms consisting only of ‘Don't-Cares’ (DC, '-') in the TagRAM, because these partial minterm are matching any bit pattern of the input vector. The partial minterms no. 3 and 10 as well as no. 5 and 6 lead to the same bit pattern
Definition of a Configurable Architecture for Implementation
145
and need only to be stored once. Therefore code 2 in figure 3 b) indicates that the first part of the input vector matches minterm 3 and 10, and the code 3 indicates the minterms 5 and 6.
partial Minterms Bits 0-3 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
0-11 1101 -0-1 ---0100 0100 ----------0-1
a)
Tag-RAM incl. Don't Care Tag
Code
0-11
0
1101
1
-0-1
2
0100
3
b)
RAM 16x3 Address Code 0000 8 0001 2 0010 8 0011 4 0100 3 0101 8 0110 8 0111 0 1000 8 1001 2 1010 8 1011 2 1100 8 1101 1 1110 8
c)
Fig. 3. Mapping partial minterms to RAM a) partial minterms b) Mapping to Tag-RAM c) Mapping to conventional RAM
When the tags are mapped to a conventional RAM, every address of this memory, which binary representation matches the bit pattern of a 3-valued tag, stores the appropriate minterm code. The address “0011” of the RAM matches the tags of the minterms 1 and 3, so the new code 4 is stored here to indicate the occurrence of these partial minterms. All codes of 5–8 represent no matching minterm and may be used for context information, e.g. illegal input vector. The RAM structure in stage 2 has to use DC coding equivalent to stage 1. This incorporates that comparison has to be provided including this case. Again, using conventional RAM means that the DC codes are decoded to all addresses storing the corresponding data inside stage 2. The RAM of this stage is addressed by the minterm combinations of stage 1. If a Tag-RAM is used in stage 2, every 3-valued tag represents a combination of minterms, which are present at the input lines and detected by stage 1. The OR-combined output-vectors of all minterms, which are represented by one combination, must be stored as the appropriate value. Again several addresses might match the same minterm combination. To provide the parallel capacity, the memory of stage 2 is pipelined (figure 4): The resulting minterm code of stage 1 addresses the first pipeline stage of the stage-2-RAM, called combination-RAM. This RAM contains an index for every valid minterm code vector, and this index addresses the second pipeline stage RAM storing the output variables for the minterm combination. This RAM is called output-vector-RAM, and the scheme enables the mapping of different minterm combination, resulting from DC digits, to the same output value.
146
C. Wiegand, C. Siemers, and H. Richter
Fig. 4. Pipelining of combination-RAM and output-vector-RAM in stage 2
As any single RAM stores only a single index, the output-vector-RAM must hold all output vector values, combined by a logical OR. This implies that the RAM must hold all possible values of the function. As this results in exponential growth of RAM size, the combination- and output-vector-RAM are segmented in parallel parts, and the results are combined using the OR-operation of stage 3. If different minterm combinations create the same output-vector, these combinations are mapped to the same output value too. Note that in the shown example the minterm combination “111111” belongs to no minterm-combination, the appropriate index leads to the output vector “000000000000”. The indices 5–7 are not used here. To use as much capacity as possible of the segmented RAM, a configurable crossbar-switch is used to connect stage 1 and stage 2. This switch maps the minterm code vector to the address inputs of the combination-RAM. Non-used data lines might be used for additional information as context, visualisation of invalid codes etc., as already mentioned. The third stage contains the OR-operation, as discussed before. To use the inverted version of the sum-of-products structure, a configurable exclusive-OR-operation (XOR) is included in this stage. The last operation uses the contents of a stored bit vector to invert the corresponding output bits. This results in the sample architecture in figure 5. The sample architecture is shown for 12 input- and 12 output-lines. To store a complete Boolean function with 12 input- and output-variables in RAM, 12 bits of data must be stored for all 4096 possible input-combinations. Therefore an amount of 6144 bytes is necessary for complete storage in a LUT-architecture. The amount of memory to configure the sample architecture sums up to: 3x minterm-RAM 16x4 3x combination-RAM 64x4 3x output-vector-RAM 16x12 Crossbar-Switch configuration Inverting register 12x1 Sum
24 96 72 18 1,5 211,5
Bytes Bytes Bytes Bytes Bytes Bytes
Definition of a Configurable Architecture for Implementation
147
input 12
4
4
Minterm RAM 16x4
Minterm RAM 16x4
4 Minterm RAM 16x4
Stage 1 4
4
4
12 Crossbar Switch 12x12
12
6
6
Combination RAM 64x4
Combination RAM 64x4
4
6
4
Output-vector RAM 16x12
Output-vector RAM 16x12
12
4 Output-vector RAM 16x12
12
12x OR
Stage 2
Combination RAM 64x4
12
12x OR 12
12
Stage 3
Invert 12x1
12x XOR 12 3 Context
12 Functional output
Fig. 5. Sample architecture for a (0,1) →(0,1) function 12
12
Of course this architecture is not capable of containing all function. The impact of the minimising, partitioning and mapping algorithms on the results and the usability will be great and is subject to future work. At this point, the architecture should be introduced as one possible architecture to implement GCA.
3 3.1
Mapping of a GCA on This Architecture Sample Implementation of the Architecture
It is assumed for the discussion inside this chapter that all binary coded states of a GCA at tn-1 are the input to the next cycle tn when the GCA computes the next state. In this case, the GCA might be mapped to the introduced architecture. To store the
148
C. Wiegand, C. Siemers, and H. Richter
8
8
8
8
8
8
8
8
RAM 256*8
RAM 256*8
RAM 256*8
RAM 256*8
RAM 256*8
RAM 256*8
RAM 256*8
RAM 256*8
actual state, an additional set of binary valued registers must be added to the architecture. These registers decouple the actual state from the new computation. In most cases, edge-sensitive registers will be used. The unused data lines, originating from stage 1, may be used as context signals. If the architecture provides more than one layer of RAM in the crossbar-switch, the stage-2-RAM and the inverting register, a context switch can be performed within 1 cycle. This leads the way to multi-context implementations of a GCA. Figure 6 shows an implementation for realising GCA with 64 state bits.
8
8
8
8
8
8
8
8
64
64
64 * 1
CrossbarSwitch
64 * =1
CrossbarSwitch
64 * D-Flipflop
64 * >1 16
RAM 64K*8
8
RAM 64K*8
8
RAM 64K*8
8
RAM 64K*8
8
RAM 256*64
64 * >1
RAM 256*64
8
RAM 64K*8
16
RAM 256*64
8
RAM 64K*8
16
RAM 256*64
8
RAM 64K*8
16
RAM 256*64
8
RAM 64K*8
16
64 * >1 16
RAM 256*64
64 * >1 64 * >1
16
RAM 256*64
64 * >1 64 * >1
16
RAM 256*64
Fig. 6. Sample architecture for implementing GCA
The computation will use one cycle per step in normal cases. As the capacity of the circuit is limited by the combination-RAM, the application might be to large for implementation in the architecture. If all methods of minimising and mapping fail, computation can be split up in partial steps. This results in more then one cycle per step to receive the states of the GCA. The memory demand per memory layer for this implementation is assumed by the following table: 8x minterm-RAM 256x8 8x combination-RAM 64Kx8 8x output-vector-RAM 256x64 2xCrossbar-Switch configuration Inverting register 64x1 Sum
2 512 16 1 8 531
KBytes KBytes KBytes KBytes Bytes KBytes
Definition of a Configurable Architecture for Implementation
3.2
149
One Example: Mapping a 4-Bit-CPU on the Implementation
This chapter introduces a simple 4-Bit-CPU, implemented as GCA and mapped on the new hardware architecture. The CPU has a simplified instruction set and consists of only a few internal registers. Internal Registers Address, 8-Bit The memory-address-register. The content of this register is directly mapped to the address-lines of the processor and vice versa Data, 4-Bit The memory-data-register. The content of this register is mapped to the data-lines of the processor and vice versa Accu, 4 Bit Internal register, where all calculations occur Code, 4-Bit Instruction register. This register holds the opcode of the current instruction during execution-time PC, 4-Bit Program counter. The content of this register represents the memory-address of the current instruction
Instruction set: bne beq lda sta and or add
Address Address Address Address Data Data Address
sub Address
Branch not equal. Jump, if content of Accu is not 0 Branch equal. Jump if content of Accu is 0 Load Accu with the data from Address Store the content of Accu at the given Address Calculate Accu and Data and store the result in Accu Calculate Accu or Data and store the result in Accu Calculate Accu + data from Address and store the result in Accu Calculate Accu - data from Address and store the result in Accu
An unconditional jump can be achieved by a bne- followed by a beq-instruction. There is neither stack processing or subroutines nor a carry flag. The instruction set consists of 8 instructions and can be coded in a 3-bit instruction code. Two operand formats are used: 8-bit addresses and 4-bit data. 3.2.1 Mapping the CPU to a GCA The GCA consist of 7 named calls. The cells 'Address', 'Data', 'Accu' and 'Code' correspond directly to the registers. The additional cells 'RW', 'Reg0' and 'Reg1' are needed to realise the program flow of the CPU. The cell 'Address' has 256 states, the cell 'RW' two, the cell 'Code' eight, and all other cells have 16 different states. The states of 'Address', 'RW' and 'Data' correspond to the I/O-lines of the CPU, and if these lines change, the states of the corresponding cells change too. Thus the states of all cells of the GCA are coded in 28 bits. The instructions will be processed in several phases. Each phase needs another configuration of the GCA. This includes different functionality of the single cells as well as different communication links between the cells. These reconfigurations are directed by the context lines, which select different levels of memory to change the behaviour of the GCA. The following phases are used:
150
C. Wiegand, C. Siemers, and H. Richter I/O-lines
Address
Data
RW
8-bit
4-bit
1-bit
Reg0
Reg1
Accu
Code
4-bit
4-bit
4-bit
3-bit
Fig. 7. GCA to realise the 4-bit CPU
Instructions OpCodes bne, beq
000, 001
lda
010
sta
011
and, or
100, 101
add, sub
110, 111
Phases Fetch OpCode Fetch Addr#1 Set Address regarding to Accu Fetch OpCode Fetch Addr#1 Save&Set Address Set Accu and restore Address Fetch OpCode Fetch Addr#1 Save&Set Address and store Accu Restore Address Fetch OpCode Calculate Accu Fetch OpCode Fetch Addr#1 Save&Set Address Calculate Accu and restore Address
Every instruction starts with the 'Fetch OpCode'-Phase. After completing the instruction, the next phase is 'Fetch OpCode' again. The sequence of phases, determined by the OpCode, is coded in the cyclic change of the context lines determined by the content of the memory arrays of the first stage of the architecture and the configuration of the crossbar switch between stage one and two. All contextstates are referred with the name of the corresponding phase in the further text. The last phase of every sequence must leave the context lines in the state to select the configuration of the 'Fetch OpCode'-phase again. To allow the same configurations to be used by different instructions and in different sequences of configurations, the OpCode is stored in the cell 'Code'. This cell is not part of the registers of the CPU. Some of the phases used to process the instructions are described in detail below: Fetch OpCode (Fig. 8a) During this phase the state of the cell 'Address' represents the current Program Counter (PC), and the content of the cell 'Data' represents the current OpCode read from RAM. The cell 'RW' is in state 'read', the state of cell 'Accu' represents the current content of the register 'Accu'. The Program Counter is increased by one, the OpCode is stored in the cell 'Code'. The context lines change to context 'Calculate Accu' or to context 'Fetch Addr#1', according to the OpCode.
Definition of a Configurable Architecture for Implementation
Address
Data
RW
Address
Data
RW
Program Counter
OpCode
read
Program Counter
Addr#1
read
Reg0
Accu
Code
Reg0
Accu
Code
Accu
OpCode
Reg1
Reg1
Accu
a)
b)
Address
Data
RW
Address
Data
RW
Program Counter
Addr#2
read
Program Counter
Addr#2
write
Reg0
Accu
Code
Reg0
Code
OpCode
Accu
Accu
Addr#1
Accu
OpCode
Reg1
Addr#1
c)
151
Reg1
d)
Fig. 8. Instruction execution phases a) Fetch OpCode b) Fetch Addr#1 c) Save&Set Address d) Save&Set Address and Store Accu
Fetch Addr#1 (Fig. 8b) The cell data, linked to the state of the data-lines of the CPU, contains the first part of an address used by the current instruction. This partial address is stored in the state of cell 'Reg0'. The Program Counter is increased by one. According to the state of cell 'Code', the next context is 'Save&Set Address' or 'Save&Set Address and Store Accu'. Save&Set Address (Fig. 8c) This phase follows directly after the phase 'Fetch Addr#1'. The cell 'Address' represents the current PC, the cell 'Reg0' the first part of the address to be read, the cell 'Data' represents the second part of this 8-bit address. In this phase, the state of the cell 'Address' will be copied to Reg0 and Reg1, while the states of the cells 'Reg0' and 'Data' give the new state of 'Address'. According to the state of cell 'Code' the next context is 'Calculate Accu and restore Address' for each of the instructions 'lda', 'add' and 'sub'. Save&Set Address and Store Accu (Fig. 8d) As in the phase 'Save&Set Address', this phase saves the current PC in the cells 'Reg0' and 'Reg1' and sets the new Address. Unlike the last phase, the state of the cell 'Accu' is copied to the cell 'Data' and the cell 'RW' changes its state from 'read' to 'write'. In this way, the content of the CPU-register 'Accu' is stored to the given address. The next phase is 'Restore Address'. 3.2.2 Mapping the GCA on the Circuit The states of all cells of the GCA request 28 bits for coding. This architecture is mapped on a circuit consisting of 8 16x4-RAMs minterm-memory in the first stage and 4 8x8-RAMs combination-memory as well as 256x32-RAMs output-memory in the second stage. The cells of the GCA are mapped to the following bit positions in the state vector:
152
C. Wiegand, C. Siemers, and H. Richter
Accu Code RW Reg0 Reg1 Data Address
bit bit bit bit bit bit bit
4..7 8..10 11 12..15 16..19 20..23 24..31
The phases of the instruction processing consist of 6 different Boolean functions, which have to be mapped on the architecture in different combinations. These are: − Increment of an 8-bit value, used to increment the PC − Copy one bit, used to copy registers. This function is used in parallel up to 20 times to copy the PC, the Accu and the new Address − 4x Calculation of a 4-bit-value from 2 4-bit-values, for the operations 'add', 'sub', 'and', 'or'. All these Boolean functions may be executed in parallel, as long as they set different output bits. The phases 'Fetch OpCode' and 'Save&Set Address and Store Accu' are explained in detail now. Fetch OpCode, Context '0000' This phase combines the increment function with copying 4 bits. The context lines are set to 'Calculate Accu' or 'Fetch Addr#1', according to the current OpCode. Figure 9 shows the mapping of the cell states to the minterm-memories MIN0 to MIN7. Only the RAMs MIN0, MIN1, MIN2, MIN7, COMB0, OUT0, COMB1 and OUT1 are used. MIN0 and MIN1 are linked to COMB0, which is completely filled, while bit 0..2 of MIN2 are linked to COMB1. The data-lines of MIN7 are linked directly to the context-lines of the circuit. The data-path from COMB0 leads to the output-memory OUT0, where the next address bits are generated and where the read-write-state is set to 'read'. The datavalue of the bits 0..2 from MIN2 representing the OpCode is stored via OUT1 into the GCA as the state of the cell 'Code'. The context lines are set to '0001', coding the context 'Fetch Addr#1', for any instruction except 'and' and 'or'. If the OpCode-Value is 'and', context is set to '0010' for 'Calculate And' and to '0011' for 'Calculate Or'. After execution of 'Fetch OpCode', the cell 'Address' keeps the new PC, the cell 'Code' stores the OpCode, the cell 'RW' in in the state 'read' and the context is either 'Fetch Addr#1', 'Calculate And' or 'Calculate Or'. Save Address and Store Accu, Context '0110' This phase is only used when the instruction 'sta' is processed. The cell 'Address' stores the current PC, the cell 'Data' stores the second part of the storage address (Addr#2), the cell 'Accu' stores the current content of the accumulator register and the cell 'RW' is in state 'read'. This phase uses the RAMs MEM0-MEM5, COMB0, COMB1, OUT0 and OUT1 completely and COMB2 and OUT2 partially. The cell 'RW' is set to state 'write', the cell 'Data' stores the content of the accumulator, and the cell 'Address' stores the address, where the content of 'Data' must be written. The context lines are not set, the next context-value is '0000', the 'Fetch OpCode'-Phase.
Definition of a Configurable Architecture for Implementation Addr#1 Addr#2 Data
MIN2 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
MIN3
MIN4
MIN5
MIN6
MIN7 0001 0001 0001 0001 0010 0011 0001 0001 0001 0001 0001 0001 0010 0011 0001 0001
MIN0 MIN1
MIN1 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
Data
COMB0 0000 0000 0000 0001 0000 0010 ... 1111 1111
OUT0 0000 0001 0000 0000 0000 0000 0010 0000 0000 0000 0000 0011 0000 0000 0000 ... 0000 0000 0000 0000 0000
MIN20..2
MIN0 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
153
COMB1 0000 0000 0000 0001 ... 0000 0111
OUT1 0000 0001 0000 0000 0000 0000 0000 0000 0000 0010 0000 0000 0000 0001 0000 0000 ... 0000 0000 0000 0000 0000 0111 0000 0000
1000 0000 0000 1000 0000 0000 1000 0000 0000 1000 0000 0000
MIN7
Context
Fig. 9. Fetch OpCode
4
Conclusion and Outlook
Starting with the problem to map a global cellular automaton on a physical circuit, where the cells could possibly but not likely have the full complexity and are linked to every other cell, this paper has introduced a new concept of realising Boolean functions with many input- and output variables. A GCA mapped to one Boolean function avoids the costs of the communication between the separated cells, which are mapped to the complexity of the single resulting Boolean function. This approach shifts the problem of communication to the theory of minimising Boolean functions and to the design of algorithms. If a Boolean function would be still too complex to fit on the architecture, the processing can be divided up in several independent steps with less complex functions. The example of the CPU, mapped to the new architecture approach, visualises the possibilities of this way to realise global cellular automata. At the same time it shows the limitations: the majority of the Boolean functions of the CPU like copying or incrementing the address are completely defined functions where every possible input combination and every possible output combination of variables may occur. For this
154
C. Wiegand, C. Siemers, and H. Richter
reason no simple way to map these functions on the architecture and save memory and complexity at the same time could be found. Addr#1 Addr#2 Data MIN2 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
MIN3 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
MIN4 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
MIN5
MIN6
MIN7
COMB0 0000 0000 0000 0001 0000 0010 ... 1111 1111
OUT0 0000 0001 0000 0000 0000 0000 0010 0000 0000 0001 0000 0011 0000 0000 0010 ... 0000 0000 0000 1111 1111
COMB1 0000 0000 0000 0001 0000 0010 ... 1111 1111
OUT1 0000 0001 0000 0000 0000 0000 0010 0000 0000 0000 0000 0011 0000 0000 0000 ... 0000 0000 0000 0000 0000
COMB2 0000 0000 0000 0001 ... 0000 1111
OUT2 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001 0000 0000 0000 0000 0000 ... 0000 0000 1111 0000 0000 0000 0000 0000
MIN4
MIN0 MIN1
MIN1 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
Accu
MIN2 MIN3
MIN0 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
Reg0
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
Context
Fig. 10. 'Save Address and Store Accu'
The results of certain considerations show that a lot of improvement is possible by special algorithms and software. This will be the topic of further research. Another topic should be the exploration and improving of the architecture itself. Because only part of the first and second stage RAM was used, it could be advantageously to provide a circuit for an automaton with wider status-register and more status bits, compared to the input lines the first stage can handle. Another way to improve the circuit could be the introduction of a special context-RAM that handles context sequences to be processed. In summary it can be concluded that this kind of approach gives new possibilities and chances worth for future considerations.
Definition of a Configurable Architecture for Implementation
155
References [1]
[2] [3] [4]
Rolf Hoffmann, Klaus-Peter Völkmann, Wolfgang Heenes, ”Globaler Zellularautomat (GCA): Ein neues massivparalleles Berechnungsmodell”. Mitteilungen – Gesellschaft für Informatik e.V., Parallel-Algorithmen und Rechnerstrukturen, ISSN 0177-0454 Nr. 18, S. 21–28 (2001) (in German language). R.K Brayton et.al., ”Logic Minimization Algorithms for VLSI Synthesis”. Kluwer Academic Publishers, 1984. Mike Trapp, “PLD-design methods migrate existing designs to high-capacity devices”. EDN Access, Febr. 1994. Wolfgang Heenes, Rolf Hoffmann, Klaus-Peter Völkmann: ”Architekturen für den th globalen Zellularautomaten“. 19 PARS Workshop, March 2003, Basel (in German language). http://www.ra.informatik.tu-darmstadt.de/publikationen/pars03.pdf
RECAST: An Evaluation Framework for Coarse-Grain Reconfigurable Architectures Jens Braunes, Steffen K¨ohler, and Rainer G. Spallek Institute of Computer Engineering Dresden University of Technology D-01062 Dresden, Germany {braunes,stk,rgs}@ite.inf.tu-dresden.de
Abstract. Coarse-grain reconfigurable processors become more and more an alternative to FPGA based fine-grain reconfigurable devices due to their reduction of configuration overhead. This provides a higher degree of flexibility for dynamically reconfigurable systems design. But, to make them more interesting for industrial applications, suitable frameworks supporting design space exploration as well as the automatic generation of dedicated design tools are still missing. In our paper we present a runtime-reconfigurable VLIW processor which combines hardwired and reconfigurable functional units in one template. For design space exploration, we discuss a framework, called RECAST (Reconfiguration-Enabled Compiler And Simulation Toolset), based on a architecture description language, which is extended by a model of coarse-grain runtime-reconfigurable units. The framework comprises a retargetable compiler based on the SUIF compiler kit, a profiler driven hardware/software partitioner and a retargetable simulator. To evaluate the framework we performed some experiments on a instance of the architecture template. The results show an increase in performance but also a lot of potential for further improvements.
1
Introduction
Reconfigurable architectures, have been subject for academic research for some years, now moving also towards industrial applications as well. With respect to rising design and masks costs, they are a very promising alternative to Application-Specific Integrated Circuits (ASICs). As a result of the availability of high flexible FPGAs we recognize the migration from specialized fixed hardware to reconfigurable devices in many cases. On the other hand this high flexibility causes additional costs. The interconnect network consumes a lot of area on the die and in many cases, a not negligible number of logic blocks can not be used because of a lack of routing resources needed by complex algorithms. Due to the bandwidth-limited interfaces, which connect the configurable device and the memory from where the configuration bit-file is loaded, reconfiguration is a time-consuming process. Depending on the device size this can require thousands of cycles. For this reason reconfiguration at run-time must be able to cope C. M¨ uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 156–166, 2004. c Springer-Verlag Berlin Heidelberg 2004
RECAST: An Evaluation Framework
157
with high latencies and is not profitable in many cases. Partial-reconfigurable devices or multi-context FPGAs are trying to overcome this penalty. However, a large number of applications from the digital signal processing domain or multimedia algorithms do not need the high flexibility of FPGAs. In general, the algorithms have similar characteristics. The instruction stream is of a regular structure. Typically, the most time-critical and computation-intensive code blocks are executed repeatedly and have a high degree of parallelism inside. The operands are rather word-sized instead of bit-level. Coarse-grain reconfigurable devices with data bit-width much greater than one bit meet the demands of these algorithms sufficiently and are more efficient in terms of area, routing and configuration time. Furthermore the parallelism inside the algorithm can be mapped efficiently into the reconfigurable device to improve the overall performance. Meanwhile a large number of coarse-grain architectures have been proposed and have shown the advantages compared to fine-grain architectures [1]. Despite all effort developing new, high sophisticated architectures, there is a lack of tools, which support the design process. The tools have to provide Design Space Exploration (DSE) to find the most convenient architecture for a particular application as well as the automatic generation of dedicated design tools such as a compiler and a simulator. In this paper we present a framework for design space exploration, which supports a template based design process of a run-time reconfigurable VLIW processor. The framework comprises a retargetable compiler, a profiler driven hardware/software partitioner and a retargetable simulator. We discuss an Architecture Description Language (ADL) driven approach for reconfigurable hardware modeling. The paper is organized as follow. Section 2 outlines typical coarse-grain reconfigurable architectures and the belonging programming and mapping tools. Section 3 covers related work on models for reconfigurable devices. Section 4 presents our framework for DSE and compilation. Section 5 discusses our experimental results. Finally, section 6 concludes the paper.
2
Coarse-Grain Reconfigurable Processors
Because of the less efficiency in terms of routing area overhead and poor routability of FPGA based fine-grain reconfigurable devices, coarse-grain reconfigurable arrays become more and more an alternative if we want to bridge the gap between ASICs and general-purpose processors. The more regular structures within the Processing Elements (PEs) with their wider data bit-width (typically complete functional units e.g. ALUs and multipliers are implemented) and the regularity of the interconnect network between the PEs, involve a massive reduction of configuration data and configuration time. In the following we want to outline four typical examples of coarse-grain reconfigurable architectures. In particular we are interested in the architecture of the reconfigurable array and the availability of programming tools for these devices.
158
J. Braunes, S. K¨ ohler, and R.G. Spallek
The PipeRench [2] device is composed of several pipeline stages, called stripes. Each stripe contains an interconnection network connecting a number of PEs, each contains an ALU and a register file. Through the interconnect the PEs have local access to operands inside the stripe as well as global access to operands from the previous stripe. Via four global buses the configuration data and the state of each stripe as well as the operands for the computations can be exchanged with the host. A stripe can be reconfigured within one cycle while the other stripes are in execution. A compiler maps the source of an application written in a Dataflow Intermediate Language (DIL) to the PipeRench device. The compilation consists of several phases: module inlining, loop unrolling and the generation of a straightline, Single Assignment Program (SIP). A place and route algorithm tries to fit the SIP into the stripes of the reconfigurable device. REMARC [3] is a reconfigurable accelerator tightly coupled to a MIPS-II RISC processor. The reconfigurable device consists of an 8×8 array of PEs or nano processors. A nano processor has its own instruction RAM, a 16-bit ALU, a 16-entry data RAM, and 13 registers. It can communicate directly to its four neighbors and via a horizontal and a vertical bus to the nano processors in the same row and the same column. Different configurations of a single nano processor are held in the instruction memory in terms of instructions. A global program counter (the same for all nano processors) is used as an index to a particular instruction inside the instruction memory. For programming the MIPS-II / REMARC system a C source is extended by REMARC assembler instructions and compiled with the GCC compiler into MIPS assembler. The embedded REMARC instructions are then translated into binary code for the nano processors using a special assembler. The PACT XPP architecture [4] is a hierarchical array of coarse-grain Processing Array Elements (PAEs). These are grouped into a single or multiple Processing Array Clusters (PACs). Each PAC has its own configuration manager which can reconfigure the belonging PAEs while neighboring PAEs are processing data. A structural language called Native Mapping Language (NML) is used to define the configurations as a structure of interconnections and PAE operations. The XPP C compiler translates the most time consuming parts of a C source to NML. The remaining C parts were added with interface commands for the XPP device and compiled by a standard C compiler for running on the host processor. MorphoSys [5] combines a RISC processor with a reconfigurable cell array consisting of 8×8 (subdivided into 4 quadrants) identical 16-bit PEs. Each PE, or Reconfigurable Cell (RC) has an ALU-multiplier and a register file and is configured through a 32-bit context word. The interconnection network comprises three hierarchical levels were nearest neighbor, intra-quadrant, and interquadrant connections are possible.
RECAST: An Evaluation Framework
159
The partitioning of C source code for the reconfigurable cell array and for the RISC processor are done manually by adding prefixes to functions. The configuration contexts can be generated via a graphical user interface or manually from assembler.
3
Models for Reconfigurable Devices
For application designers it is not an easy task to exploit the capabilities of runtime reconfigurable systems. In many cases, the hardware/software partitioning and the application mapping have to be done manually. For this reason the designer has to have a detailed knowledge about the particular system. To allow the designer to think of run-time reconfiguration at a higher, algorithmic level, dedicated tools, based on suitable models, must take over the partitioning and mapping. In the following we want to outline two models which are proposed to abstract from the real hardware and provide a more general view. GRECOM: In [6] the authors proposed a General Reconfiguration Model (GRECOM) to bridge the semantic gap between the algorithm and the actual hardware. It covers a wide range of reconfigurable devices by the means of abstraction from their real hardware structure. Each reconfigurable device consists of a number of PEs linked together by an interconnection network. Both the functionality of each PE and the topology of the interconnection network can be configured. The model is specified by four basic parameters: 1. 2. 3. 4.
Granularity of the processing elements Topology of the interconnection network Method of reconfiguration Reconfiguration time
By varying these parameters a Reconfigurable Mesh model and an FPGA model was derived from the GRECOM. Both models fit into the class of finegrain reconfigurable devices – the PEs perform operations on one bit operands. DRAA: Related to our approach (cf. section 4), Jong-eun Lee et al. [7] proposed a generic template for a wide range of coarse-grain reconfigurable architectures called Dynamically Reconfigurable ALU Array (DRAA). The DRAA consists of identical PEs in a 2D array or plane with a regular interconnection network between them. Vertical and horizontal lines provide the interconnections between the PEs of the same line (diagonal connections are not possible) as well as the access to the main memory. The microarchitecture of the PEs are described using the EXPRESSION ADL [8]. A three-level mapping algorithm is used to generate loop pipelines fit into the DRAA. First, on the PE-level, microoperation trees (expression trees with microoperations as nodes) are mapped to single PEs without the need of reconfiguration. Then the PE-level mappings are grouped together on line-level in such a way, that the number of required memory accesses not exceed the capacity
160
J. Braunes, S. K¨ ohler, and R.G. Spallek
Local Registers Global Registers Global Memory
Hardwired FU Hardwired Part
Reconfigurable FU Reconfigurable Part
Fig. 1. VAMPIRE architecture template
of the memory interface belonging to the line. On the plane-level, the line-level mappings are put into the 2D array. If there are unused rows remaining, the generated pipeline can be replicated in the free space.
4
Design Space Exploration and Compilation
The availability of efficient DSE tools and automatic tool generators for reconfigurable processors is crucial for the success of these architectures in commercial applications. For the conventional process of hardware/software co-design of SoCs, tools like EXPRESSION are used to find the architecture meeting the requirements of an application best. We want to introduce our architecture template as starting point for DSE, the framework which combines DSE and tool generation and an approach to integrate run-time reconfiguration to an ADL defined architecture model. 4.1
Reconfigurable VLIW Architecture Template
In the following, we want to introduce our template based architecture concept called VAMPIRE (VLIW Architecture Maximizing Parallelism by Instruction REconfiguration.) We extend a common VLIW processor with a single or multiple tightly-coupled Reconfigurable Functional Units (RFUs). The architecture is parameterizable in such a way that we can alter the number and the types of functional units, the register architecture as well as the connectivity between the units, the register files and the memory. Figure 1 shows the architecture template schematically.
Switch
Switch c
Shifter
Shifter
Shifter
Reg
Reg
Reg
PE
PE
Switch
Switch Ctrl Reg
PE
Ctrl
c
Ctrl
Switch
Switch
Switch Ctrl
PE
Shifter
PE
c
Switch
Reg
Switch
Shifter
Reg
161
Ctrl
Shifter
Reg
c
Switch
Shifter
PE
Switch
Switch c
Ctrl
Switch
Switch c
Ctrl
Switch
Ctrl
Switch
RECAST: An Evaluation Framework
Shifter
PE
Reg
PE
Fig. 2. RFU microarchitecture
The RFUs are composed of coarse-grain PEs which process 8-bit operands. A switched matrix provides the configurable datapaths between the PEs (Fig. 2). The RFUs are fully embedded within the processor pipeline. Considering the programming model, every RFU is assigned to a slot within the very long instruction word, which can be filled with a reconfigurable instruction. From the compiler’s point of view, the code generation and scheduling is a much easier task than for loosely coupled devices. As well as the processor architecture, the RFU microarchitecture is also parameterizable. During the DSE the number of PEs of each row and column can be adapted to the demands of the particular application. Every PE consists of a configurable ALU. The results from the ALU can be routed either directly or across a shifter to the interconnection network of the RFU. For synchronization purposes a pipeline register can be used to hold the results for the next stage. 4.2
Framework for DSE and Compilation
Based on the SUIF compiler toolkit [9], we are developing a framework for evaluation of the VAMPIRE architecture. The framework, called RECAST
162
J. Braunes, S. K¨ ohler, and R.G. Spallek RFU Architecture Description
SUIF Frontend with Optimization
C Source
RFU Structural Description (PEs and Interconnect) Frontend Backend RFU Mapping
Candidate Identifier (Early Profiler)
Dynamic Architecture Description
Static Architecture Description Processor Description (Hardwired Instruction Set)
Reconfigurable Instruction Set Synthesis (Candidates)
Code Selector
Behaviour / Structure (Reconfigurable Instruction Set)
RFU Configuration Instruction and Configuration Scheduler
Context Generator
VLIW Object Code
Configuration Data
Retargetable Simulator
Fig. 3. DSE and compilation framework
(Reconfiguration-Enabled Compiler And Simulation Toolset), consists of a profiler, retargetable compiler, a mapping module, and a simulator (Fig. 3). Frontend: For the processing of the C source, we use the frontend that comes with the SUIF compiler kit. After standard analysis and some architecture independent transformation and optimization stages, the algorithm is now represented by the SUIF Intermediate Representation (IR) in terms of an abstract syntax tree. Candidate Identifier: To find the most time-consuming parts of the application which might be accelerated by execution within the RFUs, a profiler stage is included to estimate the run-time behavior of the application. In contrast to other profiler-driven concepts, early profiling is performed on the intermediate representation instead of requiring fully compiled object code. Synthesis of Reconfigurable Instructions: The hardware/software partitioning of the algorithm takes the profiling data into consideration. In the present implementation, the mapping module generates a VHDL description for subtrees of the IR. The mapping results can be influenced by a parameter set. This includes the maximal pipeline depth, the minimal clock frequency and the maximal area consumption. For synthesis, a set of predefined, parameterizable VHDL modules is used. These modules were evaluated previously in terms of
RECAST: An Evaluation Framework
163
implementation costs for common FPGAs. Beside the VHDL code, a behavioral description and code generation rules for every mapped subtree are generated. In some cases, when the results meet the demands, not only one mapping will be generated. Such a set of mappings (or candidates) for a particular subtree, is then forwarded to the code selection phase. Each candidate is annotated by synthesis data that is used to estimate the cost and performance gain. Code Selection: Based on the estimated run-time behavior as provided by the early profiler, the candidates for reconfigurable execution can be evaluated and selected so that high speedups are achieved or the number of run-time reconfigurations is minimized. The design parameters annotated to the reconfigurable instruction candidates are used to ensure that resource constraints and design requirements like the clock rate are met by the generated code. The code generation is performed by a tree matching and dynamic programming algorithm. The rules to transform the IR into object code are specified by hand for the hardwired instructions and are generated automatically for the reconfigurable instructions as mentioned above. Finally, the scheduling phase combines the selected hardwired instructions as well as the reconfigurable instructions to the VLIW object code as an input for the parametrizable simulation model. 4.3
Architecture Description
In our approach an architecture description acts as an interface between the compiler’s point of view and the real hardware of our architecture template (Fig. 3). At present, there exists only a simple subset of such an ADL in our framework, which provides a behavioral description of the instruction set as well as the rules for code generation. We mentioned this before. As an essential improvement of our DSE framework we are now utilizing the concepts behind Architecture Description Languages like EXPRESSION. EXPRESSION [8] is designed to support a wide range of processor architectures (RISC, DSP, ASIP, and VLIW). From the combined structural and behavioral description of the processor, an ILP compiler and a cycle-accurate simulator can be automatically generated. The execution pipelines of the functional units are described explicitly as an ordering of units which comprise the pipeline stages, the timing of multi-cycled units, and the datapaths connecting the units. For ILP compilers reservation tables are automatically generated to avoid resource conflicts. As another key feature, EXPRESSION supports the specification of novel memory organizations and hierarchies. Apart from the specification of the hardwired part of our processor template, containing the pipeline structure, the memory hierarchy, and the instruction set, we can also describe the microarchitecture of the RFUs including the PEs and the interconnect network. Every PE as such is functionally described using atomic operations identical to the SUIF intermediate instructions, which might be combined to less complex trees. By this means also the granularity of the processing elements is specified. Possible inner-PE configuration has to be represented by
164
J. Braunes, S. K¨ ohler, and R.G. Spallek
26
FFT
38
Viterbi Decoder
12
Bilinear Filter
3
61
37
14
36
compared to VLIW without RFUs
7
10
18
38
100
0%
50%
100%
instructions, excluded from RFU execution
saved cycles compared to hardwired processor
RFU instructions not selected for RFU execution
RFU instructions selected for RFU execution
Fig. 4. Comparison of benchmarks. 100% corresponds to a processor without reconfigurable units
a predefined set of these trees. From the mapping point of view, a hierarchical algorithm comparable to the DRAA three-level mapping generates the particular configurations, thereby, the candidates for reconfigurable execution as well as the code generation rules for them. For instruction and configuration scheduling in general reconfiguring the RFU at run-time can be compared to a data access through a memory hierarchy. Multi-cycle reconfiguration causes latencies to be hidden by the compiler. To support this compiler task, different configuration management concepts were proposed: configuration prefetch [10], configuration caching [11] or multiple contexts for fast context switching (cf. REMARK’s instruction memory). According to the GRECOM, the method of reconfiguration has to be specified for the RFU. This is achieved by a set of predefined behavioral descriptions which also contains the reconfiguration time as an additional parameter. Encapsulated into a particular resource, comparable with EXPRESSIONS memory components, it it possible to generate reservation tables to avoid scheduling conflicts.
5
Experiments and Analysis
To validate our framework, we have compiled and simulated benchmark algorithms, which have their origin in the DSP domain. We have evaluated the utilization of the reconfigurable units for a 512-point two-dimensional in-place FFT, a Viterbi decoder for a 1/3 rate convolutional code and a Bilinear Filter detail.
RECAST: An Evaluation Framework
165
Due to the development state of our DSE framework we had to do some modifications on the experimental environment. So the benchmark algorithms passed through the following phases: 1. Following to the frontend pass of the SUIF compiler, we had to perform some transformations on the IR. We had to dismantle for-loops and special array-instructions, which are provided by the SUIF IR, because our compiler backend does not support them yet. 2. The Early Profiler estimated the run-time behavior of the algorithms on a fixed set of input samples. 3. The synthesis module generated mappings for the RFUs based on a fixed set of 24 predefined VHDL modules which correspond to all SUIF intermediate instructions that can be mapped into hardware directly. Memory operations as well as control flow operations are excluded from mapping and have to be executed by the hardwired units. 4. Due to the development state of our architecture description, we could not define the microarchitecture of the RFUs more flexible. Furthermore, the complexity of the RFUs is much lower than those of other reconfigurable devices (cf. section 2). As a consequence, the mapped IR subtrees where of order less than 4. 5. The code generation pass transformed the IR into VLIW object code using hardwired instructions as well as new generated reconfigurable instructions. 6. A simple behavioral description was used to describe the functionality of the instruction set for the simulator. We apply our measurements to the ratio of executed cycles with and without the availability of a reconfigurable hardware on the same processor architecture. Figure 4 shows the results in detail. The increase in performance (decrease of cycles) are in the range of 10 to 38%, when we focus mainly on a maximal clock frequency and a small pipeline depth. The results for the FFT and the Viterbi fall short of our expectations. We identified the following reasons for the relative low increase in performance. Firstly, the very small RFUs cause a suboptimal mapping of subtrees and possibly a frequent reconfiguration. Because of the higher costs of a large portion of candidates are not selected for RFU execution. Furthermore, there are also a lot of memory transfers and control flow operations inside the algorithms, which have to be executed in the hardwired part of the processor and do not contribute to the increase in performance. If we allow the RFU to access the memory directly, we can overcome this penalty. Furthermore, with the utilization of optimizing scheduling techniques like basic block expansion and software pipelining, we are convinced to improve the results considerably.
6
Conclusion
In our paper we have presented the RECAST framework for design space exploration for the parameterizable coarse-grain reconfigurable architecture template
166
J. Braunes, S. K¨ ohler, and R.G. Spallek
VAMPIRE. We have outlined the particular components of this framework including a profiler based on the SUIF intermediate representation, a module for synthesis of reconfigurable instructions and the code selector. Furthermore we have analyzed the results we received by some experiments. We have seen a lot of potential for further improvements. At present we are extending our framework by a more powerful architecture description, that includes a flexible model for coarse-grain run-time reconfigurable units.
References 1. Hartenstein, R.: A Decade of Reconfigurable Computing: a Visionary Retrospective. In: Proceedings of the Conference on Design Automation and Testing in Europe (DATE’01), ACM Press (2001) 2. Goldstein, S.C., Schmit, H., Moe, M., Budiu, M., Cadambi, S., Taylor, R.R., Laufer, R.: PipeRench: A Coprocessor for Streaming Multimedia Acceleration. In: ISCA. (1999) 28–39 3. Miyamori, T., Olukotun, K.: REMARC: Reconfigurable Multimedia Array Coprocessor. In: Proceedings of the ACM/SIGDA Sixth International Symposium on Field Programmable Gate Arrays (FPGA’98), ACM Press (1998) 4. Baumgarte, V., May, F., N¨ uckel, A., Vorbach, M., Weinhardt, M.: PACT XPP - A Self-Reconfigurable Data Processing Architecture. In: Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA’2001). (2001) 5. Singh, H., Lee, M.H., Lu, G., Kurdahi, F., Bagherzadeh, N., Filho, F.E.C.: MorphoSys: An Integrated Reconfigurable System for Data-Parallel and ComputationIntensive Applications. IEEE Transactions on Computers 49 (2000) 6. Sidhu, R.P., Bondalapati, K., Choi, S., Prasanna, V.K.: Computation Models for Reconfigurable Machines. In: International Symposium on Field-Programmable Gate Arrays. (1997) 7. Lee, J., Choi, K., Dutt, N.D.: Compilation Approach for Coarse-Grained Reconfigurable Architectures. IEEE Design and Test of Computers, Special Issue on Application Specific Processors 20 (2003) 8. Halambi, A., Grun, P., Ganesh, V., Khare, A., Dutt, N., Nicolau, A.: EXPRESSION: A Language for Architecture Exploration through Compiler/Simulator Retargetability. In: Proceedings of the European Conference on Design, Automation and Test (DATE 99). (1999) 485–490 9. Wilson, R.P., French, R.S., Wilson, C.S., Amarasinghe, S.P., Anderson, J.A.M., Tjiang, S.W.K., Liao, S.W., Tseng, C.W., Hall, M.W., Lam, M.S., Hennessy, J.L.: SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. SIGPLAN Notices 29 (1994) 31–37 10. Hauck, S.: Configuration Prefetch for Single Context Reconfigurable Coprocessors. In: Proceedings of the Sixth ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA ’98), ACM Press (1998) 65–74 11. Sudhir, S., Nath, S., Goldstein, S.C.: Configuration Caching and Swapping. In Brebner, G., Woods, R., eds.: Proceedings of the 11th International Conference on Field Programmable Logic (FPL 2001). Volume 2147 of Lecture Notes in Computer Science., Springer Verlag (2001) 192–202
Component-Based Hardware-Software Co-design ´ am Mann, and Andr´ P´eter Arat´o, Zolt´ an Ad´ as Orb´ an Budapest University of Technology and Economics Department of Control Engineering and Information Technology H-1117 Budapest, Magyar tud´ osok k¨ or´ utja 2, Hungary Phone: +36 14632487, Fax: +36 14632204 [email protected], {Zoltan.Mann,Andras.Orban}@cs.bme.hu
Abstract. The unbelievable growth in the complexity of computer systems poses a difficult challenge on system design. To cope with these problems, new methodologies are needed that allow the reuse of existing designs in a hierarchical manner, and at the same time let the designer work on the highest possible abstraction level. Such reusable building blocks are called components in the software world and IP (intellectual property) blocks in the hardware world. Based on the similarity between these two notions the authors propose a new system-level design methodology, called component-based hardware-software co-design, which allows rapid prototyping and functional simulation of complex hardware-software systems. Moreover, a tool is presented supporting the new design methodology and a case study is shown to demonstrate the applicability of the concepts.
1
Introduction
The requirements towards today’s computer systems are tougher than ever. Parallel to the growth in complexity of the systems to be designed, the time-tomarket pressure is also increasing. In most applications, it is not enough for the product to be functionally correct, but it has to be cheap, fast, and reliable as well. With the wide spread of mobile systems and the advent of ubiquitous computing, size, heat dissipation and energy consumption [1] are also becoming crucial aspects for a wide range of computer systems, especially embedded systems. To take into account all of these aspects in the design process is becoming next to impossible. According to the International Technology Roadmap for Semiconductors [2], the most crucially challenged branch of the computer industry is system design. The Roadmap clearly declares that Moore’s law can hold on for the next decades only if innovative new ways of system design will be proposed to handle the growing complexity.
This work have been supported by the European Union as part of the EASYCOMP project (IST-1999-14151) and by OTKA T 043329
C. M¨ uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 169–183, 2004. c Springer-Verlag Berlin Heidelberg 2004
170
´ Mann, and A. Orb´ P. Arat´ o, Z.A. an
Embedded systems have become a part of our lives in the form of consumer electronics, cell phones, smart cards, car electronics etc. These computer systems consist of both hardware and software; they together determine the operation of the system. The differences between hardware and software and their interaction contribute significantly to the above-mentioned huge complexity of systems. On the other hand, the similarities between hardware and software design open many possibilities for their optimized, synergetic co-design. This is the motivation for hardware-software co-design (HSCD) [3]. To address the above problems, different, but in many ways similar solutions have been developed in the software and hardware world. 1.1
Solutions in the Software World
Traditionally, the focus of software engineering has been on flexibility, code readability and modifiability, maintainability etc. This has led to the notions of separation of concerns, information hiding, decoupling, and object-orientation. In recent years though, as a result of the growing needs, the reuse of existing pieces of design or even code has received substantial attention. Examples of such efforts include design and analysis patterns, aspect-oriented programming, software product lines, and component-based software engineering [4]. Unfortunately, the definition of a component is not perfectly clear. There are several different component models, such as for instance the CORBA component model or the COM component model. Each of these component models define the notion of a component slightly differently. However, these definitions have much in common: a component is a piece of adaptable and reusable code, that has a well-defined functionality and a well-defined interface, and can be composed with other components to form an application. Components are often sold by thirdparty vendors, in which case we talk about COTS (commercial off-the-shelf) components. Each component model defines a way for the components—which might be very different in programming language, platform or vendor—to interact with each other. The component models are also often supported by middleware, which provides services that are often needed—such as support for distribution, naming and trading service, transactions, persistence etc.—to the components. As a result, the middleware can provide transparency (location transparency, programming language transparency, platform transparency etc.), which facilitates the development of distributed component-based software systems enormously. 1.2
Solutions in the Hardware World
Since the construction of hardware is much more costly and time-consuming than that of software, the idea of reusing existing units and creating the new applications out of the existing building blocks is definitely more adopted in the hardware world. This process has led from simple transistors to gates, then to simple circuits like flip-flops and registers, and then to more and more complex
Component-Based Hardware-Software Co-design
171
building blocks like microprocessors. Today’s building blocks perform complex tasks and are largely adaptable. These building blocks are called IP (intellectual property) units [5,6,7,8]. They clearly resemble software components; however, IPs are even less standardized than software components. We do not know widely accepted component models such as CORBA or EJB in the hardware world. Another consequence of the high cost of hardware production is that hardware must be carefully tested before it is actually synthesized. Therefore, testing solutions are more mature in the hardware world: e.g. design for testability (DFT) and built-in self test (BIST) are common features of hardware design. Moreover, it is common to use simulation of the real hardware for design and test purposes. 1.3
Convergence
The production costs of hardware units depend very much on the volume of the production. It is by orders of magnitude cheaper to use general-purpose, adaptable hardware elements which are produced in large volumes than specialpurpose units. The general-purpose units (e.g. Field Programmable Gate Arrays or microprocessors) have to be programmed to perform the given task. Therefore, when using general-purpose hardware units to solve a given problem, one actually uses software. Conversely, when creating a software solution, one actually uses general-purpose hardware. Consequently, the boundary between adaptable hardware units and software is not very sharp. As already mentioned, hardware is usually simulated from the early phases of the design process. This means that its functionality is first implemented by software. Moreover, there are now tools, for instance the PICO (Program In, Chip Out [9]) system, that can transform software to hardware. Motivated by the above facts this paper introduces a new system-level design methodology which handles both software and hardware units at a high abstraction level and propagates the concept of reuse by assembling the system from hardware and software building blocks. Note that it is not the intention of this paper to address each system-level synthesis problem emerging during HSCD, our goal is only to highlight the concept of a new system-design approach and to deal with problems special to the new methodology. The paper is organized as follows. Section 2 introduces the proposed new methodology and some related problems. The tool supporting the new concepts is demonstrated in Section 3 and a case study is presented in Section 4. Finally, Section 5 concludes the paper.
2
A New HSCD Methodology
Based on the growing needs towards system design, as well as both the software and hardware industry’s commitment to emphasize reuse as the remedy for
172
´ Mann, and A. Orb´ P. Arat´ o, Z.A. an
design complexity, we propose a novel HSCD methodology we call componentbased hardware-software co-design (CBHSCD). CBHSCD is an important contribution in the Easycomp (Easy Composition in Future Generation Component Systems1 ) project of the European Union. The main goal of CBHSCD is to assemble the system from existing pre-verified building blocks allowing the designer rapid prototyping [10,11] at a very high level of abstraction. At this abstraction level components do not know any implementation details of each other, not even whether the other is implemented as hardware or as software. The behavior of this prototype system can be simulated and verified at an early stage of the design process. CBHSCD supports hierarchical design: the generalized notion of components makes it possible to reuse complex hardware-software systems as components in later designs. (See also Section 2.6.) The main steps of CBHSCD are shown in Fig 1. In the following each subtask is detailed except the issues related to the synthesis which are beyond the scope of CBHSCD.
Problem specification Component selection
Composition
Component repository
Simulation Validation Real−time constraints Partitioning Cost & timing info Consistency check
Synthesis
Technology spec
Fig. 1. The process of CBHSCD
2.1
Component Selection
The process starts by selecting the appropriate components2 from a component repository based on the problem specification (Of course the selection of an appropriate component is an individual challenge [5,12], but it is beyond the scope of this paper to address this problem). From the aspect of CBHSCD it does not matter how the components are implemented: CBHSCD does not aim at replacing or reinventing specific hardware design and synthesis methods or software development methods. Instead, it relies on existing methodologies and best practices, and only complements them with co-design aspects. The used 1 2
www.easycomp.org We use the term component to refer to a reusable building block, which might be hardware, software, or the combination of both in hierarchical HSCD.
Component-Based Hardware-Software Co-design
173
components might include pure software and pure hardware components, but mixed components are also allowed, as well as components which exist in both hardware and software. In the latter case the designer does not have to decide in advance which version to use (only the functionality is considered), but this will be subject to optimization in the partitioning phase (see Section 2.4). 2.2
Composition
After the components are selected, they are composed to form a prototype system. Each component provides an interface for the outside world. The specification of this interface is either delivered with the component or if the component model provides a sufficient level of reflection, it can be generated automatically. One of the important contributions of CBHSCD is that the composition of components is based on remote method calls between components supported by the underlying middleware. To handle all components—including the hardware components—uniformly, a wrapper should be designed around the device driver communicating directly with the hardware. This wrapper has the task to produce a software-like interface for the hardware component, to delegate the calls and the parameters to the device driver and to trigger an event when a hardware interrupt occurs. The device driver and the wrapper together hide all hardware-specific details including port reads/writes, direct memory access (DMA) etc.: these are all done inside the wrapper and the device driver, transparently for other components. As a consequence hardware components can also participate in remote method calls both as initiator or as acceptor. Composition is supported by a visual tool that provides an intuitive graphical user interface (GUI) as well as an easy-to-use interconnection wizard. This easeof-use helps to overcome problems related to the learning-curve, since traditionally system designers have had to possess professional knowledge on hardware, software and architectural issues; thus, the lack of qualified system designers has been a critical problem. The simple composition also allows for easy rapid prototyping of complex hardware-software systems. 2.3
Simulation and Validation
Since the application has been composed of tested and verified components, only the correctness of the composition has to be validated by simulation. The individual units are handled as black-box components in this phase and only functional simulation is carried out. For instance, if a calculation is required from a hardware component, one would only monitor the final result passed back to the initiator component and not the individual steps taken inside the hardware. If problems are detected, the component selection and/or composition steps can be reviewed. It is even possible to simulate parts of the system, so that problems can be detected before the whole system is composed. Important to note that components are living and fully operable at composition time (e.g. a button can be pressed and it generates events), hence the
174
´ Mann, and A. Orb´ P. Arat´ o, Z.A. an
application can be tried out by simply triggering an event or sending a start signal to a component. This helps validate the system enormously. Since the design is only in a premature prototyping phase, it is possible that the (expensive) hardware components are not available at this stage3 . If the hardware component is already available and the component is decided to be in the hardware context, it can be used already in the simulation phase. However, it is possible that we want to synthesize or buy the hardware component only if it is surely needed. In this case, we need software simulation. If a software equivalent of the hardware component is available—e.g. if the hardware is synthesized from a software description, which is often the case, or if the hardware performs a well-known algorithm, which is also implemented in software—then this software equivalent can be used for simulation. Even if a complete software equivalent is not available, there might be an at least interface-equivalent software, e.g. if the IP vendor provides a C code to specify the interface of its product. Also, if the description of the hardware is available in a hardware description language such as VHDL, a commercial hardware simulator can be used. However, we can assume that sooner or later all IP vendors will provide some kind of formal description of their products which is suitable for functional simulation [5]. Related work includes the embedded code deployment and simulation possibilities of Matlab (http://www.mathworks.com) and the Ptolemy project (http://ptolemy.eecs.berkeley.edu/). 2.4
Partitioning
After the designer is convinced that the system is functionally correct, the system has to be partitioned, i.e. the components have to be mapped to software and hardware. (There can be components which only exist in hardware or in software, so that their mapping is trivial.) This is an important optimization problem, in which the optimal trade-off between cost and performance has to be found. Traditionally, this has been the task of the system designer, but manual partitioning is very time-consuming and often yields sub-optimal solutions. CBHSCD on the other hand makes it possible to design the system at a very high level, only concentrating on functionality. This frees the designer from dealing with low-level implementation issues. Partitioning is automated based on a declarative requirements specification. We defined a graph-theoretic model for the partitioning problem [13,14] and there are other partitioning algorithms in the literature, see e.g. [15,16,17] and references therein. The partitioning algorithm has to take into account the software running times, hardware costs (price, area, heat dissipation, energy consumption etc.), communication costs between components as well as possible constraints defined by the user (including soft and hard real-time constraints, area constraints etc.). This is very helpful for the design of embedded systems, especially real-time systems. When limiting the running time, partitioning aims at minimizing costs, which are largely the hardware 3
Before partitioning it is not even known of each component whether to be realized in software or hardware.
Component-Based Hardware-Software Co-design
175
costs. Similarly, when costs are limited, the running time is minimized, which is essentially the running time of the software plus the communication overhead. It is also possible to constrain both running time and costs, in which case it has to be decided whether there is a system that fulfills all these constraints, and in the case of a positive answer, such a partition has to be found. To generate all the input data for the partitioning algorithm is rather challenging. In case of hardware costs, it is assumed that the characteristic values of the components are provided with the component itself by the vendor. Communication costs are estimated based on the amount of exchanged data, and the communication protocol, for which there might be several possibilities with different cost and performance. Concerning the running times, a worst case (if hard real-time constraints are specified) or average case running time is either provided with the component or extracted by some profiling technique. An independent research field deals with the measurement or estimation of these values, see e.g. [18,19]. The time and cost constraints must be specified explicitly by the designer via use-cases (see Section 3 for more details). 2.5
Consistency
One of the main motivations of CBHSCD is to raise the abstraction level high enough where the boundary of hardware and software vanishes. Since components interacting with each other are not aware of the context of the other (only the interface is known), hence the change of implementation should be transparent to others. It implies two consistency problems special to partitioning in CBHSCD. Interface consistency. The components subject to partitioning are available also in software and hardware. There is an interface associated to all these pairs, which describes the necessary methods and attributes the implementations should provide in order to allow transparent change between them. It must be checked whether both implementations realize this interface. (For related work see e.g. [20].) State consistency. The prototype system can be repartitioned several times during the design process. Each time to realize a transparent swap between implementations the new implementation should be set to exactly the same state as the current one. (In the case of a long-lasting simulation it may not be feasible to restart the simulation after each swap.) This is not straightforward, because the components are handled as black-box, and it is not possible to access all the state-variables from outside. A number of component models explicitly forbid stateful components to avoid these problems. Our proposed solution to achieve the desired state is to repeat all the method calls on the new implementation that has affected the state of the current implementation since the last swap. (See Section 3 for more details.) At the end, the system is synthesized, which involves generation of glue code and adapters for the real interconnection of the system, and also the generation
176
´ Mann, and A. Orb´ P. Arat´ o, Z.A. an
of a test environment and test vectors for real-world testing. However, our main objective was to improve the design process, which is in our opinion the real bottleneck, so that the last phase, which involves implementation steps, is beyond the scope of this paper. 2.6
Hierarchical Design
Hierarchical design [21,22] is an integral part of CBHSCD. It helps in coping with design complexity using a divide-and-conquer approach, and also in enhancing testability. Namely, the system can be composed of well-tested components, and only the composition itself has to be tested, which compresses the test space enormously. Hierarchical design in CBHSCD can be interpreted either in a bottom-up or a top-down fashion. Bottom-up hierarchical design means that a system that has been composed of hardware, software and mixed components4 using CBHSCD methodology can later be used as a component on its own for building even more complex systems. Top-down hierarchical design means that a complex problem is divided into sub-problems, and this decomposition is refined until we get manageable pieces. The identified components can then be realized either based on existing components using CBHSCD methodology or using a traditional methodology if the component has to be implemented from scratch. As a simple example of such a hierarchical design, consider a computationintensive image-processing application, which consists of a set of algorithms. In order to guarantee some time constraints, one of the algorithms has to be performed by a very fast component. So the resulting system might consist of a general-purpose computer and an attached acceleration board. However, the acceleration board itself might include both non-programmable accelerator (NPA) logic and a very long instruction word processor (VLIW) processor [9], which performs the less performance-critical operations of the algorithm in software, as the result of a similar design step. 2.7
Communication
Communication between the components is facilitated through a middleware layer, which consists of the wrappers for the respective component types, as well as support for the naming of components, the conversion of data types and the delivery of events and method calls. This way we can achieve hardware-software transparency much in the same way as middleware systems for distributed software systems achieve location and implementation transparency. Consequently, the communication between hardware and software becomes very much like remote procedure calls (RPC) in distributed systems. The resulting architecture is shown in Fig 2. 4
Clearly, pure hardware and pure software components are just the two extremes of the general component notion. Generally, components can realize different cost/performance trade-offs ranging from the cheapest but slowest solution (pure software) to the most expensive but fastest solution ( pure hardware).
Component-Based Hardware-Software Co-design
177
Middleware COM wrapper
Hardware wrapper
COM component
Device driver Hardware
Fig. 2. Communication between a COTS software component (COM component in this example) and a hardware unit. The dotted line indicates the virtual communication, the full line the real communication.
The drawback of this approach is the large communication overhead introduced by the wrappers and the middleware layer in general. However, this is only problematic if the communication between hardware and software involves many calls, which is not typical. Most often, a hardware unit is given an amount of data on which it performs computation-intensive calculations and then it returns the results. In such cases, if the amount of computation is sufficiently large, the communication overhead can be reduced. However, the flexible but complicated wrapper structure is only used in the design phase, and it is replaced by a simpler, faster, but less flexible communication infrastructure in the synthesis phase. There are standard methodologies for that task, see e.g. [23,7].
3
CWB-X: A Tool for CBHSCD
Our tool to support CBHSCD is an extension of a component-based software engineering tool called Component Workbench (CWB), which has been developed at the Vienna Technical University in the Easycomp project [24]. CWB is a graphical design tool implemented in Java for the easy composition of applications from COTS software components. The main contribution of CWB is the support for composition of components from different component models, like COM, CORBA, EJB etc. To achieve this, CWB uses a generic component model called Vienna Composition Framework (VCF) which handles all existing component models similarly. This generic model offers a very flexible way to represent components, hence all existing software component models can be transformed to this one by means of wrappers. In the philosophy of CWB, each component is associated with a set of features. A feature is anything a component can provide. A component can declare the features it supports and new features can also be added to the CWB. The most typical features are the following. Property. The properties (attributes) provided by the component. Method. The methods of the component. Eventset. The set of events the component can emit.
178
´ Mann, and A. Orb´ P. Arat´ o, Z.A. an
Lifecycle. If a component has this feature, then it can be created and destroyed, activate or deactivate. GUI. The graphical interface of the component. Each component model is implemented as a plug-in in the CWB (see Fig 3). The plug-in class only provides information about the features the component can provide, the real functionality is hidden in the classes implementing the features. As the name suggests, new plug-ins can be added to the CWB, that is, new component models can be implemented. To do that, a new plug-in class and a class representing the required features have to be implemented. These classes realize the wrapper between the general component model of VCF and the specific component model.
GUI Generic Component Model COM Plugin
CORBA Plugin
EJB Plugin
COM
CORBA
EJB
CWB
Fig. 3. The architecture of the CWB.
For the communication between the components, CWB offers multiple communication styles. One of the most important communication styles supported by CWB is the event-to-method communication, i.e. a component triggers an event which induces a method call in all registered components. The registration mechanism and the remote method call is supported by Java. A wizard helps the user to set up a proper connection. New communication styles can also be added to the CWB. The used components are already operable at composition-time. This is very advantageous because this way the simulation and evaluation of the system is possible already in the early phases of the design process. Also, the user can invoke methods of the components, thus use-cases or call sequences can be tested without any programming efforts. 3.1
Extension of CWB to Support CBHSCD
CWB offers a good starting point for a hardware-software co-design tool because of its flexibility and extensibility. We extended CWB to support CBHSCD principles. In CWB-X (CWB eXtended), the designer of a hardware-software
Component-Based Hardware-Software Co-design
179
application may select both software and hardware and so called partitionable components from a repository. The latter identifies two implementations for the same behavior. These components can originate from different vendors and different component models including hardware and software. The selected component is put on the working canvas. In case of pure software components, the operable component itself—with possible GUI—can appear, but in case of hardware components the component itself might not be available and some kind of simulation is used. The designer can choose between different simulation levels, as already discussed. To enable the integration of hardware components in CWB-X, new component models are added to the CWB as plug-ins. Similarly to the software side, there is a need for several hardware component models according to the different ways the actual hardware might be connected to the computer. This goal is complicated by the lack of widely accepted industry standards for IP interface and communication specification. Since the implementation details of a component should be transparent for the other components, the hardware components should provide similar features as the software ones. Therefore we define the Method, Property and Eventset features for hardware components as well, and map methods to operations of the underlying hardware, properties to status information and initial parameters, and events to hardware interrupts. To identify the features a hardware component can provide [5], reflection is necessary, i.e. information about the interface of the component. Today’s IP vendors do not offer a standardized way to do that, often a simple text description is attached to the IP. In our model we require a hardware component to provide a description about its features (Properties, Methods, Events). The composition of components is supported by wizards; the wizard parses the component’s features and allows the connection according to the selected communication style. Due to the wrappers, hardware components act the same way as software ones, the wizards of the CWB can be used. When the architecture of the designed application is ready, partitioning is performed. We have integrated a partitioning algorithm [13] based on integer linear programming (ILP). This is not an approximation algorithm: it finds the exact optimum. This approach can handle systems with several hundreds of components in acceptable time. For the automatic partitioning process, the various cost parameters and the time constraints must be specified. Time constraints are defined on the basis of use-cases. Each use-case corresponds to a specific usage of the system, typically initiated by an entity outside the system. A use case involves some components of the system in a given order. A component can also participate multiple times in a use case. The designer defines a use-case by specifying the sequence of components affected in it and gives a time constraint for the sum of the execution times of the concerned components including communication. The constraints for all use-cases are simultaneously taken into account during partitioning. The measurement of running time and
180
´ Mann, and A. Orb´ P. Arat´ o, Z.A. an
communication cost parameters is at an initial stage in our tool; currently we expect that this data is explicitly given by the designer. CWB-X is able to check both interface and state consistency. To each partitionable component a Java-like interface is attached which describes the required features of the implementations. The tool checks whether the associated implementations are appropriate. Furthermore, to each method in this interface description file an attribute is ordered, which describes the behavior of this method in the state consistency check. The value and the meaning of the attribute are the following: NO SIDE EFFECT: the corresponding method has no effect on the state of the component, thus it should not be repeated after repartition. REPEAT AT REPARTITION: the corresponding method affects the state but has no side effect, thus it should be repeated after repartition. REPEAT AT REPARTITION ONCE: the same as the previous one, but in a sequence of this method call the last one should be repeated only. An example is setting a property to a value. SIDE EFFECT: the corresponding method does affect the state and also has some side effect (e.g. sends 100 pages to the printer) or takes too long to repeat. The system logs every method call and property change since the last implementation swap. If all these belong to the first three category, the correct state will be set automatically after the change of the implementations by repeating the appropriate function calls. If there is at least one call with SIDE EFFECT, the system shows a warning and asks the designer to decide which method calls to repeat. The designer is supported by a detailed log in this decision.
4
Case Study
In this section the CBHSCD methodology will be step-by-step demonstrated on a small example application. In this example the frequency of an unknown source signal has to be measured. This task might appear in several real-world applications like mobile phone technology, hence this system can be used as a building block in later designs. The architecture of the example can be seen in Fig 4. The frequency measurer (FM) measures the signal of the generator and sends the measured value periodically to the PC through the serial port. The PC on the one hand displays the current frequency value and plots a graph on the alteration of the value, and, on the other hand, controls the measurer through start and stop signals. There are two implementations available for the FM: the first one is a programmable PIC 16F876 microcontroller regarded as software implementation and an FPGA on a XILINX VIRTEX II XC2V1000 card as the hardware implementation. The two implementations behave exactly the same way, but their performance (and cost) is different. The microcontroller is able to precisely measure the frequency up to 25KHz (to take a sample lasts 40µs). The FPGA on the other hand can take a sample in 50ns, thus it can measure up to 20MHz without any problem.
Component-Based Hardware-Software Co-design
181
System boundary Frequency measurer PC 1.2KHz
start stop
serial port
Software impl. (microcontroller) Signal generator Hardware impl. (FPGA)
Fig. 4. The architecture of the example application
There are five components in this example: two JavaBeans buttons (start and stop), a TextField and a chart component for display and the FM declared as a partitionable component with the two implementations detailed above5 . Both implementations belong to the component model whose device driver is able to communicate with the devices through serial port. For consistency purposes the interface on Fig 5 is provided with the component. The device driver is wrapped by a CWB wrapper providing a software-like interface. The tool checks whether the interfaces of the wrappers match the requirements.
package frequency; public interface FrequencyEstimatorInterface { SIDE_EFFECT public void start(); SIDE_EFFECT public void stop(); NO_SIDE_EFFECT public void takeOneSample(); NO_SIDE_EFFECT public String getMeasuredFrequencyString(); NO_SIDE_EFFECT public Integer getMeasuredFrequency(); REPEAT_AT_REPARTITION_ONCE public void setCountEveryEdge(boolean b); NO_SIDE_EFFECT public boolean getCountEveryEdge(); } Fig. 5. Part of the required interface with state consistency attributes of the partitionable frequency measurer (FM) component
In the composition phase the start and the stop button should be mapped with the aid of the mentioned wizard to the start and stop method of the FM, respectively. The FM sends an interrupt whenever a new measured value is arrived. This interrupt appears as an event in the CWB-X; this event triggers the setText function of the TextField and the addValue function of the chart. The system can be immediately simulated without any further effort: after pressing 5
The signal generator is regarded as an outside source, hence not part of the system
182
´ Mann, and A. Orb´ P. Arat´ o, Z.A. an
the start button the current implementation of the FM starts measuring the signal of the generator and the PC will display the measured values. The task of partitioning will be to decide which implementation to use according to the time requirements of the system. The designer defines a use-case which declares a time-limit for the takeOneSample function of the FM. In this simple case the optimal partition is trivial6 : if the time-limit is under 40µs, the FPGA should be used, otherwise the microcontroller (here we assume, that to program the microcontroller is cheaper than to produce the FPGA). The partitioner finds this solution and changes the implementation if necessarily. The new implementation will be transformed to the same state as the current one according to the steps detailed in Section 3.
5
Conclusion
In this paper, we have described a new methodology for hardware-software codesign, which emphasizes reuse, a high abstraction level, design automation, and hierarchical design. The new methodology, called component-based hardwaresoftware co-design (CBHSCD), unifies component-based software engineering and IP-based hardware engineering practices. It supports rapid prototyping of complex systems consisting of both hardware and software, and helps in the design of embedded and real-time systems. The concepts of CBHSCD, as well as partitioning, enable advanced tool support for the system-level design process. Our tool CWB-X is based on the Component Workbench (CWB), a visual tool for the composition of software components of different component models. CWB-X extends the CWB with new component models for hardware components as well as partitioning and consistency checking functionality. We presented a case study to demonstrate the applicability of our concepts and usefulness of our tool. We believe that the notion of CBHSCD unifies the advantages of hardware and software design to a synergetic system-level design methodology, which can help in designing complex, reliable and cheap computer systems rapidly.
References 1. H. Lekatsas, W. Wolf, and J. Henkel. Arithmetic coding for low power embedded system design. In Data Compression Conference, pages 430–439, 2000. 2. A. Allan, D. Edenfeld, W. H. Joyner Jr., A. B. Kahng, M. Rodgers, and Y. Zorian. 2001 Technology Roadmap for Semiconductors. IEEE Computer, 35(1), 2002. 3. R. Niemann. Hardware/Software Co-Design for Data Flow Dominated Embedded Systems. Kluwer Academic Publishers, 1998. 4. George T. Heineman and William T. Councill. Component Based Software Engineering: Putting the Pieces Together. Addison-Wesley, 2001. 5. G. Martin, R. Seepold, T. Zhang, L. Benini, and G. De Micheli. Component selection and matching for IP-based design. In Proceedings of the DATE 2001 on Design, automation and test in Europe. IEEE Press, 2001. 6
Generally the partitioning problem is N P-hard.
Component-Based Hardware-Software Co-design
183
6. Ph. Coussy, A. Baganne, and E. Martin. A design methodology for integrating ip into soc systems. In Conf´erence Internationale IEEE CICC, 2002. 7. P. Chou, R. Ortega, K. Hines, K. Partridge, and G. Borriello. Ipchinook: an integrated ip-based design framework for distributed embedded systems. In Design Automation Conference, pages 44–49, 1999. 8. F. Pogodalla, R. Hersemeule, and P. Coulomb. Fast protoyping: a system design flow for fast design, prototyping and efficient IP reuse. In CODES, 1999. 9. V. Kathail, S. Aditya, R. Schreiber, B. R. Rau, D. C. Cronquist, and M. Sivaraman. PICO: automatically designing custom computers. IEEE Computer, 2002. 10. G. Spivey, S. S. Bhattacharyya, and Kazuo Nakajima. Logic Foundry: A rapid prototyping tool for FPGA-based DSP systems. Technical report, Department of Computer Science, University of Maryland, 2002. 11. Klaus Buchenrieder. Embedded system prototyping. In Tenth IEEE International Workshop on Rapid System Prototyping, 1999. 12. P. Roop and A. Sowmya. Automatic component matching using forced simulation. In 13th International Conference on VLSI Design. IEEE Press, 2000. ´ Mann and A. Orb´ 13. Z. A. an. Optimization problems in system-level synthesis. 3rd Hungarian-Japanese Symp. on Discrete Mathematics and Its Applications, 2003. ´ Mann, A. Orb´ 14. P. Arat´ o, S. Juh´ asz, Z. A. an, and D. Papp. Hardware/software partitioning in embedded system design. In Proceedings of the IEEE International Symposium on Intelligent Signal Processing, 2003. 15. N. N. Binh, M. Imai, A. Shiomi, and N. Hikichi. A hardware/software partitioning algorithm for designing pipelined ASIPs with least gate counts. In Proceedings of the 33rd Design Automation Conference, 1996. 16. B. Mei, P. Schaumont, and S. Vernalde. A hardware/software partitioning and scheduling algorithm for dynamically reconfigurable embedded systems. In Proceedings of ProRISC, 2000. 17. T. F. Abdelzaher and K. G. Shin. Period-based load partitioning and assignment for large real-time applications. IEEE Transactions on Computers, 49(1), 2000. 18. X. Hu, T. Zhou, and E. Sha. Estimating probabilistic timing performance for real-time embedded systems. IEEE Transactions on VLSI Systems, 9(6), 2001. 19. S. L. Graham, P. B. Kessler, and M. K. McKusick. An execution profiler for modular programs. Software Practice & Experience, 13:671–685, 1983. 20. A. Speck, E. Pulverm¨ uller, M. Jerger, and B. Franczyk. Component composition validation. International Journal of Applied Mathematics and Computer Science, pages 581–589, 2002. 21. G. Quan, X. Hu, and G. Greenwood. Preference-driven hierarchical hardware/software partitioning. In Proceedings of the IEEE/ACM International Conference on Computer Design, 1999. 22. R. P. Dick and N. K. Jha. MOGAC: A multiobjective genetic algorithm for hardware-software co-synthesis of hierarchical heterogeneous distributed embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 17(10):920–935, 1998. 23. A. Basu, R. Mitra, and P. Marwedel. Interface synthesis for embedded applications in a co-design environment. In 11th IEEE International conference on VLSI design, pages 85–90, 1998. 24. Johann Oberleitner and Thomas Gschwind. Composing distributed components with the component workbench. In Proceedings of the Software Engineering and Middleware Workshop (SEM2002). Springer Verlag, 2002.
Cryptonite – A Programmable Crypto Processor Architecture for High-Bandwidth Applications Rainer Buchty , Nevin Heintze, and Dino Oliva Agere Systems 101 Crawfords Corner Rd Holmdel, NJ 07733, USA {buchty|nch|oliva}@agere.com
Abstract. Cryptographic methods are widely used within networking and digital rights management. Numerous algorithms exist, e.g. spanning VPNs or distributing sensitive data over a shared network infrastructure. While these algorithms can be run with moderate performance on general purpose processors, such processors do not meet typical embedded systems requirements (e.g. area, cost and power consumption). Instead, specialized cores dedicated to one or a combination of algorithms are typically used. These cores provide very high bandwidth data transmission and meet the needs of embedded systems. However, with such cores changing the algorithm is not possible without replacing the hardware. This paper describes a fully programmable processor architecture which has been tailored for the needs of a spectrum of cryptographic algorithms and has been explicitly designed to run at high clock rates while maintaining a significantly better performance/area/power tradeoff than general purpose processors. Both the architecture and instruction set have been developed to achieve a bits-per-clock rate of greater than one, even with complex algorithms. This performance will be demonstrated with standard cryptographic algorithms (AES and DES) and a widely used hash algorithm (MD5).
1
Introduction and Motivation
Hardware ASIC blocks are still the only available commercial solution for high-bandwidth cryptography. They are able to meet functionality and performance requirements at comparably low costs and, importantly for embedded systems applications, low power consumption. Their chief limitation is their fixed functionality: they are limited to the algorithm(s) for which they have been designed. In contrast, a general purpose processor is a much more flexible approach and can be used to implement any algorithm. The current generation of these processors has sufficient computing power to provide moderate levels of cryptographic performance. For example, a high-end Pentium PC can provide encryption rates of hundreds of MBits/sec. However, general purpose processors are more than
Rainer Buchty is now a member of the University of Karlsruhe, Institute for Computer Design and Fault Tolerance, Chair for Computer Architecture and Parallel Processing. You can reach him via [email protected]
C. M¨uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 184–198, 2004. c Springer-Verlag Berlin Heidelberg 2004
Cryptonite – A Programmable Crypto Processor Architecture
185
100x larger and consume over 100x more power than dedicated hardware with comparable performance. A general purpose processor is simply too expensive and too power hungry to be used in embedded applications. Our goal in this paper is to provide a better tradeoff between flexibility and performance/area/power in the context of embedded systems, especially networking systems. Our approach is to develop a programmable architecture dedicated to cryptographic applications. While this architecture may not be as flexible as a general purpose processor, it provides a substantially better performance/area/power tradeoff. This approach is not new. An early example is the PLD001 processor [17]. While this processor is specifically designed for the IDEA and RSA algorithms, it is in fact a microprogrammable processor and could in principle be used for variations of these algorithms. A more recent approach to building a fully programmable security processor is CryptoManiac [28]. The CryptoManiac architecture is based on a 4-way VLIW processor with a standard 32-bit instruction set. This instruction set is enhanced for cryptographic processing [7] by the addition of crypto instructions that combine arithmetic and memory operations with logical operations such as XOR, and embedded RAM for table-based permutation (called the SBOX Cache). These crypto instructions take one to three cycles. The instruction set enhancement is based on an analysis of cryptographic applications. However, the analysis made assumptions about key generation which are not suitable for embedded environments. CryptoManiac is very flexible since it is built on a general purpose RISC-like instruction set. It provides a moderate level of performance over a wide variety of algorithms, largely due to the special crypto instructions. However, the CryptoManiac architecture puts enormous pressure on the register file, which is a centralized resource shared by all functional units. The register file provides 3 source operands and one result operand per functional unit [28], giving a total of at least 16-ports on a 32x32-bit register array, running at 360 MHz. This is still a challenge for a custom design using today’s 0.13µm technology. It would be even more difficult using the 0.13µm library-based ASIC tool flows typical for embedded system chips. Unlike CryptoManiac, Cryptonite was designed from the ground up, and not based on a pre-existing instruction set. Our starting point was an in-depth application analysis in which we decomposed standard cryptographic algorithms down to their core functionality as described in [6]. Cryptonite was then designed to directly implement this core functionality. The result is a simple light-weight processor with a smaller instruction set, a distributed register file with significantly less register port pressure, and a two-cluster architecture to further reduce routing constraints. It natively supports both 64-bit and 32-bit computation. All instructions execute in a single cycle. Memory access and arithmetic functions are strictly separated. Another major difference from CryptoManiac is in the design of specialized instructions for cryptographic operations. CryptoManiac provides instructions that combine up to three standard logic, arithmetic, and memory operations. In contrast, Cryptonite supports several instructions that are much more closely tailored to cryptographic algorithms such as parallel 8-way permutation lookups, parameterized 64-bit/32-bit rotation, and a set of XOR-based fold operations. Cryptonite was designed to minimize implemen-
186
R. Buchty, N. Heintze, and D. Oliva
tation complexity, including register ports, and size, number and length of the internal data paths. A companion paper [10] focuses on AES and the software for implementing AES on Cryptonite. It describes novel techniques for AES implementation, the AES-relevant aspects of the Cryptonite architecture and how AES influenced the design of Cryptonite. In this paper we focus on the overall architecture and design methodology as well as give details of those aspects of the architecture influenced by DES and hashing algorithms such as MD5, including the distributed XOR unit and the DES unit of Cryptonite.
2
Key Design Ideas
Cryptonite was explicitly designed for high throughput. Our approach combines single-cycle instruction execution with a three-stage pipeline consisting of simple stages that can be clocked at a high rate. Our architecture is tailored for cryptographic algorithms, since the more closely an architecture reflects the structure of an application, the more efficiently the application can be run on that architecture. The architecture also addresses system issues. For example, we have designed Cryptonite to generate round keys needed for encryption and decryption within embedded systems. To achieve these goals while keeping implementation complexity to a minimum, Cryptonite employs a number of architectural concepts which will be discussed in this section. These concepts arose from an in-depth analysis of several cryptographic algorithms described in [6], namely DES/3DES [4],AES/Rijndael [9,8], RC6 [21], IDEA [22], and several hash algorithms (MD4 [19,18], MD5 [20], and SHA-1 [5]). 2.1 Two-Cluster Architecture Most other work on implementing cryptographic algorithms on a programmable processor focuses solely on the core encryption algorithm and does not include round key generation (i.e. the round keys have to be precomputed). For embedded system solutions, however, on-the-fly round key generation is vital because storing/retrieving the round keys for thousands or millions of connections is not feasible. Coarse-grain parallelism can be exploited; round key calculation is usually independent of the core cryptographic operation. For example, in DES [4], the round key generation is completely independent of encryption or decryption. The loose coupling is where the round key is fed into the encryption process. Certain coarse-grained parallelism exists also within hash algorithms like MD5 [20] or SHA [5]: these algorithms consist of the application of a non-linear function (NLF) followed by further adding of table-based values. In particular, the hash function’s NLF can be calculated in parallel with summing up the table-based values. Our analysis revealed that many algorithms show similar structure and would benefit from an architecture providing two independent computing clusters. Algorithm analysis further indicated that two clusters are a reasonable compromise between algorithm support and chip complexity. Adding further clusters would rarely result in speeding up computation but rather increase silicon.
Cryptonite – A Programmable Crypto Processor Architecture
2.2
187
XOR Unit
XOR is a critical operation in cryptography1 . Several algorithms employ more than one XOR operation per computation round or combine more than two input values. Therefore, using common two-input functions causes an unnecessary sequentiality; a multi-input XOR function would avoid this sequentiality. Such a unit is easy to realize as the XOR function is fast, simple and cheap in terms of die size and complexity. Thus, Cryptonite employs a 6-input XOR unit which can take any number from one to 6 inputs. These inputs are the four ALU registers, data coming from the memory unit, result data linked from the sibling ALU, and an immediate value. As the XOR unit can additionally complement its result, it turns into a negation unit when only one input is selected. Signal routing becomes an issue with bigger data sizes. For this reason the 6-input XOR function was embedded into the data path: Instead of routing ALU registers individually into the XOR unit, the XOR function has been partially embedded into the data path from the registers to the XOR unit. Thus only one intermediate result instead of four source values has to be routed across the chip and routing pressure is taken from the overall design. This reduces die size and increases speed of operation. 2.3
Parameterizable Permutation Engine
Another basic operation of cryptographic operations is permutation, commonly implemented as a table lookup. In typical hardware designs of DES, these lookup tables are hardwired. However, for a programmable architecture, hardwired permutation tables are not feasible as they would limit the architecture to the provided tables. Supporting several algorithms would require separate tables and hence increase die size. S−box index LAR (providing page)
00 11 11 00 00 11 00 11
11 00 00 11 11 00 00 11 00 11 00 11
1111 0000 0000 1111
111 000 000 111 000 111 000 111 00 11 00 11 11 00 00 11
Resulting Data
111 000 000 111 111 000 111 000
00 11 00 111 000 00011 111 001111 11 0000 00 00 11 111 000 00011 111 001111 11 0000
111 000 00 00 00011 111 0011 11 00 11 000 111 00 00 00011 111 0011 11 00 11
Fig. 1. Vectored Memory Access 1
XOR is a self-invertible operation. We can XOR data and key to generate cipher text and then XOR the cipher text with the key to recover the data.
188
R. Buchty, N. Heintze, and D. Oliva
Instead, a reconfigurable permutation engine is necessary. Cryptonite employs a novel vector memory unit as its reconfigurable permutation engine. Algorithm analysis showed that permutation lookups are mostly done on a per-byte or smaller basis (e.g. DES: 6-bit address, 4-bit output; AES: permutation based on an 8-bit multiplication table with 256 entries). Depending on the input data size and algorithm, up to 8 parallel lookups are performed. In Cryptonite, the vector memory unit receives a vector of indexes and a scalar base address. This is used to address a vector of memories (i.e. n independent memory lookups are performed). The result is a data vector. This differs from a typical vector memory unit which, when given a scalar address, returns a data vector (i.e. the n data elements are sequentially stored at the specified address). In non-vector addressing mode, the memory address used is the sum of a base address (from a local address register) and an optional index value. Each memory in the vector of memories receives the same address, and the results are concatenated to return a 64-bit scalar. The vector addressing mode is a slight modification to this scheme: we mask out the lower 8 bits of the base address provided by a local address register (LAR) and the 64 bits of the index vector are interpreted as eight 8-bit offsets from the base address as illustrated by Figure 1. Cryptonite’s vector memory unit is built from eight 8-bit memory banks.
2.4 AES-Supporting Functions The vector memory unit described above is important for AES performance. Cryptonite also provides some additional support instructions for AES2 . Eight supporting functions are listed in Table 1. Unlike typical DES functions, the AES-supporting functions implement relatively general fold, rotate and interleave functionality and should be applicable to other crypto algorithms. Table 1. AES-supporting ALU functions Function swap(x32 , y32 ) upper(x64 , y64 ) lower(x64 , y64 ) rblm (x64 ) rbrm (x64 ) xor rblm (x64 , y64 ) fold(x64 , y64 ) ifold(x64 , y64 )
Description f (x, y) = y | x ∗ f (x, y) = x7 | x3 | y7 | y3 | x6 | x2 | y6 | y2 ∗ f (x, y) = x5 | x1 | y5 | y1 | x4 | x0 | y4 | y0 f (x) = (x63..32 ≪ (m ∗ 8) | (x31..0 ≪ ((m + 1) ∗ 8)) f (x) = (x63..32 ≫ (m ∗ 8) | (x31..0 ≫ ((m + 1) ∗ 8)) f (x, y) = rblm (x ⊕ y) f (x, y) = (x1 ⊕ y0 ) | (x0 ⊕ y1 ⊕ y0 ) f (x, y) = (x0 ⊕ y1 ) | (y1 ⊕ y0 ) ∗ indices denote bytes
indices denote bits
indices denote 32-bit words
With these functions, it it possible to implement an AES decryption routine with 81 cycles (8 cycles per round, 6 cycles setup, 3 cycles post-processing) and encryption 2
We note that the need for such functions mainly arises from AES round key generation.
Cryptonite – A Programmable Crypto Processor Architecture
189
with just 70 cycles (7 cycles per round, 1 cycle setup, 6 cycles post-processing)3 . Both routines include on-the-fly round-key generation. [10] elaborates on the AES analysis and how supporting functions were determined. It also presents the AES implementation on the Cryptonite architecture.
ALU #1 control ALU #2 control ALU interlink ALU #1
ALU #2
immediate value #1 Control Unit
source register #1 index register #1 immediate value #2
Memory Unit #1
source register #2 index register #2
Memory Unit #2
Data I/O Unit #1
sbox index
Address Generation Unit #1
Address Generation Unit #2
sbox index
Data I/O Unit #2
64−bit imm. #2 64−bit immediate #1 external data external address Data
Address Local Memory #1 (4096x64 bit)
Address
Data Local Memory #2 (4096x64 bit)
External Access Unit
Fig. 2. Overview of the Cryptonite architecture
3 The Cryptonite Architecture A high-level view of Cryptonite is pictured in Figure 2. As mentioned previously, application analysis led to the two-cluster architecture. Each cluster consists of an ALU and its accompanying data I/O unit (DIO) managing accesses to the cluster’s local data memory. A crosslinking mechanism enables data exchange between the ALUs of both clusters. The overall system is controlled by the control unit (CU) which parses the instruction stream and generates control signals for all other units. A simple external access unit (EAU) provides an easy method to access or update the contents stored in both local data memories: on external access, the CU puts all other units on hold and grants the EAU access to the internal data paths. The CU also supplies a set of 16 registers for looping and conditional branching. 12 of these are 8-bit counter registers, the remaining four are virtual registers reflecting the two ALU’s states: we use these registers to realize conditional branches on ALU results such as zero result return (BEQ/BNE or JZ/JNZ) or carry overflow/borrow (BCC/BCS or JC/JNC). The Cryptonite CU is depicted in Figure 3. The use of special purpose 3
The asymmetry arises from the fact that AES, although symmetric in terms of cryptography, is asymmetric in terms of computation. Decryption needs a higher number of table lookups.
190
R. Buchty, N. Heintze, and D. Oliva
instruction data
immediate value
instruction address
decremented value
to Instruction Memory
ALU control MU control instruction decoder
jump
reset
register address
PC counter registers
1
1111 00 0011 00 000 111 00 11 00 11 00 000 111 111 0011 11 0011 11 00 11 000 11 111 00 00 00 11 000 ALU #1 zero flag ALU #1 carry flag ALU #2 zero flag ALU #2 carry flag
1 is_zero
0
=0 immediate address
Fig. 3. The Cryptonite Control Unit
looping registers reduces register port pressure, and routing issues. In addition, from our application analysis it was clear that most cryptographic algorithms have relatively small static loop bounds. In fact, data-dependent branching is rare (IDEA being one exception). Finally, the use of special purpose registers in conjunction with a post-decrement loop counter strategy allows us to reduce the branch penalty to 1 cycle. 3.1 The Cryptonite ALU Much effort was put into the development of the Cryptonite ALU. Our target clock frequency was 400 MHz in TSMC’s 0.13 µm process. To reach this goal, we had to carefully balance the ALU’s features (as required for the crypto algorithms) and its complexity (as dictated by technology constraints). One result of this tradeoff is that the number of 64-bit ALU registers in each cluster was limited to four. Based on our application analysis, this was judged sufficient. To compensate for the low register count, each register can be either used as one 64-bit or two individually addressable 32-bit quantities. The use of a 64-bit architecture was motivated by both the requirements of DES and AES as well as parameterizable algorithms like RC6. To enable data exchange between both blocks the first register of each ALU is crosslinked with the first register in the other cluster. This crosslink eases register pressure as it allows cooperative computation on both ALUs (this is critical for AES and MD5). To further reduce register pressure, each ALU employs an accumulator for storing intermediate results. The ALU itself consists of the arithmetic unit (AU) and a dedicated XOR unit (XU). The AU provides conventional arithmetic and boolean operations but also specialized functions supporting certain algorithms. It follows the common 3-address model with two source registers and one destination register. These registers are limited to the ALU’s accumulator, the four ALU registers, data input and output registers of the associated memory unit, and an immediate value provided by the CU. The XU may choose any number of up to 6 source operands. In addition, the XU may optionally complement the output of the XOR. Thus, with only one source operand, the XU can act like a negation unit.
Arithmetic Unit w/ Accumulator A−Bus
Link (out)
Cryptonite – A Programmable Crypto Processor Architecture
191
Register Bank
A−Bus X−Bus
Register r0
u cc A
Arith/Accu
Link
srcA
srcB
A−Bus
Register r1
X−Bus
Arithmetic Unit
0
1
Register r2
A−Bus X−Bus
Register r3
A−Bus X−Bus
0
A−Bus X−Bus
0
ALU data (out)
X−Bus XOR Unit
Data Size Selector ALU data (in)
Logical XOR
GIV Input
64bit, 32bit/H, 32bit/L, null
Fig. 4. Overview of Cryptonite’s ALU
The 64-bit results of AU and XU operations are placed on separate result buses. From these buses, either the upper 32-bits, lower 32-bits or the entire 64-bit value can be selected and stored in the assigned register (or register half). It is not possible to combine two individual 32-bit results from both result buses into one 64-bit register. Results may also be forwarded to the data unit. Figure 4 illustrates the Cryptonite ALU with its sub-units.
3.2 The Cryptonite Memory Unit Access to local data memory is handled by the memory unit. It is composed of an address generation unit (AGU) and a data I/O unit (DIO). The address generation unit depicted in Figure 5. It generates the address for local memory access using the local address registers (LAR). The AGU contains a small add/sub/and ALU for address arithmetic. This supports a number of addressing modes such as indexed, auto-increment and wraparound table access as listed in Table 2. Furthermore, the SBox addressing mode performs eight parallel lookups to the 64-bit memory with 8-bit indices individual to each lookup. For a detailed description of this addressing mode please refer to Section 2.3. The DIO, shown in Figure 6, contains two buffer registers which are the data input and data output registers (DIR and DOR). They buffer data from and to local memory. The DOR can also be used as an auxiliary register by the ALU. The DIR also serves as the SBox index to the AGU.
192
R. Buchty, N. Heintze, and D. Oliva immediate value
source register
index register
external address
write control
address generator
local address registers
source
index
source
0x8000
0
Modulo
sbox index
local memory address
Fig. 5. The Address Generation Unit Table 2. Addressing modes supported by Cryptonite’s AGU Addressing Mode Address Computation LAR Update direct addr = LAR ”, w/ register modulo addr = LARx LARx = LARx % LARy ”, w/ immediate modulo addr = LARx LARx = LARx % idx S-Box ∀0 ≤ i ≤ 7 : addri = (LAR ∧ 0x7f00) ∨ idxi (LAR unchanged) immediate-indexed addr = LAR LAR = LAR + idx ditto, w/ register modulo addr = LARx LARx = (LARx + idx) % LARy LARx = LARx + LARy register-indexed addr = LARx ditto, w/ immediate modulo addr = LARx LARx = (LARx + LARy ) % idx Addressing modes written in italics are based on architectural side-effects and have not been designed in by purpose. ALU data
external data
Input Register
Output Register
DES Unit
DES Output Input Register Content
sbox address
memory data
Fig. 6. Data I/O from embedded SRAM
Cryptonite – A Programmable Crypto Processor Architecture
193
The DIO also contains a specialized DES unit. Fast DES execution not only requires highly specialized operations but also SBox access to memory. Hence the DES support instructions are integrated into the memory unit rather than the ALU. 3.3 The Cryptonite DES Unit As mentioned in Section 3.2, the DES unit has been implemented into the memory unit instead of the ALU. The reason for doing so was to not bloat the ALU with functions which are plain DES-specific and cannot be reused for other algorithms. Even the most primitive permutation, compression, and expansion functions required for DES computation are clearly algorithm-specific as discussed in detail in [6]. In addition to these bit-shuffling functions, DES computation is based on a table-based transposition realized through an SBox lookup and therefore needs access to the data memory. For this reason, the DES unit has been placed directly into the memory unit rather than incorporated into the ALU. Doing so, no unnecessary complexity – i.e. die size and signal delay – is added to the ALU. Futhermore, penalty cycles resulting from data transfer from memory to ALU and back are avoided.
Input Permutation
L
R
Key Transform
Compression
Kl
Expansion
P−Box
Kr
Clear
Counter Register
Round Constant Memory
Enc#/Dec
Final Permutation
Output to Data Unit
Input from Data Unit
Fig. 7. Cryptonite’s DES Unit
The DES unit is pictured in Figure 7. It consists of data (L and R) and key (K) registers, a round counter and a constant memory of 16x2 bits providing the round constant for key shifting. The computation circuitry provides two selectable, monolithic functions performing the following operations:
194
R. Buchty, N. Heintze, and D. Oliva
1. expand Ri−1 , shift & compress key, and XOR the results 2. permutate the S-Box result using P-Box shuffling and XOR this result with Li−1 ; forward Ri−1 to Li These functions can be selected independently to enable either initial computation (function 1) or final computation (function 2) needed for the first and last round, or back-to-back execution (function 2 followed by function 1) for the inner rounds of computation.
4
Results
Several algorithms were investigated and implemented on a custom architecture simulator. Based on the simulation results, the architecture was fine-tuned to provide minimum cycle count while maintaining maximum flexibility. In particular, the decision to incorporate the DES support instructions within the memory unit instead of the ALU (see Section 3.2) was directly motivated by simulation results. In this section we will now present the simulation results for a set of algorithm implementations which were run on our architecture simulator.
4.1
DES and 3DES
As Cryptonite employs a dedicated DES unit, the results for the DES [4] and 3DES implementations were not surprising. Cryptonite reaches throughput of 732 MBit/s for DES and 244 MBit/s for 3DES. In contrast, the programmable CryptoManiac processor [28] achieves performance of 68 MBit/s for 3DES. To quantify the tradeoffs of programmability versus performance, we give some performance numbers for DES hardware implementations. Hifn’s range of cores ([16], 7711 [11], 7751 [12], 7811 [13] and 790x [14,15]) achieve performance of 143-245 MBit/s for DES and 78-252 MBit/s for 3DES. The OpenCore implementation of DES [27] achieves performance of 629 MBit/s. Arguably the state-of-the-art DES hardware implementation is by SecuCore [26]. SecuCore’s high-performance DES hardware implementation (SecuCore DES/3DES Core [24]) achieves 2 GBit/s, just a factor of 2.73 better than Cryptonite. These results are summarized in Figure 8.
4.2 Advanced Encryption Standard (AES) Figure 9 compares the AES performance of Cryptonite against a set of hardware implementations from Amphion [1], Hifn, and Secucore as well as the programmable CryptoManiac. Cryptonite running at 400 MHz outperforms a number of hardware implementations by a factor of 1.25 to 2.6. Compared with CryptoManiac, Cryptonite
Cryptonite – A Programmable Crypto Processor Architecture 2000 1800
Performance (MBit/s)
1600 1400 1200 1000 800 600 400 200 0
111 000 Hifn 7711 000 111 Hifn 7751 000 111 Hifn 790x 000 OpenCores 111 DES 000 111 DES 000 SecuCore 111 Cryptonite 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 00 11 000 111 00 11 000 111 00 11 000 111 00 11 000 111 00 11 000 111 00 11 000 111 00 000 111 00 11 11 00 11 000 111 00 11 00 000 111 00 11 11 00 11 000 111 00 11 111 000 11 00 111 000 11 00
Performance (MBit/s)
200
150
100
50
0
11 00 00 11 00 11 00 11 00 11
11 00
11 00 00 11 00 11 00 11 00 11 11 00
300
250
195
11 00 00 11 00 11 00 11 00 11 00 11 00 11 000 00111 11 00 11 00 11 000 111 00 11 00 11 000 00111 11 00 11 00 11 000 111 00 11 00 11 000 111 00 11 00 11 00 11 000 00111 11 00 11 000 111 00 11 00 11 00 11 111 000 00 11 00 11 000 111 11 00 11 00 00 11
Hifn 7711 Hifn 7751 Hifn 7811 Hifn 790x CryptoManiac Cryptonite
11 00 00 11 00 11 00 11
11 00
Fig. 8. DES and 3DES Performance Comparison
shows an almost two times better performance.4 This result justifies our decision to go for simple ALUs providing more specialized functionality. High-performance hardware AES implementations provided by Amphion CS521040 High Performance AES Encryption Cores [2] and CS5250-80 High Performance AES Decryption Cores [3] and SecuCore (Secucore AES/Rijndael Core [23]) are able to outperform Cryptonite by a factor of 2.64. In addition, an extremely fast implementation from Amphion is even able to reach 25.6 GBit/s performance. This performance, however, is paid with an enormous gate count (10x bigger than other hardware solutions) which is why this version has not been included in the comparison chart shown in Figure 9. 4.3
MD5 Hashing Algorithm
Cryptonite performance on MD5 was 421 MBit/s at 400 MHz clock speed. It outperforms the Hifn hardware cores (7711 [11], 7751 [12], 7811 [13], and 790x [14,15]) 4
We remark that the CryptoManiac results appear to exclude round key generation whereas Cryptonite includes round key generation. In the RC4 discussion, [28] mentions impact from writing back into the key table. A similar note is missing for the AES implementation which suggests that only the main encryption algorithm (i.e. excluding round key generation) was coded. The cycle count of just 9 cycles per round without significant AES instruction support seems consistent with this assumption.
196
R. Buchty, N. Heintze, and D. Oliva
by factors of 1.12 to 7.02. SecuCore’s high-performance MD5 core (SecuCore SHA1/MD5/HMAC Core [25]), is a factor of 2.97 faster than Cryptonite, highlighting the programmability tradeoff. A comparison with CryptoManiac is omitted because the performance of MD5 is not reported in [28]. Figure 10 summarizes the results for MD5. 2500
Performance (MBit/s)
2000
1500
1000
500
0
Amphion CS5220
Amphion CS5230 11 00 Hifn 7854 00 11 SecuCore AES−128 00 11 CryptoManiac 00 11 Cryptonite (enc) 00 11 000 111 Cryptonite (dec) 00111 11 00000 11 000 111 00 11 000 00111 11 000 111 00111 11 00000 11 000 111 00 11 000 00111 11 000 111 00 11 000 00111 11 000 00111 11 000 00111 11 000 111 00 11 000 00111 11 000 00111 11 000 111 00 11 00 11 000 111 00 11 00 11 000 111 00 11 00 11 000 00 11 00111 11 000 111 00 11 00 11 111 000 00000 11 00 11 111 000 111 000 11 00 111 11 00
11 00 00 11 00 11 00 11 00 11
00 11 11 00
00 11 111 000 00 11 000 111 00 11 000 111 00 11 000 111 00 000 11111 00 11
Fig. 9. AES-128/128 Performance Comparison
1400 1200
Performance (MBit/s)
1000 800 600 400 200 0
Hifn 7711 11 00 Hifn 7751 00 11 111 000 Hifn 7811 000 111 Hifn 790x 11 000 111 00 MD5 11 000 SecuCore 111 00 Cryptonite 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 00 11 00 11 111 000 00 11 00 11 000 111 00 11 111 000 111 000 11 00
11 00
11 00 00 11 00 11 00 11 00 11 11 00
Fig. 10. MD5 Performance Comparison
5
Summary
We have presented Cryptonite, a programmable processor architecture targeting cryptographic algorithms. The starting point of this architecture was an in-depth application analysis in which we decomposed standard cryptographic algorithms down to their core functionality. The Cryptonite architecture has several novel features, including a distributed multi-input XOR unit and a parameterizable permutation unit built using new form of vector-memory block. A central design constraint was simple implementation,
Cryptonite – A Programmable Crypto Processor Architecture
197
and many aspects of the architecture seek to reduce port counts on register files, number and width of internal buses and number and size of registers. In contrast, CryptoManiac has a number of implementation challenges, including a heavyweight 16-port register file. We expect the Cryptonite die size to be significantly smaller than CryptoManiac. A number of algorithms (including AES, DES and MD5) were implemented on the architecture simulator with promising results. Cryptonite was able to outperform numerous hardware cores. It outperformed the programmable CryptoManiac processor by factors of between two and three and comparable clock speeds. To determine the tradeoff between programmability and dedicated high-performance hardware cores, Cryptonite was compared to cores from Amphion and Secucore: these outperform Cryptonite by about a factor of 3.
References 1. Amphion Semiconductor Ltd. Corporate Web Site. 2001. http://www.amphion.com. 2. Amphion Semiconductor Ltd. CS5210-40 High Performance AES Encryption Cores Product Information. 2001. http://www.amphion.com/acrobat/DS5210-40.pdf. 3. Amphion Semiconductor Ltd. CS5210-40 High Performance AES Decryption Cores Product Information. 2002. http://www.amphion.com/acrobat/DS5250-80.pdf. 4. Ronald H. Brown, Mary L. Good, and Arati Prabhakar. Data Encryption Standard (DES) (FIPS 46-2). Federal Information Processing Standards Publication (FIPS), Dec 1993. http://www.itl.nist.gov/fipspubs/fip46-2.html (initial version from Jan 15, 1977). 5. Ronald H. Brown and Arati Prabhakar. FIPS180-1: Secure Hash Standard (SHA). Federal Information Processing Standards Publication (FIPS), May 1993. http://www.itl.nist.gov/fipspubs/fip180-1.htm. 6. Rainer Buchty. Cryptonite – A Programmable Crypto Processor Architecture for HighBandwidth Applications. PhD thesis, Technische Universit¨at M¨unchen, LRR, September 2002. http://tumb1.biblio.tu-muenchen.de/publ/diss/in/2002/buchty.pdf. 7. Jerome Burke, John McDonald, and ToddAustin. Architectural support for fast symmetric-key cryptography. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2000), November 2000. 8. J. Daemen and V. Rijmen. The block cipher Rijndael, 2000. LNCS1820, Eds: J.-J. Quisquater and B. Schneier. 9. J. Daemen and V. Rijmen. Advanced Encryption Standard (AES) (FIPS 197). Technical report, Katholijke Universiteit Leuven / ESAT, Nov 2001. http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf. 10. Nevin Heintze Dino Oliva, Rainer Buchty. AES and the Cryptonite Crypto Processor. CASES’03 Conference Proceedings, pages 198–209, October 2003. 11. Hifn Inc. 7711 Encryption Processor Data Sheet. 2002. http://www.hifn.com/docs/a/DS-0001-04-7711.pdf. 12. Hifn Inc. 7751 Encryption Processor Data Sheet. 2002. http://www.hifn.com/docs/a/DS-0013-03-7751.pdf. 13. Hifn Inc. 7811 Network Security Processor Data Sheet. 2002. http://www.hifn.com/docs/a/DS-0018-02-7811.pdf. 14. Hifn Inc. 7901 Network Security Processor Data Sheet. 2002. http://www.hifn.com/docs/a/DS-0023-01-7901.pdf.
198
R. Buchty, N. Heintze, and D. Oliva
15. Hifn Inc. 7902 Network Security Processor Data Sheet. 2002. http://www.hifn.com/docs/a/DS-0040-00-7902.pdf. 16. Hifn Inc. Corporate Web Site. 2002. http://www.hifn.com. 17. Jüri Pöldre. Cryptoprocessor PLD001 (Master Thesis). June 1998. 18. R. Rivest. RFC1186: The MD4 Message-Digest Algorithm. October 1990. http://www.ietf.org/rfc/rfc1186.txt. 19. R. Rivest. The MD4 message digest algorithm. Advances in Cryptology - CRYPTO ’90 Proceedings, pages 303–311, 1991. 20. R. Rivest. RFC1312: The MD5 Message-Digest Algorithm, April 1992. http://www.ietf.org/rfc/rfc1321.txt. 21. Ronald R. Rivest, M.J.B. Robshaw, R. Sidney, and Y.L. Yin. The RC6T M Block Cipher. August 1998. http://www.rsasecurity.com/rsalabs/rc6/. 22. Bruce Schneier. 13.9: IDEA. Angewandte Kryptographie: Protokolle, Algorithmen und Sourcecode in C, pages 370–377, 1996. ISBN 3-89319-854-7. 23. SecuCore Consulting Services. SecuCore AES/Rijndael Core. 2001. http://www.secucore.com/secucore aes.pdf. 24. SecuCore Consulting Services. SecuCore DES/3DES Core. 2001. http://www.secucore.com/secucore des.pdf. 25. SecuCore Consulting Services. SecuCore SHA-1/MD5/HMAC Core. 2001. http://www.secucore.com/secucore hmac.pdf. 26. SecuCore Consulting Services. Corporate Web Site. 2002. http://www.secucore.com/. 27. Rudolf Usselmann. OpenCores DES Core. Sep 2001. http://www.opencores.org/projects/des/. 28. Lisa Wu, Chris Weaver, and Todd Austin. Cryptomaniac: A fast flexible architectore for secure communication. In 28th Annual International Symposium on Computer Architecture (ISCA 2001), June 2001.
STAFF: State Transition Applied Fast Flash Translation Layer Tae-Sun Chung, Stein Park, Myung-Jin Jung, and Bumsoo Kim Software Center, Samsung Electronics, Co., Ltd., Seoul 135-893, KOREA {ts.chung,steinpark,m.jung,bumsoo}@samsung.com
Abstract. Recently, flash memory is widely used in embedded applications since it has strong points: non-volatility, fast access speed, shock resistance, and low power consumption. However, due to its hardware characteristics, it requires a software layer called FTL (flash translation layer). The main functionality of FTL is to convert logical addresses from the host to physical addresses of flash memory. We present a new FTL algorithm called STAFF (State Transition Applied Fast Flash Translation Layer). Compared to the previous FTL algorithms, STAFF shows higher performance and requires less memory. We provide performance results based on our implementation of STAFF and previous FTL algorithms.
1
Introduction
Flash memory has strong points: non-volatility, fast access speed, shock resistance, and low power consumption. Therefore, it is widely used in embedded applications, mobile devices, and so on. However, due to its hardware characteristics, it requires specific software operations in using it. The basic hardware characteristics of flash memory is erase-before-write architectures [4]. That is, in order to update data on flash memory, if the physical location on flash memory was previously written, it has to be erased before the new data can be rewritten. In this case, the size of the memory portion for erasing is not same to the size of the memory portion for reading or writing1 [4], which results in the main performance degradation in the overall flash memory system. Thus, the system software called FTL (Flash Translation Layer) [2,3,5,7,8, 9] is required. The basic scheme for FTL is as follows. By using the logical to physical address mapping table, if a physical address location corresponding to a logical address is previously written, the input data is written to another physical location which is not previously written and the mapping table is changed. In applying the FTL algorithm to real embedded applications, there are two major points: the storage performance and the memory requirement. In 1
Flash memory produced by Hitachi has different characteristics. That is, the size of the memory portion for erasing is same to the size of the memory portion for reading or writing.
C. M¨ uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 199–212, 2004. c Springer-Verlag Berlin Heidelberg 2004
200
T.-S. Chung et al.
the storage performance, as flash memory has special hardware characteristics mentioned above, the overall system performance is mainly effected by the write performance. In particular, as the erase cost is much more than write and read cost, it is needed to minimize the erase operation. Additionally, the memory requirement for mapping information is important in real embedded applications. That is, if an FTL algorithm requires large memory for mapping information, the cost can not be approved in embedded applications. In this paper, we propose a high-speed FTL algorithm called STAFF (State Transition Applied Fast FTL) for flash memory system. Compared to the previous FTL algorithms, our solution shows higher performance and requires less memory. This paper is organized as follows. Problem definition and previous work is described in Section 2. Section 3 shows our FTL algorithm and Section 4 presents performance results. Finally, Section 5 concludes.
2 2.1
Problem Definition and Previous Work Problem Definition
In this paper, we assume that the sector is the unit of read and write operation, and the block is the unit of the erase operation on flash memory. The size of a block is some multiples of the size of a sector. Figure 1 shows the software architecture of a flash file system. We will consider the FTL layer in Figure 1. The File System layer issues a series of read and write commands with logical sector number to read from, and write data to, specific addresses of flash memory. The given logical sector number is converted to real physical sector number of flash memory by some mapping algorithm provided by FTL layer.
Fig. 1. Software architecture of flash memory system
Thus, the problem definition of FTL is as follows. We assume that flash memory is composed of n physical sectors and file system regards flash memory as m logical sectors. The number m is less than or equal to n.
STAFF: State Transition Applied Fast Flash Translation Layer
201
Definition 1. Flash memory is composed of blocks and each block is composed of sectors. Flash memory has the following characteristics: If the physical sector location on flash memory was previously written, it has to be erased in the unit of block before the new data can be rewritten. The FTL algorithm is to produce the physical sector number in flash memory from the logical sector number given by the file system. 2.2
Previous FTL Algorithms
Sector Mapping. First intuitive algorithm is sector mapping [2]. In sector mapping, if there are m logical sectors seen by the file system, the raw size of logical to physical mapping table is m. For example, Figure 2 shows an example of sector mapping. In the example, a block is composed of 4 sectors and there are 16 physical sectors. If we assume that there are 16 logical sectors, the raw size of the mapping table is 16. When the file system issues the command “write some data to LSN (Logical Sector Number) 9”, the FTL algorithm writes the data to PSN (Physical Sector Number) 3 according to the mapping table. If the physical sector location on flash memory was previously written, the FTL algorithm finds another sector that was not previously written. If it finds it, the FTL algorithm writes the data to the sector location and changes the mapping table. If it can not find it, a block should be erased, the corresponding sectors should be backed up, and the mapping table should be changed.
Fig. 2. Sector mapping
Block Mapping. Since the sector mapping algorithm requires the large size of mapping information, the block mapping FTL algorithm [3,5,8] is proposed. The
202
T.-S. Chung et al.
basic idea is that the logical sector offset within the logical block corresponds to the physical sector offset within the physical block. In block mapping method, if there are m logical blocks seen by the file system, the raw size of logical to physical mapping table is m. Figure 3 shows an example of the block mapping algorithm. If we assume that there are 4 logical blocks, the raw size of the mapping table is 4. When the file system issues the command “write some data to LSN (Logical Sector Number) 9”, the FTL algorithm calculates the logical block number (9/4=2) and sector offset (9%4=1), and then, it finds physical block number (1) according to the mapping table. Since the physical sector offset is identical to the logical sector offset (1) the physical sector location can be determined. Although the block mapping algorithm requires the small size of mapping information, if the file system issues write commands to the same sector frequently, the performance of the system is degraded since whole sectors in the block should be copied to another block.
Fig. 3. Block mapping
Hybrid Mapping. Since the both sector and block mapping have some disadvantages, the hybrid technique [7,9] is proposed. The hybrid technique first uses the block mapping technique to find the physical block corresponding to the logical block, and then, the sector mapping techniques used to find an available sector within the physical block. Figure 4 shows an example of the hybrid technique. When the file system issues the command “write some data to LSN (Logical Sector Number) 9”, the FTL algorithm calculates the logical block number (9/4=2), and then, it finds physical block number (1) according to the mapping table. After finding the block number, any available sector can be chosen to write the input data. In the example since the first sector in the block 1 is empty, the data is written to the
STAFF: State Transition Applied Fast Flash Translation Layer
203
first sector. In this case, since the logical sector offset within the logical block is not identical to the physical sector offset within the physical block, the logical sector number (9) should be written to the sector. When reading data from flash memory, FTL algorithm first finds the physical block number from the logical block number according to the mapping table, and then, by reading the logical sector numbers within the physical block, it can read the requested data.
Fig. 4. Hybrid mapping
Comparison. We can compare the previous FTL algorithms in two points of view: the read/write performance and memory requirement for mapping information. The read/write performance of the system can be measured by the number of flash I/O operations since the read/write performance is I/O bounded. If we assume that access cost of the mapping table in each FTL algorithms presented in the previous section is zero since it exists in RAM, the read/write cost can be measured by the following equations. Cread = xTr Cwrite = p(Tr + Tw ) + (1 − p)(Tr + (Te + Tw ) + Tc )
(1) (2)
Where Cread , Cwrite are read and write cost respectively, Tr , Tw , and Te are flash read, write, and erase cost. Tc is copy cost that is needed to move sectors within a block to other free block before erasing and to copy back after erasing. p is the probability that the write command does not require the erase operation. We assume that the input logical sector within the logical block is mapped to one physical sector within one physical block.
204
T.-S. Chung et al.
In sector and block mapping methods, the variable x in the equation (1) is 1 because the sector location to be read can be found directly by the mapping table. However, in the hybrid technique, the value of the variable x is in 1 ≤ x ≤ n, where n is the number of sectors within a block. It is because that the request data can be read only after scanning the logical sector number stored in flash memory. Thus, the hybrid mapping has higher read cost compared to sector and block mapping. In writing case, we assume that a read operation is needed before writing to see the corresponding sector can be written. Thus, Tr is added in the equation (2). Since Te and Tc is high cost compared to Tr and Tw , the variable p is a key point in the write performance. Sector mapping has the smallest probability of requiring the erase operation and block mapping has the worst. Other comparison criteria is the memory requirement for mapping information. Table 1 shows the memory requirement for mapping information. Here, we assume that 128MB flash device that is composed of 8192 blocks and each block is composed of 32 sectors [4]. In sector mapping, 3 bytes are needed to represent whole sectors and in block mapping, 2 bytes are necessary. In hybrid mapping, 2 bytes are needed to for block mapping, and 1 byte for sector mapping within a block. Table 1 shows that block mapping is superior to the others. Table 1. Memory requirement for mapping information Addressing (B: Byte) Total Sector mapping 3B 3B * 8192* 32 = 768KB Block mapping 2B 2B *8192= 16KB Hybrid mapping 2B+1B 2B*8192+1B*32*8192 = 272KB
3
STAFF (State Transition Applied Fast FTL)
STAFF is our FTL algorithm for flash memory. The purpose of STAFF is to provide a device driver for flash memory with maximum performance and small memory requirement. 3.1
Block States Machine
Compared to previous work, we introduced the states of the block. A block in STAFF has the following states. – F state: If a block is an F state block, the block is a free state. That is, the block is erased and is not written. – O state: If a block is an O state block, the block is an old state. The old state means that the value of the block is not valid any more.
STAFF: State Transition Applied Fast Flash Translation Layer
205
– M state: The M state block is in-order, free. The M state block is the first state from a free block and is in place. That is, the logical sector offset within the logical block is identical to the physical sector offset within the physical block. – S state: The S state block is in-order, full. The S state block is created by the swap merging operation. The swap merging operation will be described in Section 3.2. – N state: The N state block is out-of-order and is converted from an M state block.
Fig. 5. Block state machine
We have constructed a state machine according the states defined above and various events occurred during FTL operations. The state machine is formally defined as follows. Here, we use the notation about automata in [6]. An automa ton is denoted by a five-tuples (Q, , δ, qo , F ), and the meanings of each tuple are as follows. – Q is a finite set of states, namely Q = {F, O, M, S, N }. – is a finite input alphabet, in our definition, it corresponds to the set of various events during FTL operations. – δ is the transition function which maps Q × to Q. – q0 is the start state, that is a free state. – F is the set of all final states. Figure 5 shows the block state machine. The initial block state is F state. When an F state block gets the first write request, the F state block is converted to the M state block. The M state block can be converted to two states of S and N according to specific events during FTL operations. The S and N state block is converted to O state block in the event e4 and e5, and the O state block is converted to the F state block in the event e6. The detailed description of the events is presented in Section 3.2. 3.2
FTL Operation
The basic FTL operations are read and write operations.
206
T.-S. Chung et al.
Algorithm 1 Write algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:
Input: Logical sector number (lsn), data to be written Output: None Procedure FTL write (lsn, data) if merging operation is needed then do merging operation; end if if the logical block corresponding to the lsn has an M or N state block then if corresponding sector is empty then write the input data to the M or N state block; else if the block is the M state block then the block is converted to the N state block; end if write the input data to the N state block; end if else get an F state block; the F state block is converted to the M state block; write the input data to the M state block; end if
Write Algorithm. Algorithm 1 shows the write algorithm of STAFF. The input of the algorithm is the logical sector number and the data to be written. The first operation of the algorithm is checking the merging operation. We have two kinds of merging operation: swap and smart merging. The swap merging operation is occurred when a write operation is requested to the M state block which has no more space. Figure 6-(a) shows that the swap merging operation is performed. Here, we assume that a block is composed of 4 sectors. The M state block is converted to the S state block and one logical block is mapped to two physical blocks. In Figure 5, the event e2 corresponds to the swap merging operation.
Fig. 6. Various write scenarios
STAFF: State Transition Applied Fast Flash Translation Layer
207
The smart merging operation is illustrated in Figure 6-(c). The smart merging operation is occurred when a write operation is requested to the N state block which has no more space. In the smart merging operation, a new F state block is allocated and the valid data in the N state block is copied to the F state block, which is now an M state block. In Figure 6-(c), since only one data corresponding to the lsn 0 is valid, it is copied to the newly allocated block. The smart merging operation is related to the events e5 and e1 in Figure 5. In line 7 of Algorithm 1, if the logical block corresponding to the input logical sector number doesn’t have an M or N state block, it means that the logical block corresponding to the lsn has not been written. Thus, a new F state block is allocated and the input data is written to the F state block, which is now an M state block (line 17-19). If the logical block corresponding to the input logical sector number has an M state block or an N state block, the write algorithm examines that the sector corresponding to the lsn is empty. If the sector is empty, the data corresponding to the lsn is written to it. Otherwise, the data is written to the N state block. In our write algorithm, a logical block can be mapped to two physical blocks in maximum. Thus, when there is no space, one block is converted to an O state block. Figure 6-(b) shows this scenario. When allocating an F state block (line 17), if there is no free block available, merging operation is performed explicitly and erase operations may also be needed. Read Algorithm. Algorithm 2 shows the read algorithm of STAFF. The input of the algorithm is the logical sector number and the data buffer to read. The data to be read may be stored in the M, N, or S state block. If a logical block is mapped to two physical blocks, the two physical blocks are S and M state blocks or S and N state blocks. In this case, if input lsn corresponds to both the S state block and (N or M) state block, data in the N and M state block is valid one. If the M or N state block has no data corresponding to the lsn, the data may be stored in an S state block. Thus, data can be read from the S state block or the error message is printed. When reading data from the N state block, since the data may be stored in a physical sector offset which is not identical to the logical sector offset, valid data should be found. The valid data can be determined according to the write algorithm. The detailed algorithm is omitted because it is trivial.
4 4.1
Experimental Evaluation Cost Estimation
As mentioned earlier, we can compare the FTL algorithm in two points of view: memory requirement for mapping information and the flash I/O performance. Since STAFF is based on the block mapping, it requires small memory for mapping information as presented in Section 2.2. Compared with the 1:1 block
208
T.-S. Chung et al.
Algorithm 2 Read algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32:
Input: Logical sector number (lsn), data buffer to read Output: None Procedure FTL read(lsn, data buffer) if the logical block corresponding to the lsn has an M or N state block then if the block is the M state block then if the corresponding sector is set then read from the M state block; else if the logical block has an S state block then read from the S state block; else print: “the logical sector has not been written”; end if end if else {the block is the N state block} if the sector is in the N state block then read from the N state block; else if the logical block has an S state block then read from the S state block; else print: “the logical sector has not been written”; end if end if end if else if the logical block has an S state block then read form the S state block; else print: “the logical sector has not been written”; end if end if
mapping technique presented in Section 2.2, STAFF is a hybrid of 1:1 and 1:2 block mapping. Additionally, the N state block should have sector mapping information. In the flash I/O performance, the read/write cost can be measured by the following equations:
Cread = pM Tr + pN k1 Tr + pS Tr (where pM + pN + pS = 1) = pf irst [(Tf + Tw )] + (1 − pf irst )[pmerge {Tm + pe1 Tw + (1 − pe1 )(k2 Tr + Tw )} + (1 − pmerge ){pe2 (Tr + Tw ) + (1 − pe2 )(k3 Tr + Tw ) +
Cwrite
Tr + pM N Tw }]
(3) (4)
STAFF: State Transition Applied Fast Flash Translation Layer
209
where 1 ≤ k1 , k2 , k3 ≤ n. Here, n is the number of sectors within a block. In the equation (3), pM , pN , and pS are the probability that data is stored in the M, N, and S state block, respectively. In the equation (4), pf irst is the probability that the write command is the first write operation with the input logical block and pmerge is the probability that the write command requires the merging operation. pe1 and pe2 are the probability that input logical sector can be written to the in place location with merging and without merging operation, respectively. Tf is the cost for allocating a free block. It may require the merging and the erasing operation. Tm is the cost for the merging operation. Finally, pM N is the probability that the write operation converts the M state block to the N state block. When the write operation converts the M state block to the N state block, a flash write operation is needed for marking states. The cost function shows that the read and write operations to the N state block requires some more flash read operation than the M or S state block. However, in flash memory the read cost is very low compared to the write and erase cost. Thus, since Tf and Tm may require the flash erase operation, they are dominant factors in evaluating the overall system performance. STAFF is designed to minimize Tf and Tm that require the erase operation. 4.2
Experimental Result
In overall flash system architecture presented in Figure 1, we implemented various FTL algorithms and compared them. The physical flash memory layer is simulated by a flash emulator [1] which has same characteristics as real flash memory. We have compared three FTL algorithms: Mitsubishi FTL [8], SSR [9], and STAFF. The Mitsubishi FTL algorithm is based on the block mapping algorithm presented in Section 2.2 and the SSR algorithm is based on the hybrid mapping. We have not implemented the sector mapping algorithm. It is not a realistic FTL algorithm since it requires too much memory. The FAT file system is widely used in embedded system. Thus, we got access patterns that the FAT file system on Symbian operating system [10] issues to the block device driver when it gets a 1MB file write request. The access patterns are very similar to the real workload in embedded applications. Figure 7 shows the total elapsed time. The x axis is the test count and the y axis is the total elapsed time in millisecond. At first, flash memory is empty, and flash memory is occupied as the iteration count increases. The result shows that STAFF has similar performance to hybrid mapping and has much better performance than block mapping. Since STAFF requires much smaller memory space compared to the hybrid mapping technique, it may be efficiently used in embedded applications. Figure 8 shows the erase count. The result is similar to the result of the total elapsed time. This is because the erase count is a dominant factor in overall system performance. In particular, when flash memory is empty, STAFF shows better performance. It is due to that STAFF delays the erase operation. That is,
210
T.-S. Chung et al. The total elapsed time 6000 ssr mistubishi staff 5000
Elapsed time (ms)
4000
3000
2000
1000
0 0
20
40
60
80
100
Test count
Fig. 7. The total elapsed time The erase count 500 ssr mistubishi staff 400
Count
300
200
100
0
0
20
40
60
80
100
Test count
Fig. 8. Erase counts
by using the O state block, STAFF delays the erase operation until there is no more space available. If the system provides concurrent operation, the O state blocks can be converted to the F state blocks by another process in STAFF. In addition, STAFF shows the consistent performance although flash memory is fully occupied. Figure 9 and Figure 10 shows the read and write counts respectively. STAFF shows reasonable read counts and best write counts. [4] says that the running time ratio of read (1 sector), write (1 sector), and erase (1 block) is 1:7:63 approximately. Thus, STAFF is a reasonable FTL algorithm.
STAFF: State Transition Applied Fast Flash Translation Layer
211
The read count 45000 ssr mistubishi staff
40000 35000
Counts
30000 25000 20000 15000 10000 5000 0 0
20
40
60
80
100
Test count
Fig. 9. Read counts The write count 14000 ssr mistubishi staff 12000
10000
Counts
8000
6000
4000
2000
0 0
20
40
60
80
100
Test count
Fig. 10. Write counts
5
Conclusion
In this paper, we propose a novel FTL algorithm called STAFF. The key idea of STAFF is to minimize the erase operation by introducing the concept of state transition in the erase block of flash memory. That is, according to the input patterns, the state of the erase block is converted to the appropriate states, which minimizes the erase operation. Additionally, we provided low cost merging operations: swap and smart merging. Compared to the previous work, our cost function and experimental results show that STAFF has reasonable performance and requires less resources.
212
T.-S. Chung et al.
For future work, we have plan to generate intensive workloads in real embedded applications. We can customize our algorithm according to the real workloads. Acknowledgments. The authors wish to thank Jae Sung Jung and Seon Taek Kim for their FTL implementations. We are also grateful to the rest of Embedded Subsystem Storage group for enabling this research.
References 1. Sunghwan Bae. SONA Programmer’s guide. Technical report, Samsung Electronics, Co., Ltd., 2003. 2. Amir Ban. Flash file system, 1995. United States Patent, no. 5,404,485. 3. Amir Ban. Flash file system optimized for page-mode flash technologies, 1999. United States Patent, no. 5,937,425. 4. Samsung Electronics. Nand flash memory & smartmedia data book, 2002. 5. Petro Estakhri and Berhanu Iman. Moving sequential sectors within a block of information in a flash memory mass storage architecture, 1999. United States Patent, no. 5,930,815. 6. John E. Hopcroft and Jeffrey D. Ullman. Introduction to automata theory, languages, and computation. Addison-Wesley Publishing Company, 1979. 7. Jesung Kim, Jong Min Kim, Sam H. Noh, Sang Lyul Min, and Yookun Cho. A space-efficient flash translation layer for compactflash systems. IEEE Transactions on Consumer Electronics, 48(2), 2002. 8. Takayuki Shinohara. Flash memory card with block memory address arrangement, 1999. United States Patent, no. 5,905,993. 9. Bum soo Kim and Gui young Lee. Method of driving remapping in flash memory and flash memory architecture suitable therefore, 2002. United States Patent, no. 6,381,176. 10. Symbian. http://www.symbian.com, 2003.
Simultaneously Exploiting Dynamic Voltage Scaling, Execution Time Variations, and Multiple Methods in Energy-Aware Hard Real-Time Scheduling Markus Ramsauer Chair of Computer Architecture (Prof. Dr.-Ing. Werner Grass) University of Passau, Innstrasse 33, 94032 Passau, Germany [email protected]
Abstract. In this paper we present a novel energy-aware scheduling algorithm that simultaneously exploits three effects to yield energy savings. Savings are achieved by using dynamic voltage scaling (DVS), flexibility provided by slack time, and by dynamically selecting for each task one of several alternative methods that can be used to implement the task. The algorithm is split up in two parts. The first part is an off-line optimizer that prepares a conditional scheduling precedence graph with timing conditions defining for any decision point in time which branch should be taken due to the assumed elapsed time. The second part is an efficient runtime dispatcher that evaluates the timing conditions. This separation of optimization complexity and runtime efficiency allows our algorithm to be used on mobile devices having only little energy resources and being driven to the edge by the applications that run on them, e.g., creating a video on a mobile phone. We show that our approach typically yields more energy savings than worst-case execution time based approaches while it still guarantees all real-time constraints. Our application model includes periodic non-preemptive tasks with release times, hard deadlines and data-dependencies. Multiple methods having different execution times and energy demands can be specified for each task, and an arbitrary number of processor speeds is supported.
1
Introduction
Today, there is an increasing number of mobile, battery operated devices, and the applications being run on them become more complex. A typical scenario one can think of is the transmission of live videos with a multi-media capable mobile phone. The phone may utilize an embedded processor that provides choice between two speed-modes which trade processor speed for energy consumption. This allows for an economically sound usage of the restricted energy provided by the phone’s small battery. We provide a scheduling algorithm that exploits the limited processor speed to execute processor intensive applications while it ensures that all hard real-time constraints are met and as little energy as C. M¨ uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 213–227, 2004. c Springer-Verlag Berlin Heidelberg 2004
214
M. Ramsauer
possible is used. Mobile devices are usually equipped with small, low capacity batteries to keep the device itself small and lightweight. Therefore great attention must be paid to the energy consumption of the device to ensure long periods of operation without recharging the battery. Today, many processors can be configured to run at different speeds by changing the processors supply voltage. The processors speed approximately doubles if the voltage is doubled, and the power consumption quadruples. As modern processors are not fully utilized by some real-time applications, this effect can be used to save energy by reducing the supply voltage every time it is possible. E.g., if there is enough time to finish a task in half-speed mode, we need only half the energy that would be needed in full-speed mode. This is the case because the processor needs twice the time to complete the task, but it only consumes one fourth of the power at full-speed. Of course, preserving energy is not the users main concern. He also wants to fully exploit the mobile device’s computational power to run very demanding real-time applications such as live video transmission on his mobile device. They include tasks with deadlines and release times, and therefore it is not sufficient to simply execute tasks as fast as possible. Tasks must be scheduled at the right time, e.g., an image can not be encoded before it is provided by the camera hardware and it has to be sent before the next image arrives to avoid stuttering. Although these applications may demand the full processor speed in the worst-case, they usually do not need full processor speed all over the time in the average case. Slack time of preceding tasks can be used to reduce the speed for succeeding tasks, and thus energy can be saved. Additional energy can be saved by dynamically choosing between methods that implement a task. We allow to specify different implementations for a task, and during runtime we select the implementation that should be executed. Thereby, energy savings can be achieved even if the processor does not support multiple speed modes, because implementations differ not only in their worst-case execution time but also in their average execution time. We schedule the implementation with the lower expected execution time to save energy if possible. We also use the implementation with the lower worst-case execution time to meet deadlines if timing constraints are tight. Our scheduling algorithm is customized to schedule demanding real-time applications with release times and hard deadlines on energy restricted mobile devices. It simultaneously exploits the flexibility provided by the processor speed modes, it uses an explicit model of tasks’ slack times, and it allows to specify different implementations for tasks. We calculate an optimized data-structure statically that we interpret dynamically to select the best task, the best processor speed, and the best implementation to schedule. The paper is organized as follows: First, we give some references to related work. Then we introduce our underlying model before we describe our scheduling algorithm. In Section 5, we present benchmark results and we finish with a conclusion and an outlook in Section 6.
Simultaneously Exploiting Dynamic Voltage Scaling
2
215
Related Work
Energy-aware scheduling is getting increasing attention. An introduction to processor power consumption and processor features can be found in Wolf [1]. Concrete power and energy specifications of processors from Intel and Transmeta are available, e.g., in [2,3]. Unsal and Koren [4] give a survey on current powerand energy-related research. They advocate energy-aware scheduling techniques to cope with several problems in modern system design. Our work addresses several properties of real-time systems they state: real-time systems are “typically battery-operated and therefore have a limited energy budget”, they are “relatively more time-constrained compared to general-purpose systems”, and they are “typically over-designed to ensure that the temporal deadline guarantees are still met even if all tasks take up their worst-case execution time (WCET) to finish”. Additionally, they propose exploiting the differences in the execution time distributions of different implementations (multiple methods) instead of relying on WCET-based analysis. Our design-to-time algorithm [5], which is the basis for the algorithm presented here, has been specifically designed to do so. We will show the potential of exploiting execution time variations and multiple methods to save energy by adapting and applying our algorithm to energy-aware scheduling. Melhem et al. [6] give an overview on, and a classification for power-aware scheduling algorithms. They propose a concept of power management points (PMPs) to reduce the energy consumption of applications on systems with variable processor speeds. Simulated annealing has become a well known technique for very complex combinatorial optimization problems [7]. It has successfully been used for scheduling [5,8,9,10] and it is applied to energy-aware scheduling in this work. We use it to optimize applications that are too complex to be handled within a sufficiently short time by our exhaustive optimizer. Our design-to-time algorithm matches the hard real-time system constraints given above: It exploits DVS-based (Dynamic Voltage Scaling) energy savings, utilizes slack time, and it considers differences in execution time histograms to select energy optimal implementation dynamically. Our dispatcher algorithm fits very well into the PMP concept. We define that a PMP is reached every time a task finishes, and processor speed/voltage can be adjusted at each PMP.
3
Application Model
An application (see Figure 1) is made up of periodic tasks. A task instance may have a relative release time as well as a relative hard deadline, which both are measured from the begin of the period of the task instance. Additionally, a task can be data-dependent on one or more tasks having the same period duration. Every task can be implemented by alternative methods and only one of them has to be executed to fulfill the task. The execution time of a method may vary for different task instances. As the processor can run at different speeds, we do not specify a method’s execution time. Instead, we specify the number of
216
M. Ramsauer
period 40ms rel. time 0ms deadline 20ms
resize
period 40ms rel. time 2ms deadline 26ms
period 40ms rel. time 3ms deadline 35ms
period 40ms rel. time 9ms deadline 40ms
insert title
encode
send
task method processor
bilinear IU 2
prob 1.0
paste lines IU 1 4 7
prob 0.7 0.2 0.1
CPU
jpg−1 IU 6 12
prob 0.9 0.1
mode c full 1.0 half 0.5
jpg−2 IU 6 9
prob 0.2 0.8
transmit IU 4 5
e b e i 4 0.4 1 0.1
prob 0.6 0.4
implemented by data−dependency IU: instruction unit
Fig. 1. Live Video Application
instruction units (IUs) that a method needs, and we define an instruction unit to be a certain number of cpu clock cycles (e.g., compare [6]). Tasks In our model, a periodic task Ti with period pTi represents an abstract piece of work, e.g, encode, that has to be done regularly. There may be real-time constraints such as a release time RT (Ti ) and a hard deadline DL(Ti ), both being relative to every task instances period beginning. In our example in Figure 1, the task send has a deadline of 40ms to ensure the transmission of 25 images per second. A task may be data-dependent on other tasks, e.g., there is a datadependency DD(encode, send). The meaning of a task Ti being data-dependent on a task Tj is, that the first instance of Ti is data-dependent on the first instance of Tj , the second instance of Ti is dependent on the second one of Tj , etc. Methods A method is an implementation of an algorithm that is able to perform a task. The number of instruction units needed by a method to complete varies due to several reasons. Variation can be caused by user input during runtime, or if the complexity of the calculation depends on the input-data. To reflect these variations, we do not restrict our model to the specification of worst-case values, but we allow that every method’s number of instruction units is specified by means of a discrete probability distribution. Figure 2 shows the probability distribution of a jpg-encoder (jpg encoder 1 ) which needs 6 instruction units or less in 90 percent of its executions and 7 to 12 instruction units in the remaining cases. Methods are different regarding the number of instruction units they need to complete. Even more, the number of instruction units that a method needs is dependent on the input data. E.g., see Figure 2: The first method has a lower average number of instruction units to complete than the second one, but the
Simultaneously Exploiting Dynamic Voltage Scaling probability
217
probability
0.8
0.8
0.6
jpg encoder 1
0.6
average IUs = 6.6
average IUs = 8.4
0.4
0.4
0.2
0.2
2
4
6
8
10
12
IUs
jpg encoder 2
2
4
6
8
10
12
IUs
Fig. 2. Probability Distributions of two Methods
first method’s worst-case number of instruction units is higher than the second one’s. Although we can expect a lower energy consumption for the first method, we have to use the second method to guarantee that the deadline is met if the number of remaining instruction units before the deadline is less than 12. Of course, if the number of instruction units is 12 or more, we will prefer the first method to save energy. Processor The processor is responsible for the execution of methods. It can have several speed modes s = (c(s), ei (s), eb (s)) which differ in speed c(s) and in energy consumption per time unit during idle ei (s) and busy time eb (s). Speed is the number of instruction units that are executed per time unit. Due to 2 f (see [11]), power consumption Pdynamic quadruples Pdynamic = CL NSW VDD if the supply voltage VDD is doubled. As the processor’s speed (clock frequency) changes linearly with the supply voltage, speed is proportional to the square value of the power. Thus executing a method at a lower speed saves energy although the method needs more time. We derive the actual execution time et of a method m in mode s as et = IUa · c(s) · timeunit where IUa is the actual number of instruction units that m needs to complete. Our example processor in Figure 1 has two speed modes: full speed (1, 0.4, 4) is a mode that executes one instruction unit per time unit with an energy consumption of 0.4 energy units for one time unit idle time and 4 energy units for one busy time unit. The mode half speed (0.5, 0.1, 1) executes half an instruction unit per time unit with an energy consumption of 0.1 energy units for one time unit idle time and 1 energy units for one busy time unit. Thus, in the first mode one instruction unit needs one unit of time and 4 units of energy and in the second mode one instruction unit needs two units of time and 2 units of energy. This means if we have enough units of time to perform an instruction unit in the slower mode, we can reduce energy consumption by 2 energy units. In our example we define one instruction unit as the number of cpu clock cycles being performed in one millisecond if the cpu is set to its maximum speed, and we define one time unit to be 1ms.
218
M. Ramsauer
Energy Our aim is to minimize the expected energy consumption E per hyperperiod.1 To calculate E we sum up the expected energy consumptions of all task instances within one hyperperiod, which are in fact the energy consumptions of the methods used to perform the task instances. A method’s energy consumption depends on its executed number of instruction units, and on the energy consumption and speed of the current processor mode, which determines the actual time needed to complete the method.2 Thus, we compute the expected energy consumption E(Ti ) of the task instance Ti being performed in processor mode si with method m(Ti ) as: iu(m(Ti )) E(Ti ) = idle(Ti ) ∗ ei (si ) + (1) · eb (s), c(s) where iu(m(Ti )) is the average number of instruction units method m(Ti ) needs and idle(Ti ) is the time in which the processor is idle. This may be the time between the finishing time of the preceding task instance and the beginning of Ti or the beginning of the next hyperperiod. The first may happen, e.g., if the release time of Ti is later than the finishing time of the preceding task instance. The second is the case, if the last task instance of a hyperperiod does not need all the time left before the next hyperperiod starts. Example Application In our accompanying hypothetical application (Figure 1), the processor is an embedded two speed-step processor in a multi-media capable mobile phone that is equipped with a camera. The user wants to send a video of himself to a friend’s mobile phone, and he can insert text information, e.g. date and time, in realtime by pressing buttons on the phone. The application is modeled by four tasks: resize (the pictures taken by the camera are too big for transmission), insert text (date and time code), encode (compress each image to save bandwidth), and send. Resize is implemented as a bilinear scaling algorithm that completes in a constant number of instruction units. Insert text is implemented by copying the date and/or time strings into the image and it needs three instruction units per string and one unit for overhead operations. As the user changes the information that has to be inserted during runtime, the method has a non-deterministic 1
2
We do not consider the energy consumption of the device’s memory. The memory subsystems can dissipate as much as 50% of total power in embedded systems [12], but data-memory size is not changed by using multiple methods. Clearly, storing multiple methods increases the size of the instruction memory, but power dissipation does not grow if alternative methods for the same task are stored on different memory banks. Only the memory bank for the chosen method gets actived by the dispatcher. All other memory banks can be put to sleep mode and do not consume additional energy if only dynamic power is considered [13]. We do not consider the effects of cache misses, memory stall cycles, or main memory energy consumption. We asume averaged power consumptions for the specified speed modes.
Simultaneously Exploiting Dynamic Voltage Scaling
219
demand of instruction units. JPG-encoding is the most complex task. It has been implemented by two methods which both need a non-deterministic number of instruction units to complete, depending on the input data. The first algorithm has a lower expected but a higher worst-case number of instruction units to complete than the second one. Thus, it is preferable to use the first method to save energy if enough time is available. The number of instruction units needed by the transmit method depends on the amount of data that has to be sent and the actual transmission bandwidth. Therefore, even this method needs a nondeterministic number of instruction units to complete. As we want to transmit 25 frames per second to deliver fluent video, the hard deadline for the send task is 40 ms which is in this examples also the period duration of all tasks. Therefore, the hyperperiod is 40ms, too, and for every task exactly one instance has to be scheduled per hyperperiod. The release time of the resize task is zero ms, because a period starts when a picture has been taken. Starting with this release time and deadline, we derive the effective release times and deadlines shown in Figure 1. The probability distributions of the methods’ numbers of instructions are also shown, and the processor executes one instruction unit per millisecond in full-speed mode and half an instruction unit per millisecond in half-speed mode. The energy consumptions per millisecond for the two modes are 4 energy units in full speed mode and 1 energy unit in half speed mode.
4
Scheduling Algorithm
Our scheduling algorithm is a two-parts approach. First, we calculate an optimal CDAG (conditional-directed-acyclic-graph). The CDAG is a conditional precedence graph. The second part is a very efficient dispatcher that executes task instances according to the information stored in the CDAG. Each node of the CDAG specifies, which task instance and method to schedule at which speed for a given condition. A condition is described by the set of instances that still has to be scheduled and the maximum time that is allowed to be consumed by the predecessors. Therefore, the root node of a CDAG contains the task instance, method, and speed to start scheduling with under the initial condition (maximum elapsed time 0, ALL INSTANCES). If a method has finished at a certain time t, the dispatcher branches to the current node’s son whose time condition specifies the smallest value that is bigger than t. During optimization the time conditions of a node’s sons are calculated by adding the possible execution times of the specified method to the maximum of the nodes time condition and the specified task’s release time. Figure 3 shows a possible energy optimal CDAG for the live video application. The root node of the CDAG has starting time 0 and all tasks instances {resize, insert title, encode, send} have to be scheduled. The third row in the root node tells the dispatcher to start scheduling with task resize using method bilinear and to set the processor to speed mode half. The last row shows the expected energy consumption of the sub-CDAG the node is root of. This value is used during optimization only (see formula 2 at the end of this section). After the method
220
M. Ramsauer
time: 0 {resize,insert title,encode,send} resize, bilinear, half E: 34,89 0 < time <= 4 time: 4 {insert title,encode,send} insert title, paste text, half E: 30,49 2 < time <= 6
6 < time <= 12
time: 6 {encode,send} encode, jpg−1, half E: 23,90
time: 12 {encode,send} encode, jpg−2, half E: 26,54 24 < time <= 30
3 < time <= 18
3 < time <= 24
18 < time <= 30
time: 18 {send} send, write to output, half E: 10,42
12 < time <= 18 time: 18 {encode,send} encode, jpg−1, full E: 36,56 24 < time <= 30
time: 30 {send} send, write to output, half E: 9,22
3 < time <= 24
time: 24 {send} send, write to output, half E: 9,82
Fig. 3. A CDAG for the Live Video Application
bilinear has finished, the dispatcher executes task insert title with method paste text at speed half. Method paste text consumes a non-deterministic number of instruction units. Thus, to select the next node in the CDAG the dispatcher has to figure out how much time it needed to complete. If it needed 2ms or less (half speed!), it branches to the left son of the node which tells that it is optimal to schedule task encode with method jpg-1 at half speed in this situation. In case of 3 to 8ms, the dispatcher continues with the middle son. For this condition scheduling task encode with method jpg-2 at half speed is the best choice to save energy and leave enough time to schedule the last task at half speed. In the third case (9 to 14ms) method jpg-1 is scheduled at full speed to leave sufficient time for the last task to run in half speed. After finishing the encoding method, there are again only three conditions that may arise, because the nodes that schedule the encode task share sons.3 All leave nodes schedule task send with 3
Note that there is no edge from node (time 6, {encode,send}) to node (time 24, {send}), because we do not make any assumptions on the execution time distributions in our model. The probabilities of the specified execution times are used as weights during optimization. Weights have been omitted in Figure 3 for clarity. As we do not know the weight of the afore mentioned edge – because there is no corresponding execution time entry in the distribution of jpg-1 – we can not consider it during optimization. Therefore, our CDAG is optimal for methods like paste text which always execute for one of the specified numbers of instruction units. As this is not the case for many methods, we can insert this edge during post processing, to achieve further energy savings with these methods.
Simultaneously Exploiting Dynamic Voltage Scaling
221
method transmit at half speed.4 As soon as the chosen node has finished, the dispatcher starts over with the root node of the CDAG to schedule the next hyperperiod. Our CDAG optimization algorithm interleaves standard scheduling algorithms with search algorithms such as, e.g., simulated annealing or depth-first search, and with a nodepool (a kind of solution cache). The interleaving of the algorithms is made in a way that avoids pruning of optimal solutions. This often happens if heuristics are concatenated, because the pruning process does not fully evaluate the investigated solutions, or because it does pruning on a preliminary representation of solutions. Our idea is to interleave single steps of the heuristics – instead of concatenating the heuristics – to be able to fully quantify the quality (energy consumption) of a solution. This allows us to avoid that optimal solutions get pruned. Moreover, our approach is designed to preserve the benefits of combining the heuristics to find good solutions fast. Taking a closer look at our scheduling problem shows that the optimizer has to build solutions for conditions being characterized by the set of task instances that has to be scheduled starting at a certain time. We split this problem into deciding which task instance, method, and speed are best to be scheduled first and then recursively applying our algorithm to the follow-up conditions given by the remaining task instances and the possible remaining times. To decide which task instance to schedule, we use scheduling heuristics such as, e.g., earliest-deadline-first or ratemonotonic scheduling with backtracking to get feasible schedules. The choice of the method and processor speed to perform the selected task instance is made by, e.g., greedily choosing the method and processor mode minimizing the energy consumption of the selected task instance.
T1 Period: 12
T3
T2 Period: 12
Period: 12
M1
M2
M3
M4
0.9 2 0.1 4
0.7 2 0.3 4
0.8 3 0.2 4
0.9 2 0.1 6
CPU
full speed
Fig. 4. A three task instances application
A naive implementation would be to apply the algorithm recursively to optimize the follow-up conditions and use the scheduling, method and speed choice heuristics within the recursion. As this would repeatedly solve the same follow-up conditions, it would result in a very bad runtime complexity of O((N M KC)N ) 4
Nodes that schedule the same task and method at the same speed under conditions that differ only in their time are collapsed to one node during post processing.
222
M. Ramsauer
complete solution A
time: 2 {T2,T3} T2 M2
time: 4 {T3} T3 M3
complete solution B
time: 0 {T1,T2,T3} T1 M1
time: 4 {T2,T3} T2 M2
equal situations!
time: 6 {T3} T3 M3
time: 6 {T3} T3 M3
time: 0 {T1,T2,T3} T2 M2
time: 2 {T1,T3} T1 M1
time: 8 {T3} T3 M3
time: 4 {T3} T3 M3
time: 4 {T1,T3} T1 M1
equal situations!
time: 6 {T3} T3 M3
time: 6 {T3} T3 M3
time: 8 {T3} T3 M3
same situations as in solution A!
Fig. 5. Two complete solutions with many nodes in common
to find an optimal solution. N is the number of instances within one hyperperiod H, M is the number of methods per instance, K is the number of execution times specified per method, and C the number of speed steps. Our demand driven approach is an anytime algorithm [14], and therefore it delivers a series of improving CDAGs as optimization progresses. The worst-case complexity to actually verify that an optimal CDAG G has been found remains O(HM KC · 2N ), but G is output much earlier. E.g., if we look at an application that has three task instances within one hyperperiod (Figure 4), the naive implementation would not recognize identical follow-up conditions. This is the case, if the time consumptions of the first two instances sum up to the same value or if only the ordering of the first two instances has been reversed (see Figure 5). Our algorithm avoids this problem by separately storing decisions for conditions instead of complete solutions. We use the conditions as keys and store the best known task instance, method and speed mode to start scheduling under these conditions. Therefore we can provide a solution for the whole scheduling problem, by accessing the information in the node with key (maximum elapsed time 0, ALL INSTANCES) and continuing with the nodes for its follow-up problems. Figure 6 shows the implicit CDAG representation of the two solution trees shown in Figure 5. Only two new nodes have to be generated to get from solution A to solution B. The search for an optimal CDAG can be described as the search for the root node minimizing the sum of the expected energy consumptions of its sons and of its own expected energy consumption, i.e. the energy consumption of the method scheduled in it. Clearly, the optimization could be recursively separated in searching the optimal nodes for all instance sets containing one task instance less than the node one’s, but the possible points in time are not known in advance. Instead of generating optimal nodes for all points in time and all subsets of ALL IN ST AN CES, our algorithm generates only nodes for conditions that are requested by a node that is currently optimized and the generated nodes are stored for future access. Since the accessed node conditions change gradually during optimization, early generated nodes are less frequently requested
Simultaneously Exploiting Dynamic Voltage Scaling
time: 2 {T2,T3} T2 M2
time: 4 {T3} T3 M3
solution A
solution B
time: 0 {T1,T2,T3} T1 M1
time: 0 {T1,T2,T3} T2 M2
time: 2 {T1,T3} T1 M1
time: 4 {T2,T3} T2 M2
time: 6 {T3} T3 M3
223
new nodes
time: 4 {T1,T3} T1 M1
time: 8 {T3} T3 M3
Fig. 6. Representing two solutions by nodes and implicit links to follow-up nodes
as optimization progresses if, e.g., depth-first search is used. This allows us to delete early generated nodes without serious performance drawbacks if memory resources become scarce.
initialize Nodepool perform change
Guide
where to change? how to change?
Optimizer final solution
Fig. 7. Scheduling Algorithm
Figure 7 illustrates our overall scheduling algorithm that calculates an optimized CDAG. First, an initial solution is generated by initializing the nodepool, which can currently be done by using earliest-deadline-first scheduling either considering only the fastest methods and speed or trying to schedule the slowest speed that is sufficient to meet the task’s deadline.5 Starting with such an initial solution (stored implicitly as separate nodes), optimization begins. The optimizer component (e.g., simulated annealing or depth-first search) decides where 5
Using only the fastest methods and speed quickly finds a feasible CDAG. Greedily using the slowest speed generates a lower energy CDAG first, but may result in longer initialization times if timing constraints are tight or if the application is unschedulable. Under tight timing constraints, backtracking has to be applied more often because the greedy selection of methods leaves less time to succeeding tasks.
224
M. Ramsauer
to change the current solution, i.e., it decides which node to change. Then, the guide (scheduling heuristic) decides how to change the chosen node, i.e., which task instance to schedule first and which method and speed to use. The nodepool performs the change by searching or generating the corresponding node, and by searching or initializing the nodes for the follow-up conditions. To finish the overall optimization step, the energy consumption of the new solution is calculated by summing up the weighted energy consumptions of the contained nodes. If an optimal or sufficiently low energy root node has been generated, optimization is stopped. Starting with the root node, the nodes are recursively linked to the nodes for their follow-up conditions to get the solution CDAG G. With p(n) being the probability of branching to the follow-up node n, the expected energy consumption E(G) of this CDAG can be calculated recursively by: E(G) = E(root(G)) = E(Troot(G) ) + p(n) ∗ E(n) (2) n∈sons(root(G))
5
Results
Now we show the energy savings that can be achieved for our example live video application as well as the results for benchmarks that have been done with applications of different complexity. Live Video Application Energy Savings The expected energy consumptions in energy units per hyperperiod of our live video application are shown in Table 1. It was scheduled with (DVS) or without (¬DVS) voltage scaling, with (SL) or without (¬SL) slack time, and by implementing only method jpg-1 (M1), only jpg-2 (M2) or jpg-1 and jpg-2 (M12). Clearly, exploiting all three effects yields the highest energy savings in contrast to scheduling statically without voltage scaling and only considering worst-case times and one method. If no voltage scaling is available or if only worst-case execution times are considered, our algorithm automatically selects the best method (see columns one through three). In case of voltage scaling and slack time specifications via probability distributions, it optimally chooses the best processor speed and dynamically selects the method that is best suited for the current slack value. Combining different speed modes, slack time specifications, and multiple methods yields even higher energy savings than just specifying one method. By exploiting all three effects the energy consumption is lowered by 55 percent compared to not exploiting them. Benchmarks We optimized 10 applications of different complexities to show the energy savings that can be achieved by combinations of DVS, SL, and multiple methods. The topology (periods, data-dependencies, method WCET’s, etc.) of the applications has been generated generically from a set of manually specified tasks and
Simultaneously Exploiting Dynamic Voltage Scaling
225
Table 1. Energy Consumptions of Live Video Application ¬DVS ¬SL SL M1 70.72 70.72 M2 77.20 77.20 M12 70.72 70.72
DVS ¬SL SL 48.04 37.02 42.16 37.86 42.16 34.89
methods. Every task is implemented by two methods, that have execution time distributions containing two values. The smaller number of instruction units has been generated randomly. The average number of instruction units is in the range of 38 to 68 percent of the worst-case value for each method. The numbers of task instances range from 14 to 26.6 The processor has two speed modes if DVS is activated, and it always runs at maximum speed if DVS is deactivated. Every application was optimized with variants similar to the ones used for the live video application. Variants M1 and M2 have been replaced by onea and oneb , because every task is implemented by two methods in the benchmark applications. In variant onea every task is implemented by only one method. In variant oneb every task is only implemented by the method that has been removed in variant onea . The minimal worst-case utilization (fastest cpu speed and using method with lowest WCET) of the applications is in the range of 0.6 to 0.98, and thus none of the applications is fully schedulable at half speed only. The optimal energy consumptions for the variants are shown in Table 2. The values are normalized to the values in the first column for each application (¬DVS, ¬SL, onea ). In each column the last row shows the averaged values. These values are plotted in Figure 8. The back row shows the energy consumptions for ¬DVS and the front row shows DVS values. In the rightmost group of bars displays the energy consumptions of the multiple methods approach. Our optimizer (implemented in Java) took 8 seconds to 25 minutes to fully optimize a variant on an Pentium IV 2.8 GHz PC. The results are consistent with the live video results and show, that using multiple methods yields energy savings even on systems without voltage scaling capabilities and without considering slack time. Even if DVS and slack time distributions are available more energy is saved.
6
Conclusion
We presented a two-parts energy-aware scheduling algorithm for hard real-time, data-dependent, non-preemptive tasks. Our algorithm allows to combine three effects simultaneously to save energy. Multiple methods that implement the same 6
These applications are rather small, but they are sufficient to demonstrate the impact of DVS, SL, and multiple methods on the achievable energy consumption. We use our simulated annealing optimizer [5] for applications that feature more task instances. Here, we applied our exhaustive optimizer to provide exact results to show the possible energy savings.
226
M. Ramsauer Table 2. Energy Consumptions (normalized to ¬DVS & ¬SL & onea ) ¬ DVS ¬ SL BM onea oneb both onea 1 1,00 0,72 0,72 1,00 2 1,00 0,63 0,63 1,00 3 1,00 0,93 0,78 1,00 4 1,00 0,72 0,67 1,00 5 1,00 0,72 0,68 1,00 6 1,00 0,90 0,80 1,00 7 1,00 0,94 0,84 1,00 8 1,00 0,90 0,83 1,00 9 1,00 0,84 0,70 1,00 10 1,00 0,90 0,82 1,00 AVG 1,00 0,82 0,74 1,00
SL oneb 0,72 0,63 0,93 0,72 0,72 0,90 0,94 0,90 0,84 0,90 0,82
both 0,72 0,63 0,78 0,67 0,68 0,80 0,84 0,83 0,70 0,82 0,74
DVS ¬ SL onea oneb both onea 1,00 0,53 0,50 0,62 0,98 0,44 0,41 0,66 0,86 0,74 0,60 0,55 1,00 0,55 0,55 0,62 0,97 0,54 0,53 0,68 0,81 0,61 0,50 0,46 0,87 0,82 0,67 0,54 0,87 0,66 0,64 0,53 0,25 0,21 0,18 0,25 0,89 0,74 0,67 0,59 0,85 0,59 0,52 0,55
SL oneb 0,32 0,28 0,45 0,33 0,35 0,39 0,46 0,43 0,21 0,45 0,37
both 0,32 0,28 0,35 0,31 0,33 0,34 0,39 0,39 0,18 0,39 0,33
Fig. 8. Averaged Normalized Energy Cosumptions of Optimized Variants
task can be specified, the processors speed is varied by dynamic voltage scheduling, and tasks exploit slack time caused by preceding tasks that did not run to their worst-case execution-time. Our algorithm separates complex optimization and runtime efficient dispatching to reduce runtime overhead and yield good system predictability (due to the predictable execution time of the dispatcher). Our approach reduces energy consumption with a very small and predictable runtime overhead while still all real-time constraints are met. We showed that energy savings can be achieved by specifying multiple methods. This is true, even on systems that are not capable of voltage scaling and/or if slack time is not specified in the application model. Some work regarding a refined energy model is ongoing work. We will refine the energy model by considering the energy and time needed to change processor speed/voltage and to execute the PMPs. To do this, we only need to extend the keys of the CDAG nodes with the processor speed that is set when the node is entered (this equals the speed of the previously scheduled method). As the execution time of our dispatcher is predictable, we can choose to schedule idle time before changing processor speed if the speed is going to be raised, or after changing processor speed if processor speed is going to be lowered. This yields additional energy savings, because at the lower speed
Simultaneously Exploiting Dynamic Voltage Scaling
227
the processor consumes less energy during idle time than it does at a higher speed. We utilize the concept of power management points, which is not only suitable for energy reduction on operating system level, but furthermore, it could be used in energy-aware compilers, where tasks would be represented by interfaces and methods by interface implementations. The compiler would then insert control flow statements to call the best suited implementation at runtime. Two different implementations of our dispatcher are feasible: storing the CDAG in a compressed form and implementing the dispatcher as an CDAG interpreter, or automatically transforming the CDAG in a branching structure, which is more efficient for smaller applications.
References 1. Wolf, W.: Computers as Components: Principles of Embedded Computing System Design. Morgan Kaufman Publishers (2001) 2. Fleischmann, M.: LongRun(TM) Power Management: Dynamic Power Management for Crusoe(TM) Processors,Transmeta Corporation (2001) 3. Intel: Mobile Intel Pentium III Processors: Intel SpeedStep Technology, http://www.intel.com/support/processors/mobile/pentiumiii/ss.htm, Intel Corporation (2003) 4. Unsal, O.S., Koren, I.: System-Level Power-Aware Design Techniques in Real-Time Systems. In: Proceedings of the IEEE. Volume 91. (2003) 5. Ramsauer, M.: Using Simulated Annealing for Hard Real-Time Design-to-Time Scheduling. In: ESA 2003: Proceedings of the 2003 International Conference on Embedded Systems and Applications, Las Vegas, 23. - 26. June. (2003) 6. Melhem, R., AbouGhazaleh, N., Aydin, H., Moss´e, D.: Power Management Points In Power-Aware Real-Time Systems; In Power Aware Computing, ed. by R. Graybill and R. Melhem, Plenum/Kluwer Publishers. (2002) 7. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science, Number 4598, 13 May 1983 (1983) pages 671–680 8. Catoni, O.: Solving scheduling problems by simulated annealing. SIAM Journal on Control and Optimization, Volume 36, Number 5 (1998) pages 1539–1575 9. Natale, M.D., Stankovic, J.A.: Applicability of simulated annealing methods to real-time scheduling and jitter control. In: IEEE Real-Time Systems Symposium. (1995) 190–199 10. Steinh¨ ofel, K.: Stochastic Algorithms in Scheduling Theory. PhD thesis, DISKI 218, infix-Verlag, ISBN 3-89601-218-5 (1999) 11. Lee, Y.H., Krishna, C.M.: Voltage-Clock Scaling for Low Energy Consumption in Fixed-Priority Real-Time Systems. In: Real-Time Systems, Kluwer Academic Publishers (2003) 303–317 12. Levy, R., Crilly, B., Narahari, B., Simha, R.: Memory Issues in Power-aware Design of Embedded Systems: An Overview. In: Second International Workshop on Compiler and Architecture Support for Embedded Systems, CASES’99. (1999) 13. Unsal, O.S., Koren, I., Krishna, C.M.: High-Level Power-Reduction Heuristics in Large Scale Real-Time Systems. In: IEEE International Workshop On Embedded Fault-Tolerant Systems. (2000) 14. Zilberstein, S.: Operational Rationality through Compilation of Anytime Algorithms. PhD thesis, Computer Science Division, University of California at Berkeley (1993)
Application Characterization for Wireless Network Power Management Andreas Weissel, Matthias Faerber, and Frank Bellosa University of Erlangen, Department of Computer Science 4 {weissel,faerber,bellosa}@cs.fau.de
Abstract. The popular IEEE 802.11 standard defines a power saving mode that keeps the network interface in a low power sleep state and periodically powers it up to synchronize with the base station. The length of the sleep interval, the so called beacon period, affects two dimensions, namely application performance and energy consumption. The main disadvantage of this power saving policy lies in its static nature: a short beacon period wastes energy due to frequent activations of the interface while a long beacon period can cause diminished application responsiveness and performance. While the first aspect, reduction of power consumption, has been studied extensively, the implications on application performance have received only little attention. We argue that the tolerable reduction of performance or quality depends on the application and the user. As an example, a beacon period of only 100 ms slows down RPC-based operations like NFS dramatically, while the user will probably not recognize the additional delay when using a web browser. If at all, known power management algorithms guarantee a system wide limit on performance degradation without differentiating between different application profiles. This work presents an approach to identify on-line the currently running application class by a mapping of network traffic characteristics to a predefined set of application profiles. We propose a power saving policy which dynamically adapts to the detected application profile, thus identifying the application- and user-specific power/performance trade-off. An implementation of the characterization algorithm is presented and evaluated running several typical applications for mobile devices.
1
Introduction
We present an approach to on-line characterization of applications based solely on information obtained from the network link layer. This information is used to realize dynamic power management for mobile communication which aims at maximizing power savings without degrading application dependent, user-perceived performance and quality. 1.1
Motivation
For wireless networking, the IEEE 802.11 standard defines two operating modes: the continuously-aware mode CAM, which leaves the interface fully powered, and the power saving mode PSP, which keeps the network interface in a low power sleep mode C. M¨uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 231–245, 2004. c Springer-Verlag Berlin Heidelberg 2004
232
A. Weissel, M. Faerber, and F. Bellosa
and periodically activates it to synchronize with the base station. These periodic synchronizations are called beacons, the length of the sleep interval is the so called beacon period with a default value of 100 ms. Some network interfaces support additional power saving modes, e.g. PSPCAM or PSP-adaptive which automatically switch between the two modes depending on the network traffic. While this mechanism is defined for managed as well as ad-hoc networks, for reasons of simplicity, we concentrate our discussion on a configuration with one base station, the access point, communicating with at least one client. The beacon period affects two dimensions, namely performance by generating network delays and energy savings. While this policy dramatically reduces the time the network interface has to be fully powered, receiving data is only possible after synchronizing with the base station. Incoming traffic is buffered at the access point and signalled in the “traffic indication map" (TIM) sent at each beacon. If data is waiting, the client activates the network interface and polls the data from the access point. After the transmission, the sleep cycle is established again. If data is buffered at the access point, the client is not aware of the incoming data for at most one whole beacon period. Thus, additional network delays are introduced and user-perceived application performance or quality can be affected. The amount of performance or quality degradation depends not only on the beacon period, but also on the type of application, the send/receive characteristics and the user sensitiveness and tolerance. This aspect is often neglected by power management algorithms. Typically, only a system wide performance level is guaranteed without differentiating between different application profiles. As performance or latency is a subjective measure, the leverage of these factors differs from application to application and from user to user. For interactive processes, small delays are usually tolerated depending on the application and the sensitivity of the user. Performance degradation should be completely avoided for jobs that the user wants to finish as fast as possible, e.g. a download operation. Streaming applications like a radio or video player have to provide a certain quality of service which can not be achieved if the additional network delays caused by power management exceed the playtime of the buffered data. We argue that the tolerable reduction of performance or quality depends on the application and the user, and therefore has to be considered by operating system power management policies. 1.2
Contributions of This Work
An approach to on-line characterization of networked applications is presented. Several parameters derived from send/receive statistics are mapped to a predefined set of application profiles. The necessary data is already available in the kernel network stack and the computational overhead to determine the parameters is low. If a profile is identified with enough certainty, an application-specific power management setting is triggered. We examine the characteristics of typical networked applications for mobile devices and identify their power/performance trade-offs. Alternative approaches to characterize applications are outlined and discussed. The application profiles are characterized by several different parameters regarding network traffic, e.g. the proportion of data received to data sent, the average length of idle
Application Characterization for Wireless Network Power Management
233
or active periods, the deviation of these values etc. A secure shell (SSH) session, e.g., can be identified by rather small packets together with short active and long passive periods while an audio stream can be recognized by periodic transmissions, i.e. a small deviation from the average length of periods of inactivity. Information on packet timing gets coarser with longer beacon intervals due to the increasing delays in receiving packets. For the range of beacon intervals considered in this work (up to 500 ms), the chosen parameters are quite robust and show only little variation over different delays. The target parameters for the set of profiles are identified using numerous traces from application runs with different power management settings. The characterization is evaluated with another set of recorded traces and with extensive on-line tests including power management. We show that applications running in isolation are identified correctly. If the user works with two programs that generate network traffic simultaneously, the algorithm does either detect no profile at all or identifies both correctly, but switches frequently between the two. The application dependent power management decision is configurable from user space and can be extended with more sophisticated power management algorithms. We present related work in the following section. Our approach to application characterization is outlined in section 3. Section 4 describes the implementation in detail; followed by an evaluation of the characterization algorithm in section 5.
2
Related Work
2.1 Application Characterization The operating system can be identified by analyzing the network traffic originating from it [11]. The IP implementation slightly differs from operating system to operating system, depending on RFC interpretation. The TCP SYN packets provide enough information to accurately determine the system. To gather more information about the system and available services on the network, the packet payload can be analyzed. With tools like ngrep, a regular expression based analysis of network traffic is possible. A straightforward and simple approach to identify applications is to use the port number and the protocol (TCP, UDP) from the headers of network packets. Unfortunately, ports can easily be mapped to or tunnelled through other ports. Firewalls often restrict connections to only a few open ports, e.g. port 80 for HTTP and 22 for SSH. To enable networked applications based on other ports to run, tunnelling of connections has become a common technique. A proxy (caching) server outside of the firewall serves not only HTTP requests, but also Multimedia streams. Identification using this method is also problematic with applications that use dynamically assigned ports, such as FTP and RPC. For all these cases, the proposed technique can complement the simple method of mapping port numbers to applications. A more sophisticated method is to look at the contents of the packets. By reassembling the packets and by recognizing certain patterns the application can be identified from the contents of the data stream. This classification can be made without relying on the port numbers. Projects like “l7-filter" [10] classify packets based on patterns in layer
234
A. Weissel, M. Faerber, and F. Bellosa
7 (application layer) at the cost of high processor utilization. The overhead of packet introspection can negatively affect power consumption. 2.2
Power Management for Wireless Networking
Stemm et al. [12] and Feeney at al. [6] investigate the energy consumption of wireless network interfaces and different network protocols in detail. Power management policies can be classified into three categories. 2.2.1 Static Protocols Static protocols use one fixed, system wide beacon period, time-out value or inactivity threshold to trigger transitions from active to low power modes with reduced performance. The IEEE 802.11 power management algorithm is a typical representative of this class. Static protocols are often implemented in hardware because they are simple and do not require much storage space or computational effort. 2.2.2 Adaptive Protocols Dynamic link-layer protocols adapt the beacon period or time-out threshold to the current usage characteristics of the device. These algorithms are often called history based as they draw upon the observed device utilization of the past. Krashinsky and Balakrishnan present the Bounded Slowdown (BSD) protocol [8]. This power management protocol minimizes energy consumption while guaranteeing that the round trip time (RTT) does not increase by more than a predefined factor p over the RTT without power management. The factor controls the maximum percentage slowdown, defining the trade-off between energy savings and latency. If at time t1 the network interface has not received a response to a request sent at time t0 , the interface can switch to sleep mode for a duration of up to p(t1 − t0 ); i.e., the RTT, which is at least t1 − t0 , will not be increased by more than the factor p. Thus, the beacon period is dynamically adapted to the length of the inactivity period. When data is transmitted, the wireless interface is set to continuously-aware mode. As this approach requires only information available at the link layer, it can be implemented in hardware. Chandra examined the energy characteristics of streams of different multimedia formats, namely Microsoft Media, Real and Quicktime, received by a wireless network interface and under varying network conditions [3]. A simple history based policy is presented which predicts the length of the next idle phase according to the average of the last idle phases. As Microsoft Media exhibits regular transmission intervals, high energy savings can be achieved using this policy. Chandra and Vahdat [4] propose energy aware traffic shaping for multimedia streams in order to create predictable transmission intervals. Varying the transmission periods reveals a trade-off between frequent mode transitions and added delays in the multimedia stream reception. Traffic shaping can be performed in the origin server, in the network infrastructure or in the access point itself. Regular packet arrival times enable client side mechanisms to effectively utilize the low power sleep state of the wireless interface. The approach of Chandra and Vahdat addresses the trade-off between energy savings and performance, but is limited to streaming applications. As traffic shaping is performed at the server, user-specific preferences can not be taken into account. Application-specific
Application Characterization for Wireless Network Power Management
235
server side traffic shaping and client side power management should add to each other nicely. Several proposals for energy efficient transport layer protocols can be found in the literature. Bertozzi et al. show that the TCP buffering mechanism can be exploited to increase energy efficiency of the transport layer with minimum performances overheads [2]. 2.2.3 Application-Specific Protocols These protocols require the support of applications to enable operating system power management. To achieve this, applications typically have to use a certain API to inform the operating system about their intended use of the network interface. Anand et al. [1] propose STPM: self-tuning wireless network power management. STPM considers the time and energy costs of changing power modes. Applications can provide hints to the operating system about the time and volume of data transmission over the network interface. STPM dynamically adapts its power management policy. Kravets and Krishnan propose a power management algorithm which shuts down the wireless interface after a certain period of inactivity and reactivates it periodically [9]. Variations of the algorithm with fixed and variable sleep periods are evaluated. Predictive algorithms are proposed to determine the length of the sleep periods. An application level interface to the power management protocol allows applications to control the policies used for determining sleep durations. Predictive algorithms can be used to adjust the power management parameters (inactivity time-out and sleep duration) using application-specific strategies. An implementation of a simple adaptive algorithm is presented which responds to communication activity by reducing the sleep duration to 250 ms and to idle periods by doubling the sleep duration up to 5 minutes. Energy aware adaptation as presented by Flinn [7] is another approach to application dependent power management. By dynamically modifying their behavior, applications can reduce system power consumption, e.g. to achieve a specific battery lifetime. The operating system monitors the battery status, selects the correct power/quality-of-service trade-off and informs the currently running applications, guiding their adaptation. While application-specific power management strategies address the power/performance trade-off, applications are required to be rewritten to support the operating system in its efforts to save energy. Our approach circumvents this drawback by providing the necessary information separate from the applications and, at the same time, allowing not only application-, but also user-specific power management policies.
3 Application Characterization We examined the network characteristics of several typical applications for mobile systems (laptops and hand-helds): • • • • •
Mozilla (web browser) Secure Shell (SSH) session Network File System (NFS) operations FTP download Netradio: low bandwidth Real audio stream
236
A. Weissel, M. Faerber, and F. Bellosa
• MP3 audio stream • high bandwidth Real video stream These applications can roughly be divided into three groups: interactive, foreground applications (web browser and SSH), non-interactive applications where execution time is key (network file system, download) and streaming applications. With interactive applications, the user is sensitive to reduced application responsiveness. Small additional network delays may be tolerated or probably not even recognized. Thus, power management with small beacon intervals would be an option for these applications. Figure 1 shows the energy consumption of a five minute run of Mozilla under different power management settings. Power management increases the round trip time of network packets by an average of (beacon period / 2). The sensitivity to application responsiveness can not be measured directly; it depends on the individual user.
Fig. 1. Energy consumption of Mozilla and average packet delays under different power management settings
As Anand et al. [1] demonstrate, power management can dramatically increase the execution time and, as a consequence, the energy consumption of NFS. The same applies to a download or copy operation over the network. It is important to identify these applications and switch off power management to avoid wasting energy. Figure 2 shows the runtime of a simple find operation on a directory mounted via NFS. Without power management, the job finished after a few seconds and consumed approximately 4 J. For a beacon period of 100 ms, the runtime increases to over 250 seconds and the energy consumption to 67 J. For all network services based on RPC, as well as copy or download operations, power management should be disabled. Streaming applications buffer a certain amount of data to reduce the impact of network delays or varying bandwidth. As a consequence, they are insensitive to small delays introduced by power management algorithms. In case of high delays, the quality of the presentation, e.g. the number of frames per second, is reduced, pauses are introduced or
Application Characterization for Wireless Network Power Management
237
Fig. 2. Energy consumption and performance degradation of NFS
frames are skipped. In our experiments, a beacon interval of 500 ms can be used without influencing a 200 kbit video stream. In order to achieve energy savings, not only the power consumption but also the runtime of the task has to be considered. The beacon mechanism reduces the average power consumption but introduces additional delays when receiving packets. This leads to an increased runtime for certain types of networked applications, e.g. RPC-based and copy or download operations. The resulting higher energy consumption can outweigh the savings due to the reduced average power. In contrast to this, for many interactive and streaming applications, the runtime is mainly determined by the user think time or other factors. In these cases, reduced power consumption will lead to reduced energy consumption because the runtime is not or only marginally influenced by the power management algorithm. To sum up, the performance or, more generally, the quality-of-service degradation a user is willing to tolerate depends on the currently running application and the user expectations. Therefore, we propose a power management policy which incorporates user preferences into its decisions. Different application profiles are identified on-line and the user-defined, application-specific power management setting is chosen. Table 1. Power management settings for parameter training
238
A. Weissel, M. Faerber, and F. Bellosa Table 2. Characterization of different profiles
3.1 Application Profiles In order to be able to distinguish different application profiles during runtime, we examined several different parameters identifying the network characteristics in a single user system. At the link layer, information about the number of packets and the size of the packets sent or received is available. Using these values, we constructed the following parameters: 1. average size of packets received (= number of packets received / bytes received) 2. average size of packets sent 3. average length of inactive periods (time intervals with no transmissions) 4. average length of active periods (time intervals with transmissions) 5. traffic volume received 6. traffic volume sent We collected traces for every application under six different power management settings (see table 1) and computed the parameters presented above, together with the averages and standard deviations. In addition to that, we experimented with several ratios and combinations of the values. We decided to drop several parameters that showed high deviations or low correlation to the corresponding applications. The final, reduced set consists of the following parameters: 1. average size of packets received 2. average size of packets sent 3. ratio of average length of inactive to length of active periods 4. ratio of average size of packets received to size of packets sent 5. ratio of traffic volume received to traffic volume sent 6. standard deviation of the average size of packets received 7. standard deviation of the average length of inactive periods
Application Characterization for Wireless Network Power Management
239
Table 2 shows the highest correlations of parameters to application profiles. The numbers in brackets depict the range of parameter values. With some parameters, all profiles can be distinguished while others, e.g. the size of sent packets, can only be used to confirm classifications.
4
Implementation
The implementation of the algorithm for the Linux operating system consists of two parts, the collector module located inside the kernel and the characterization module in user space. 4.1
Collector Module
The kernel part of the system is responsible for retrieving data from the network interface card. The Linux kernel already maintains a data structure that contains statistical data about the traffic for each network interface. Thus, no additional overhead is imposed on the system to obtain the necessary information. The collector module periodically (100 ms) retrieves the number of bytes and packets that have been received and transmitted during the last time slot from the kernel structure and passes the statistics to user space via the proc interface. 4.2
Characterization Module
The user space part of the characterization algorithm performs a mapping of the network statistics to application profiles. To perform the identification the module maintains tables containing the target values for all parameters and all applications. Periodically, these parameters are extracted from the network traffic statistics and compared to the parameters of the training runs. For every parameter, the application with the minimum difference between the current and the target values is chosen as a candidate. From the set of candidates, an application is only selected if it has the majority, i.e. if at least four of the seven candidates indicate the same. If no majority is reached, the decision is considered uncertain and the algorithm stops, i.e. the last identification is retained. We measured the overhead of mode transitions of a Cisco Aironet wireless adapter (table 3). It can be seen that changing the beacon interval and the operating mode is equally expensive. In order to avoid frequent mode transitions which would prevent the interface from achieving any energy savings, we introduce a minimum time span twait between two mode transitions. In our experiments, this time span was set to 10 seconds. 4.3
Profile Management
The corresponding power management settings for the different application profiles are read from a configuration file or can be specified as command line parameters when running the characterization module. This way, varying preferences or preferences of different users can be taken into account by, e.g., switching to another set of power management settings when a new user logs into the system. A user who wants to change
240
A. Weissel, M. Faerber, and F. Bellosa Table 3. Overhead of mode transitions
the currently active settings just needs to invoke the characterization module with the new set of parameters as command line attributes. It is possible to activate and deactivate the characterization module and use another power management algorithm instead, e.g. PSPCAM.
4.4
Future Work
We plan to integrate additional information available on the network and transport layer into our decision rule. Examples are the port number of a network connection, the protocol used (TCP or UDP) and the user id of the receiving or sending process. This information can support a decision based purely on link layer statistics. With the port number, some applications can be identified immediately. However, if the original port is mapped to port 80 due to firewall restrictions, this information is not available. The user id enables a distinction of user-specific application profiles or different power management policies for different users. Statistical data stored in the task structure of the receiving or sending process, e.g. the runtime or priority, could be used to distinguish between interactive and non-interactive, background processes, similar to the approach taken in typical schedulers. We would also like to investigate a combination of application-specific server side traffic shaping, like the approach presented by Chandra and Vahdat [4], with applicationspecific power management on the client side.
5 5.1
Evaluation Data Acquisition
For power measurements we used the Cisco Aironet wireless adapter [5], connected to a notebook via a PC Card extender card from Sycard Technology. The extender card allows to isolate the power buses, so we attached a 4-terminal precision resistor of 50 mOhm to the 5 V supply line. The voltage drop at the sense resistor was measured with an A/D-converter with up to 40000 samples per second and a resolution of 256 steps. The maximum voltage drop that is correctly converted is 50 mV. Figure 3 shows the power consumption of different power management settings of the Cisco Aironet card.
Application Characterization for Wireless Network Power Management
241
Fig. 3. Average power consumption of different power management settings
5.2
Offline Application Characterization
We determined parameters for the following application profiles: • • • • • • •
Web browser/text Web browser/images NFS operations (kernel compile run, find) SSH (terminal session) FTP download Low-bandwidth stream (< 64 kbit, audio, Realplayer and MP3) High-bandwidth stream (video)
The two web browser profiles are treated as one application, as well as the different types of streams. In order to evaluate the characterization algorithm, we recorded several traces of application runs with two different users. Figure 4 shows the percentage of time the algorithm identifies the correct application, reaches no decision (keeping the correct decision) and performs a wrong classification. The last value is determined counting identifications of other profiles and the time that follows until the correct application is identified again. “Mozilla 1" is a browser session viewing web pages with many pictures, in “Mozilla 2" mainly text is viewed. SSH with X-forwarding shows a high error rate; in this case, an additional application profile should be created. During the “Mozilla 2" run, SSH was detected several times. As both are interactive applications they could also be combined to one profile. When combining Mozilla and SSH to one profile (which could be called “Interactive"), the percentage of time this profile is correctly classified is almost 100%. The two NFS sessions represent different file system operations (find & grep and chmod). Different streaming applications and download operations are identified correctly.
242
A. Weissel, M. Faerber, and F. Bellosa
Fig. 4. Offline application characterization
5.3
Online Application Characterization
In order to be of practical use, our system has to characterize running applications correctly. We performed several on-line tests with a mix of networked applications and recorded the decisions of the characterization module and the power consumption of the system. Figure 5 shows an 8 minute run of four different programs: Mozilla, an SSH session, find on a NFS-mounted directory and Netradio (low-bandwidth Real audio stream). Approximately every 2 minutes the user performing the test switched to another application. The characterization module changes the power management setting according to the identified application. For Mozilla, we chose a beacon period of 200 ms, for SSH 100 ms and for Netradio 500 ms. When running NFS, we activated the continuously-aware mode (CAM). The figure shows the power consumption during the test run together with the detected applications. Mozilla (time stamp 0–136 s) is identified except for a short period between 74 and 88 s. The following two application profiles are recognized correctly (SSH: 136–255 s and NFS: 255–394 s). At timestamp 394 s the user activates the Real audio stream; however Mozilla is detected. The characterization module needs 11 seconds to correctly identify Netradio. This profile is kept until the end of the test. In order to evaluate the system, we recorded the decisions of the characterization module during several test runs of over 70 minutes. For these tests, we used two different power management settings: CAM mode for NFS and download/copy operations and PSP mode with a beacon period of 100 ms for other application profiles. If the user switches to a different application, the algorithm needs some time to recognize this change. We determined the length of these recognition delays and the time intervals during which wrong decisions are taken, i.e. during which the running application is not correctly detected. In these cases, the user either experiences delays
Application Characterization for Wireless Network Power Management
243
(if the power management setting is more aggressive than it should be) or energy is wasted (if the power management setting could be more aggressive). We determined an error rate of 6.5% over all test runs, composed of 4.5% wrong classifications and 2% recognition delays.
Fig. 5. Power consumption during a run of four different applications
5.4
Running Applications in Parallel
In order to evaluate our characterization algorithm when a mixture of application profiles is observed, we performed tests where two applications run in parallel. The first scenario is a run of Netradio and Mozilla, the user browsing the web while listening to radio. In this case, both applications are recognized, but the algorithm switches frequently between the two (almost every 10 seconds). A solution would be to choose the application profile for which the user is more sensitive to network delays (the application with the smaller beacon interval or with power management disabled). Our second test is a SSH session with Netradio running in the background. This time, the characterization algorithm does not reach a decision for the whole test run.
244
A. Weissel, M. Faerber, and F. Bellosa
If no profile is detected for a certain period of time, the algorithm could disable power management, switch to the card’s own power management mechanism (PSPCAM) or activate a user-specified default setting. 5.5
Comparison with PSPCAM
In order to compare the energy savings achieved by our characterization approach to the card’s own adaptive power management algorithm, PSPCAM, we recorded the power consumption of test runs with several applications under different power management settings. During periods of network activity, PSPCAM stays in CAM mode. After a short period of inactivity (approx. 2 seconds) the card transitions to PSP mode. This transition comes, in contrast to “manual" (de)activation, with almost no overhead (energy and time). Table 4 shows the energy consumption of 5 minute test runs. It can be seen that for all applications which are robust to the beacon mechanism, that are all except NFS, more energy can be saved in PSP than in PSPCAM mode. As a consequence, a power management algorithm which distinguishes different profiles can achieve higher energy savings than an energy aware link-layer protocol. When running e.g. Netradio, PSPCAM would unnecessarily deactivate power management during data transfers, although the audio stream works equally well when the beacon mechanism is active. Table 4. Energy consumption over 5 minutes
6
Conclusion
We argue that power management for mobile communication has to take the applicationand user-specific power/performance trade-off into account. This work presents an approach to on-line characterization of applications, enabling application-specific power management. A small set of parameters based on network traffic is mapped to predefined application profiles. We present an implementation with low computational overhead and evaluate it under several typical applications for mobile devices. Higher energy savings can be achieved when exploiting the tolerable performance degradation of different profiles than using the card’s own power management mechanism. This approach does not require source code modifications to programs. The user can specify individual power management settings for all application profiles. Thus, not only application- but also user-specific power management can be realized, addressing the power/performance trade-off of the individual user.
Application Characterization for Wireless Network Power Management
245
References [1] Anand, M., Nightingale, E.B., and Flinn, J. Self-tuning wireless network power management. In Proceedings of the 9th Annual International Conference on Mobile Computing and Networking (MOBICOM ’03) (September 2003). [2] Bertozzi, D., Raghunathan, A., Benini, L., and Srivaths, R. Transport protocol optimization for energy efficient wireless embedded systems. In Proceedings of the Design, Automation and Test Conference in Europe (DATE ’03) (March 2003). [3] Chandra, S. Wireless network interface energy consumption implications of popular streaming formats. In Proc. Multimedia Computing and Networking (MMCN’02) (January 2002). [4] Chandra, S., and Vahdat, A. Application-specific network management for energy-aware streaming of popular multimedia format. In Proceedings of the 2002 USENIX Annual Technical Conference (June 2002). [5] Cisco Systems, Inc. Cisco Aironet 350 Series Client Adapters, June 2003. [6] Feeney, L., and Nilsson, M. Investigating the energy consumption of a wireless network interface in an ad hoc networking environment. In Proceedings of IEEE Infocom (April 2001). [7] Flinn, J., and Satyanarayanan, M. Energy-aware adaptation for mobile applications. In Proceedings of the 17th Symposium on Operating System Principles SOSP’99 (December 1999). [8] Krashinsky, R., and Balakrishnan, H. Minimizing energy for wireless web access with bounded slowdown. In Proceedings of the Eighth Annual ACM/IEEE International Conference on Mobile Computing and Networking (MOBICOM 2002) (September 2002). [9] Kravets, R., and Krishnan, P. Application-driven power management for mobile communication. ACM/URSI/Baltzer Wireless Networks (WINET) special issue of Best Papers from MobiCom’98 (1998). [10] Levandoski, J., Sommer, E., and Strait, M. Application layer packet classifier for Linux. [11] Spitzner, L. Passive fingerprinting. In Security Focus (2000). [12] Stemm, M., and Katz, R.H. Measuring and reducing energy consumption of network interfaces in hand-held devices. IEEE Transactions on Communications E80-B, 8 (1997), 1125-31.
Frame of Interest Approach on Quality of Prediction for Agent-Based Network Monitoring 1
2
Stefan Schulz , Michael Schulz , and Andreas Tanner 1
1
Intelligent Networks and Distributed Systems Management, TU-Berlin, EN6 Einsteinufer 17, 10587 Berlin, Germany {schulz, tanner}@ivs.tu-berlin.de 2 accsis Informatik – Forschung und Praxis GmbH Kühhornshofweg 8, 60320 Frankfurt am Main, Germany [email protected]
Abstract. We present an approach to compute the quality of prediction for network monitoring. The monitoring is part of a proactive mobile agents based management system for network health (magmaNH). To allow prediction of a system’s behavior, magmaNH contains prediction services placed on core nodes of a network. To make predictions as precise as possible, a measure and a process have to be defined, which enable to determine the quality of predictions. This measure of quality enables magmaNH optimizing the prediction services to become a reliable support system for automated network management. Keywords. Distributed Network Management, Mobile Agents, Quality of Service, Prediction, Proactiveness, Self-Optimization
1
Introduction
Decentralization, heterogeneity and mobility have become essential requirements in today’s enterprise IT environments. The management of networks, systems, and applications has to encompass and master these requirements. Modern ways of doing business necessitate agile and flexible solutions for managing enterprise IT infrastructures. We claim that mobile agent technology can satisfy these requirements if it is employed in management platforms in a way that recognizes its potential benefits and drawbacks. Clearly, the use of mobile management agents does not warrant automatically a more effective management for distributed systems. It takes a careful and well-thought out design to exploit the full potential of mobile agent technology. An important advantage of agent based network management is the ability of agents to autonomously decide and act and, hence, become responsible for certain network sections. Autonomous actions range from simple notification to actively intervene into the system’s operations. Obviously, currently gathered data only allows reacting on problems. Our monitoring agents, so called HealthAgents, are deployed as close as possible to the network elements. When monitoring system variables the agents are capable of translating time series statistics into the prediction for the likely C. Müller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 246–259, 2004. © Springer-Verlag Berlin Heidelberg 2004
Frame of Interest Approach on Quality of Prediction
247
evolution of the monitored network parameters. Thus, we try to react proactively to emerging resource problem situations. Of course, the prediction has to be as accurate as possible to operate effectively. The focus of this paper is on determining the quality of prediction. Section 2 will give an overview of the related work. In Section 3 we will present the underlying architecture and functionality of our mobile agent based management approach. Section 4 presents the approach on measuring the quality of prediction. In section 5 we will describe the evaluation of our quality measure approach based on an on-site test in a real industrial distributed network environment. Finally, in section 6 we summarize our findings and give an outlook to future work.
2
Related Work
For current computational learning methods, e.g. PAC1-learning alike algorithms, quality only appears in the training phase. It is seen as a kind of threshold or criterion to decide upon sufficient learning. Once, the algorithm built the decision tree or artificial neural network (ANN) based on training with samples, i.e., the learning criterion has been reached, the algorithm is finished. To our knowledge, there is no discussion on detecting a situation, when the trained algorithm does no longer provide accurate predictions. In dynamic software systems as network monitoring, such a situation eventually will occur due to changes in the network’s configuration or the users’ behavior (cf. [1]). Applying learning to such a system necessitates measuring periodically the quality of the algorithm to provide correct predictions of management data. In magmaNH, ANNs are applied in network monitoring to predict the behavior of hosts and the network. This way, network managers or HealthAgents are enabled to proactively prevent failures or bottleneck situations. Bad predictions can lead to system faults that cause (at least) financial damages. Merely, investigations have been done on the amount of samples needed for sufficient learning. In [2], Valiant shows that upper and lower boundaries exist on the number of samples, i.e., too many samples will overlearn a system and too few samples will not give enough accuracy, respectively. These boundaries promise to establish a good hypothesis in a PAC-learning algorithm. This hypothesis is trained by samples, so that with a given probability (1 - δ) the hypothesis h will fall below a given threshold θ for any data having the same distribution as the samples. That is, h fulfils a predefined quality of prediction. In [3] is shown that the number of needed samples drastically can be reduced in practical applications. Here, instead of using a specific amount of samples, with each new sample it is tested, if the hypothesis could be set up. This results in a sequential approach for learning. Investigations on general verification of reliability in system management can be found in [1]. It provides a support for manual verification of a systems quality by automated data collection. The authors give a metric of computing the average fault behavior of a system, e.g. by monitoring system crashes and uptime. Transferred to 1
PAC: Probably Approximately Correct. A model, that provides the ability to learn categorization and approximation of functions by providing samples for training.
248
S. Schulz, M. Schulz, and A. Tanner
prediction, this metric would be sufficient for comparing and analyzing the predictions reliability in form of charts, but it could not give enough information for automatically improving a system, as it is not sufficient to detect wrong fault notifications to decide upon the quality of a prediction. The approach closest to our’s, though not applied to automated measurement, can be found in [4]. Bajić describes a comparison of prediction software in the sequence analysis for DNA and protein. The different measures used to compare predictions are based on the numbers of positive and negative predictions, both again differentiated in true and false variants (cf. section 4). Each of the measures provides a computed distance of predictions to the ideal predictor. As a result, Bajić states, that nearly each measure computes a different ranking for the prediction software systems tested. It seems obvious, that a measure for determining the quality of predictions has to be suited to the application and its environmental context.
3
Mobile Agent Based Management for Network Health
In magmaNH we apply the approach of Management by Delegation (MbD) [5]. In MbD the management application becomes autonomous and self-organizing, and it grows with the managed network instead of being limited by the resources available at some central management node. Moreover, this avoids bottlenecks and central points of failure since failures and crashes at a local scale will only affect certain parts of the management application leaving the other autonomous parts operational. The result is a flexible, scalable, and robust network management system which fits well into the vision of universal global networking. We make use of the mobile agent platform AMETAS [6] as a basic middleware platform. AMETAS introduces distinct properties concerning the security, flexibility, and decoupling of mobile agents. The main feature that is responsible for these properties is the inherently asynchronous communication between application components (agents and services) based on a mail box system for secure message exchange. On each device that is to host mobile agents, a runtime daemon process has to be installed. In AMETAS, this process is called a place and provides the execution environment for agents and services. 3.1
The magmaNH Architecture
Figure 1 depicts the management platform we ultimately aim for. It is composed of horizontal and vertical functions. The horizontal functions are essential for a variety of network management tasks. Among several well-known horizontal aspects such as events and support for multiple management protocols, the functions specific to magmaNH are decentralization by means of mobility and proactivity by means of prediction. The vertical functions correspond to the well-known FCAPS management functions (fault-, configuration-, accounting-, performance-, and securitymanagement, cf. [7]) that are enriched by mobility and prediction. An Integration Platform provides interfaces to dynamically extend horizontal and vertical functions via plug-in mechanisms and integrates the administrator into the management system
Frame of Interest Approach on Quality of Prediction
249
Proactivity / Prediction
…
Fault Management
Decentralization / Mobility
Application Management
Legacy Protocols
Security Management
Development Tools (e.g. HEAT)
Events / Traps
Capacity Management
via appropriate user interfaces. This architecture is supplemented with development tools like HEAT [8] which can be used by administrative personnel to customize existing functions and create new ones in a simple and intuitive way. Another vital component that is closely related to the development tools is the HealthAgent Repository. This storage facility holds the developed mobile agents so that they can be quickly configured and deployed or adapted in an additional development cycle.
… HealthAgent Repository Integration Platform
Administrator
Fig. 1. magmaNH management architecture.
3.2
Basic HealthAgent Functionality
Since mobile magmaNH agents essentially calculate a so-called health function, they are called HealthAgents. These will be dynamically placed on a network node to be managed or which manages other nodes. Agent placement is triggered manually by a human administrator or automatically by authorized software agents. As access to management data or other system resources only is available via by magmaNH services, agents have to check the availability of required services at a place before operation. Each place can be equipped with generic and specialized services. A generic service would, for example, be an SNMP service allowing agents to access a local SNMP management information base. Specialized services may grant access to other system-specific management functions. At a certain frequency, an agent measures system parameters and combines them into a health function. A health function H may be an arbitrary mathematical function of arbitrary parameters depending on the task of the agent. H may be formally defined as:
H : Rn 6 R ( x1 ,..., xn ) 6 H ( x1 ,..., xn )
250
S. Schulz, M. Schulz, and A. Tanner
The result of this function is a scalar value that describes a particular aspect of the over-all system health. An example for a function that calculates the current traffic of a network interface is the following:
H traffic =
ifInOctets + ifOutOctets ifSpeed • sysUpTime
On basis of its calculation, the agent analyses the scalar value (or a series of values) to decide upon actions to be taken. A simple action can be to transmit the current value to a central management application. In general, the agent provides the administrator with an aggregated view of the network condition. Since only relevant data is transmitted, a low amount of network resources gets consumed by communication. 3.3 Prediction As mentioned before, proactiveness is an important issue for system management to prevent foreseen problems rather than to react on occurring ones. One way to do so is to build a prediction based on the analysis of historic data. We have studied these questions for some time and have come up with a solution based on Artificial Neural Networks (ANNs) [9]. An ANN is a simple computing model inspired by the structure observed in biological brains. Very simple computing units (neurons) are interconnected by weighted edges (axons). An artificial neuron essentially takes the input values, computes the weighted sum based on the axons’ definitions, and filters this sum through a non-linear activation function. The output is a single value which then is propagated through the neuron’s output and may serve again as input or as the ANN’s output. The attractiveness of this model lies in its architectural simplicity, its highly modular composition, its ability to learn and adapt to different situations and its ability to solve virtually any classification problem. magmaNH allows HealthAgents to carry ANNs and use them to predict the future development of a health. Input for a prediction system as ANN is a value of a computed health function as described in section 3.2. Health function values and predicted indices can be stored and used as training data for an ANN. It is possible to generate a forecast that reaches several time steps into the future. This prediction may be sent to a central administrator for manual analysis and it may be fed into the aforementioned agent’s decision process. If the current developments at a managed node will eventually lead to a critical system state, the administrator as well as the local health agent may attempt to take preventive actions before the critical state is reached.
4
How to Measure Quality of Prediction
As stated earlier, network monitoring operates in a highly dynamic environment. The accuracy of predictions made strongly depends on the networks configuration, installed software systems and their versions, and the network users’ behavior. Each of the criteria changes frequently, maybe by introducing software updates, extending
Frame of Interest Approach on Quality of Prediction
251
the network, or by seasons. In larger networks, the effort for administrative personnel to actually monitor the monitors on their quality would considerably reduce the personnel’s productivity. Further, a network administrator usually does not have knowledge on prediction or training of learning algorithms. Hence, an automation of measuring the quality of prediction and of retraining prediction services becomes a fundamental requirement. In the following, we will present an approach for measuring quality of prediction that is part of magmaNH. Environment. In the measures, two different kinds of predicted events are considered, each having two variants: • positives A positive describes a fault prediction regarding a specific threshold. The two variants are true positive (TP), which describes a correct prediction of a fault situation, and false positive (FP), which describes a predicted fault that did not occur. • negatives A negative describes a non-fault prediction regarding a specific threshold. The two variants are true negative (TN), which describes a correct prediction of non-fault situation, and false negative (FN), which describes predicted non-faults where a fault occurred. The values monitored and predicted are of restricted value range. They are not lower than 0 and do not supercede the upper bound U. Predicted values at a time t are represented by p(t). Measured values at a time t are represented by m(t). The distance between predicted and measured value at t is represented by d(t). And T is the set of times t, where values have been taken:
| T |= TP + FP + TN + FN 4.1 Counting Based Approach A simple approach is to count all false predictions (FP and FN) in a fixed time interval. The quality of prediction will be normalized to (0,...,1). In this approach, the quality of prediction based on counting false predictions (CFP) lowers on increasing amounts of false predictions over time:
QCFP (T ) =
| T | − ( FP + FN ) TP + TN = |T | |T |
In [1], a choice of improvements of this measure is described. The major disadvantage of this kind of approaches is that the measures do not consider the distances d(t), i.e., critical faults are treated the same way as trivial faults.
252
S. Schulz, M. Schulz, and A. Tanner
4.2 Deviation Based Approach
This approach takes the distances of predicted to measured values into account. The distances get normalized to absolute percentage of deviation D(t):
p (t ) − m(t ) ⋅100 p (t ) or m(t ) > 0 max( p (t ), m(t )) D(t ) = 0 else Applied to a fixed time interval, the percentage deviation quality of prediction computes to:
QPD (T ) =
∑ D(t ) t∈T
T
In contrast to the approach in 4.1, the difference from prediction to reality plays a major role. The problem is to determine a good T: Depending on the number of values sampled in the time interval and the size of the time interval, predictions of high quality, i.e., D(t) ≈ 0, simply blanket high or critical loads. And on the other hand, having a very short time interval or small number of sampled values, critical loads strike hard and may cause the prediction system to fluctuate. Another disadvantage of this approach is that it does not consider distinctions between positives and negatives. Generally, in management of networks false positives are considered less fatal than false negatives. 4.3 Frame of Interest Approach
Based on the investigations briefly introduced in the previous sections, we propose the frame of interest approach to build a measure for quality of prediction. The goal is to handle the dynamic behavior of ANNs’ quality in network monitoring. Our approach is based on the fact that when monitoring a network, attention has to be paid to certain parameters passing critical thresholds. That means that to judge the usability of a certain prognosis, the question to be answered is whether this prognosis reliably predicts these parameters as long as their values lie in a certain frame of interest around the critical thresholds. To achieve this, we generalize the approach based on counting from above by attaching weighs to the positives and negatives defined above. As stated before, the choice of an adequate time frame T is important to gain good quality measures. The frame spans over events (positives and negatives) registered from the last back to a specified number of events (frame size). To cope with small deviations, a system parameter specific deviation channel gets defined. This way, deviations of predicted to measured values are mapped to binary
Frame of Interest Approach on Quality of Prediction
253
values, taking specific fluctuations into account. The channel is represented by the function ∆ and the distance ε:
1 | p (t ) − m(t ) |> ε ∆(t ) = 0 else Additionally, a weight is computed for each event, which depends on the kind and variant of prediction made. The value range of weights is (0,...,1) and depends on the experiences made with specific system parameters. In case of a single threshold θ separating two states of the network, there are weights wvk for the four events (positive and negative, each in the two variants true and false) given as by:
wTP w ωθ (t ) = FP wTN wFN
p(t) > θ ∧ m(t ) > θ p(t) > θ ∧ m(t ) ≤ θ p(t) ≤ θ ∧ m(t ) ≤ θ p(t) ≤ θ ∧ m(t ) > θ
We compute the frame of interest quality of prediction to:
QFI (T ) = 1 −
∑ ω (t ) ⋅ ∆(t ) t∈T
|T |
Consequently holds:
1 ≥ QFI (T ) ≥ 1 − max ωθ (t ) ≥ 0 t∈T
As a result, we can compare the quality of prediction by defining appropriate thresholds. The closer QFI (T) is to 1, the better quality the prediction has. 4.4 An Example for Frame of Interest
In network management, usually three states (conditions green, yellow, and red) of a system’s component are considered, i.e., the network monitor handles two thresholds, i.e., the system will handle thresholds θyellow and θred and provides positives and negatives for each threshold. The measure described above can easily be extended to a higher number of thresholds by redefining ∆ and ωθ. Figure 2 shows an example for ω in the network monitoring domain having three states. This instantiation of ω pays more attention to false predictions. It depends on the distance of the states to each other.
254
S. Schulz, M. Schulz, and A. Tanner
Fig. 2. Instance of ωθ with states green, yellow, and red.
In figure 3, examples are given for the graphs of measured and predicted values without testing on the quality of prediction. The yellow and red thresholds and the respective state zones are shown. The important areas are the dark marked areas between the two graphs. At t1, measured values exceed the yellow threshold, where the prediction shows green until t2. Hence, for the range from t1 to t2 the prediction failed to notify about a yellow state. A similar situation appears from t3 to t4, where a red state never gets predicted. And the opposite happens from t6 to t7, where the prediction gives a false positive of state yellow.
Fig. 3. Zones for states and graphs of measured (black) and predicted (blue) values.
Finally, figure 4 shows an example for a frame of interest. The measure will encounter a series of false negatives on the yellow state. Hence, the value of QFI will be close to 1 and force to retrain the ANN.
Fig. 4. Frame of Interest
Frame of Interest Approach on Quality of Prediction
5
255
Evaluation of Quality Measures
For a first evaluation of the different measures of quality, various system parameters like storage, memory usage, and CPU load have been collected. In total, a continuous measuring for about 28 days has been performed by taking data every 5 seconds. That is, we created one time series per parameter per day each containing 17,280 events. Training and Predictions. To train ANNs, from each series a reduced set of values has been computed by taking the average for ca. hundred values. This resulted in about 170 values as training data. Always starting with the same untrained ANN, from each reduced set a trained ANN has been built. Afterwards, clustered by parameter affiliation, each of the time series was presented to all of the ANNs. That is, each ANN produced 28 time series of predicted values, one series for each day. Frame of Interest Setting. The important factor in the frame of interest approach is the ω-function, which defines the weight for predicted events, as described in section 4.3. Moreover, a channel is defined by the distance ε. For evaluation, the following setting has been applied:
wTP = 0.1 wFP = 0.3 wTN = 0.01 wFN = 0.9
ε = 0 .2 Here, the false negatives have the highest weight, followed by false positives. Correct predictions, either positive or negative, have low weights and will marginally contribute to the quality. The channel will cover the imprecision of predictions resulting from the fluctuations caused by the reduced set of training data. Example Setting. For this paper, we selected the memory usage as an example to verify the measures. As threshold we set θ = 97.3. To measure the quality we apply a time frame of 100 events. We chose the ANN that resulted from the day 2 training set and its prediction for day 5. Graphs for both the measured and predicted values for day 5 are shown in figure 5. For later explanation, we will look at a clip of about 4,000 values for detail.
A full graphical comparison of the qualities computed by each of the approaches can be seen in figure 6.
256
S. Schulz, M. Schulz, and A. Tanner
Fig. 5. Measured (black) and predicted (grey) values for day 5. Prediction based on day 2 ANN. (vertical lines demarcate clip)
Fig. 6. Graphical comparison of the approaches of QCFP (grey), QPD (grey dashed), and QFI (black). (vertical lines demarcate clip)
In general, the figure depicts that QPD is not a real quality measure but shows the deviation of the prediction. This is, because it ignores the existence of a threshold to separate between two states of predictions. The quality here would be a threshold,
Frame of Interest Approach on Quality of Prediction
257
which defines the maximum allowed deviation. It would be a valid quality measure, if the goal is to only allow very precise predictions, e.g., for mathematical functions. The graph for QCFP implies an overall bad quality for the prediction, regardless of what kind the false predictions are. Hence, this measure of quality only allows for black and white decisions: as soon as false predictions occur, the quality is bad. This kind of quality measure is valid if all states are equally important. Finally, the graph for QFI reflects the influence of the weights. Obviously, for the chosen weights, the quality of prediction is still fair most of the time.
Fig. 7. Clip showing measured (black) and predicted (grey) values for day 5. Prediction based on day 2 ANN. Threshold is shown as horizontal line (black dashed). (vertical lines demarcate sections, black dotted)
Comparison Details. For a detailed look, we chose a clip from event 6,000 to event 10,000. Figure 0 shows a comparison between measured and predicted values for this range and the threshold θ. Except of some peaks, the measured values form 5 sections with respect to the threshold. In general, the prediction seems to smoothly follow the measurement. But the prediction only forms two such sections with respect to the threshold, where in the first 1,800 events continuously its prediction lays above the threshold. In terms of fault monitoring, the system falsely gives warning of exceeding the threshold for these events. Much worse, the prediction misses out an excess between events 3333 and 3388.
The corresponding clip for the quality measurement can be seen in figure 8, where the graph for QPD has been left out. The figure depicts, that QPD gives a quality of 0 for first and third section where the prediction falsely gives warning, but has a quality of 0.5 where the prediction falsely misses the excess. In contrast, QFI gives a more “natural” relation between the quality for the different sections in relation to the kind of deviation. It gives closely the same quality for the
258
S. Schulz, M. Schulz, and A. Tanner
important deviation in the fourth section, but only denotes a “warning” level for the first and third section.
Fig. 8. Graphical comparison of the approaches of QCFP (grey) and QFI (black). (vertical lines demarcate sections, black dotted)
Another observation in figure 8 is the difference in the second section: while QCFP rates the prediction as quality of 1, QFI does not reach this best quality level. This is, because of the deviation that exceeds ε and true positives are weighted with 0.1. As a comparison, in the end of section three and section five it is obvious that true negatives have no real influence on the quality.
6
Conclusions
We presented a frame of interest approach for measuring the quality of prediction. The measure is applied in the domain of network monitoring, considering peaks as well as fluctuations in measurement of system parameters. The kind and variants of predictions are weighted depending on their importance. Further, we explained its scalability. The features mentioned are necessary for monitoring large systems. Its simplicity allows to integrate prediction quality measurement in distributed, agentbased network management systems as magmaNH. We compared three measures for quality of prediction to study its effectiveness in deciding upon when to retrain an ANN. Therefore, test runs operated on productive hosts measuring system parameters like CPU load, memory usage, and storage usage. Predictions are made for these parameters and manually be compared with the real time series. The evaluation showed the advantages of the frame of interest approach in the area of network monitoring, where different states regarding a threshold are of different importance. Of course, this comparison only can be a first step and will be followed by further evaluation including a retraining when detecting bad quality. This way, we will be able to compare the dynamic behavior of the whole system and the success of each measure in long-term prediction. An open topic is to automatically determine good samples for training. A research on this topic will be done after the test runs, to decide, if samples can be taken from the real time series directly or if they have to be totally or partial independent.
Frame of Interest Approach on Quality of Prediction
259
Acknowledgements. This paper is based on research results obtained in the SysteMATech project, which is supported in the Fifth Framework program by the European Commission in the IST initiative and is included in the cluster of projects EUTIST-AMI (www.eutist-ami.org) regarding Agents and Middleware Technologies applied in real industrial environments. We further thank Deutsche Flugsicherung DFS GmbH for great cooperation in testing and evaluation of magmaNH and network monitoring and for test-bed provisioning.
References 1. 2. 3. 4. 5. 6. 7. 8. 9.
Murphy, B., Gent, T.: Measuring System and Software Reliability using an Automated Data Collection Process. Quality and Reliability Engineering International, 341-353. 1995. Valiant, L.: A theory of the learnable. Communication of the ACM, 27(11):1134-1142, 1994. Schuurmans, D., Greiner, R.: Practical PAC Learning. Proceedings of the Fourteenth International Conference on Artificial Intelligence (IJCAI-95), 1995. Bajić, V.B.: Comparing the success of different prediction software in sequence analysis: a review. Briefings in Bioinformatics, 1(3):214-228. 2000. Goldszmidt, G.: On Distributed System Management. In Proceedings of the Third IBM/CAS Conference, Toronto, Canada, 1993. Herrmann, K., Zapf, M.: The AMETAS White Paper Series. July 2000. URL: http://www.accsis.com/ametas/docs/white/AllWhite.pdf. Hegering, H.-G., Abeck, S., Neumair, B.: Integrated Management of Networked Systems. Morgan Kaufman, 1999. Schulz, S., Herrmann, K., Geihs, K.: HEAT - A HealthAgent-Toolbox for Distributed th Network Management. In Proceedings of the 10 annual Workshop of HP OpenView University Association, Geneva, Switzerland, 2003. Herrmann, K., Geihs, K.: Integrating mobile agents and neural networks for proactive management. In IFIP International Working Conference on Distributed Applications and Interoperable Systems (DAIS 01), Krakow, Poland, 2001. Chapman-Hall.
Bluetooth Scatternet Formation – State of the Art and a New Approach Markus Augel and Rudi Knorr Fraunhofer Institute for Communication Systems ESK, Hansastrasse 32, 80686 Munich, Germany {markus.augel, rudi.knorr}@esk.fraunhofer.de http://www.esk.fraunhofer.de
Abstract. The emerging radio technology Bluetooth offers configuration- and administration-free ad hoc networking between mobile terminals. A simple so-called Bluetooth piconet consisting of up to eight active devices can easily be formed with products available today. The Bluetooth specification, however, goes a step further and indicates the combination of several piconets into a larger network, a so-called scatternet. But any details of this self organization are far beyond the scope of the specification - scatternet formation is an active research area. This paper first gives a survey of the state of research and development in the area of scatternet formation. Secondly, a new approach in scatternet formation is presented: formation dependent on the QoS requirements of the applications.
1
Introduction
An evolution is taking place in the area of networking. In the beginning there was a desire to exchange data between stand-alone computers in a convenient way, so that they were first interconnected using wires. The logical next step was to add more flexibility to the network structure by omitting the wires and using radio waves to transmit the data. These first wireless networks still required configuration and administration by the user or by an engineer, whereas ad hoc networking aims to overcome these disadvantages and offers wireless communication without any administrative overhead. To cover larger areas, several radio subnetworks have to be coupled together due to the limited range of the radio transmission. The network has to include forwarding nodes to facilitate communication between devices far away from each other over several hops. This evolution finally can lead to the vision of ubiquitous computing, omnipresent smallest “computers” communicating wirelessly with one another acting as cooperating “smart objects” with situation and context based behavior. Bluetooth [3,4,7] was especially designed and developed for ad hoc networking. It not only offers the linking of up to eight devices to a piconet (consisting of a master and up to seven active slaves) but also the interconnection of several piconets to a scatternet. However, the Bluetooth specification gives no details C. M¨ uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 260–272, 2004. c Springer-Verlag Berlin Heidelberg 2004
Bluetooth Scatternet Formation – State of the Art and a New Approach
261
about this self organization and therefore there is a lot of ongoing research in this area. This paper is organized as follows. Sect. 2 gives a short introduction in Bluetooth technology. Sect. 3 points out what requirements a scatternet formation algorithm should fulfill. Sect. 4 gives an overview of the state of research and development in the area of scatternet formation. In Sect. 5 a new approach in scatternet formation is presented: formation dependent on the QoS requirements of the applications.
2
Bluetooth Technology
Bluetooth networks have a twofold topology. In the first stage of the network formation, several devices join to a piconet consisting of exactly one master and up to seven active slaves. The master coordinates the communication inside the piconet by polling the slaves. If slaves have to exchange data it has to be forwarded by the master. Each piconet offers the gross data rate of 1 Mbit/s. In the second stage of the network formation several piconets join to one larger network, the scatternet. The inter-piconet data exchange is done by bridge nodes which communicate on a time division basis in several piconets. I.e. a master or a slave in one piconet can join other piconets as a slave and thereby becomes a master/slave bridge or a slave/slave bridge, see Fig. 1. It is the formation algorithm’s task to keep masters and bridges from becoming bottlenecks.
S
M
MS
M SS
S
S
S
SS
M
Fig. 1. A Bluetooth piconet and a scatternet.
Bluetooth defines four communication phases to be used for formation: inquiry, paging and the corresponding states inquiry scan and page scan. In the inquiry state, a device searches for other devices in radio range and collects information necessary for a later connection establishment with the page procedure. Both procedures, inquiry and paging, can only be successful if the designated communication partners perform the corresponding scan procedure at the same time. Activation, duration and regularity of the four communication phases can be controlled by software and especially by a formation algorithm which has a strong impact on the resulting network topology. Consider Fig. 2 in which two topologies for four devices each are depicted. In the upper part of the figure
262
M. Augel and R. Knorr
exactly one device is performing inquiry and afterwards pages the found devices and therefore becomes master of the piconet. The three other devices alternate in performing inquiry scan and page scan (abbreviated iscan, pscan). In the lower part of the figure the scenario is only somewhat different. The two devices performing inquiry and page become masters of two different piconets. In order to get a connected scatternet a slave has to act as a bridge node by being shared between the two masters. inquiry / paging
X
X
iscan / pscan
M
S
iscan / pscan
X
X
iscan / pscan
S
S
inquiry / paging
X
X
inquiry / paging
M
M
iscan / pscan
X
X
iscan / pscan
SS
S
Fig. 2. Influencing network topology by controlling communication phases
Fig. 3 indicates the manifold possibilities for building up a scatternet out of nine devices. Four exemplary topologies are depicted. Bhagwat et al. [2] derived by graph theoretical considerations that, with only ten devices in radio range, over one million - mostly inefficient - topologies are possible. A formation algorithm has to find a well-suited, efficient topology with desired properties within this large set of topologies.
S
S
S
M
SS
M
S
S
S
M
SM
S
S
MS
S
SM
SM
SS
M
SS
M
SM
SM
S
S
S
M
S
SS
M
S
S
S
S
S
S
Fig. 3. Four possible scatternet topologies for nine Bluetooth devices, three trees and one mesh
3
Formation Algorithm Requirements
A formation algorithm should fulfill several requirements which are depicted in the following, in order to cope with the character of Bluetooth ad hoc networks.
Bluetooth Scatternet Formation – State of the Art and a New Approach
263
First and foremost a formation algorithm should not depend on a central component which is a clear contradiction of the character of ad hoc networks. In ad hoc networks every node may leave on an arbitrary basis. If a central coordinator leaves during formation, the process has to be restarted from the beginning in the worst case. Another possibility is to recognize in time that the coordinator is about to leave and let another node continue its work. Binding the coordinator role to a stationary node restricts the application area to networks in which such a node is present. A better approach is to design a formation algorithm in a decentralized and distributed manner. A formation algorithm should have low complexity, including time, memory and message complexity. This refers to the character of the devices used in ad hoc networks and especially in ubiquitous computing scenarios which are mostly battery powered and are equipped with only a small amount of memory and low computational power. Clearly the network setup delay should be as small as possible, so that the end-user can use the network quite quickly. The message complexity is directly related to the energy needs of the algorithm. The QoS requirements of the applications should be considered. As pointed out above, the controlling of the different formation phases can have a strong impact on the resulting network topology. The topology should be set up in such a way that the QoS requirements of the user applications in the network can be met. Another important point is to incorporate the energy reserves and the computational power of the devices involved. For example, it would make no sense to have a headset with minimum energy reserves act as a bridge in several piconets. Furthermore the algorithm should not assume that all devices are in radio range. Especially in larger in-house scenarios, this condition does not hold. The algorithm should support mobile devices. It should not only support devices entering and leaving the network but also devices moving within the network during communication. Finally, the algorithm should not require proprietary enhancements of the Bluetooth specification - especially the hardware - in order to ensure realistic and timely practicability.
4
State of the Art in Scatternet Formation
Due to the large number of possible scatternet topologies - most of them being inefficient (bottlenecks) - a formation algorithm to generate a “suitable” topology is needed. Several authors deal with scatternet formation and differ in what they see as “suitable”. The following paragraphs give an overview of current research in the area of scatternet formation. There is no consensus in literature as to whether master/slave bridges should be used in a scatternet or not. This kind of bridge has the disadvantage that all communication in the bridge’s own piconet has to be suspended if the bridge is absent and acting as a slave in another piconet. Depending on the number of nodes in the bridge’s piconet this may be reasonable or not. Bhagwat et al. [2] state that “such configurations result in poor bandwidth utilization and are therefore undesirable.” Kalia et al. [8] derived by simulation that, regarding
264
M. Augel and R. Knorr
average system delays and system throughput, a slave/slave bridge is better than a master/slave bridge. Miˇsi´c et al. [15] conclude from a queuing theoretic analysis that delays caused by a master/slave bridge can be reduced by setting the time interval between bridge exchanges appropriately. 4.1
Tree Topologies
Several authors deal with Bluetooth tree topologies. Tan et al. [23,24] developed the distributed Tree Scatternet Formation algorithm TSF, which assigns master/slave roles to nodes while connecting them in a tree structure in which every node is active in at most two piconets. TSF supports entering and leaving of nodes by healing partitions but does not support mobile nodes. Disadvantage of the algorithm is that throughput is not taken into consideration. Z´aruba et al. [26] propose a similar approach called Bluetree. To speed up formation, their algorithm creates also several disjunctive smaller trees in parallel which are afterwards joined to one big tree. Each node is active in up to three piconets. The authors assume low node mobility. Sun et al. [21] present an approach - which is also called Bluetree - aimed at simplifying the routing. The basic idea is to order the nodes in a tree structure consisting of master/slave-bridges sorted by their device addresses. Due to this special structure, no routing tables are needed. The algorithm requires all nodes to be in proximity of each other and generates maximal filled piconets. At each point in time, several nodes can leave or at most one node can enter the network. Node mobility is not considered. 4.2
Mesh Topologies
Law et al. [9,10] propose a randomized distributed algorithm for scatternet formation which assumes that all devices are in radio range of each other. The algorithm is based on the merging of components which are sets of interconnected devices (single device, piconet, scatternet). Interconnection of piconets is done by shared slaves, which are aimed to be active in at most two piconets. The algorithm supports joining of devices but no node mobility. In [10] an outline to support leaving devices is given. Salonidis et al. [18] propose the Bluetooth Topology Construction Protocol BTCP. The algorithm requires all nodes to be in radio coverage and assumes a conference-scenario without node mobility. BTCP is based on a central coordinator which determines the role of each device using a formula which is only valid for up to 36 nodes. BTCP uses only slave/slave-bridges and limits its degree to two. Wang et. al [25] present the distributed algorithm bluenet which also is only suitable for stationary nodes. To achieve that each bridge node has at most one connection to each piconet a proprietary modification of Bluetooth’s the paging procedure is proposed. Petrioli and Basagni describe in [17] the BlueMesh algorithm which is an enhancement of BlueStar [1]. The idea behind BlueMesh is to elect some “well
Bluetooth Scatternet Formation – State of the Art and a New Approach
265
suited” nodes serving as masters. The election is based on a “weight” which each node assigns to itself and which can be based for example on battery power. BlueMesh does not support dynamic topologies. 4.3
Ring Topologies
Lin et al. [12] propose to connect all masters in a ring structure interleaved by slave/slave-bridges. Each master can have further slaves which are outside the ring. Foo et al. [5] present a similar ring structure consisting of master/slavebridges only, which simplifies the routing and offers a large fraction of bandwidth to each device. The presented ring structure suffers from the large diameter n/2. 4.4
Graph Theory
Several authors model scatternet formation as a graph theoretical problem. Guerin et al. [6] interpret scatternets as bipartite graphs (with the disjunctive partitions masters and slaves, i.e. there are no master/slave-bridges) and investigate centralized algorithms to generate such graphs. To cope with dynamic scenarios and to decentralize the algorithms further research is necessary. Li [11] and Stojmenovi´c [22] present graph algorithms based on partial delaunay triangulation resp. dominating sets first constructing a bounded-degree sparse topology, then assigning roles to each node. Both approaches require each node to know its current position. 4.5
Further Approaches
Marsan et al. formulate in [14] scatternet formation as a linear optimization problem which only goal is to minimize the load on the most congested node in the network. Due to the centralized approach and, in particular, to the complexity of the algorithm (NP-complete) it is only suited for stationary environments. Siegemund [20] discusses context based scatternet formation for sensor networks. The basic idea is to include sensor information in the formation process and to arrange nodes with the same context in the same piconet. For example nodes which measure the same light irradiance, the same ambient temperature and the same noise are assumed to be in the same context and to communicate with each other with a high probability. Liu et al. [13] propose to set up routes between Bluetooth nodes on demand instead of creating a scatternet containing all nodes, for energy saving reasons. However, the route set up requires inquiry operations at each hop, which are carried out in sequential manner and introduce a huge formation delay. Zhen et al. [27] suggest a related approach called “blue-star” which lowers the formation delay compared to [13] but also needs to perform several inquiry operations one after another.
266
4.6
M. Augel and R. Knorr
Discussion
In contrast to meshed topologies, tree topologies generally have the disadvantage that higher effort is needed to compensate dynamic changes. In a tree there is exactly one path between each two nodes. If only one node on a path fails (moving out of range, power-down, defect) without countermeasures, the net will at once be divided into two separate trees. A further drawback of a tree topology is the root node which is very likely to become a bottleneck. Meshed topologies, in contrast, with multi-redundant paths can cope with topology changes more easily but require a more complex routing and are also affective to bottlenecks. Rings are a good compromise between tree and mesh topology especially for a smaller number of nodes. Due to the special structure, the routing can be nearly as simple as in a tree topology. Rings do not suffer as much from bottlenecks as trees or meshes, due to the uniform network structure. But for larger numbers of nodes the network diameter grows too high for delay-sensitive applications. Independent of the kind of topology it is essential that reducing the network diameter - and thereby communication delay - by allowing larger node degrees, i.e. more devices per piconet and more piconets per bridge, has a bad influence on throughput since the gross data rate of 1 Mbit/s per piconet has to be shared among more devices. However, if the node degrees grow too high especially the bridge nodes will become bottlenecks with a bad influence on both delay and throughput; the throughput will further decrease and the delay will increase. It should be the formation algorithm’s task to find a “good” compromise between delay and throughput. Minimizing the number of piconets due to interference reasons is a secondary aspect. According to [28] the interference problem is not so bad that coexisting piconets should be avoided. Postulating that each node knows its current position is a condition which can not be hold. Equipping each node with a GPS module seems to be improbable especially in ubiquitous computing scenarios. The workaround suggested in [11] to estimate positions based on incoming signal strengths is not applicable in most cases, since the corresponding RSSI-feature1 is not mandatory for Bluetooth devices.
5
New Approach: Application-Oriented Formation
Most approaches proposed in literature aim to optimize several parameters of the scatternet topology while disregarding others, e.g. trying to minimize the average path length while not considering throughput issues; or, they have further restrictions which are not feasible in ubiquitous computing scenarios (all nodes in radio range, low node mobility, central coordinator required etc., see Sect. 4). One-sided topology optimization, however, restricts the algorithm’s application areas to the ones in which the disregarded parameters are not important. In the authors’ opinion, a formation algorithm’s goal should not be the onesided optimization of a few parameters while neglecting others. In fact the network topology should be controlled from the point of view of the most important involved factors, namely the applications. In other words there should be a 1
RSSI = Received Signal Strength Indicator
Bluetooth Scatternet Formation – State of the Art and a New Approach
267
flexible formation algorithm which is able to control and maintain the network topology dependent on the QoS requirements of the applications (data rate, delay, jitter, bit and packet error rate). A “good” formation algorithm is not one which generates topologies with smallest diameter or highest throughput but, rather, one which fulfills the applications’ requirements as well as possible. This issue has not been addressed in detail so far. Pabuwal et al. [16] discuss a Java-based API to deploy scatternet-based applications over Bluetooth. The authors denote the possibility of switching between several different formation algorithms dependent on application requirements. No information about possible criteria on which this switching can be based are given. With this approach no flexible control of the network topology is possible. Instead of switching between various formation algorithms, it would be better to have only one algorithm which controls the topology in an application-oriented manner. The application areas focused on in this paper are dynamic home and small office/home office (SOHO) in-house networks which can consist of various kinds of devices with different capabilities and QoS requirements.2 Depending on the devices present in the network some QoS parameters may be of interest while others are not. If the network only consists of notebooks there is a high probability that file transfers are to expect and there is a need to set up a network with high throughput. If there is a more heterogeneous device structure, e.g. PDAs, notebooks, a beamer and a printer, there are several different QoS requirements. From the point of view of the printer a network with high throughput should be set up, whereas delay, jitter and error rates are less important. Communication between beamer and notebook, however, requires low delay and low error rates. In a ubiquitous computing scenario with smart objects, sensors and actors, low delay and error rates are most important whereas throughput is not. A scenario in which a user wears a headset to communicate for example with people at the door or to control home appliances by speech, low delay and low jitter is needed. Throughput and error rates are less important in this case. These few examples may indicate the need for application-oriented scatternet formation. The basic idea of the new approach is to create a parameterized formation algorithm which facilitates the application and device dependent control of the formation parameters. Formation parameters are: the node degrees (slaves per master and piconets per bridges), the variance of the node degrees (a high variance indicates a rather star-shaped network with bottlenecks), the bridge type (slave/slave, master/slave) and the activation and duration of communication phases (inquiry etc., see Sect. 2). As a basic example of how formation parameters influence network topology refer to Fig. 4. This shows how the controlling of the node degrees directly influences network diameter and average shortest path lengths. Higher node degrees (here no distinction between master and bridge degrees is made) lead to shorter path lengths, as can be expected, and therewith to lower communication 2
Wireless applications with huge bandwidth demands (e.g. broadband video transmission, hotspots etc.) are mainly the domain of IEEE 802.11x and are therfore not considered here.
268
M. Augel and R. Knorr
delay.3 In contrast, a lower master degree offers each piconet member a larger fraction of bandwidth. This bandwidth fraction can be preserved on the multihop communication link if the degree of the involved bridges is also low and so not too much bridging overhead is introduced.
network diameter average shortest path
7
hops
6
5
4
3
2 1
2
3
4 5 maximum node degree
6
7
8
Fig. 4. Impact of restricted node degrees on diameter and path length. Randomized formation for eight stationary nodes in radio coverage. Each node acts with a probability of 50% as master or as slave (i.e. no master/slave-bridges are considered in this case). Duration of the formation phases is randomized among the intervals given by the Bluetooth specification.
5.1
Structure of the Algorithm
The algorithm which is executed on each device works as follows: 1. Gathering the device’s own properties and applications Device properties which are energy reserves and computational power affect the role the device can play in the scatternet. For example, a node with low computational power cannot be made a bridge in several piconets or a master of several devices. It will therefore leave inquiry scan and page scan modes once connected. It is assumed that there is a suitable middleware which is capable of reporting which applications are present and what QoS requirements they have. 2. Gathering neighboring devices of class of device (inquiry) The answer to an inquiry packet (FHS-packet) contains the useful class of device field, which includes information about the type of the answering device. From the results of step 1 and 2 conclusions about the expected communication connections and especially their QoS requirements are drawn. For example if a notebook finds a camera it might be useful to set up a network with high throughput; or if several sensors find each other there is a need to set up a scatternet with low delay properties. 3
This is not the case if the degree grows too high, see Sect. 4.6.
Bluetooth Scatternet Formation – State of the Art and a New Approach
269
3. Identifying suitable formation parameters to fulfill the QoS requirements and assigning suitable values to them This is the most important part of the algorithm and is discussed in Sect. 5.2 below. 4. Formation This step includes paging, as well as the transmission of new control messages to devices already connected, e.g. a node with a high degree stops paging and instructs a neighboring node with a low degree to start paging instead. In a heterogeneous scenario each device may try to influence to topology in another way dependent on different QoS requirements. The resulting network will be a compromise which fulfills the requirements best. It may include domains with different local topology structures as needed. 5. Optimizing the topology and maintenance Further parameters which cannot be known before a connection is established (e.g. link quality) are included in the formation process. Maintenance includes support for mobile nodes. For example a node leaving its link partner (which is identified by decreasing link quality) will be handed over to a neighbor of its partner which best fulfills the QoS requirements. 5.2
Determining Suitable Formation Parameters
The core task in application-oriented scatternet formation is to establish a relation between the QoS requirements of the applications and the QoS properties of the network. This is done in two steps. First, the relation between the formation parameters and the properties of the network is derived by simulation. Secondly, the relation between the applications’ needs and the formation parameters has to be modeled. This can be done using decision theory which is a part of operations research (economics) and deals with the formal modeling of decision problems. Decision theory distinguishes between actions ai which can be affected by a decision maker and states sj which cannot be affected. The decision for an action and the occurrence of a state lead to an effect eij ∈ E 4 , which is illustrated in Fig. 5. Each effect is rated with a utility function u, i.e. the utility function assigns to each possible effect a real number; u : E → , see Fig. 5. Finally, decision rules aid in choosing an action with respect to its utility. Fig. 6 shows how decision theory can be applied on scatternet formation, where the (abstract) decision maker is the formation algorithm. In this context an action is a decision to assign specific values to specific formation parameters. This is illustrated here by “node degree = low” and “node degree = high” to keep the example simple. In reality this would be something like “slave degree = 2”. The effects (in terms of decision theory) of these actions are the QoS properties of the network; here delay and bandwidth are singled out. The relation between node degrees, delay and bandwidth has already been discussed in Sect. 4.6. The detailed relationship has to be investigated by simulation. In the further columns of the table the utilities for three exemplary applications (states in terms of 4
E is the set of effects
270
M. Augel and R. Knorr no influence ↓ state s 1 · · · · · · sm e11 e1m
a1 ··· influence → action · · · effect an en1 enm
s1 · · · · · · sm a1 u11 u1m ··· utility ··· an un1 unm
Fig. 5. An effect matrix (left) and a decision matrix (right)
decision theory) are depicted. For example, because high delay is bad for voice applications this combination of action and state is rated with a low utility. In this example the decision rule is just to maximize the utility. If therefore a node, resp. the formation algorithm on this node, has acquired the information that itself and a neighboring device are capable of voice communication it looks for the maximum utility for voice applications and performs the action which is related to this utility. values of formation parameters node degree = low node degree = high ···
effect
applications
voice video Internet · · · high delay, utility 1 utility 2 utility 3 high fraction of bandwidth low delay, utility 5 utility 3 utility 2 low fraction of bandwidth
Fig. 6. Applying decision theory on scatternet formation (combined depiction of effect and decision matrix). Low numbers denote low utility. Bottlenecks are not considered here to keep the example simple.
The mapping between formation parameters and QoS properties is done before the formation takes place, i.e. all devices are equipped with simple tables in which the appropriate values are looked up as needed during formation. All complex calculations are carried out independently of the formation so that they cannot cause significant formation delays. For unknown applications several classes of QoS properties are provided. During formation an unknown application is mapped to the class which fits best.
Bluetooth Scatternet Formation – State of the Art and a New Approach
6
271
Conclusion and Outlook
In the paper a new approach to facilitate scatternet formation depending on the QoS requirements of the applications was presented. Its designated scope of application is home and SOHO in-house communication. The aim of the new approach is not the isolated optimization of a few network parameters such as delay or throughput, but a flexible control of the network’s topology to fit the applications’ needs as well as possible. The advantage of the presented approach is its broad application area. Unlike other approaches it does not focus on a few parameters while neglecting others and does not perform a one-sided topology optimization. Rather, it facilitates a flexible control of the network’s topology as needed. If there are applications with strong contradicting QoS requirements (e.g. video and real-time sensor data) the new approach may lead to a bad compromise. In such a case it may be better to separate the communication into disjoint scatternets. Further research has to be be carried out. First, the relation between formation parameters and the QoS properties of the network has to be deeply analyzed. Secondly, the decision theoretic definition of the utility function (to rate the utility of each QoS property for different applications’ needs) and the definition of decision rules is a subject of ongoing research.
References 1. Basagni, S., Petrioli, C.: Multihop Scatternet Formation for Bluetooth Networks. IEEE Vehicular Technology Conference (2002) 2. Bhagwat, P., Rao, S.P.: On the Characterization of Bluetooth Scatternet Topologies. Submitted for publication 3. Bluetooth Special Interest Group: Specification of the Bluetooth System, Version 1.1. (2001) 4. Bray, J., Sturman, C.F.: Bluetooth 1.1, Connect without Cables. Prentice Hall (2001) 5. Foo, C.-C., Chua, K.-C.: Bluerings - Bluetooth Scatternets with Ring Structures. IASTED Wireless and Optical Communications (2002) 6. Guerin, R., Sarkar, S., Vergetis, E.: Forming Connected Topologies in Bluetooth Adhoc Networks. University of Pennsylvania, Technical Report (2002) 7. Haartsen, J.C.: The Bluetooth radio system. IEEE Personal Communications Magazine, Vol. 7, No. 1 (2000) 8. Kalia, M., Garg, S., Shorey, R.: Scatternet Structure and Inter-Piconet Communication in the Bluetooth System. IEEE National Conference on Communications (2000) 9. Law, C., Mehta, A.K., Siu, K.-Y.: Performance of a New Scatternet Formation Protocol. ACM Symposium on Mobile Ad Hoc Networking and Computing (2001) 10. Law, C., Mehta, A.K., Siu, K.-Y.: A New Bluetooth Scatternet Formation Protocol. ACM Mobile Networks and Applications Journal 8(5) (2002) 11. Li, X.-Y., Stojmenovi´c, I.: Partial Delaunay Triangulation and Degree Limited Localized Bluetooth Scatternet Formation. Proc. Ad-hoc Networks and Wireless, Fields Institute (2002)
272
M. Augel and R. Knorr
12. Lin, T.-Y., Tseng, Y.-C., Chang, K.-M., Tu, C.-L.: Formation, Routing, and Maintenance Protocols for the BlueRing Scatternet of Bluetooths. 36th Hawaii International Conference on System Sciences (2003) 13. Liu, Y., Lee, M.J., Saadawi, T.N.: A Bluetooth Scatternet-Route Structure for Multihop Ad Hoc Networks. IEEE Journal on Selected Areas in Communications, Vol. 21, No. 2 (2003) 14. Marsan, M.A., Chiasserini, C.F., Nucci, A., Carello, G., De Giovanni, L.: Optimizing the Topology of Bluetooth Wireless Personal Area Networks. IEEE Infocom (2002) 15. Miˇsi´c, V.B., Miˇsi´c, J.: Bluetooth Scatternet With a Master/Slave Bridge: A Queuing Theoretic Analysis. IEEE Globecom (2002) 16. Pabuwal, N., Jain, N., Jain, B.N.: An Architectural Framework to deploy Scatternet-based Applications over Bluetooth. IEEE International Conference on Communications (2003) 17. Petrioli, C., Basagni, S.: Degree-Constrained Multihop Scatternet Formation for Bluetooth Networks. IEEE Globecom (2002) 18. Salonidis, T., Bhagwat, P., Tassiulas, L., LaMaire, R.: Distributed Topology Construction of Bluetooth Personal Area Networks. IEEE Infocom (2001) 19. Siegemund, F., Rohs, M.: Rendezvous Layer Protocols for Bluetooth-Enabled Smart Devices. To be published: ACM Journal for Personal and Ubiquitous Computing (2003) 20. Siegemund, F.: Kontextbasierte Bluetooth-Scatternetz-Formierung in ubiquit¨ aren Systemen. First German Workshop on Mobile Ad hoc Networks (2002) 21. Sun, M.-T., Chang, C.-K., Lai, T.-H.: A Self-Routing Topology For Bluetooth Scatternets. International Symposium on Parallel Architectures, Algorithms and Networks (2002) 22. Stojmenovi´c, I.: Dominating set based Bluetooth scatternet formation with localized maintenance. IEEE Int. Parallel and Distributed Processing Symposium and Workshops (2002) 23. Tan, G., Miu A., Guttag, J., Balakrishnan, H.: Forming Scatternets form Bluetooth Personal Area Networks. MIT Technical Report, MIT-LCS-TR-826 (2001) 24. Tan, G., Miu A., Guttag, J., Balakrishnan, H.: An Efficient Scatternet Formation Algorithm for Dynamic Environments. IASTED Communications and Computer Networks (2002) 25. Wang, Z., Thomas, R.J., Haas, Z.: Bluenet - a New Scatternet Formation Scheme. 35th Hawaii International Conference on System Sciences (2002) 26. Z´ aruba, G.V., Basagni, S., Chlamtac, I.: Bluetrees - Scatternet Formation to Enable Bluetooth-Based Ad Hoc Networks. IEEE International Conference on Communications (2001) 27. Zhen, B., Park, J., Kim, Y.: Scatternet Formation of Bluetooth Ad Hoc Networks. 36th Hawaii International Conference on System Sciences (2003) 28. Z¨ urbes, S.: Considerations on link and system throughput of Bluetooth networks. 11th IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (2000)
A Note on Certificate Path Verification in Next Generation Mobile Communications Matthias Enzmann, Elli Giessler, Michael Haisch, Brian Hunter, Mohammad Ilyas, and Markus Schneider Fraunhofer Institute for Secure Telecooperation (FhG-SIT), Rheinstr. 75, 64295 Darmstadt, Germany {firstname.lastname}@sit.fraunhofer.de
Abstract. Certificate-based authentication of parties provides a powerful means for verifying claimed identities, since communicating partners do not have to exchange secrets in advance for authentication. This is especially valuable for roaming scenarios in mobile communications. When dealing with certificates, one must cope with the verification of complete certificate paths for security reasons. In mobile communications, there exist special conditions for this verification work. Mobile devices may have limited capacity for computation and mobile communication links may have limited bandwidth. In this paper, we propose to apply PKI servers —such as implemented at FhG-SIT— that allow the delegation of certificate path validation in order to speed up verification. Furthermore, we propose a special structure for PKI components and specific cooperation models that force certificate paths to be short. Additionally, we deal with the problem of users who do not have Internet access during the authentication phase. We explain how we have solved this problem and show a gap in existing standards.
1
Introduction
With the ongoing development in the area of mobile communication technologies we are approaching the next generation mobile Internet step by step. However, there are several security problems, e.g., confidentiality, access control, and entity authentication, attributed to roaming users. Certificate-based authentication provides a powerful means for communicating parties to verify claimed identities without the necessity of distributing shared secrets beforehand. Thus, certificate-based authentication is attractive to support roaming in mobile communications. Furthermore, the concept of digital signatures, that requires the application of certificates, allows the introduction of complex business models. Strong authentication is useful for purposes of authorization and accounting. During public key based authentication protocols several cryptographic computations must be performed by both parties. However, in addition to the computations directly related to the protocol, the verifying parties must validate the certificate that belongs to the corresponding public key of the communication partner. We consider X.509v3 conformant certificates. In practice, the C. M¨ uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 273–287, 2004. c Springer-Verlag Berlin Heidelberg 2004
274
M. Enzmann et al.
verifier must verify the correctness (i.e., verification of the certificate signature and validity time) and revocation status of not one certificate but rather every certificate in a certificate path. In order to construct and verify this path, the verifier fetches the certificates of the issuing CAs up to the trust anchor. In addition, up-to-date revocation information must be retrieved for every certificate. A trust anchor is a public key —with its associated certificate— that is trusted by the verifier. If the verifier trusts a CA’s public key and the certificate then no further path construction and verification is necessary beyond this certificate. A party defines its own trust anchor set containing all of its trust anchors. In this work, we propose to apply PKI servers for certificate-based authentication in order to make certificate path validation more efficient. At the Fraunhofer Institute for Secure Telecooperation, such a PKI server has been developed. The delegation of validation work to PKI servers reduces the time for path validation and retrieval costs. This is very valuable for mobile devices with restricted capacity and beneficial in low bandwidth mobile networks. Furthermore, we propose a specific structure regarding the location of PKI components and specific cooperation models that restrict the length of certificate paths to a small value. Short certificate paths reduce the time needed for certificate path verification. We describe our ideas for Internet Service Provider (ISP) roaming models that are supported by a Roaming Service Provider (RSP) that allows the efficient introduction of roaming agreements. The applicability is not restricted to this specific roaming model. It can also be applied to other mobile communication contexts, e.g., involvement of WLAN providers. Furthermore, we propose how PKI servers can support mobile users who do not have Internet access yet and need to validate an ISP certificate. Unfortunately, there is no standard available that solves this problem. We explain how we have solved this problem by using a new extension for TLS [4,5] and modifying the semantics of the Online Certificate Status Protocol (OCSP) [13]. However, a standard is needed.
2
Requirements for Certificate-Based Authentication in Mobile Contexts
The verification of public key certificates can become very complex as already pointed out in [3]. Today’s existing standard applications, e.g., web browsers, are still far from supporting the processing of arbitrary certificate paths. The situation becomes even more difficult when we focus on mobile communications. In the following, we state specific problems that do not exist in the non-mobile context. 1. Limited capacity. Mobile devices may be limited in computational capacity. This means that not all devices can necessarily carry out heavy computational work within a reasonable period of time. Thus, the construction and verification of longer certificate paths is not possible for them. 2. Limited bandwidth. Potentially, mobile devices can only have network connections with rather small bandwidth. Thus, the amount of data to be transferred for certificate verification should be small.
A Note on Certificate Path Verification
275
3. Demanding processes. There are processes that have special requirements regarding the maximum time for the complete authentication process. One of these is handover. Before a handover is completed, the mobile node and the attendant must authenticate each other. Since the quality of the connection should not be affected too much by the handover, the time that is necessary for the verification of certificates should be as short as possible. 4. Verification without network access. When a mobile node establishes an Internet connection to an attendant without having an existing Internet connection to another attendant, the mobile node cannot carry out the work that is necessary to verify the attendant’s certificate which is presented to him within the authentication protocol. In order to verify the certificate, the mobile node requires information to construct the certificate path and to check the status of the certificates. One cannot assume that the mobile node has this information locally available. The idea of this work is to apply PKI servers (PKIS) that support the authenticating parties either by providing them with relevant information (e.g., certificate status information) or by offering them complete certificate verification services (e.g., path construction, verification of certificate correctness, status information evaluation). The services of a PKIS will be requested when carrying out the authentication protocol. When the mobile node does not have Internet access, it is possible that it asks the attendant to delegate some desired verification work to a selected PKIS on behalf of him. The same principle is already described in RFC 3546 [4], where the intention is to ask the authenticating party from within a TLS handshake protocol to provide the verifier with an OCSP response covering its own certificate generated by a trusted responder [13].
3
Framework for Mobile User Roaming
In this work, we assume that authentication of entities is based on appropriate protocols that apply public key cryptography with corresponding certificates which state that a public key really belongs to a given identity. The usage of public key cryptography for certificate-based authentication has many additional advantages compared to other approaches in which mutual secrets are shared. This has an impact on the set of business models, especially when considering directions for money flows in these models, that allow reasonably secure cooperation among the parties involved therein, as described in [15]. In general, the application of public key cryptography allows the introduction of stronger solutions for authentication, authorization, and accounting (AAA) so that the trust requirements are easier to fulfil in potential business models in mobile communications. 3.1
Parties in ISP Roaming Models
In order to present our solutions to apply certificate-based entity authentication in the context of mobile communication and to support mobile user roaming, we
276
M. Enzmann et al.
CA
CA
(I)
CRI
(I)
(R)
CRI
(R)
(I)
PKIS
PKIS
(R)
RSP
ISP1
Customer of ISP1
ISP2
Customer of ISP2
Fig. 1. RSP, ISP, mobile users, and PKI aspects
restrict our considerations to those users that roam to different ISPs. Even if we restrict the focus of our considerations to roaming users that use Internet access services from distinct ISPs, this does not affect the generality of our solution’s underlying principles. These principles also hold in other roaming scenarios that deal with certificate-based entity authentication. We assume that a user has some specific relationship with one ISP that offers Internet access at which the mobile user has subscribed for ISP services. After service provision, contacted ISPs are recompensated by the subscribed ISP. Usually, an ISP has an interest in allowing its customers to roam to many other ISPs, since a more extensive mobile network increases the attractiveness of its services. Since the number of ISPs is expected to increase, bilateral roaming agreements cause a considerable workload for ISPs. We use a specialized party, called Roaming Service Provider (RSP) that supports ISPs in roaming contexts. The RSP acts as an intermediary for ISPs by reducing the effort for an ISP to establish roaming agreements with other ISPs. However, it is not our intention to present the complete functionality of an RSP. Instead, we will restrict our considerations exclusively to those aspects that are related to PKI. 3.2
Location of PKI-Related Components
In order to support certificate-based entity authentication in mobile communications, we propose a special structure for the location of specific PKI components. This structure guarantees that the length of certificate paths that have to be processed for entity authentication is upper-bounded by a small value. In the following, we will explain which PKI-related functions are associated to which entities. In our proposal, an RSP operates a certification authority CA(R) , which issues certificates. This is depicted in Figure 1. The colors and arrows in this figure show the association of certificates and issuing CAs. Furthermore, the RSP operates a component CRI(R) that provides certificate revocation information, e.g., as certificate revocation lists (CRL) or as certificate status information via OCSP [9,13]. Additionally, an RSP has a PKI server PKIS(R) to which certificate path
A Note on Certificate Path Verification
277
validation work can be delegated by other parties, e.g., in accordance with SCVP which is currently being standardized [12]. There are different types of ISPs. Large ISPs, e.g., ISP1 in Figure 1, may operate their own certification authority CA(I) , CRI(I) for the provision of revocation information for certificates issued by CA(I) , and their own server PKIS(I) . Smaller ISPs, e.g., ISP2 in Figure 1, may use the PKI services of another party, e.g., the RSP. CA(I) of ISP1 can either be certified by an additional third party or have a self-signed certificate (Figure 1 shows the self-signed case). Furthermore, ISP1 also has an additional cross-certificate issued by CA(R) . CA(I) issues certificates for ISP1 ’s customers. When a contractual relationship to a customer is canceled before the expiration date of the customer’s certificate, or if the certificate should be invalidated, then CA(I) revokes the certificate, i.e., this is reflected in the revocation information of CRI(I) . From a PKI-related point of view, ISP2 only plays the role of a registration authority. When a customer registers with ISP2 , then ISP2 checks the customer data and provides the customer with a certificate generated by CA(R) . Consequently, when the customer certificate has been revoked, CRI(R) provides the corresponding information. ISPs are potentially interested that their competitors cannot access the complete revocation information on all their customers, e.g., in order to hide customer fluctuation from competitors. If this kind of secrecy is required then access to a corresponding CRI component is exclusive to specific parties. PKISs also have own public keys and certificates which they obtain form their associated CAs. These keys are applied by parties that have delegated validation work to a PKIS in order to verify the integrity of responses given by a PKIS. 3.3
RSP Cooperation
In practice, there may be several RSPs, each holding business relationships with many ISPs. In order to allow customers to roam to ISPs that are not associated with the RSP of customer’s home ISP, it is necessary to introduce special cooperations among RSPs. We say that all ISPs and their customers that belong to the same RSP span an RSP domain. We propose three variants for RSP cooperation to support roaming of mobile users to ISPs whereas mobile users and ISPs belong to distinct RSP domains for given RSP1 , . . . ,RSPn , with n > 1. T (e) denotes the trust anchor set of entity e, (R) and c(e) denotes its certificate. Furthermore, we assume that c(CAi ) ∈ T (e), if e belongs to the domain of RSPi , e.g., e is a mobile user, an ISP, or an RSPi (R) that operates a PKISi . Trust anchor sets are applied for the construction of certificate paths, i.e., in our proposal, they are applied by PKISs. In general, trust anchor sets to be applied can be specific for the party that delegates certificate (R) (R) (R) path construction to a PKISi , or PKISi can apply its own T (PKISi ) for all verification requests. We consider the following variants for RSP cooperation: 1. RSP cross-certification. The certification authorities associated with RSPs, (R) (R) e.g., CAk with RSPk and CAl with RSPl , cross-certify each other. Thus,
278
M. Enzmann et al. (R)
CAk has several certificates, i.e., a public key certificate (potentially self(R) signed), and cross-certificates issued by CAi for i = 1, . . . , k−1, k+1, . . . , n. (R) The number of CAk ’s cross-certificates grows with the number of RSPs (R) RSPk cooperates with. The introduction of cross-certificates enables PKISi (R) to construct a certificate path to CAi from a certificate that is associated with another RSP domain. This type of cooperation among RSPs is canceled by revoking the corresponding cross-certificate. 2. Definition of trust anchor sets. Certificate paths to be constructed and validated can be shortened by an appropriate definition of trust anchor sets. If (R) (R) PKISk modifies its trust anchor set in such a way that CAi ’s certificate (R) is in T (PKISk ), then the verification work can be reduced when the certificate to be verified belongs to a party that is associated with the domain of RSPi . Thus, in order to allow roaming across RSP domains, RSPs can define appropriate trust anchor sets. This type of cooperation between RSP domains is canceled by deleting the corresponding CA —or more precisely, the CA’s public key with certificate— from the trust anchor set. 3. Re-delegation. RSPs can operate their PKISs in such a way that re-delegation of a verification request to a corresponding PKIS is supported. When a party from the domain of RSPk delegates verification of a certificate, belonging to (R) (R) the domain of RSPl , to PKISk then PKISk can re-delegate the verifica(R) (R) tion request to PKISl . Re-delegation can be advantageous when PKISk has no permission to access revocation information for the corresponding (R) certificate, but PKISl has the required permission, e.g., since it belongs to the same RSP domain as the certificate holder. The party that initiates re-delegation has to be provided with the address of the party to which the work is re-delegated. In order to solve this problem we propose a special X.503v3 certificate extension PKISRedelegationExt according to the rules for certificate extensions in [9]. In practice, RSPs can choose whether they establish their cooperation based on cross-certification, on the appropriate definition of trust anchor sets, or on re-delegation. All these variants are based on a direct cooperation of RSPs. Note that certificate path verification across distinct RSP domains would also be possible without the direct cooperation of RSPs, e.g., by cross-certification chains of arbitrary length. However, the verification of such certificate paths can become very time-consuming, and thus, we do not recommend it.
4
Authentication of Mobile Users
In this section, we show how PKISs can be applied to support contacted ISPs that authenticate mobile users. We consider mechanisms for certificate path verification of user certificates during the authentication protocol. When a mobile user sends his certificate to a contacted ISP within an authentication protocol, then the contacted ISP delegates the verification of the
A Note on Certificate Path Verification
RSPi domain
RSPj domain
(R) CA i (R) RSPi CRI i (R) i
(R) CA j
PKIS
(R)
RSPj
(R)
Verification re-delegation
j
Verification Delegation delegation response Authentication
ISP
RSPi domain CA
Re-delegation CRI(R) j response
PKIS
279
i (R)
CRI
PKIS Verification delegation
RSPi
i (R) i
Delegation response
Authentication
ISP
Fig. 2. Contacted ISP authenticates mobile user in distinct RSP domains (left) and same RSP domain (right)
complete path, including the mobile user’s certificate, to a specific PKIS. This PKIS is operated by the RSP of his own domain. This is depicted in Figure 2. The work to be done by a PKIS differs depending on whether the mobile user and the contacted ISP belong to the same or distinct RSP domains. When the certificate path has been verified by the PKIS, it sends a delegation response to the contacted ISP. According to the content of this message, the mobile user can be authenticated or not, i.e., if the result of the delegation is negative, the claimed identity should not be accepted by the contacted ISP even if further verification results in the authentication protocol are positive. The usage of a PKIS for certificate path verification in mobile user authentication has several advantages for contacted ISPs: – Delegation to PKISs, or re-delegation, respectively, can be useful when only specific parties have the right to access components that provide certificate revocation information. Potentially, there are CAs that do not allow certificate revocation information retrieval by all ISPs. – PKISs can cache certificates of CAs which may be contained in certificate paths. Since the core work of a PKIS requires that the PKIS has CA certificates available, the probability is rather high that cached certificates can be used several times. Thus, caching speeds up the verification process. – PKISs can cache revocation information. Thus, a PKIS can use cached revocation information within certain periods, e.g., some hours. If cached revocation information can be used for several verification delegation requests, then there is a speed up in verification time. When the contacted ISP sends a delegation request for certificate path verification to the PKIS provided by the RSP in his own domain, the PKIS processes this request as shown in Figure 3. In the first step, the PKIS decides if redelegation is required. Note that re-delegation may only occur when the mobile user and the contacted ISP belong to distinct RSP domains. In our work, we assume that for each delegation request by ISPs only one re-delegation is allowed, in order to avoid re-delegation chains —this reduces time for path verification. If the PKIS opts for re-delegation, then it sends the request for verification
280
M. Enzmann et al.
Start
no
Re-delegation required?
yes start redelegation
yes
max chain length? yes
no
fetch issuer certificate(s)
Issuer ∈ T?
no
check correctness of certificates
wait for redelegation response evaluate redelegation response
check revocation status create and send delegation response
Stop
Fig. 3. Algorithm to be carried out by a PKIS
re-delegation to a PKIS of the mobile user’s RSP domain. For this case, the mobile user’s certificate contains an X.509v3 extension PKISRedelegationExt that refers to the correct PKIS address. Upon receipt of the re-delegation request the contacted PKIS starts the same algorithm as depicted in Figure 3 without option for further re-delegation. The PKIS that has initiated re-delegation then waits for the re-delegation response. Upon receipt of this response, the PKIS evaluates it and creates a delegation response for the requesting ISP. In case a PKIS does not re-delegate the verification request, it carries out the complete work for certificate path construction and verification. The PKIS fetches the issuer certificates for the presently considered certificate in the path until the issuing CA is contained in the trust anchor set or the maximum acceptable path length is reached. The PKIS may obtain the CA certificates directly from the CA or from its local database. When the complete certificate path has been constructed in this way, then the PKIS starts to check the correctness of the certificates by applying the corresponding public keys. If all certificates in the chain are correct, then the PKIS checks the revocation status of these certificates by exploiting the information that is provided by corresponding CRI components. Depending on the result of the verification the PKIS creates a delegation response and sends it to the requesting ISP. The effort that is necessary by a PKIS for the construction of certificate paths, the verification of its correctness, and the status checks depends on several factors: Do the mobile user and the ISP belong to the same or distinct RSP domains? Is the mobile user’s certificate issued by his subscribed ISP’s CA service or by an RSP CA? What is the type of cooperation among RSPs?
A Note on Certificate Path Verification
281
In the following we consider the certificate paths that have to be verified for a certificate of a mobile user U . We denote the certificate of party X by c(X). The arrows ’→’ indicate the order a PKIS constructs a path. Note that the PKIS has the certificate and the public key of the applied trust anchor stored locally. We assume that verifying PKISs apply their own trust anchor set.1 For the trust anchor’s certificate no revocation check is done. Certificates, that require a revocation check, are marked by ’↓’. Revocation information for a certificate is provided by the CRI component of the CA that follows next in the path. A PKIS has to fetch those certificates that are contained after the first and before the last certificate in the chains shown below. There are the following certificate paths: (R)
1. U and contacted ISP in RSPi domain, U has obtained c(U ) from CAi : ↓
(R) c (U ) → c(CA(R) ) ∈ T (PKISi ) i
2. U and contacted ISP in RSPi domain, U has obtained c(U ) from CA(I) operated by his subscribed ISP: ↓
↓
(R) c (U ) →c (CA(I) ) → c(CA(R) ) ∈ T (PKISi ) i
3. U in RSPi domain, contacted ISP in RSPj domain, U has obtained c(U ) (R) from CAi , RSP cooperation based on cross-certification: ↓
↓
(R) (R) c (U ) →c (CA(R) ) → c(CAj ) ∈ T (PKISj ) i
4. U in RSPi domain, contacted ISP in RSPj domain, U has obtained c(U ) from CA(I) operated by his subscribed ISP, RSP cooperation based on crosscertification: ↓
↓
↓
(R) (R) c (U ) →c (CA(I) ) →c (CA(R) ) → c(CAj ) ∈ T (PKISj ) i
5. U in RSPi domain, contacted ISP in RSPj domain, U has obtained c(U ) (R) from CAi , RSP cooperation based on the modification of T : ↓
(R) c (U ) → c(CA(R) ) ∈ T (PKISj ) i
6. U in RSPi domain, contacted ISP in RSPj domain, U has obtained c(U ) from CA(I) operated by his subscribed ISP, RSP cooperation based on the modification of T : ↓
↓
(R) c (U ) →c (CA(I) ) → c(CA(R) ) ∈ T (PKISj ) i
7. U in RSPi domain, contacted ISP in RSPj domain, U has obtained c(U ) (R) (R) from CAi , RSP cooperation based on re-delegation to PKISi : ↓
(R) c (U ) → c(CA(R) ) ∈ T (PKISi ) i 1
It would be also possible that ISPs, which delegate the verification to the PKIS, provide the PKIS with their own trust anchor set to be applied for path verification.
282
M. Enzmann et al.
8. U in RSPi domain, contacted ISP in RSPj domain, U has obtained c(U ) from CA(I) operated by his subscribed ISP, RSP cooperation based on re(R) delegation to PKISi : ↓
↓
(R) c (U ) →c (CA(I) ) → c(CA(R) ) ∈ T (PKISi ) i
As can be seen above, certificate paths are shorter when ISPs do not provide their own CA service. When RSP cooperation is based on the appropriate definition of trust anchor sets or re-delegation, the effort for verification is reduced compared to cross-certification. The biggest effort is required in case 4. However, even there the effort for certificate path verification involves only fetching two certificates (R) (c(CA(I) ), c(CAi )), correctness and status check of three certificates (c(U ), (R) c(CA(I) ), c(CAi )).
5
Authentication of ISPs
In this section, we focus on the opposite direction of mutual authentication: mobile users authenticate contacted ISPs. In order to do this, we propose that PKISs support mobile users in certificate path verification of ISP certificates. Besides the advantages of PKIS-based certificate path verification mentioned in the previous section, there are some additional reasons that make the usage of PKISs attractive: – Limited bandwidth of mobile connections. The construction of certificate paths may require the collection of certificates from several CAs until a trust anchor is reached. Furthermore, revocation status information has to be collected for these certificates. – Limited capacity of mobile devices. The verification of certificate correctness requires extensive computational work for all certificates in the path. – Support of mobile users that do not have Internet access at the authentication phase. (Note that they connect to an ISP to obtain Internet access. However, this aspect is not relevant for mobile users that authenticate ISPs when their Internet connection is handed over from one ISP to another ISP, i.e., they can use their existing connection for certificate path verification, e.g., in case of a soft handover.) Referring to the last reason, there are different solutions of how to deal with this problem. As a first possibility, the mobile user could carry out the authentication protocol without having the result of certificate path verification and certificate status information. Then, he could use the Internet connection —obtained from the not completely authenticated ISP— to validate the ISP’s certificate. If the verification yields that the certificate path is not valid, then the mobile user can decide to terminate the connection. However, this solution is rather cumbersome. As a second possibility, the mobile user asks the contacted ISP within the authentication protocol to provide him with some desired verification results regarding the validity of his own certificate. This is sketched in Figure 4. In order to achieve a high level of security, the verification result should include the
A Note on Certificate Path Verification
RSPi domain
RSPj domain
(R) CA i
RSPi CRI
CA
(R) i
RSPj
(R)
Verification re-delegation
PKIS
j
Delegation response Validation request
ISP
RSPi domain CA
j
Re-delegation CRI(R) j response
(R) PKIS i
Verification delegation
(R)
283
(R) i
CRI
(R) i
RSPi
(R) i
PKIS Verification delegation
Delegation response
Validation request
Validation response
Validation response
Authentication
Authentication
ISP
Fig. 4. Mobile user authenticates contacted ISP in distinct RSP domains (left) and same RSP domain (right)
verification of the complete certificate path for some given set of trust anchors. Therefore, the mobile user sends a validation request for the ISP’s certificate to the ISP itself. This validation request contains a unique reference to the component that should carry out the validation of this certificate. This component is selected by the mobile user and it is trusted by the mobile user to carry out validation correctly. In our scenario, this component is a PKIS that is provided by the RSP the mobile user is associated with. When the contacted ISP has received the validation request from the mobile user, it sends a message for verification delegation to the PKIS that is stipulated by the mobile user. When the PKIS has received the verification delegation message from the contacted ISP, it behaves in essentially the same way as described in the previous section. Either, it carries out the verification work on its own, or re-delegates it to the PKIS that is operated by the RSP that the contacted ISP is associated with; Figure 3 provides a rather abstract sketch for the sequence of steps that are carried out by the PKIS. Re-delegation requires an adequate X.509v3 extension PKISRedelegationExt in the ISP certificate. When the re-delegating PKIS obtains the re-delegation response, it has to evaluate it and create a new delegation response that is signed with its own secret key; such a delegation response is also created in the non-delegation case. Then, the delegation response is sent to the ISP who sends it in the validation response to the mobile user. The mobile user evaluates the validation response by checking the integrity of the validation response and applies its result in the authentication protocol. There are several reasons why the mobile user specifies which PKIS should carry out the verification. First, it can be assumed that the mobile user trusts this RSP. Second, it can be assumed that the mobile user has the public key of this RSP stored on his mobile device. This public key is necessary for the mobile user in order to be able to check the integrity of the validation response. It cannot be assumed that the mobile user has trustworthy public keys of arbitrary PKISs. Thus, a PKIS that initiated a re-delegation creates a new message when it has received the verificaton result from the PKIS to which verification was re-delegated.
284
M. Enzmann et al.
The PKIS carrying out the verification work —either the PKIS that is contacted by an ISP, or the PKIS to which verification was re-delegated— follows the steps as sketched in Figure 4. The amount of work for certificate path validation depends on several factors: Do the mobile user and contacted ISP belong to the same or distinct RSP domains? Is the ISP’s certificate issued by his RSP’s CA or the ISP’s own CA? What is the underlying RSP cooperation type? Dependent on these conditions, there are the following certificate paths: (R)
1. U and contacted ISP in RSPi domain, ISP has obtained c(ISP) from CAi : ↓
(R) c (ISP) → c(CA(R) ) ∈ T (PKISi ) i
2. U and contacted ISP in RSPi domain, ISP has obtained c(ISP) from CA(I) operated by its own: ↓
↓
(R) c (ISP) →c (CA(I) ) → c(CA(R) ) ∈ T (PKISi ) i
3. U in RSPi domain, contacted ISP in RSPj domain, ISP has obtained c(ISP) (R) from CAj , RSP cooperation based on cross-certification: ↓
↓
(R) (R) c (ISP) →c (CA(R) ) ∈ T (PKISi ) j ) → c(CAi
4. U in RSPi domain, contacted ISP in RSPj domain, ISP has obtained c(ISP) from CA(I) operated by its own, RSP cooperation based on crosscertification: ↓
↓
↓
(R) (R) c (ISP) →c (CA(I) ) →c (CA(R) ) ∈ T (PKISi ) j ) → c(CAi
5. U in RSPi domain, contacted ISP in RSPj domain, ISP has obtained c(ISP) (R) from CAj , RSP cooperation based on the modification of T : ↓
(R) c (ISP) → c(CA(R) ) j ) ∈ T (PKISi
6. U in RSPi domain, contacted ISP in RSPj domain, ISP has obtained c(ISP) from CA(I) operated by its own, RSP cooperation based on the modification of T : ↓
↓
(R) c (ISP) →c (CA(I) ) → c(CA(R) ) j ) ∈ T (PKISi
7. U in RSPi domain, contacted ISP in RSPj domain, ISP has obtained c(ISP) (R) (R) from CAj , RSP cooperation based on re-delegation to PKISj : ↓
(R) c (ISP) → c(CA(R) j ) ∈ T (PKISj )
8. U in RSPi domain, contacted ISP in RSPj domain, ISP has obtained c(ISP) from CA(I) operated by its own, RSP cooperation based on re-delegation to (R) PKISj : ↓
↓
(R) c (ISP) →c (CA(I) ) → c(CA(R) j ) ∈ T (PKISj )
The properties of these variants for certificate path validation are the same as those in the previous section. The biggest certificate path length occurs in case 4, where the cooperation is based on cross-certificaton. The other cooperation models have the same lenghts as in the single domain case.
A Note on Certificate Path Verification
6
285
PKIS Technology
In the previous sections, we have explained why and how PKISs should be applied for certificate path validation in mutual authentication scenarios, especially in mobile communications. We have developed a PKIS, e.g., as described in [10], that can be used in mobile communication scenarios. Due to space limitations, we cannot describe details of our PKIS solution. Currently, our PKIS is based on DPV technology [14], i.e., in our implementation this technology is applied for the validation delegation and the delegation response. The adaptation of our PKIS to the current status of SCVP technology [12] is on its way. SCVP is currently being developed and standardized by the PKIX group. SCVP is a protocol that may be used for the implementation of the validation delegation and delegation response. The PKIS supports several approaches on how to deal with certificate revocation information: CRLs and OCSP [9,13]. Which technology is used in a specific instance depends on the way how CAs provide their revocation information, i.e., how they operate their CRI component. Thus, our PKIS supports both technologies. So far, the PKIS is based on existing or presently emerging standards. For one aspect of our solution, there is no standardized solution in sight. When the mobile user authenticates the contacted ISP, there is the problem that the mobile user does not necessarily have Internet access at this stage. Thus, the validation of the ISP’s certificate was delegated via the ISP to a PKIS which is trusted by the mobile user. In the current version of the TLS protocol and its newly standardized extensions [5,4], there is an option to ask for the status of a certificate with the extension CertificateStatusRequest. TLS can be used in mobile contexts over EAP (EAP TLS [1]). The newly defined TLS extension allows the mobile user to request an OCSP response on the ISP’s certificate to be generated by a trusted OCSP responder. The standardization of this extension was motivated to support constrained clients to check the validity of certificates by using a certificate-status protocol such as OCSP. This avoids the transmission of CRLs, and therefore saves bandwidth in networks with limited capacity. However, an OCSP response only provides information about the status of the certificate contained in the request. Note that an OCSP response says nothing about the validity of a certificate path for a certificate with respect to a given trust anchor. In OCSP, there are three indicators for the status value of a certificate: good, revoked, and unknown. In order to obtain information about the validity of a complete certificate path, a TLS extension that allows the initiation of a complete certificate path validation for some given certificate and a given trust anchor set would be more appropriate. Therefore, an extension that invites a contacted ISP to send a SCVP request to a PKIS instead of an OCSP request to an OCSP responder is necessary. However, such a standard does not yet exist. In order to circumvent this existing gap, our solution tackles this problem by using the existing TLS extension. We cope with the problem by using OCSP with modified semantics. The PKIS generates an OCSP response good in order to indicate that the certifi-
286
M. Enzmann et al.
cate path is valid. An OCSP response revoked is produced in case of an invalid path, and unknown is used in case no path could be built. With the validation delegation request, the contacted ISP asks the PKIS for an OCSP response with appropriately modified semantics. When the contacted ISP receives the OCSP response, he sends it to the mobile user who interprets the modified semantics in the required way. This solution can be applied as long as there is no TLS extension to initiate a complete certificate path validation request. However, once SCVP is standardized, standardization work should be done to extend the current TLS extension to include support for SCVP requests.
7
Related Work
Certificate-based authentication and related problems have already been discussed in other work. In [7,8], the authors propose a solution on how to achieve a minimized handover delay. The proposed solution assumes an indirect trust generation. This means that in order to reduce the time for re-authentication in case of a handover the old domain controller provides the newly contacted domain controller with the mobile user identity. However, it is questionable whether the real trust relationships among domain controllers are always adequate for this solution. Furthermore, the approach focuses mainly on speeding up re-authentication in case of handover. No solution is given for speeding up authentication in general, e.g., by limiting certificate paths or by PKISs. Typical problems that arise in the context of certificate path validation and the advantages of PKISs have been presented in [3,10]. However, they do not address the benefits of PKISs in mobile communications. The authors of [11] considered the application of PKI servers for restricted mobile devices, but only for application contexts (e.g., m-commerce) and not for roaming scenarios. In the SHAMAN project, certificate-based authentication for roaming devices in Personal Area Networks (i.e., spheres of about 10 meters radius) has been considered, e.g., see [6,16]. There, the new concept of a personal CA has been introduced which differs from a usual large CA. All certificates in such PANs are issued by the same CA, whose public key is shared by all entities. Thus, there is no necessity for complicated certificate path validation. However, this model seems not to be applicable in general roaming scenarios. Even if we have motivated the use of PKISs for certificate-based authentication in the case of ISP roaming, our ideas can be applied for other roaming contexts. According to [2], for today’s 4G visions there is a need for integrated certificate-based authentication at network access and service levels. The underlying principles of our solution are general enough that they can be applied to tackle the problems that arise in the context of certificate-based authentication for future mobile networks and their services.
8
Conclusion
The goal of this work was to provide solutions for the validation of certificate paths for certificate-based authentication in mobile communication scenarios. We
A Note on Certificate Path Verification
287
have given proposals on where to locate PKI components, especially PKI servers that allow to speed up authentication of mobile users and contacted ISPs in case of roaming users. We have shown how to apply PKI servers in different roaming cooperations. Furthermore, we deal with the problem of mobile users that do not have Internet access, and thus cannot verify certificate paths on their own. We also showed a gap in existing standards for authentication and explained how we solved this problem until an appropriate standard is available.
References 1. B. Aboba, D. Simon. PPP EAP TLS Authentication Protocol. RFC 2716, 1999 2. J.v. Bemmel, H. Teunissen, G. Hoekstra. Security Aspects of 4G Services. In Wireless World Research Forum (WWRF 9), 2003. 3. D. Berbecaru, A. Lioy, M. Marian. On the Complexity of Public-Key Certificate Validation. In Information Security (ISC01), 4th International Conference, LNCS 2200. Springer Verlag, 2001. 4. S. Blake-Wilson, M. Nystrom D. Hopwood, J. Mikkelsen, T. Wright. Transport Layer Security TLS Extensions. RFC 3546, 2003. 5. C. Dierks, C. Allen. The TLS protocol version 1.0. RFC 2246, 1999. 6. C. Gehrmann. Detailed technical specification of distributed mobile terminal system security. In D10, SHAMAN Project (IST-2000-25350), 2002. 7. J. Gu, S. Park, O. Song, J. Lee. A PKI-Based Authentication Framework for Next Generation Mobile Internet. In Web Communication Technologies and InternetRelated Social Issues (HSI 2003), LNCS 2713. Springer Verlag, 2003. 8. J. Gu, S. Park, O. Song, J. Lee, J. Nah, S. Sohn. Mobile PKI: A PKI-Based Authentication Framework for the Next Generation Mobile Communications. In Information Security and Privacy, 8th Australasian Conference (ACISP 2003), LNCS 2727. Springer Verlag, 2003. 9. R. Housley, W. Polk, W. Ford, D. Solo. Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile. RFC 3280, 2002. 10. B. Hunter. Simplifying PKI Usage through a Client-Server Architecture and Dynamic Propagation of Certificate Paths and Repository Addresses. In Trust and Privacy in Digital Business (TrustBus 2002). 2002. 11. M. Jalali-Sohi, P. Ebinger. Towards Efficient PKIs for Restricted Mobile Devices. In IASTED Int. Conf. Comm. and Comp. Networks (CCN02). 2003 12. A. Malpani, R. Housley, T. Freeman. Simple Certificate Validation Protocol (SCVP). Internet Draft, 2003. 13. M. Myers, R. Ankney, A. Malpani, S. Galperin, C. Adams. X.509 Internet Public Key Infrastructure Online Certificate Status Protocol – OCSP. RFC 2560, 1999. 14. D. Pinkas, R. Housley. Delegated Path Validation and Delegated Path Discovery Protocol Requirements. RFC 3379, 2002. 15. D. Westhoff. The role of mobile device authentication with respect to domain overlapping business models. In 6th World Multiconference on Systemics, Cybernetics and Informatics (SCI02), 2002. 16. T. Wright. Intermediate specification of PKI for heterogeneous roaming and distributed terminals. In D07, SHAMAN Project (IST-2000-25350), 2002.
The Value of Handhelds in Smart Environments Frank Siegemund, Christian Floerkemeier, and Harald Vogt Institute for Pervasive Computing Department of Computer Science ETH Zurich, Switzerland {siegemund|floerkem|vogt}@inf.ethz.ch
Abstract. The severe resource restrictions of computer-augmented everyday artifacts imply substantial problems for the design of applications in smart environments. Some of these problems can be overcome by exploiting the resources, I/O interfaces, and computing capabilities of nearby mobile devices in an ad hoc fashion. We identify the means by which smart objects can make use of handheld devices such as PDAs and mobile phones, and derive the following major roles of handhelds in smart environments: (1) mobile infrastructure access point, (2) user interface, (3) remote sensor, (4) mobile storage medium, (5) remote resource provider, and (6) weak user identifier. We present concrete applications that illustrate these roles, and describe how handhelds can serve as mobile mediators between computer-augmented everyday artifacts, their users, and background infrastructure services. The presented applications include a remote interaction scenario, a smart medicine cabinet, and an inventory monitoring application.
1
Introduction
As pointed out by Weiser and Brown [13], ”Ubiquitous Computing is fundamentally characterized by the connection of everyday things in the real world with computation”. Computer-augmented everyday artifacts – also called smart everyday objects – epitomize this vision of Ubiquitous Computing in that they are everyday objects augmented with small sensor-based computing platforms (cf. Fig. 1). Smart objects are aware of their environment, can perceive their surroundings through sensors, collaborate with peers using short-range wireless communication technologies, and provide context-aware services to users in smart environments. But the computational capabilities of smart objects are very limited because their computing platform needs to be small and unobtrusive. Furthermore, they do not possess conventional I/O interfaces such as keyboards or displays, which restricts the interaction with users. And finally, because of their limited energy resources, smart objects support only short-range communication technologies,
Parts of this work were supported by the Smart-Its project, which is funded by the European Commission (contract No. IST-2000-25428) and the Swiss Federal Office for Education and Science (BBW No. 00.0281).
C. M¨ uller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 291–308, 2004. c Springer-Verlag Berlin Heidelberg 2004
292
F. Siegemund, C. Floerkemeier, and H. Vogt
which makes it difficult to access background infrastructure services when no access point is near by. Combined, all these limitations cause severe problems for the design of applications in environments of smart objects.
Fig. 1. A smart everyday object: an everyday item augmented with a sensor-based computing platform.
We argue that most of these problems can be overcome when smart objects can spontaneously access the capabilities of nearby handheld devices. In smart environments, people move around who carry their personal devices with them. By exploiting the features of nearby handhelds in an ad hoc fashion, new possibilities for the design of applications on smart objects evolve. We identify and illustrate six different means by which computer-augmented everyday artifacts can make use of handhelds: (1) as mobile infrastructure access point, (2) as user interface, (3) as remote sensor, (4) as mobile storage medium, (5) as remote resource provider, and (6) as weak user identifier. Given these roles, handhelds can enrich the interactions among smart objects, users, and background infrastructure services (cf. Fig. 2). As mobile access points, handhelds facilitate the ad hoc interaction between smart objects and a background infrastructure. A handheld’s input and display capabilities enable new forms of user interactions with smart objects. And finally, the cooperation among smart objects themselves can be improved by utilizing handheld devices as remote resource providers. As mediators between smart objects, users, and background infrastructure services, handhelds enable new forms of applications in smart environments. To support this thesis, we present three applications that make extensive use of handheld devices. We start with a remote interaction application. Here, interaction patterns that people associate with a specific type of handheld device (e.g., making phone calls) are translated to smart environments. In particular, we assign phone numbers to smart objects and describe how users can ”call” and interact with objects from remote locations. The remote interaction application uses handhelds as mobile storage medium, user interface, and weak user identifier. We then present the Smart Medicine Cabinet. It improves medical compliance and facilitates a more effective treatment of mobile patients by using handheld
The Value of Handhelds in Smart Environments
293
Fig. 2. Handhelds as mediators in smart environments: handheld devices enrich the interaction between different smart objects, between smart objects and their users, and between smart objects and background infrastructure services
devices together with ”smart medicine”. In this application, handhelds serve as mobile infrastructure access points, and again as user interface and mobile storage medium. Finally, we present an inventory monitoring application. Its goal is to illustrate how smart objects can spontaneously outsource computations to nearby handheld devices. It illustrates a handheld’s ability to serve as mobile resource provider and also as user interface for smart objects. In the following section we review related work. In section 3 we identify the roles of handhelds in smart environments. Section 4 discusses how these roles can actually be implemented. Sections 5 through 7 present three applications that illustrate these roles. Section 8 concludes the paper by summarizing the lessons learned from our applications.
2
Related Work
Gellersen et al. [6] proposes to integrate low-cost sensors into everyday objects and mobile user devices to facilitate the development of context-aware applications. In their MediaCup project [2], active sensor tags are embedded into everyday objects to derive and provide services based on the situational context of users. For example, the context ”meeting” can be inferred from the presence of many hot cups in a meeting room. However, their work assumes stationary access points that facilitate the cooperation between active artifacts and existing infrastructure services, while we explicitly focus on utilizing nearby handheld devices in an ad-hoc fashion. Hartwig [7] integrates Web servers into active Bluetooth-enabled tags, attaches them to physical objects, and controls augmented items with nearby Bluetooth-enabled mobile phones by means of a WAP (Wireless Application Protocol) interface. WAP is used for local interactions with augmented items,
294
F. Siegemund, C. Floerkemeier, and H. Vogt
whereas we use WAP (among other technologies) for remote interactions with smart objects. Want et al. [12] augments physical objects with passive RFID tags to associate objects with a representation in the virtual world. In the Cooltown project [8], everyday items are also equipped with tags in order to link them with a representation in the World Wide Web. In both approaches, the functionality of a tag consists primarily in providing a link to information in the virtual world. As tags are usually read out by a mobile user device, the actual application is implemented on the handheld or the backend infrastructure, but not on the tag. In our approach, the active tags and sensors attached to physical objects process data autonomously, derive context-information in collaboration with other smart objects, and coordinate the actual applications. Nearby handheld devices do not implement applications for smart objects, but are only means by which computer-augmented artifacts can dynamically extend their own capabilities.
3
Characteristics and Roles of Handhelds in Smart Environments
The roles of handheld devices in environments of computer augmented everyday artifacts are manifold. Handhelds can serve as primary user interface, they can be a mobile infrastructure access point, provide mobile data storage, act as a user identifier, supply energy and computational resources, or offer sensing capabilities. In this section we identify the main reasons for this versatility. Thereby, we name important characteristics of handheld devices and from that derive the major roles of handhelds in smart environments. Habitual presence. As mobile phones, PDAs, and other handheld devices are habitually carried around by their owners, they are always in range of a smart object when a physical interaction with it is about to take place. This is especially important because the smart objects themselves generally do not have access to resources beyond their peers, and handheld devices are the only local devices able to provide powerful resources and sophisticated services. The habitual presence of handheld devices during physical interactions with smart objects is the most important characteristic of handhelds in smart environments. It entails their general function as mediator between smart objects, users, and background infrastructure services, and is therefore a precondition for the roles of handhelds presented in this paper. Wireless network diversity. Mobile phones and PDAs usually support both short-range as well as long-range wireless communication technologies, such as Bluetooth, IrDA, WLAN, GSM, or UMTS. This enables handhelds to not only interact with smart objects directly via short range communication standards but also to relay data from augmented items to powerful computers in an infrastructure far away. The characteristic of wireless network diversity makes it possible for handhelds to serve as mobile infrastructure access points. User interface and input capabilities. Tags attached to everyday objects have to be small, unobtrusive and are ideally invisible to human users.
The Value of Handhelds in Smart Environments
295
Consequently, they do not possess conventional buttons, keyboards, or screens. Interaction with augmented objects therefore has to take place either implicitly by considering sensory data of smart objects, or explicitly by using the input and display capabilities of other devices [10]. As people are usually familiar with the features provided by their handhelds, interactions with smart objects that are based on these well-known interfaces should imply a more comfortable and easy usage of smart objects. As a result, handhelds often serve as the primary user interface for smart objects. Perception. Handheld devices can serve as remote sensors for a smart object, which are accessed wirelessly using a communication technology supported by all participating devices. The way handheld devices perceive their environment strongly depends on their functionality. Cellular phones, for example, know to what cell they currently belong and can serve as remote location sensors for augmented items. Mobility. Active tags can transfer data such as how to reach a smart device from remote locations to a handheld device, where it is permanently stored and accessible for users independent from their current location. Here, handhelds serve as mobile storage medium for smart objects. Table 1. The roles of handhelds in smart environments and the underlying characteristics that entail these roles Handheld’s role Underlying characteristic Mobile infrastructure access point Wireless network diversity User interface Input and display capabilities Remote sensor Perception Mobile storage medium Mobility Remote resource provider Computational resources Regularly refilled mobile energy reservoirs Weak user identifier Personalization
Computational resources and regularly refilled mobile energy reservoirs. Although the energy consumption of a handheld device such as a cellular phone should be as small as possible, people are used to recharge its batteries at regular intervals. PDAs are often shipped with a cradle that offers both host access to the device and automatic recharging. A similar procedure, however, is not feasible for smart objects because there are just too many of them. As a result, smart objects may exploit handheld devices in range as remote energy reservoir, for example for carrying out complex and energy consuming computations. Because of regularly renewed energy resources, handhelds can also offer more powerful resources regarding memory and bandwidth, which allow smart objects to use them as remote resource providers. Personalization. PDAs and mobile phones are most often personalized, i.e. they belong to a certain person who uses the device exclusively. Smart devices
296
F. Siegemund, C. Floerkemeier, and H. Vogt
can therefore adapt their behavior according to the current handheld devices in range and thereby offer functions tailored towards certain persons. In this context, handhelds can serve as weak user identifiers. The relation between the described characteristics of handheld devices and the roles we derived from these characteristics are summarized in Tab. 1.
4
Interfacing Handhelds from Smart Objects
After having identified the major roles of handhelds in smart environments on a more conceptual level (cf. Tab. 1), we now describe how these roles can actually be implemented. Thereby, we focus on smart objects that are equipped with BTnodes [3]. BTnodes are small computing platforms, consisting of a microcontroller, Bluetooth communication modules, an autonomous power supply, and externally attached sensor boards (cf. Fig. 3). As Bluetooth is integrated into an increasing number of consumer devices, BTnodes are suitable to illustrate the roles of handhelds in smart environments. These roles, however, do not depend on Bluetooth or any other specific communication standard (cf. [9] for a discussion about communication issues in smart environments).
6 cm
4 cm
Connector for power supply
Bluetooth module
Microcontroller on rear side Connectors for sensor boards
Fig. 3. BTnodes are used as a device platform to make everyday objects ”smart”
Mobile infrastructure access point. Because of their wireless network diversity – i.e., their support of short-range as well as wide-range communication technologies – handheld devices can serve as mobile gateways to background infrastructure services. Technically, this is achieved by establishing a local shortrange connection from a smart object to a handheld device, and a long range communication link from the handheld to a background infrastructure server (cf. Fig. 4). The wireless technology used to build up the long-range connection to the backend infrastructure depends on the capabilities of the handheld device. In
The Value of Handhelds in Smart Environments
297
case of PDAs this might be an IEEE 802.11 link to a base station, and a GSM (Global System for Mobile Communication) or GPRS (General Packet Radio Service) connection in case of mobile phones.
Fig. 4. Mobile access points: smart objects use nearby handheld devices to communicate with background infrastructure services in an ad hoc fashion
In our applications, we have realized a handheld’s role as mobile infrastructure access point as follows. When a handheld device (e.g., a mobile phone) comes within range of a smart object that needs to access the background infrastructure, the object establishes a local Bluetooth connection to the handheld. The smart object then sends AT commands over this local Bluetooth link in order to establish a long-range GSM data connection from the mobile phone to a background infrastructure server. Given this connection, arbitrary data can be exchanged between the smart object and the background server. There is a standardized set of AT commands supported by all GSM-enabled mobile phones. Besides using explicitly established GSM data connections, it is also possible for smart objects to embed data into an SMS (Short Message Service) message. As the latter approach does not require the overhead of GSM data connection establishment, it is the preferred way to exchange data with a background infrastructure server in our applications. User interface. The user interface and input capabilities of handheld devices can be exploited by smart objects to notify users acoustically or to allow interactions with smart items based on a graphical user interface. Mobile phones and PDAs offer several popular features by means of which such an interaction can be realized. They range from (1) custom alarms, (2) SMS messages, (3) WAP pages, (4) calendar entries and business cards to (5) whole Java user interfaces that can be downloaded over a local connection from a smart object to a handheld device. We have prototypically implemented all these different means to facilitate the user interaction with smart objects. Thereby, BTnodes are used as prototyping platform to augment everyday objects, and mobile phones or PDAs as handheld devices. (1) Alarms are written to cellular phones by transmitting standardized AT commands from smart objects over a local Bluetooth connection to a mobile phone.
298
F. Siegemund, C. Floerkemeier, and H. Vogt
(2) Similarly, smart objects initiate the exchange of SMS messages with remote users by sending AT commands to a local GSM gateway. This GSM gateway transmits the SMS messages to remote users and relays incoming messages to the corresponding smart objects (cf. Sect. 5). (3) User interaction with smart objects can also take place via WAP interfaces. In our implementation, user interaction with WAP takes place by means of a background infrastructure service, the background infrastructure representative (BIRT) of a smart object. Smart objects synchronize their state with the BIRT whenever an access point is in range. Based on this information, the BIRT provides WAP pages that reflect the current state of an object. By means of their handheld devices users can then access these WAP pages and exchange information with the BIRT of a smart object. User input is relayed to the actual smart object during the next synchronization phase (cf. Sect. 6). (4) Calendar entries and business cards can be exchanged via Bluetooth OBEX (Bluetooth object exchange protocol) with Bluetooth-enabled mobile phones and PDAs in range of a smart object. Calendar entries can be used as an alternative to custom alarms in mobile phones. Their advantage is that they not only trigger an acoustic alarm at the time specified but also display information on the handheld’s screen. (5) Finally, we have also implemented means for more sophisticated interactions with smart objects. Thereby, a Java user interface is stored on the smart object. People can select smart objects in their environment by means of a small program on their handheld device and download the user interface from the selected object (cf. Sect. 7). Remote sensor. The percepts of handheld devices are of potential interest for smart objects, which often simply do not have sufficient resources to deploy sophisticated sensors. Some sensor data – as, for example, the information about the current cell id of a mobile phone – can be easily retrieved from nearby handheld devices. In our implementation, mobile phones can serve as remote location sensors for smart objects. Thereby, a short-range connection is established from a smart object to a mobile phone, whose location information is queried by exchanging AT commands on top of the local communication link. Mobile storage medium. Data transmitted from a smart object to a mobile device is available to users independent from their current location and their overall situational context. In our applications, smart objects transmit contact information in form of telephone book entries and templates that specify the commands supported by a specific smart object to mobile phones (cf. Sect. 5). These information enable users to start an interaction with smart objects from anywhere. As in our software package, data are transmitted to mobile phones by sending standardized AT commands over a local Bluetooth communication link when a user is in range of a smart object. Remote resource provider. The previously described roles show how handhelds mediate between smart objects and their users, or between smart objects and background infrastructure services. As remote resource provider, however, a handheld primarily enriches the interaction among smart objects
The Value of Handhelds in Smart Environments
299
themselves in that it provides a platform for outsourcing complex computations and offers sophisticated data storage capabilities. Our goal was to spontaneously integrate handheld devices into already existing groups of collaborating smart objects. This goal is achieved by introducing an infrastructure layer facilitating the collaboration among computational entities. This layer is a distributed tuple space [4] for smart objects and handheld devices, which is part of our implementation. Smart objects that want to collaborate form a tuple space and write their sensory data into the space. When a handheld device comes into the range of collaborating objects it also joins this distributed tuple space. Thereby, smart objects can instantaneously make use of the memory resources of handheld devices on the basis of resource-aware tuple space operations. Our resource-aware tuple space operations try to identify the most suitable place to store sensor tuples. As the actual location of a tuple becomes transparent through the tuple space, they are stored on the device with the most spare resources – which is often the handheld device. The most important reason for introducing the tuple space abstraction, however, is that the location where code is executed becomes transparent. This is because all devices operate on the same data basis of the distributed tuple space. Smart objects can therefore simply transfer code to a nearby handheld device participating in their tuple space and thereby exploit its computational resources. In our implementation of this concept, Java classes are stored on a smart object that are spontaneously transmitted to nearby handheld devices when the handheld joins the distributed tuple space (cf. Sect. 7). Weak user identification. In many interaction scenarios, the identity of an involved user is important for authorizing certain actions, or adapting services. Handheld devices offer user identification capabilities in varying flavors. They range from a PIN that is necessary to operate a mobile phone, to fingerprint sensors. Knowledge of the PIN gives a hint on the user’s identity if one can assume that a device is personalized to a single user, while a biometric sensor provides much stronger confidence in the user’s identity. From the point of view of the smart object, the downside of using an external identification mechanism is the necessity of putting trust in the handheld’s correct operation (and in the user, e.g. to keep the PIN secret). Since the assurance level on the identity of the current user is usually rather low, we call this feature weak identification. The software package we developed to demonstrate the roles of handhelds in smart environments supports authentication using PIN codes as part of the implemented Bluetooth protocols.
5
Smart Object-Human Interaction: As Easy as Making Phone Calls
In this and the following two sections we present applications that illustrate the previously identified roles of handhelds in smart environments. The application in this section demonstrates how mobile devices serve as mobile storage medium,
300
F. Siegemund, C. Floerkemeier, and H. Vogt
weak user identifier, and as user interface for remote interactions with smart objects. People usually associate a specific type of handheld device with a specific way to interact with communication partners: teenagers write SMS messages to arrange a fun meeting using their mobile phones, and business people organize their appointments with PDAs. Adopting device specific behavior to smart environments while maintaining interaction patterns people expect from their handheld devices is a key approach for successfully integrating handhelds into smart environments.
Fig. 5. Remote interaction with smart objects using handheld devices: when people are in range of smart objects, handhelds serve as mobile storage medium for interaction stubs (1); when far away, interaction stubs are executed to trigger interactions with remote objects using the handheld as user interface (2)
In the following, we illustrate how device specific interaction patterns like making phone calls can be used as a metaphor for implementing remote interactions with smart objects. Enabling remote interactions is a two step process: (1) when a user is in proximity of a smart object, it stores interaction stubs in the user’s handheld device; (2) later, when not in vicinity of the object, a user selects a suitable interaction stub stored on the handheld to trigger a remote interaction with an augmented item (cf. Fig. 5). The interaction stubs are the key mechanism to establish the remote communication link. They consist of a human readable name for a smart object, a set of commands that can be executed by it, and its address. In our current implementation, mobile phones serve as handhelds and the actual communication with a smart object takes place by exchanging SMS messages. Here, an interaction stub is composed of a phone book entry for the smart object and an SMS template. The phone book entry indicates the human readable name and the object’s address, which is a telephone number in this case. The SMS template
The Value of Handhelds in Smart Environments
301
contains a range of predefined commands that can be activated and sent to the smart object (cf. Fig. 6). We illustrate this approach with an office room as an example of a (rather large) smart object. The ”smart office” knows who is currently working in it and what noise level is inside.
Fig. 6. Interaction stubs transmitted from smart objects to a mobile phone (phone book entries (1), SMS templates (2)), an edited SMS template with activated command (3), and the corresponding reply from a remote smart object (4).
A BTnode equipped with several sensors, such as a microphone, is placed in the office and provides information about the noise level inside (cf. [1] for a description of the sensor boards used). Furthermore, according to the concept of weak identification we can infer from a handheld’s presence who is currently in the room, and utilize the handheld as weak user identifier. Given these capabilities, the smart office can keep track of entering and leaving persons, maintain a short history of events, and derive its current situational context (e.g., an ongoing meeting). Based on this information, interaction stubs (phone book entries and SMS templates) are transmitted to a user’s handheld device. For example, the person most frequently in the office is given a special stub that allows her to remotely interact with the office after hours. Interaction stubs (phone book entries and SMS templates) are transferred over a short-range wireless link between smart object and handheld (cf. Sect. 4). These transmissions are completely transparent to the user and are initiated by a smart object based on its current context and history information. Later, when people want to remotely interact with the smart office, they select the corresponding phone book entry from their phone book and compose a new SMS message using the appropriate SMS template. The SMS message is received by a stationary access point with a GSM gateway and relayed to the corresponding smart object in range of the access point. The smart object then executes the
302
F. Siegemund, C. Floerkemeier, and H. Vogt
embedded commands and returns a message to the user’s mobile phone (cf. Fig. 6). Consecutive messages can be exchanged between user and smart object. Besides the described SMS-based approach we have also implemented a similar solution based on WAP. Direct remote interaction with a smart object requires a nearby stationary access point. The next section shows how we can get rid of this stationary gateway by using nearby handheld devices as mobile infrastructure access points.
6
The Smart Medicine Cabinet
The smart medicine cabinet illustrates a handheld’s role as mobile infrastructure access point, mobile storage medium, and user interface. It was designed to improve the drug compliance of patients with chronic diseases by reminding them to take their medicine. It also knows about its contents and can be queried remotely with a WAP-enabled mobile phone. Interaction with the information technology inside the cabinet is implicit – i.e., transparent for the patients – who might not even know that the cabinet is ”smart”. By using small RFID tags attached to the folding boxes and an off-the-shelf medicine cabinet equipped with a BTnode connected to an RFID reader, the information technology becomes completely invisible to the users (cf. Fig. 7). When a patient removes a certain kind of medicine she needs to take during the day, the active tag in the cabinet establishes a connection through the user’s Bluetooth-enabled mobile phone to a background infrastructure service querying it about prescription information concerning this medicine. The active tag then utilizes the user’s mobile phone as mobile storage medium and stores a corresponding alarm and a calendar entry in it that are activated when the patient has to take the medicine.
RFID antenna
RFID reader
RFID tags on folding boxes
BTnode
Fig. 7. The smart medicine cabinet
The Value of Handhelds in Smart Environments
303
A WAP-based interface allows patients to remotely interact with the smart medicine cabinet – or its representation in the background infrastructure, the background infrastructure representative (BIRT) of the cabinet. Since the patient’s mobile phone serves as mobile infrastructure access point for the cabinet, it operates in a disconnected mode whenever there is no mobile phone present. It hence requires a BIRT in the background infrastructure that represents the medicine cabinet continuously. Whenever a mobile phone is in the vicinity of the medicine cabinet and provides connectivity, the cabinet synchronizes its state with the BIRT. Afterwards, remote WAP-based queries need to address the BIRT of the cabinet. Information displayed on WAP pages regarding the contents of the cabinet and recently taken medicine therefore reflect the status of the cabinet during the last synchronization.
Fig. 8. When the patient is in range of the cabinet her mobile phone serves as mobile access point to the background infrastructure server (1), when not in range of the cabinet she interacts with the background infrastructure representative of the cabinet using the handheld as user interface (2)
The following technologies have been incorporated into an ordinary medicine cabinet: (1) passive RFID tags on the folding boxes, (2) an RFID reader attached to the medicine cabinet, and (3) a BTnode that processes the information from the RFID reader and communicates via Bluetooth with a (4) mobile phone (cf. Fig. 7).
304
F. Siegemund, C. Floerkemeier, and H. Vogt
An actual use case would be the following: A patient approaches the smart medicine cabinet and takes a package out of the cabinet. The active tag in the cabinet notices that a box of medicine disappeared and connects to a background service (the BIRT of the cabinet). Thereby, the patient’s mobile phone acts as a mobile infrastructure access point to the cellular phone network for the smart object, and no stationary access point is required near the cabinet. The Bluetooth-enabled active tag in the cabinet queries the BIRT about when the patient has to take the medicine. It then stores a corresponding calendar entry into her mobile phone that reminds the patient to take the medicine during the day. While there is a connection to the background infrastructure through the patient’s mobile phone, the cabinet also synchronizes its state with that of the BIRT, which provides WAP pages based on this information. When the patient visits a pharmacist, the WAP interface can be used to query the contents of the cabinet (cf. Fig. 8). This information is a good basis for the pharmacist to decide whether another kind of medicine is compliant to that in the smart cabinet. The concept of a medicine cabinet augmented by information technology has been demonstrated previously by Wan [11], who integrated a personal computer, an LCD screen and a broadband Internet connection into a medicine cabinet. The medicine cabinet presented as part of our work, however, was designed with the goal to leave the medicine cabinet practically unmodified from a user perspective.
7
Spontaneous Integration of Handhelds into Smart Environments
The inventory monitoring application presented in this section illustrates a handheld’s role as remote resource provider and user interface. As remote resource provider, a handheld provides data storage capabilities and serves as a platform for executing computations on behalf of smart objects. The possibility to outsource computations to nearby handheld devices also allows us to transmit sophisticated user interfaces, which facilitate the local user interaction with smart objects. As already described in Sect. 4, a handheld’s role as remote resource provider is based on a distributed tuple space implementation. In a typical pervasive computing environment, smart objects need to collaborate in order to implement an application. Such collaborating entities form a distributed tuple space as a shared data structure distributed over participating entities (and without the use of a background infrastructure). Smart objects write data (e.g., perceived sensory data) into the tuple space, read data from it, process these data, and write the corresponding result again into the space. Thereby, the origin of data becomes transparent for participating objects when it is not explicitly coded into tuples. Consequently – and this is the major point – the location where code is executed becomes irrelevant because it operates on the same data basis regardless of the node in the distributed tuple space chosen for code execution. In order for a handheld to serve as resource provider for smart objects, it joins
The Value of Handhelds in Smart Environments
305
their distributed tuple space and receives code from these objects. As likewise previously mentioned, resource-aware tuple space operations can automatically use the memory capacity of a handheld after it has joined the tuple space. We have implemented the idea of using handhelds as remote resource provider and user interface for local interactions with smart objects in a software framework called Smoblets. The main goals of Smoblets are to enable interactions with smart objects without the need of a supporting backend infrastructure, to outsource computations to handheld devices in order to save energy, and to foster the collaboration among smart objects and handheld computers.
Smart object
Sensory data are exchanged using the space as shared data medium
Smoblets stored in program memory
Application
Distributed tuple space Sensors
Smart object
Bluetooth
Handheld device (PDA or mobile phone) Smoblet transfer
Smoblets Smoblet runtime
Data exchange hidden in tuple space
Object selection mechanism
Distributed tuple space Bluetooth stack
Smart object Smart object
Network of smart objects sharing a distributed tuple space
Fig. 9. The Smoblet concept: ad hoc interaction with nearby handheld devices
The main components of a Smoblet system are (1) a set of Java classes – the actual Smoblets – stored in the program memory of an active tag (a BTnode in our case), (2) a runtime environment for the execution of Smoblets on a handheld device, (3) a mechanism to select smart objects and to transfer Smoblets from smart objects to handheld devices, and (4) a distributed tuple space [5] implementation that serves as a shared content-based data structure for participating entities (cf. Fig. 9). It is important to note that smart objects themselves cannot execute Java code; their program memory merely serves as storage medium for the code. Java classes can only be executed on nearby handheld devices, which communicate with the smart objects using a distributed tuple space abstraction. A Smoblet transmission can either be initiated automatically by smart objects or manually triggered by human users. To manually initiate the transmission of Smoblets from a smart object to a handheld device, the user selects a smart object for interaction by means of a small program on her handheld. The device address of the object is then used to establish a connection and to retrieve the Java classes from the smart object.
306
F. Siegemund, C. Floerkemeier, and H. Vogt
Together with the Smoblets, their functionality is moved and therefore outsourced from a smart object to a handheld device. Because of the distributed tuple space the actual source of data becomes transparent for the Smoblets. Although they are executed on the handheld they can operate on the data provided by the object they are originating from and perform computations on its behalf that were not feasible because of the object’s limited resources. Outsourcing computations from smart objects and using handhelds as remote resource provider is the core motivation for the Smoblet idea. We have created a Java framework that simplifies the development of Smoblets. It provides methods to access data stored in a distributed tuple space, which can be used to transparently access remote sensor data of smart objects via Bluetooth. The distributed tuple space implementation provides a convenient high-level communication abstraction to the application programmer, who does not have to care about low-level communication issues. The tuple space also detaches Smoblets from any particular communication technology that is being used by the smart objects.
Fig. 10. A typical Smoblet interaction – after a user has searched for smart objects (1), Smoblets are transferred to a handheld device and offer a small user interface (2), which allows users to graphically interact with active tags and to access data shared through the distributed tuple space (3)
To illustrate the idea of Smoblets, we have implemented an inventory monitoring application (cf. Fig. 10). Here, BTnodes together with sensor boards are attached to expensive products – in our case a video cassette recorder (VCR), but it could also be a bottle of expensive wine, a book, etc. – in order to notify its owner or another person when it is being stolen or damaged. Smoblets stored on the smart video cassette recorder allow a user to customize the behaviour of the product. The scenario is as follows: When nearby, a person selects a smart object by using the SmobletFinder application, which triggers the transmission of the
The Value of Handhelds in Smart Environments
307
corresponding Java code from the selected object to her handheld device (cf. Fig. 10). The code contains a small user interface for adapting the behaviour of the smart object. Here, a user can specify a telephone number and associate messages with certain situations the product can be in. For example, in case of damage a notification message must be sent to a nearby repair service. The user input (e.g., the telephone number and the message text entered into the user interface) is embedded into a tuple and written into the underlying distributed tuple space. As the smart object that provided the user interface is also a member of the tuple space, it can access the information input by the user. In our implementation, the smart object registers a callback on tuples that represent relevant user input. Is a corresponding tuple written into the space the smart object is automatically notified by the tuple space implementation. Therefore, the smart object can immediately react on the user input and adapt its behaviour accordingly. When the user closes the application the handheld leaves the distributed tuple space.
8
Conclusion
This paper formulated and supported the hypothesis that smart objects can provide increasingly sophisticated services to users in smart environments when they are able to exploit the capabilities of nearby handheld devices in an ad hoc fashion. We identified six roles that describe how smart objects can spontaneously make use of nearby handhelds: (1) as mobile infrastructure access point, (2) as user interface, (3) as remote sensor, (4) as mobile storage medium, (5) as remote resource provider, and as (6) weak user identifier. We then presented an implementation of these roles tailored towards everyday objects which have been augmented with Bluetooth-enabled sensor nodes. Finally, three applications – a remote interaction scenario, a smart medicine cabinet, and an inventory monitoring application – illustrated some of the usage scenarios that become feasible when smart objects are able to spontaneously collaborate with nearby handheld computers. Table 2 summarizes the roles of handhelds in the presented applications. Table 2. The roles of handhelds in the applications presented in this paper
Application Remote interaction
Handheld’s role Mobile storage medium, user interface, weak user identifier Smart Medicine Cabinet Mobile infrastructure access point, mobile storage medium, user interface Inventory monitoring Remote resource provider, user interface
308
F. Siegemund, C. Floerkemeier, and H. Vogt
Considering the presented applications, the basic function of handheld devices is to mediate between smart objects and infrastructure services, between smart objects and their users, and among smart objects themselves. For example, in the Smart Medicine Cabinet application, a patient’s mobile phone serves both as mobile infrastructure access point (i.e., it enables smart objects to communicate with background infrastructure services) and as primary user interface (i.e., it facilitates the user interaction with smart objects). As a remote resource provider a handheld provides data storage capabilities and serves as platform for outsourcing computations from smart objects. By providing resources for smart objects, handhelds enrich smart object’s computational abilities and their interaction among each other.
References 1. M. Beigl and H. W. Gellersen. Smart-Its: An Embedded Platform for Smart Objects. In Smart Objects Conference (SOC) 2003, Grenoble, France, May 2003. 2. M. Beigl, H.W. Gellersen, and A. Schmidt. MediaCups: Experience with Design and Use of Computer-Augmented Everyday Objects. Computer Networks, Special Issue on Pervasive Computing, 25(4):401–409, March 2001. 3. J. Beutel, O. Kasten, F. Mattern, K. Roemer, F. Siegemund, and L. Thiele. Prototyping Sensor Network Applications with BTnodes. January 2004. Accepted for publication, IEEE European Workshop on Wireless Sensor Networks (EWSN 04). 4. N. Davies, S. Wade, A. Friday, and G. Blair. Limbo: A Tuple Space Based Platform for Adaptive Mobile Applications. In Proceedings of the International Conference on Open Distributed Processing/Distributed Platforms (ICODP/ICDP ’97), pages 291–302, Toronto, Canada, May 1997. 5. D. Gelernter. Generative Communication in Linda. ACM Transactions on Programming Languages and Systems, 7(1):80–112, January 1985. 6. H. W. Gellersen, A. Schmidt, and M. Beigl. Multi-Sensor Context-Awareness in Mobile Devices and Smart Artifacts. Mobile Networks and Applications (MONET), 7(5):341–351, October 2002. 7. Stephan Hartwig, Jan-Peter Str¨ omann, and Peter Resch. Wireless Microservers. Pervasive Computing, 1(2):58–66, 2002. 8. T. Kindberg et al. People, Places, Things: Web Presence for the Real World. In WMCSA 2000, Monterey, USA, December 2000. 9. A. Schmidt, F. Siegemund, M. Beigl, S. Antifakos, F. Michahelles, and H. W. Gellersen. Mobile Ad-hoc Communication Issues in Ubiquitous Computing. In Personal Wireless Communication (PWC 03), September 2003. 10. F. Siegemund and C. Floerkemeier. Interaction in Pervasive Computing Settings using Bluetooth-enabled Active Tags and Passive RFID Technology together with Mobile Phones. In IEEE Intl. Conference on Pervasive Computing and Communications (PerCom 2003), pages 378–387, March 2003. 11. D. Wan. Magic Medicine Cabinet: A Situated Portal for Consumer Healthcare. In 1st Intl. Symposium on Handheld and Ubiquitous Computing (HUC ’99), Karlsruhe, Germany, September 1999. 12. R. Want, K. Fishkin, A. Gujar, and B. Harrison. Bridging Physical and Virtual Worlds with Electronic Tags. In ACM Conference on Human Factors in Computing Systems (CHI 99), Pittsburgh, USA, May 1999. 13. M. Weiser and J. S. Brown. The Coming Age of Calm Technology, October 1996.
Extending the MVC Design Pattern towards a Task-Oriented Development Approach for Pervasive Computing Applications
Patrick Sauter1, Gabriel Vögler2, Günther Specht1, and Thomas Flor2
1
Universität Ulm, Fakultät für Informatik, Abt. Datenbanken und Informationssysteme, 89069 Ulm, Germany {patrick.sauter, specht}@informatik.uni-ulm.de 2 DaimlerChrysler, Research & Technology, Software Architectures, Postfach 2360, 89013 Ulm, Germany {gabriel.voegler, thomas.flor}@daimlerchrysler.com
Abstract. This paper addresses the implementation of pervasive Java Web applications using a development approach that is based on the Model-ViewController (MVC) design pattern. We combine the MVC methodology with a hierarchical task-based state transition model in order to achieve the distinction between the task state and the view state of an application. More precisely, we propose to add a device-independent TaskStateBean and a device-specific ViewStateBean for each task state as an extension to the J2EE Service to Worker design pattern. Furthermore, we suggest representing the task state and view state transition models as finite state automata in two sets of XML files.
1
Introduction
The use of the Model-View-Controller design pattern (MVC, cf. [1]) is very common in User Interface development. Yet, an important practical goal of pervasive application development – maximizing the number of supported devices while minimizing code redundancy – is not necessarily achieved by employing an arbitrary MVC-based development methodology. In particular, a typical problem of pervasive application development is to provide client-adapted user interfaces [2]. For example, a dialog that fits onto a single Web page on a PC (e.g. a complex form) must be fragmented into several sub-dialogs on a device with a small screen size (e.g. a PDA or a cellphone). MVC-based approaches allow supporting such adaptations without rewriting the entire user interface code. A result of the decomposition of the architecture into model, view and controller is that device-independent and device-specific code is decoupled into separated components. In order to support an additional device, only the view (and for more complex applications sometimes the controller) must be reC. Müller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 309–321, 2004. © Springer-Verlag Berlin Heidelberg 2004
310
P. Sauter et al.
written or at least adapted. However, this might lead to many "similar" view and controller components with partial code redundancy. A widely used MVC-based pattern for J2EE applications is the Service to Worker design pattern (cf. [3]). It covers many aspects of practical interest when implementing a single-client (mainly desktop browser) Java-based Web application. Although it does suggest using an explicit state transition model that is stored in an XML file, this model does not fully satisfy the needs of pervasive applications with differing screen flow on different devices. Only global transitions such as global JSP state transitions (e.g. “switch from searchmask.jsp to resultlist.jsp”) are specified, so a device with a different screen flow would require its very own screen flow model and thus probably a different set of JSP files. It is therefore necessary to identify similarities between the screen flow models of the devices and introduce a more abstract task-based (as described in [4]) model that all devices have in common. To take on this issue, we suggest using a two-stage hierarchical model of JavaBeans to store information separately about the state of the (device-independent) task flow and the state of the (device-specific) view flow of the application. This paper extends the Service to Worker design pattern in order to support multidevice Web applications by introducing a task-based two-stage hierarchical state transition model. The rest of the paper organizes as follows: In section 2, existing work in the range of UI development for pervasive computing is discussed. Section 3 contemplates the task-based development methodology, describes a task-based implementation algorithm for pervasive Web applications and gives a suitable definition for the term "task". In section 4, we take a look at the Service to Worker design pattern from the viewpoint of pervasive computing. We will then introduce an extension and describe an implementation approach for pervasive applications in greater detail, illustrated by several UML diagrams and code examples. In section 5, we will discuss the additional design-time and run-time benefits of the separation between an application’s task state and its view state; more specially, we consider the possibilities to facilitate the development process and the ability to change devices "on the fly".
2
Related Work
The specific requirements of pervasive applications have been widely discussed in [2, 4, 5, 6, 7]. In particular, Banavar et. al. [4] describe a programming model that strictly treats task logic and user interaction separately. They make the suggestion to start with creating a superior task-based model for program structure that covers the user's abstract interaction and the application logic and then continue with creating a subordinate navigation model that covers the flow of the view elements. At the time being, the model-based methodology is not yet supported by major programming tools or design patterns. Our suggested development strategy is fundamentally based on this task-driven development methodology.
Extending the MVC Design Pattern
311
There are several pervasive computing projects that aim at devising a high-level UI design language for abstract user interaction (cf. [8, 9] as well as IBM’s PIMA project [http://www.research.ibm.com/PIMA/]). Any of these approaches proceed similarly: the modeling phase of the abstract user interaction in the respective language is followed by a semi-automatic generation of the device-specific code. Although our paper describes a less automated development process, the two-stage hierarchical state transition model presented in the subsequent section could well be used by any of these languages to support the advanced run-time characteristics described later in this paper. Furthermore, the two JavaBeans specified in section 4 could also be created by the code generator of the respective high-level language. In general, the contribution of this paper is the combination of the well-proven MVC methodology with the separation of an application’s task state and its view state. The result is a practical task-based implementation guideline for pervasive Javabased Web applications.
3
The Task-Based Development Approach
As a general approach for developing pervasive applications with multi-device capabilities, Banavar et. al. [4] suggest modeling the task logic prior to the user interaction (i.e. the screen flow). The task flow of an application, as we define it, must, in contrast to its screen flow and, without loss of generality, be common to all devices. We therefore need to redefine or at least narrow the definition of the term task so that it satisfies our requirements. A task, as defined by Bergman et. al. [10], is a unit of work to be performed by the user. This definition, however, does not specify the granularity of a task in comparison with a subtask. A task, as we use it, must comply with the following requirements: • It must not be too fine-grained such that a step in a subsequent task could be part of the current task without affecting the functionality of the application. For example, all text fields of a search mask that a user can fill in without requiring any additional system interaction must necessarily belong to the same task. • It must not be too coarse-grained such that there exist two steps within a single task with one step requiring input data that is to be produced as the output of another step within the same task. For example, the search mask and the result list of a search engine query ought to belong to separate tasks. In other words, a task must consist of steps which logically belong together from both the user's and the system's point of view. The purpose of this definition is that a task could well be displayed on a single view, i.e. on a single Web page. Example The database application discussed in this section is a typical example for Web-based intranet applications: A company wants to make its corporate employee directory (cf. Figure 1) accessible from various client platforms, e.g. a typical desktop browser and
312
P. Sauter et al.
a PDA (with an HTML browser). On a desktop PC, due to its plentiful screen size, the application might consist of only three Web sites: the search mask, the result list and an additional more detailed employee profile page. Although it might be only a rule of thumb, in this example each of the three desktop browser Web sites should represent exactly one task according to the preceding definition: the entire search mask as the entry point, the result list displaying the results of the directory query and the detailed employee profile page displaying the results of a second query of the selected employee displaying a greater number of attributes from the directory. That is, task flow and screen flow are identical on the desktop browser.
exreduce
Fig. 1. Fig. 1 depicts how the first page of this example Web application might be rendered on a desktop and a PDA browser, where only the most frequently used input fields are displayed at first.
Extending the MVC Design Pattern
313
On the PDA, however, task flow and screen flow might differ because of its limited screen size. For example, it might be sensible to display only a reduced search mask as an entry point with a subset of only the most frequently used input fields (e.g. only first and last name as well as the department code) and an additional “Expand search mask” button that would bring on all search fields. This click on the “Expand search mask” button would only alter the view state of the application, not its task state. In expanded state, the user would be expected to scroll down in order to find all search fields. We shall refer to this example in subsequent sections. A Task-Based Design Cycle for Pervasive Computing Applications The idea of a task-driven development methodology must now be put into a more concrete and practical development guideline. As an implication to our definition of the term “task” in combination with the task-based development approach, we suggest that the development of a pervasive Web application should proceed as follows: 1. Model the task flow of the application in order to express user requirements, e.g. using a UML activity diagram. 2. Since each of these tasks should comply with the above definition, each of these tasks is to be represented by a single view component. That is, create a view component (e.g. an empty JSP file) for each device and for each task. 3. For each target device and for each task, check if the capabilities of the device allow a one-to-one relationship between a task and a view. In other words, determine if it is feasible and sensible to display the task on the particular device as a single view component. 4. If this is the case for a particular view component, the view flow and the task flow are identical for this page and no changes have to be made. Go to step 7. 5. If this is not the case, the task has to be assigned several view states. That is, part of the task has to be grouped on separate screens or maybe even left out completely, e.g. on a cellphone with an extremely restricted screen size and very little memory. 6. Use the additionally available view state information to enable the user to switch from one view of the same task to another. In a J2EE Web application environment, this could be accomplished by adding embedded Java code to a given JSP file (which represents the task state). This file retrieves the view state information and thereby decides which segment of the JSP file (e.g. either the “reduced” or the “expanded” part) to display. Furthermore, view state change buttons must be provided, e.g. by inserting an “Expand search mask” button. 7. Optimize the device-specific presentation, for example by adjusting font sizes or defining CSS files. The advantages of this approach are the clear-cut separation between global task state and view state, both at design-time (a Web design team could create complex view flow structures without affecting the team implementing task flow and business logic) and at run-time (as we will show in section 5). See Figure 2 for how this algorithm might transform into application states and state transitions for the previously mentioned employee search example.
314
P. Sauter et al. resultlist.jsp searchmask.jsp
employeeprofile.jsp
showReduced SearchMask
task state
expand
reduce
startSearch
showEntire SearchMask
...
... ...
... ... backTo SearchMask
...
view state
... ...
backToSearchMask
Fig. 2. Fig. 2 depicts a possible result of the algorithm for the previously mentioned corporate directory application example on a PDA device. The labels of the arrows correspond to action names. Each task state is represented by a single JSP file.
4
Design Patterns for the Implementation
After introducing our task-based design cycle for pervasive computing applications we now want to discuss a design pattern for its realization within a J2EE environment. In order to add multi-device capabilities, we now take up the idea of separating task logic and user interaction and therefore extend a well-known design pattern of the J2EE presentation tier [3]: The Service to Worker Pattern The Service to Worker pattern is a design pattern that primarily addresses the development of single-client Web applications. In particular, it decouples business logic from the view components and allows explicit state transition handling. We therefore consider it suitable for the type of Web applications discussed in this paper, i.e. typical thin-client intranet applications. Although many of the succeeding concepts can also be used in combination with other Core J2EE Patterns [http://java.sun. com/blueprints/corej2eepatterns/] or the more general MVC design pattern, the Service to Worker pattern provides the best basis for our extensions to facilitate pervasive computing application development. The Service to Worker pattern is a combination of the Front Controller and the View Helper patterns in the J2EE presentation tier. For a more rigorous discussion on the Service to Worker J2EE design pattern, cf. [3] and [http://java.sun.com/j2ee/]. Its main feature added to the MVC architecture is the ability to comprise the state transition logic into a dedicated dispatcher class rather than the controller. The dispatcher
Extending the MVC Design Pattern
315
accesses an XML file that defines all existing states and associates user actions with state transitions. In contrast to existing MVC-based frameworks for pervasive computing Web applications (e.g. MVC-Portlets of IBM WebSphere Everyplace Access Portal Server 4.2.1), the concept described in this paper does not use multiple device-specific controller classes. In our approach, the controller has only two responsibilities: pass on every request to the dispatcher and, when required, call the appropriate business service. Yet, these business services are only called when the application’s task state changes. And since all devices have the underlying task state transition model in common, a single device-independent controller class is sufficient. Extension of the Service to Worker Pattern The Service to Worker pattern does not yet fully satisfy the needs of pervasive computing development. As mentioned in the previous subsection, Sun Microsystems' specification of the Service to Worker pattern introduces a dispatcher class to handle state transitions. However, neither does it prescribe how to model the state transition logic (inside the XML file), nor does it include a description of the runtime component which realizes this model. Furthermore, since the Service to Worker pattern is designed for single device access, the issue of device-specific views has to be addressed. Even the algorithm provided in section 3 does not give any implementation details. We will take on these issues now by describing an implementation approach in great technical detail. For the purpose of modeling the device-independent state transition logic we use an XML file named mappings.xml (as suggested by [http://java.sun.com/ blueprints/patterns/FrontController.html]). For the device-specific view logic we utilize an additional file named viewmappings.xml for each supported device. Both describe a finite state automaton, defining states, actions and transitions. We suggest implementing these automata as two dedicated JavaBeans classes: TaskStateBean and ViewStateBean (cf. Figure 3). Both are initialized with data of the XML files described above. At runtime, they provide a getState() and a doAction(String) method. The former returns the current state and the latter performs the assigned action and returns either the new state or throws an exception if the action is illegal. Additionally, the TaskStateBean can reference a devicespecific ViewStateBean for each task state if the view state of a previously completed task has to be retained. For example, if the user clicks on “Back to search mask” which he left in expanded state, it could again be rendered in expanded state when the user returns. Device-specific renderings are achieved by view components (JSP files) that are particular to exactly one device. There exists a single JSP file for every task/device combination which contains view-specific markup and therefore must also include code for accessing the ViewStateBean information. All JSP files may be placed together with the appropriate viewmappings.xml in a device-specific subfolder (e.g. /pda) in order to deliver straightforward naming conventions.
316
P. Sauter et al.
TaskStateBean TaskStateBean
Client Client
Controller Controller
Dispatcher Dispatcher
1..*
1..*
ViewStateBean ViewStateBean
View View
1..* 1..*
Helper Helper
Fig. 3. Fig. 3 depicts the class diagram of the extended Service to Worker design pattern. The added elements are the two JavaBeans classes.
After setting up the general structure we now want to discuss the component interaction by Figures 4 and 5 which refer to the example described above. Consider the situation in which a PDA user fills in the search mask of the corporate employee directory example. In this scenario, the TaskStateBean is in the state "searchmask.jsp" and references the PDA version of the ViewStateBean which initially is in the state "showReducedSearchMask". Next, the user clicks the "start search" button. This leads to the creation of an HTTP GET request with the action parameter "startSearch", which is then sent to the controller (e.g., http://host/controller?action=startSearch). At this point, the request object will be enriched with information about the delivery context by the pervasive computing environment. For example, the IBM WebSphere Everyplace Access Server looks up the User-Agent string of the HTTP header in its device database and maps it to one of its pre-defined device class (e.g. “Pocket Internet Explorer” is recognized as a “PDA” device). An attribute indicating this piece of information about the device class will then be added to the request object. The controller reads the action value ("startSearch"), calls some helper class to handle the query (i.e., invoke a LDAP service) and then calls the dispatcher to display the results. Next, the dispatcher calls doAction("startSearch") of the TaskStateBean in order to determine what JSP file to display next. In this case, the action causes a task state transition, so the return value is "resultlist.jsp". Additionally, the TaskSateBean initializes a ViewStateBean for the new task state and sets it to the default view state. According to the device class detected earlier, the dispatcher now displays "resultlist.jsp" of the /pda directory. Before generating the final markup, the embedded code of the JSP file queries the ViewStateBean to decide which view elements to display. Figure 4 gives an impression of how the internal handling sequence of the application for the “Start search” scenario might look like.
Extending the MVC Design Pattern
Client
Controller
request
TaskState Bean
Dispatcher
delegate request
Business Service
Helper
retrieve content
317
get data do action
ViewState Bean
“startSearch"
ValueBean
View get state get property
Fig. 4. Fig. 4 depicts the internal sequence for handling a click on the “Start search” button, i.e. a typical task state changing action.
The internal object interaction of a click on the “Expand search mask” (instead of "start search"), however, might look fundamentally different. Therefore, Figure 5 (view state change) differs from Figure 4 (task state change) in so far as no business service has to be accessed. Client
Controller
request
delegate request
Dispatcher
View
TaskState Bean
ViewState Bean
ValueBean
do action “reduce"
do action “reduce"
"reload"
get state get property
“showReducedSearchMask"
Fig. 5. Fig. 5 refers to a typical view state changing action such as a click on the “Reduce search mask” button on the PDA.
In the situation of Figure 5, the submitted action ("reduce") is not a task state changing action, so the TaskStateBean delegates the action handling to the ViewStateBean. After the ViewSateBean has changed its state, the TaskStateBean returns the old task state ("searchmask.jsp"). Consequently, the dispatcher calls the same JSP file ("searchmask.jsp"). But this time, the getState() method inside the embedded JSP code returns a different view state, which causes to display different markup.
318
P. Sauter et al.
Figures 6 and 7 give an impression of how the previously mentioned search mask of the employee directory example might be put into XML and JSP code. searchmask.jsp <defaultviewstate> showReducedSearchMask showReducedSearchMask expand <switchTo>showEntireSearchMask ... showEntireSearchMask reduce <switchTo>showReducedSearchMask ... ... ... Fig. 6. Fig. 6 depicts how the /pda/viewmappings.xml file might look like; cf. also Fig. 2.
... <% if (myViewStateBean.getState(). equals("showEntireSearchMask")) { %> ... Reduce Mask <% } else if (myViewStateBean.getState(). equals("showReducedSearchMask")) { %> ... Expand Mask <% } %> ... Fig. 7. Fig. 7 shows the corresponding JSP file /pda/searchmask.jsp.
Notice that addressing actions is a mere implementation detail and depends on the set of available technologies of the Web application server. For a JSP-based portal
Extending the MVC Design Pattern
319
environment, for example, the Java Specification Request (JSR) 168 (cf. [http://www.jcp.org/en/jsr/detail?id=168] and Figure 8) defines a dedicated set of JSP tags for generating dynamic Uniform Resource Identifiers (URIs) which, when requested, enforce the invocation of an action handling method.
<portlet:param name="action" value="expand"/> " >Expand Search Mask Fig. 8. Fig. 8 shows sample JSP code according to the JSR 168 (version 1.0) specification.
In general, this section points out that the separation of task state and view state effects the implementation of Web application in two ways: Both the dynamic information about the current state (which in this case is represented by JavaBeans objects) and the static state transition information (which we suggested storing in XML files) have to be subdivided into their task state and view state components.
5
Additional Benefits of the Separation between Task State and View State
The previous section describes a general solution of how to design task-based pervasive computing applications. We now discuss some additional benefits of our approach, which provides a solution to some typical pervasive computing issues. Changing Devices “on the Fly” The strict separation of task state and view state allows the implementation of a rather advanced feature that lets the behavior of the application move closer to the vision of pervasive and ubiquitous computing that applications should seamlessly move from one device to another tracking the user [6, 7]. Consider the situation that a user of a Web application submits a query to the employee search engine of his company on his PDA and alters the view state of the result list that is particular to this gadget. As he enters his office, he doesn't want to enter the query again on his desktop PC, so he logs in and in case the task state of the previous session and its value beans have been left unchanged, he is shown the result list in a desktop PC browser format, and not the search mask. This behavior could obviously not be realized if the two devices had different task models, e.g. different JSP file names, because even if the state of the application was handed over to the other device, it would not be guaranteed that this state (e.g. “somePdaSpecificResultList.jsp”) still is defined for the new device (e.g. if the desktop browser only features a single “resultlist.jsp” file). The separation of task state and view state is the key to support such behavior. When using the method described in this paper, it is assured that the two devices have their task state models in common. So, a dynamic device change could be implemented by simply handing over the
320
P. Sauter et al.
“old” TaskStateBean to the “new” dispatcher class of the device. More precisely, the login procedure of the application would have to accomplish this task, and the “old” ViewStateBean could either be destructed or retained as a separate object to possibly support the use of a single Web application with multiple devices in succession. That is, the implementation effort would merely be limited to the provision of a centralized TaskStateBean (and value beans) management, but still the underlying application and business logic architecture would not have to be changed at all. Using the XML Task Definition for Code Generation Now, all the information required for both task state and view state transitions is completely accessible in the XML files. This allows partial automatization of the development process as defined in section 3. One way to utilize this information is to manually create one JSP file for each task state as specified in the mappings.xml file (e.g. including typical HTML or WML headers) and to create one subsection inside the appropriate JSP file for each view state based on the information in the viewmappings.xml file. (This is essentially the functionality of steps 2 and 6 in the algorithm of section 3.) For small projects, manual execution of the algorithm might be sufficient. However, the task model included in the XML files described above is a good starting point for code generation, which could be helpful for larger projects with changing requirements. Furthermore, since the task state transition information is also included in the mappings.xml file, the development environment could also offer the developer the appropriate set of actions that should be used in the respective JSP file. For example, the development environment might, based on the information given in the mappings.xml file, suggest that the searchmask.jsp file contains a link or a button with the action "startSearch". Also the viewmappings.xml data could be used in a similar way. In the example of the finite view state automaton of the search mask (cf. Figure 2), the development environment could generate navigation elements (e.g. a navigation bar) inside the search mask automatically. This navigation element would offer the user the functionality to switch from one view to another, e.g. from reduced to expanded state. The main benefit of this approach is on the one hand less implementation effort and on the other hand assistance in maintaining a consistent user interface, because changes made to the XML files will be cascaded down to the JSP files.
6
Conclusions
This paper describes a development approach for Web applications that support multiple devices in a J2EE environment. The separation of task state and view state allows a more structured development approach and advanced run-time characteristics. In particular, device changes on the fly can be supported with little effort by retaining the device-independent task state information. The cost of adding support of another device to a given Web application involves merely the presentation and view state
Extending the MVC Design Pattern
321
layer. No changes have to be made to the task state transition model, while developers are still offered full control of design and usability (e.g. grouping and ordering) issues. However, this approach does not work for applications with device-specific task flow, e.g. if an internet shop requires a cellphone customer to sign another agreement before an order can be legally accepted. Furthermore, the XML file schemas used in the example are only suggestions and have not yet been formally specified.
References 1.
E. Gamma, R. Helm, R. Johnson, J. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995. 2. G. Voegler, T. Flor: Die integrierte Client-Entwicklung für PC und Mobile Endgeräte – am Beispiel der Portaltechnologie. In: Klaus Dittrich, Wolfgang König, Andreas Oberweis, Kai Rannenberg, Wolfgang Wahlster (eds.): Informatik 2003 – Innovative Informatikanwendungen (Band 1). GI-Edition – Lecture Notes in Informatics (LNI), P-34. Köllen Verlag, Bonn, 2003, 207-211. 3. Sun Microsystems. Core J2EE Patterns. 2002. http://java.sun.com/blueprints/corej2eepatterns/Patterns/ServiceToWorker.html 4. G. Banavar, J. Beck, E. Gluzberg, J. Munson, J. Sussman, D. Zukowski. Challenges: An Application Model for Pervasive Computing. August 2000. Proc. 6th ACM MOBICOM, Boston, MA. 5. J. Burkhardt, H. Henn, S. Hepper, K. Rindtorff, T. Schäck. Pervasive Computing: Technology and Architecture of Mobile Internet Applications. Addison-Wesley, 2002. 6. U. Hansmann, L. Merk, M. Nicklous, T. Stober. Pervasive Computing Handbook. Springer, 2001. 7. G. Banavar, A. Bernstein. Software Infrastructure and Design Challenges for Ubiquitous Computing Applications. 2002. Communications of the ACM, December 2002. 8. F. Giannetti. Device Independence Web Application Framework (DIWAF). HP Labs. W3C Workshop on Device Independent Authoring Techniques, 25-26 September 2002, SAP University, St. Leon-Rot, Germany. 9. F. Paternò, C. Santoro. One Model, Many Interfaces. Proceedings Fourth International Conference on Computer-Aided Design of User Interfaces, pp. 143-154, Kluwer Academics Publishers, Valenciennes, May 2002. 10. L. Bergman, T. Kichkaylo, G. Banavar, J. Sussman. Pervasive Application Development and the WYSIWYG Pitfall. Lecture Notes in Computer Science, vol. 2254, 2001, pp. 157172, May 2001.
Adaptive Workload Balancing for Storage Management Applications in Multi Node Environments Jens-Peter Akelbein and Ute Schröfel IBM Deutschland Speichersysteme GmbH, Hechtsheimer Str. 2, 55131 Mainz, Germany [email protected]
Abstract. Cluster file systems provide access for multiple computing nodes to the same data via a storage area network with the same performance as a local drive. Such clusters allow applications distributed over multiple nodes to access storage via parallel data paths. Also storage management applications can take advantage of this parallelism by distributing the data transfers over multiple nodes. This paper presents an approach for client/server oriented backup solutions. The workload is distributed over multiple client nodes for successive backups to a storage repository server by an adaptive approach. An outlook is given how the method can be applied also for distributing workload of several clients to multiple servers.
1 Introduction Since decades the capacities of hard disks have been growing fast, up to doubling each year. Currently, no end to this trend is foreseeable and the amount of data being stored in some computing centers is growing even faster. By integrating multiple hard disks in raid arrays or in enterprise storage servers, applications are enabled to store dozens of Terabytes or more. Striping data on multiple hard disks provides a high bandwidth for read and write accesses up to several Gigabytes per second. Furthermore, using storage area networks and cluster file systems like IBM GPFS, SGI XCFS, or Sun SAMFS give access for multiple computing nodes to the same data. Applications distributed over several cluster nodes benefit from the parallelism in the underlying layers by a high overall performance. Growing storage capacity usually leads to a larger number of files and an increased average size of data objects (files and directories). The amount of time needed to access a single data object is related to the average seek time to position the heads. During the past two decades the duration of a single seek was reduced from 70 to a few milliseconds while the capacity of hard disk drives was multiplied by four orders of magnitude (20MByte to 200GByte). Therefore, the duration of scanning through a whole file system has grown over the time because of the number of contained objects. This trend affects the scalability of incremental backup solutions like progressive incremental which need to scan each object for changes [1]. If only the scan of a file system takes several hours because millions of objects have to be processed, the backup window extends unacceptably. The amount of data being backed up C. Müller-Schloer et al. (Eds.): ARCS 2004, LNCS 2981, pp. 322–337, 2004. © Springer-Verlag Berlin Heidelberg 2004
Adaptive Workload Balancing for Storage Management Applications
323
also grows constantly. Combining both trends leads to the conclusion that traditional incremental backups of large file systems require unacceptable processing time. File servers storing millions of files already demonstrate such limitations. 1.1 Backup Methods for Large Amounts of Data Several methods are known to address the scalability problem of incremental backups. Backing up images of logical volumes avoids scanning of objects. Nevertheless, all used data blocks in a logical volume need to be transferred. An incremental image backup removes the need to transfer all data. To provide such functionality, the file system or a component below this layer needs to memorize the information, which blocks changed since the last backup. Current implementations of the block-oriented layers of storage devices or software do not provide such functionality. Another approach is called journal based backup. All file system activities are monitored to create a journal containing a list of all changed objects so no scan is needed anymore to backup the incremental changes. This facility requires to be integrated into the operating system i.e. by integrating into the file system code or by intercepting all file system activities. Many file systems already provide a journal which can be used also for backup purposes. Snapshot and mirroring facilities allow saving the state of a file system which will not change anymore. The original data will remain online while a backup can be taken from the snapshot. Snapshot facilities reduce offline time of the data to a few seconds. Nevertheless, a snapshot does not reduce the period of time needed for incremental backups. Also parallelism in hardware and software can be used to reduce the time of a backup by splitting up the single task into parallel activities using multiple data paths. Current data protection products can use multiple processes or threads to backup data in storage repositories. For instance, scanning the file system, reading data and writing them to tape are performed concurrently each by their own thread. Furthermore, multiple instances of the same application can be started. If multiple applications are started on different nodes of a cluster each node can backup another part of the storage by using parallel data paths. In such a configuration a higher total throughput can be achieved compared to a single node configuration depending on the parallelism of the underlying storage system and the number of disk drives contained. All described methods are commonly known and used to backup data. This paper focuses on backups performed in parallel on multiple nodes (see fig. 1). Such configurations are not easy to administer because the backup application on each node needs to be configured separately. Furthermore, no current data protection product provides autonomous workload balancing between multiple nodes. In contrast, the worst case leads to adjusting the configuration on all nodes again to achieve a new workload distribution which also needs to be provided by the system administrator. Unfortunately, the amount of workload cannot be easily determined and a distribution is not easy to be calculated. Therefore, a solution is required which auto-configures the workload distribution in a cluster and adapts the configuration to deviations of the
324
J.-P. Akelbein and U. Schröfel
Fig. 1. Multiple backup application instances running on different nodes in a cluster allow copying different parts of a cluster file system via parallel data paths to backup media.
workload afterwards automatically by self-optimization. Section 2 of the following gives an overview of what needs to be considered for workload balancing of backups in a cluster and how it can be integrated into existing technology. Section 3 demonstrates a prototype solution and implementation details. The next section presents some results gained with the prototype and describes environments where performance improvements can be gained. Section 5 gives an outlook to other application areas of adaptive workload balancing for data management applications in multi node environments.
2 Workload Balancing of Parallel Backups in a Cluster If parallelism should be applied to backups of large amounts of shared data the workload has to be divided into a number of subtasks. All subtasks get assigned to a set of nodes within the cluster. Therefore, a distribution of the workload has to be determined so that all nodes have to backup a similar amount of data. If a separation into subtasks without intersections can be achieved n nodes can backup the whole set of data in the 1/n amount of time. Subtasks can be defined in different ways. This paper presents an approach using the hierarchy of file systems and the data objects contained to define parts of it as subtasks. Whole file systems, directories, files or even file stripes can define subtasks being assigned to different nodes. 2.1 Separating the Backup Workload Adaptively Compared to methods for dynamic workload balancing already used in network or parallel computing systems [3][4], a system to balance the workload for backups has to fulfill some other requirements. Online scheduling algorithms like the round robin strategy assign discrete amounts of workload immediately. In contrast, backup events are discrete events. The backup itself lasts over a period of time and should fit in a given backup window. No data transfer occurs between two backups so this time frame can be used for scheduling without an impact on the backups itself. Therefore,
Adaptive Workload Balancing for Storage Management Applications
325
performing backups can be compared to scheduling a set of batch jobs on multiprocessor machines where offline scheduling algorithms can be used. In contrast to scheduling batch jobs with a fixed amount of computing time, the scope of backup tasks and their duration can be varied by the algorithm which distributes the workload. A static approach of separating the workload into subtasks distributes all subtasks only once so a new assignment will not take place automatically. Typically, the configuration C0 is set up by a system administrator who identifies file systems or parts of them as subtasks. For all backup events E the workload of subtasks will be distributed over all nodes by the same assignment C0. If the workload deviates over time a new configuration C1 will have to be determined by the system administrator to balance the workload once again. By increasing the number of participating client nodes the configuration will become quite complex and hard to administer. In contrast to configuring all client tasks manually, a new configuration can be determined by a scheduling algorithm. An offline scheduling approach allows to modify the configuration C adaptively while online scheduling leads to changing C dynamically during the backup. Adaptive workload balancing means that a new configuration Cn+1 is computed after each backup event En. In most cases, it is not necessary to apply a new configuration after each backup. Instead, a lot of reconfigurations can be avoided by specifying limits for an allowed deviation. It is only needed when a significant change of the workload or its distribution occurs. Computing a new configuration needs the information how much workload was generated by each subtask in the past for a backup. Progressive incremental backup solutions using a database already provide most of the information by storing the size of each data object and whether it was backed up or not. Nevertheless, adaptive workload balancing does not depend on how this information is stored. Such information is taken as a set of statistical data available for each backup event Ei to compute new configurations without requiring a special format. Adaptive methods change the configuration C between two backup events Ei-1 and Ei. In contrast, balancing the workload dynamically changes the configuration when the backup Ei itself is ongoing. Such an approach is very similar to workload balancing already known in other application areas. Therefore, three different approaches can be distinguished by considering the following sets of statistical data of backup events (see fig. 2) 1. Ei 2. E0, …, Ei 3. E0, …, Ei-1
dynamic workload balancing without considering historical data dynamic workload balancing considering historical data adaptive workload balancing based on historical data only
An algorithm cannot compute an ideal configuration if only the workload of the current backup Ei is being considered like in case 1 because the workload is only partly known before the backup is completed. The workload needs to be distributed dy namically while the backup takes place leading to communication between the differ-
326
J.-P. Akelbein and U. Schröfel
ent nodes to coordinate the activities. The communication overhead increases significantly by adding more nodes.
Fig. 2. Sequence of backup events E and different workload balancing approaches with their scope of data mentioned to compute a new workload distribution
Case 3 is based on historical data only so computing of a new configuration can take place between two backups. An ideal configuration for previous backups can be computed which might also fit for subsequent backups. No additional communication is needed anymore in the backup window. On the other hand, the dynamic approach can try to balance the workload if it deviates significantly from previous backups. An adaptive solution like case 3 will not be able to appropriately handle such situations. Therefore, case 3 is a very handy way to determine an appropriate configuration for environments where the workload of future backup events correlates well with previous backups. Case 2 combines the potential of both approaches. Solutions for considering Ei are known from scheduling algorithms for CPU usage, network traffic workload balancing, and other application areas as well [5]. As there was no previous work known in the literature about an approach applied to storage management for case 3, a prototype solution was built for adaptive workload balancing to evaluate its potential and drawbacks in typical customer environments [6]. 2.2 Prototyping Based on Existing Technology Backups are performed since decades and no one will leave data unprotected if it is being identified as important or mission critical. A lot of backup applications on the market are commonly known and used since years. Therefore, one important goal for a new feature is that it can be easily integrated into existing technology. The prototype presented in this paper uses the IBM Tivoli Storage Manager (TSM) as an existing client/server oriented solution for storage management to perform the backups [2][7]. The TSM server manages all storage devices as a repository where data ob-
Adaptive Workload Balancing for Storage Management Applications
327
jects can be stored. A database contains the relation between a data object and the location on storage media. A TSM client resides on the machine which needs to be protected by creating a backup. Data objects are sent by the client to the server via a LAN. A variety of historical data can be used as the input for a workload balancing algorithm. Here is a list of typical information available for backups of file systems or directories: • number of objects contained in a file system or in each directory NDirAll • number of objects backed up in each directory, new and modified objects NDirDelta • the total number of bytes stored in all objects NBAll • bytes transferred during the backup NBXfer • number of objects deleted since the last backup NDirDel • number of subdirectories in a directory NDirSub The presented approach recognizes directories as the lowest level which can be assigned independently to a single client as a subtask. Files contained in a directory cannot be distinguished by the separation algorithm. One reason to limit the historical information on directory level is the amount of data which will be created if every file has to be recognized. In addition, large environments tend to spread over thousands or more directories. Therefore, a granularity on directory level with a large amount of individual workload measurements guarantees enough possibilities of separating the workload into equal parts. The following section will show how the functionality of existing data protection products need to be extended to allow adaptive workload balancing. Furthermore, a modification of an existing algorithm is described which implements a method to distribute the backup operations.
3 Functionality of the Prototype Today's data protection products in client/server architecture focus on backing up data objects by a single client node. For instance, a TSM server assigns a node name to a client based on the host name. All data from a client node can be found in file spaces associated to the node name. If the client node backing up the data changes over time the server needs to be able to detect that the files being stored remain the same. This is supported by the TSM client by specifying another node name instead of the host name. All clients involved in a backup of shared data need to use the same node name. Furthermore, single node backup clients define their backup domain in their local configuration. If the backup domain is reconfigured the backup copy of all objects not residing in the backup domain anymore will be expired on the TSM server to delete them later on. This should not occur for shared objects being backed up by another node in a multi node environment. In addition, the path names of the shared file systems have to be the same on all client nodes so that a single object can be identified as the same one.
328
J.-P. Akelbein and U. Schröfel
3.1 Separation of a File System Tree Backup solutions typically offer options for including and excluding file spaces, directories, and files defining the scope of a backup domain. If the backup workload should be distributed between a set of clients different nodes in a cluster a similar mechanism will be required to separate a shared backup domain into different parts. Therefore, separation points are defined dividing the shared domain into subtasks (see fig. 3). The subtasks are mutually exclusive so no object is backed up twice. All subtasks assigned to a client define the local backup domain so that all local backup domains together sum up to the shared backup domain. A client scans its backup domain by parsing the hierarchical tree structure of file systems and directories in linear order. The separation points intercept this scan so that only the local backup domain is parsed. Two additional options CONTINUE <path name> and PRUNE <path name> are required to define a separation point for the TSM client. A subtask is assigned to a client using the CONTINUE option. The path name of the root directory of the subtask needs to be specified. If the traversal through the file system reaches the next subtask the search has to be stopped by specifying PRUNE and the path name of the directory. All clients have to prune their scan at a separation point while only one client needs to continue its traversal. Therefore, all instances of the client get their own separation list (see an example of separation lists in fig. 4 derived from the subtasks shown in fig. 3). The TSM client used by the prototype supports separation lists by reading a file containing the options. A distribution of the workload W generated by a backup of a shared backup domain can address different goals. The workload can be balanced so that all clients need the same amount of time to backup their local domain. Another goal can be to transfer the same amount of bytes for each client. Both approaches lead to equal workloads Wi = W / n for each of n client instances regardless of whether the duration or the number of bytes are the base measurement W. By using a unitless workload value the same algorithm allows to achieve one of these goals by transforming the historical data. To balance the number of transferred bytes, this value needs to be considered as the workload W = NBXfer. The backup duration for a directory TDir can be measured directly by the client. If the duration itself cannot be determined an approximation has to be made. For instance, a single TSM backup client allows using multiple sessions to the server for backing up a number of objects simultaneously. Such embedded parallelism within the client reduces the total backup duration while the time used for a single object stays the same or even increases. Therefore, the prototype does not consider real time measurements, but uses the historical information to compute a workload value. One assumption is, that the duration of scanning a single object requires a fixed amount of time TScan. In addition, the network bandwidth BCS for all transfers between the client instance and the server is assumed as being constant so no other application uses the same LAN connection at the same time. By applying these simplifications, the backup duration TDir can be determined by TDir = NDirAll * TScan + NBXfer / BCS
Adaptive Workload Balancing for Storage Management Applications
329
Fig. 3. A shared backup domain divided into subtasks assigned to the local domain of three clients
Fig. 4. A shared backup domain divided into subtasks assigned to the local domain of three clients
The transformation results in a single workload value WDir = TDir for each backup. If more than one backup event is being considered by the workload balancing algorithm a second transformation is needed to determine a single workload value WDir. The prototype allows computing the mean value from all backup events, or, by specifying a multiple of the standard deviation, a transformation using the Gauss error distribution curve. The later one eliminates single events of peak workloads from the list of values being considered. After these computations are performed all directories have a single workload value WDir associated with them so algorithms for separating a
330
J.-P. Akelbein and U. Schröfel
weighted tree can be used. The modified TSM prototype client writes out backup protocols into files containing the statistical data needed for computing the workload value.
Fig. 5. The components of the workload dispatcher prototype and their inputs and outputs relate to the model of a autonomic manager with the four different phases of monitoring, analyzing, planning, and executing [12]
3.2 Selecting the Algorithm to Separate the Workload Present algorithms for computing the optimum separation of weighted trees lead to a non-polynomial complexity and run time. The time to compute a workload distribution is limited by the time between two backup windows. Large amounts of file systems and directories lead to a high number of subtasks so a non-polynomial complexity results in a run time much longer than the time frame between two backup windows. In most cases, an approximation of the optimal distribution will also fulfill the requirements of a balanced separation [8]. Therefore, another class of algorithms can be chosen which guarantee a polynomial complexity under all circumstances. A similar task has to be performed like for job scheduling solutions [9]. A list containing the workload for each job and the number of computing nodes will represent the input in such a case. The goal of such job scheduling is to assign each workload to one of the computing nodes in order to minimize the workload for each node [10]. In [11] the offline greedy algorithms are introduced as a typical class of job scheduling algorithms which work like this: 1. 2. 3.
The list is sorted by the workload of each job in decreasing order. The job with the biggest workload is removed out of the list and assigned to the computing node with the least workload at this moment. Step 2 is repeated until the list is empty.
Adaptive Workload Balancing for Storage Management Applications
331
As [11] shows, the worst case will be Tapprox / Toptimal = 4/3 – 1/3n, where n is the number of clients, Tapprox is the amount of time needed by the approximation algorithm, and Toptimal for an optimal distribution. So for n>>1, the worst case will need 33% amount of time. Nevertheless, the worst case will show up only for special sizes of subtasks not to be expected on any file server. By looking at this example for job scheduling the separation problem can be derived from the job scheduling problem. In both application areas, the algorithm has to assign multiple workloads to a number of instances. The differences to the separation of backup workloads are: 1.
2. 3.
The workload represents computing time for job scheduling problems, the value is derived from the previous backup workload history for separating directory trees. The instance of the separation problem is a tree while the instance of the job scheduling problem is a list. The result of the job scheduling problem is the lowest possible for the given set of workloads. It does not take into consideration splitting one task into subtasks.
The first point does not make a difference to the algorithm itself because generic numbers are used in both cases. Both of the other points can be addressed in the following way. A list is created out of the tree by using the workloads of all file systems WFSi as elements. WFSi represents the sum of the workload of all directories contained in a file system. By running the offline greedy algorithm with this initial list, an assignment of all file systems to client nodes is computed. In a second step, it has to be verified that the resulting separation guarantees having workloads for all clients less than the maximum amount of time or bytes to be transferred. For instance, a backup window specifies a time period which should not be exceeded. The prototype divides the whole workload by the number of clients to compute the average workload to be assigned to each client. After assigning all subtasks no client should perform more than a given percentage of workload above the average. If no client exceeds this limit the separation is taken as the final result. If one or more of the clients exceed the upper bound one workload WFSi will be replaced in the list by splitting it into workload values WDiri. They represent the workloads for each of the subdirectories in the file system i. After the modified list is created the next iteration is performed by running the offline Greedy algorithm again. Further steps may also split directories into their subdirectories. This leads to the following iterative approximation algorithm solving the separation problem: 1. 2. 3. 4. 5.
Copy the file system workloads WFSi contained in the tree into a list Perform the offline greedy algorithm Test, whether all client workloads are less than a specified bound If a client's workload is above the upper bound then split a workload into the workloads of the direct subdirectories Perform step 2 again
332
J.-P. Akelbein and U. Schröfel
Looking at step 4 of the algorithm, the decision is not described in detail which node has to be split. There are several alternatives of choosing an element e.g. 1. Remove the element that has been added last to the client's workload and try to split it. If this element cannot be split the solution will not improve. 2. Remove the element that has been added first to the client's workload and try to split it. If this element cannot be split the solution will not improve. The algorithm described above provides a solution for the separation problem in polynomial time. The complexity of the offline greedy algorithm itself is linear while 2 the presented algorithm shows O(N) = N in its worst case for N being the number of file systems and directories in the file system tree [6]. For common environments found on large file servers, the approximation provides reasonable workload distributions nearly equal to the optimal solution. The offline greedy algorithm was chosen because it was easy to adapt to the problem described above. It would be worth trying to adopt other load balancing algorithms to the problem to analyze whether they provide better results. The separation algorithm is implemented as a component called workload dispatcher (see fig. 5) as part of the prototype solution. It uses previous backup protocols of all clients instead of the TSM server database as the input and generates separation lists as output avoiding any changes in the server code. It can be configured to use different transformations of the historical data etc. which applies to the requirement of the environment. Most of the efforts to create the prototype implementing adaptive workload balancing were spent on developing the workload dispatcher component. Only a few lines of code needed to be changed in existing TSM client to be integrated into the prototype solution while the server was used without modifications. The small amount of efforts used for integrating the new component with the existing ones shows that adaptive workload balancing can easily be integrated into existing products fulfilling the requirement described in section 2.2. The next section presents some results generated by using the prototype in a simulated environment.
4 Results To analyze the results of distributing a backup workload five TSM clients backed up a shared GPFS file system. Different scenarios were performed like in the system verification tests of TSM clients which represent typical customer environments [6]. 4.1 Scenario One – Balanced Workload After each backup about 5% of the files are modified and created as new ones. All changes are randomly distributed over the file system. The file sizes also vary randomly following a typical distribution which can be found on file servers [13]. To
Adaptive Workload Balancing for Storage Management Applications
333
5 clients 1 client
18:00
duration 12:00 [min]
6:00
0:00 1
3
5
7
9
11
13
15
17
19
21
backup event E
Fig. 6. Backup durations of a file system parsed by one and by five clients in scenario 1
show the influence of peak workloads, 25% of the files are changed after the 15th backup. Fig. 6 shows the backup durations needed for all backups. By using five clients in parallel the backup duration takes no more than 30% of the time needed by one. Most of the backups take slightly more than 20%. Because the changes are distributed randomly over the whole file system a peak workload like in the 16th backup results in a longer backup time. Nevertheless, the backup is performed in about 30% of the time needed by a single client. The only exception can be found for the first backup. As no statistical data is available at this point in time one client parses the whole file system to gain this information. So the first backup can be seen as an initial configuration procedure. Other reconfigurations are not needed in this scenario. For a progressive incremental method the first backup is the only full backup. Dedicated time should be reserved to perform it. 4.2 Scenario Two – Unbalanced Workload Scenario one demonstrated that the workload dispatcher created only one list of separation points after the first backup. In this scenario the workload changes over the time, but the distribution of changes in the different subtasks remain the same. A second scenario shows how the configuration is adapted for successive backups. Therefore, after each backup 5% of the files are modified or created. In addition, a large directory containing subdirectories and files with about 25% of the whole data is added after the second and two other large directories after the twelfth backup.
334
J.-P. Akelbein and U. Schröfel
Fig. 7 shows the backup duration of each of the five TSM clients. Because the large directory has to be backed up by client 1 in backup 3, it needs more than double the time compared to the other clients. A similar situation occurs in the thirteenth backup event. So if a large amount of data is added at single place within the file system increasing the number of clients will not reduce the backup duration of the single client which has to backup significantly more data like in the backup three and thirteen.
Client 1 Client 2
6:00
Client 3 Client 4 Client 5 Backup duration [min]
3:00
0:00 2
4
6
8
10
12
14
Fig. 7. Comparison of the backup duration of five clients running in parallel as described in scenario 2 without showing the first full backup
After the backup took place the workload dispatcher distributes the workload of the additional data between all clients so their backups align to the same period of time again for the forth and 14th backup. Based on experiences with a large variety of different product environments such peak workloads tend to become less probable with an increasing amount data as most of the data is stored only for reference not changing anymore. One way to avoid the influence of peak workloads is running a dedicated backup apart from the normal schedule for only the large set of new or modified data. In [6], other experiments show further analyses of other situations which might occur in customer environments. 4.3 Introducing a Correlation Factor If a quantitative statement about the suitability of the presented adaptive workload algorithm in a specific environment is needed for a backup event Ei, the actual workload Wi,k of each client k will need to be compared with the forecasted workload WFi,k = Wi-1,k. Because the total workload Wi of all clients might deviate also like seen in scenario one, the normalized workload ratio Wi,k/Wi must be considered instead of Wi,k itself. By normalizing also the forecast, the correlation factor, ci,k = 1 - | WFi,k / WFi Wi,k / Wi | represents the deviation of the workload for client k in comparison to the expected ratio of the total workload Wi. Therefore, ci = MIN ( ci,1, …, ci,n ) defines the
Adaptive Workload Balancing for Storage Management Applications
335
worst ratio of a client for a backup Ei. So in a balanced workload distribution like in scenario the overall correlation factor c = MIN ( c0, …, ci ) for all backup events E0, …, Ei is near by one (in scenario one c=0,98). If c is significantly lower than one an unbalanced workload situation is indicated (in scenario two c=0,86).
5 Outlook After the prototype was developed and the chosen approach demonstrated its strengths and drawbacks, considerations of other areas in storage management seem worthwhile. The prototype assigns storage management clients to data objects like files or directories adaptively [14]. The prototype only considers backups, but restoring data can also be carried out in parallel by multiple client nodes. The scope of a restore is specified when this activity is started. The amount of data to be transferred cannot be forecasted by any previous event. Therefore, workload distributions can be computed only for predefined scopes. Such an approach is feasible for situations like a full restore.
Fig. 8. Adaptive assignment of client nodes and their workloads to server nodes by an adaptive assignment table realized as a proxy
Also on the server side an adaptive assignment of clients to different servers can lead to a workload balancing. Current distributed storage management applications assign a client statically to a server. If a client node and its data should be moved from one server to another one it needs to be exported and imported manually. Managing hundreds or even thousands of clients can be very time consuming if the workload should be assigned and balanced between several servers. A self-optimizing solution can be achieved by an equivalent adaptive approach as shown in section 2.1. If the backup workload generated by each client is known the same algorithm is able to generate a new assignment table between clients and servers. Such an assignment table is the equivalent of separation lists used in the prototype. The client nodes and their data have to be moved between the servers according to the assignment table (see fig. 8). Therefore, it makes sense to modify the algorithm to achieve a new assignment with a minimum of client node movements.
336
J.-P. Akelbein and U. Schröfel
A client needs to know which server should be used for backing up its data. So a proxy stores the assignment table. If a client contacts its server the proxy can tell it the appropriate IP address. By sharing the data of all client nodes on the managed storage devices in a SAN no movement of data objects is needed reducing the data traffic significantly. Sharing a distributed database between all server instances removes any meta data movements at all. A client node will not need to be assigned totally to a single server. If a proxy tracks also file spaces this level of granularity can also be used to distribute the whole workload. The method of adaptively assigning data objects to data management applications does not distinguish between such levels. Only the number of subtasks to be assigned by the algorithm varies. Adaptively assigning data objects to data management applications can automate a lot of tasks to configure complex and distributed storage management applications. Typical TSM customers rarely optimize their environment because of the efforts needed to perform such tasks. Furthermore, it needs time to collect the workload information and a good workload distribution cannot be achieved easily by manually configuring the whole environment. This paper demonstrates that backup windows can be reduced significantly by parallel data transfers of multiple client nodes sharing the same data. By approximating a good workload distribution the prototype automatically gains such a benefit. If only historical information is considered the design of existing storage management application does not need to be changed significantly leading to an easy integration of the concept of adaptive workload balancing. Peak workloads might lead to longer backup durations so an analysis of the correlation factor c should be made for environments to determine whether such events can take place.
References 1. 2. 3.
4.
5. 6. 7. 8.
Kaczmarksi, M., Jiang, T., Pease, D.A.: Beyond backup toward storage management. IBM Systems Journal, vol. 42, no. 2, pp. 223, (2003). Tivoli Storage Manager for UNIX. Backup-Archive Client Installation and User Guide, Version 5 Release 2, fourth Edition, (2003) Shen, K., Yang, T., Chu, L.: Cluster Load Balancing for Fine-grain Network Services, IEEE Proc. of the Int. Parallel and Distributed Processing Symposium IPDPS’02, pp. 51b, (2002) Thomé, V., Vianna, D., Costa, R., Plastino, A., Filho, O.T.: “Exploring Load Balancing in a Scientific SPMD Parallel Application, IEEE Proc. Of the Int. Conf. on Parallel Processing Workshops ICPPW’02, pp. 419, (2002) Tanenhaus, A.: Operating Systems – Design and Implementation, Prentice Hall, (1992) Schröfel, U.: Separation eines Dateisystembaums zum dynamischen Lastenausgleich, master thesis, University of Paderborn, Germany, 2002. Tivoli Storage Manager for AIX: Administrator’s Guide, Version 5 Release 2, second Edition, (2003) Wanka, R.: Approximationsalgorithmen, http://www.upd.de/cs/agmadh/ vorl/Approx02, version 2.1, 2002.
Adaptive Workload Balancing for Storage Management Applications 9. 10.
11. 12. 13.
14.
337
Borodin, A., El-Yaniv, R.: Online Computation and Competitive Analysis, Cambridge University Press, (1998) Graham, R.L., Lawler, E.L., Lenstra, J.K., and Rinnoy Kan, A.: Optimization and approximation in deterministic sequencing and scheduling: a survey, Annals of discrete Mathematics 5, pp. 287, (1979) Graham, R.L.: Bounds on multiprocessing timing anomalies, SIAM J. Appl. Math. 17, pp. 263, (1969) Autonomic computing Overview, IBM research, http://www.research.ibm.com/autonomic/overview/ faqs.html, (2003) Akelbein, J.-P.: Ein Massenspeicher mit höherer logischer Schnittstelle – Design und Analyse, PhD thesis, University of the federal Armed Forces, Hamburg, Shaker Verlag, (1998) Akelbein, J.P., Schröfel, U.: A method for adaptively assigning of data management applications to data objects, patent application 02019559.0, European patent office, (2002)
Author Index
Ahmadinia, Ali 125 Akelbein, Jens-Peter 322 Arat´ o, P´eter 169 Augel, Markus 260 Bellosa, Frank 231 Belmans, Ronnie 78 Berger, Michael 107 Bobda, Christophe 125 Braunes, Jens 156 Buchty, Rainer 184 Chen, Ming 63 Choi, Lynn 47 Chung, Tae-Sun 199 Deconinck, Geert 78 Dolev, Shlomi 31 Enzmann, Matthias 273 Eschmann, Frank 9 Faerber, Matthias 231 Floerkemeier, Christian 291 Flor, Thomas 309 Giessler, Elli
273
Haase, Jan 9 Haisch, Michael 273 Haviv, Yinnon A. 31 Heintze, Nevin 184 Hunter, Brian 273 Ilyas, Mohammad Jung, Myung-Jin
273 199
Kim, Bumsoo 199 Klauer, Bernd 9 Knorr, Rudi 260 K¨ ohler, Steffen 156
LaMarca, Anthony 92 Liu, Xuezheng 63 Maier, Andreas 3 ´ am Mann, Zolt´ an Ad´ Norden, Erik
169
4
Oliva, Dino 184 Orb´ an, Andr´ as 169 Park, Stein
199
Ramsauer, Markus 213 Richter, Harald 140 Rodrig, Maya 92 Sauter, Patrick 309 Schneider, Markus 273 Schr¨ ofel, Ute 322 Schulz, Michael 246 Schulz, Stefan 246 Seitz, Christian 107 Shin, Yong 47 Siegemund, Frank 291 Siemers, Christian 140 Spallek, Rainer G. 156 Specht, G¨ unther 309 Tanner, Andreas 246 Teich, J¨ urgen 125 Tresp, Volker 20 Vanthournout, Koen 78 V¨ ogler, Gabriel 309 Vogt, Harald 291 Waldschmidt, Klaus 9 Weissel, Andreas 231 Wiegand, Christian 140 Yang, Guangwen Yu, Kai 20
63